How to save JSON data fetched from URL in PySpark? - json

I have fetched some .json data from API.
import urllib2
test=urllib2.urlopen('url')
print test
How can I save it as a table or data frame? I am using Spark 2.0.

This is how I succeeded importing .json data from web into df:
from pyspark.sql import SparkSession, functions as F
from urllib.request import urlopen
spark = SparkSession.builder.getOrCreate()
url = 'https://web.url'
jsonData = urlopen(url).read().decode('utf-8')
rdd = spark.sparkContext.parallelize([jsonData])
df = spark.read.json(rdd)

For this you can have some research and try using sqlContext. Here is Sample code:-
>>> df2 = sqlContext.jsonRDD(test)
>>> df2.first()
Moreover visit line and check for more things here,
https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html

Adding to Rakesh Kumar answer, the way to do it in spark 2.0 is:
http://spark.apache.org/docs/2.1.0/sql-programming-guide.html#data-sources
As an example, the following creates a DataFrame based on the content of a JSON file:
# spark is an existing SparkSession
df = spark.read.json("examples/src/main/resources/people.json")
# Displays the content of the DataFrame to stdout
df.show()
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON. As a consequence, a regular multi-line JSON file will most often fail.

from pyspark import SparkFiles
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Project").getOrCreate()
zip_url = "https://raw.githubusercontent.com/spark-examples/spark-scala-examples/master/src/main/resources/zipcodes.json"
spark.sparkContext.addFile(zip_url)
zip_df = spark.read.json("file://" +SparkFiles.get("zipcodes.json"))
#click on raw and then copy url

Related

Removing header from json and leave json array

I have a json file in the form
{"total_rows":1000,"rows":[{data},{data},{data}]}
and I just want
[{data},{data},{data}]
I know pandas has desired output to dataframe like:
import pandas as pd
file_reading = json.loads(open(url).read())
df = pd.DataFrame.from_dict(file_reading['rows'])
print(df)
But I am hoping for a way to do this outputting to json array and its a big dataset so I dont want to loop
You opened a file without closing it. There's nothing fancy needed, the JSON just translate into a dictionary in Python:
with open(url) as fp:
file_reading = json.load(fp)
df = pd.DataFrame(file_reading["rows"])

python api json dict in dataframe

I want to scrape data at the county level from https://apidocs.covidactnow.org
However I could only get a dataframe with one line for each county, and data for each date is stored within a dictionary in each row/county. I would like to access this data and store it in long format (= have one row per county-date).
import requests
import pandas as pd
import os
if __name__ == '__main__':
os.chdir('/home/username/Desktop/')
url = 'https://api.covidactnow.org/v2/counties.timeseries.json?apiKey=ENTER_YOUR_KEY'
response = requests.get(url).json()
data = pd.DataFrame(response)
This seems like a trivial question, but I've tried for hours. What would be the best way to achieve that ?
Do you mean something like that?
import requests
url = 'https://api.covidactnow.org/v2/states.timeseries.csv?apiKey=YOURAPIKEY'
response = requests.get(url)
csv_response = (response.text)
# Then you can transform STRING to CSV
Check this fo string to CSV --> python parsing string to csv format

Reading json file from remote url in groovy

Suppose Some JSON file is # www.github.com/xyz/Hello.json, I want to read this content of this JSON in JSON object in Groovy
You then need:
import groovy.json.JsonSlurper
def slurped = new JsonSlurper().parse('www.github.com/xyz/Hello.json'.toURL())
Here you can find more info.

Not able to read streaming files using Spark structured streaming

I have a set of CSV files which needs to be read through Spark structured streaming. After creating a DataFrame I need to load into a Hive table.
When a file is already present before running code through spark-submit,the data is loaded into Hive successfully.But when I add new CSV files on runtime, it's not at all inserting it into Hive.
Code is:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.OutputMode
val spark = SparkSession.builder().appName("Spark SQL Example").config("hive.metastore.uris","thrift://hostname:port").enableHiveSupport().config("hive.exec.dynamic.partition", "true").config("hive.exec.dynamic.partition.mode", "nonstrict").getOrCreate()
spark.conf.set("spark.sql.streaming.schemaInference", true)
import spark.implicits._
val df = spark.readStream.option("header", true).csv("file:///folder path/")
val query = df.writeStream.queryName("tab").format("memory").outputMode(OutputMode.Append()).start()
spark.sql("insert into hivetab select * from tab").show()
query.awaitTermination()
Am I missing out something here?
Any suggestions would be helpful.
Thanks

Python: Build DataFrame from parts of JSON response

I am trying to develop an application to retrieve stock prices (in JSON) and then do some analysis on them. My problem is with getting the JSON response into a pandas DataFrame where I can work. Here is my code:
'''
References
http://stackoverflow.com/questions/6862770/python-3-let-json-object- accept-bytes-or-let-urlopen-output-strings
'''
import json
import pandas as pd
from urllib.request import urlopen
#set API call
url = "https://www.quandl.com/api/v3/datasets/WIKI/AAPL.json?start_date=2017-01-01&end_date=2017-01-31"
#make call and receive response
response = urlopen(url).read().decode('utf8')
dataresponse = json.loads(response)
#check incoming
#print(dataresponse)
df = pd.read_json(dataresponse)
print(df)
The application errors at df = pd.read_json... with error TypeError: Expected String or Unicode.
So I reckon this is the first hurdle.
The second is getting where I need to. The JSON response contains only two arrays I am interested in, column_names and data. How do I extract only these two and put into a pandas DataFrame?
To answer your first question, pd.read_json takes a JSON string directly, so you should be doing this:
pd.read_json(response)
But instead, considering how the data is structured, it's best to first convert the JSON string to a dictionary containing the data:
d = json.loads(response)
Then simply build the dataframe from d['dataset']['data'] and d['dataset']['column_names']:
pd.DataFrame(data=d['dataset']['data'], columns=d['dataset']['column_names'])