How to save JSON data fetched from URL in PySpark?

How to save JSON data fetched from URL in PySpark? - json

I have fetched some .json data from API.
import urllib2
test=urllib2.urlopen('url')
print test
How can I save it as a table or data frame? I am using Spark 2.0.

This is how I succeeded importing .json data from web into df:
from pyspark.sql import SparkSession, functions as F
from urllib.request import urlopen
spark = SparkSession.builder.getOrCreate()
url = 'https://web.url'
jsonData = urlopen(url).read().decode('utf-8')
rdd = spark.sparkContext.parallelize([jsonData])
df = spark.read.json(rdd)

For this you can have some research and try using sqlContext. Here is Sample code:-
>>> df2 = sqlContext.jsonRDD(test)
>>> df2.first()
Moreover visit line and check for more things here,
https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html

Adding to Rakesh Kumar answer, the way to do it in spark 2.0 is:
http://spark.apache.org/docs/2.1.0/sql-programming-guide.html#data-sources
As an example, the following creates a DataFrame based on the content of a JSON file:
# spark is an existing SparkSession
df = spark.read.json("examples/src/main/resources/people.json")
# Displays the content of the DataFrame to stdout
df.show()
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON. As a consequence, a regular multi-line JSON file will most often fail.

from pyspark import SparkFiles
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Project").getOrCreate()
zip_url = "https://raw.githubusercontent.com/spark-examples/spark-scala-examples/master/src/main/resources/zipcodes.json"
spark.sparkContext.addFile(zip_url)
zip_df = spark.read.json("file://" +SparkFiles.get("zipcodes.json"))
#click on raw and then copy url

Related

Removing header from json and leave json array

I have a json file in the form
{"total_rows":1000,"rows":[{data},{data},{data}]}
and I just want
[{data},{data},{data}]
I know pandas has desired output to dataframe like:
import pandas as pd
file_reading = json.loads(open(url).read())
df = pd.DataFrame.from_dict(file_reading['rows'])
print(df)
But I am hoping for a way to do this outputting to json array and its a big dataset so I dont want to loop

You opened a file without closing it. There's nothing fancy needed, the JSON just translate into a dictionary in Python:
with open(url) as fp:
file_reading = json.load(fp)
df = pd.DataFrame(file_reading["rows"])

python api json dict in dataframe

I want to scrape data at the county level from https://apidocs.covidactnow.org
However I could only get a dataframe with one line for each county, and data for each date is stored within a dictionary in each row/county. I would like to access this data and store it in long format (= have one row per county-date).
import requests
import pandas as pd
import os
if __name__ == '__main__':
os.chdir('/home/username/Desktop/')
url = 'https://api.covidactnow.org/v2/counties.timeseries.json?apiKey=ENTER_YOUR_KEY'
response = requests.get(url).json()
data = pd.DataFrame(response)
This seems like a trivial question, but I've tried for hours. What would be the best way to achieve that ?

Do you mean something like that?
import requests
url = 'https://api.covidactnow.org/v2/states.timeseries.csv?apiKey=YOURAPIKEY'
response = requests.get(url)
csv_response = (response.text)
# Then you can transform STRING to CSV
Check this fo string to CSV --> python parsing string to csv format

Reading json file from remote url in groovy

Suppose Some JSON file is # www.github.com/xyz/Hello.json, I want to read this content of this JSON in JSON object in Groovy

You then need:
import groovy.json.JsonSlurper
def slurped = new JsonSlurper().parse('www.github.com/xyz/Hello.json'.toURL())
Here you can find more info.

Not able to read streaming files using Spark structured streaming

I have a set of CSV files which needs to be read through Spark structured streaming. After creating a DataFrame I need to load into a Hive table.
When a file is already present before running code through spark-submit,the data is loaded into Hive successfully.But when I add new CSV files on runtime, it's not at all inserting it into Hive.
Code is:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.OutputMode
val spark = SparkSession.builder().appName("Spark SQL Example").config("hive.metastore.uris","thrift://hostname:port").enableHiveSupport().config("hive.exec.dynamic.partition", "true").config("hive.exec.dynamic.partition.mode", "nonstrict").getOrCreate()
spark.conf.set("spark.sql.streaming.schemaInference", true)
import spark.implicits._
val df = spark.readStream.option("header", true).csv("file:///folder path/")
val query = df.writeStream.queryName("tab").format("memory").outputMode(OutputMode.Append()).start()
spark.sql("insert into hivetab select * from tab").show()
query.awaitTermination()
Am I missing out something here?
Any suggestions would be helpful.
Thanks

Python: Build DataFrame from parts of JSON response

I am trying to develop an application to retrieve stock prices (in JSON) and then do some analysis on them. My problem is with getting the JSON response into a pandas DataFrame where I can work. Here is my code:
'''
References
http://stackoverflow.com/questions/6862770/python-3-let-json-object- accept-bytes-or-let-urlopen-output-strings
'''
import json
import pandas as pd
from urllib.request import urlopen
#set API call
url = "https://www.quandl.com/api/v3/datasets/WIKI/AAPL.json?start_date=2017-01-01&end_date=2017-01-31"
#make call and receive response
response = urlopen(url).read().decode('utf8')
dataresponse = json.loads(response)
#check incoming
#print(dataresponse)
df = pd.read_json(dataresponse)
print(df)
The application errors at df = pd.read_json... with error TypeError: Expected String or Unicode.
So I reckon this is the first hurdle.
The second is getting where I need to. The JSON response contains only two arrays I am interested in, column_names and data. How do I extract only these two and put into a pandas DataFrame?

To answer your first question, pd.read_json takes a JSON string directly, so you should be doing this:
pd.read_json(response)
But instead, considering how the data is structured, it's best to first convert the JSON string to a dictionary containing the data:
d = json.loads(response)
Then simply build the dataframe from d['dataset']['data'] and d['dataset']['column_names']:
pd.DataFrame(data=d['dataset']['data'], columns=d['dataset']['column_names'])

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

How to save JSON data fetched from URL in PySpark? - json

I have fetched some .json data from API. import urllib2 test=urllib2.urlopen('url') print test How can I save it as a table or data frame? I am using Spark 2.0.

For this you can have some research and try using sqlContext. Here is Sample code:- >>> df2 = sqlContext.jsonRDD(test) >>> df2.first() Moreover visit line and check for more things here, https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html

Related

Removing header from json and leave json array

python api json dict in dataframe

Reading json file from remote url in groovy

Not able to read streaming files using Spark structured streaming

Python: Build DataFrame from parts of JSON response

Categories

Resources