PySpark loading from URL - csv

I wanted to load csv files from a URL in PySpark, is it even possible to do so?
I keep the files on GitHub.
Thanks!

There is no naive way in pyspark (see here).
However, if you have a function that takes as input a URL and outputs the csv:
def read_from_URL(UR):
# your logic here
return data
You can use spark to parallelize this operation:
URL_list = ['http://github.com/file/location/file1.csv, ...]
data = sc.parallelize(URL_list).map(read_from_URL)

Related

Merging and/or Reading 88 JSON Files into Dataframe - different datatypes

I basically have a procedure where I make multiple calls to an API and using a token within the JSON return pass that pack to a function top call the API again to get a "paginated" file.
In total I have to call and download 88 JSON files that total 758mb. The JSON files are all formatted the same way and have the same "schema" or at least should do. I have tried reading each JSON file after it has been downloaded into a data frame, and then attempted to union that dataframe to a master dataframe so essentially I'll have one big data frame with all 88 JSON files read into.
However the problem I encounter is roughly on file 66 the system (Python/Databricks/Spark) decides to change the file type of a field. It is always a string and then I'm guessing when a value actually appears in that field it changes to a boolean. The problem is then that the unionbyName fails because of different datatypes.
What is the best way for me to resolve this? I thought about reading using "extend" to merge all the JSON files into one big file however a 758mb JSON file would be a huge read and undertaking.
Could the other solution be to explicitly set the schema that the JSON file is read into so that it is always the same type?
If you know the attributes of those files, you can define the schema before reading them and create an empty df with that schema so you can to a unionByName with the allowMissingColumns=True:
something like:
from pyspark.sql.types import *
my_schema = StructType([
StructField('file_name',StringType(),True),
StructField('id',LongType(),True),
StructField('dataset_name',StringType(),True),
StructField('snapshotdate',TimestampType(),True)
])
output = sqlContext.createDataFrame(sc.emptyRDD(), my_schema)
df_json = spark.read.[...your JSON file...]
output.unionByName(df_json, allowMissingColumns=True)
I'm not sure this is what you are looking for. I hope it helps

How to load json and value out of json in Pig?

I have a json and value out of json
000000,{"000":{"phoneNumber":null,"firstName":"xyz","lastName":"pqr","email":"email#xyz.com","alternatePickup":true,"sendTextNotification":false,"isSendTextNotification":false,"isAlternatePickup":true}}
I'm trying to load this json in pig using elephant bird json loader but unable to do that.
I'm able to load the following json
{"000":{"phoneNumber":null,"firstName":"xyz","lastName":"pqr","email":"email#xyz.com","alternatePickup":true,"sendTextNotification":false,"isSendTextNotification":false,"isAlternatePickup":true}}
Using following script -
REGISTER json-simple-1.1.1.jar;
REGISTER elephant-bird-pig-4.3.jar;
REGISTER elephant-bird-hadoop-compat-4.3.jar;
json_data = load 'ek.json' using com.twitter.elephantbird.pig.load.JsonLoader() AS (json_key: [(phoneNumber:chararray,firstName:chararray,lastName:chararray,email:chararray,alternatePickup:boolean,sendTextNotification:boolean,isSendTextNotification:boolean,isAlternatePickup:boolean)]);
dump json_data;
But when I include value out of json
json_data = load 'ek.json' using com.twitter.elephantbird.pig.load.JsonLoader() AS (id:int,json_key: [(phoneNumber:chararray,firstName:chararray,lastName:chararray,email:chararray,alternatePickup:boolean,sendTextNotification:boolean,isSendTextNotification:boolean,isAlternatePickup:boolean)]);
it is not working!! Appreciate the help in advance.
JsonLoader allows loading only of correct json, while your format is actually CSV. There are three options for you ordered by incresing complexity:
Adjust your input format and make id part of it
Load data as CSV (as 2 fields: id and json, then use custom UDF to parse json field into a tuple)
Write custom loader that will allow you your original format.
You can use builtin JsonStorage and JsonLoader()
a = load 'a.json' using JsonLoader('a0:int,a1:{(a10:int,a11:chararray)},a2:(a20:double,a21:bytearray),a3:[chararray]');
In this example data is loaded without a schema; it assumes there is a .pig_schema (produced by JsonStorage) in the input directory.
a = load 'a.json' using JsonLoader();

running nested jobs in spark

I am using PySpark. I have a list of gziped json files on s3 which I have to access, transform and then export in parquet to s3. Each json file contains around 100k lines so parallelizing it wont make much sense(but i am open to parallelizing it), but there are around 5k files which I have parallelize. My approach is pass the json file list to script -> run parallelize on the list -> run map(? this is where I am getting blocked). how do I access and transform the json create a DF out of the transformed json and dump it as parquet into s3.
To read json in a distributed fashion, you will need to parallelize your keys as you mention. To do this while reading from s3, you'll need to use boto3. Below is a skeleton sketch of how to do so. You'll likely need to modify distributedJsonRead to fit your use case.
import boto3
import json
from pyspark.sql import Row
def distributedJsonRead(s3Key):
s3obj = boto3.resource('s3').Object(bucket_name='bucketName', key=key)
contents = json.loads(s3obj.get()['Body'].read().decode('utf-8'))
return Row(**contents)
pkeys = sc.parallelize(keyList) #keyList is a list of s3 keys
dataRdd = pkeys.map(distributedJsonRead)
Boto3 Reference: http://boto3.readthedocs.org/en/latest/guide/quickstart.html
Edit: to address the 1:1 mapping of input files to output files
Later on, it's likely that having a merged parquet data set would be easier to work with. But if this is the way you need to do it, you could try something like this
for k in keyList:
rawtext = sc.read.json(k) # or whichever method you need to use to read in the data
outpath = k[:-4]+'parquet'
rawtext.write.parquet(outpath)
I don't think you will not be able to parallelize these operations if you want a 1:1 mapping of json to parquet files. Spark's read/write functionality is designed to be called by the driver, and needs access to sc and sqlContext. This is another reason why having 1 parquet directory is likely the way to go.

Json-Opening Yelp Data Challenge's data set

I am interested in data mining and I am writing my thesis about it. For my thesis I want to use yelp's data challenge's data set, however i can not open it since it is in json format and almost 2 gb. In its website its been said that the dataset can be opened in phyton using mrjob, but I am also not very good with programming. I searched online and looked some of the codes yelp provided in github however I couldn't seem to find an article or something which explains how to open the dataset, clearly.
Can you please tell me step by step how to open this file and maybe how to convert it to csv?
https://www.yelp.com.tr/dataset_challenge
https://github.com/Yelp/dataset-examples
data is in .tar format when u extract it again it has another file,rename it to .tar and then extract it.you will get all the json files
yes you can use pandas. Take a look:
import pandas as pd
# read the entire file into a python array
with open('yelp_academic_dataset_review.json', 'rb') as f:
data = f.readlines()
# remove the trailing "\n" from each line
data = map(lambda x: x.rstrip(), data)
data_json_str = "[" + ','.join(data) + "]"
# now, load it into pandas
data_df = pd.read_json(data_json_str)
Now 'data_df' contains the yelp data ;)
Case, you want convert it directly to csv, you can use this script
https://github.com/Yelp/dataset-examples/blob/master/json_to_csv_converter.py
I hope it can help you
To process huge json files, use a streaming parser.
Many of these files aren't a single json, but a stream of jsons (known as "jsons format"). Then a regular json parser will consider everything but the first entry to be junk.
With a streaming parser, you can start reading the file, process parts, and wrote them to the desired output; then continue writing.
There is no single json-to-csv conversion.
Thus, you will not find a general conversion utility, you have to customize the conversion for your needs.
The reason is that a JSON is a tree but a CSV is not. There exists no ultimative and efficient conversion from trees to table rows. I'd stick with JSON unless you are always extracting only the same x attributes from the tree.
Start coding, to become a better programmer. To succeed with such amounts of data, you need to become a better programmer.

PySpark: How to Read Many JSON Files, Multiple Records Per File

I have a large dataset stored in a S3 bucket, but instead of being a single large file, it's composed of many (113K to be exact) individual JSON files, each of which contains 100-1000 observations. These observations aren't on the highest level, but require some navigation within each JSON to access.
i.e.
json["interactions"] is a list of dictionaries.
I'm trying to utilize Spark/PySpark (version 1.1.1) to parse through and reduce this data, but I can't figure out the right way to load it into an RDD, because it's neither all records > one file (in which case I'd use sc.textFile, though added complication here of JSON) nor each record > one file (in which case I'd use sc.wholeTextFiles).
Is my best option to use sc.wholeTextFiles and then use a map (or in this case flatMap?) to pull the multiple observations from being stored under a single filename key to their own key? Or is there an easier way to do this that I'm missing?
I've seen answers here that suggest just using json.loads() on all files loaded via sc.textFile, but it doesn't seem like that would work for me because the JSONs aren't simple highest-level lists.
The previous answers are not going to read the files in a distributed fashion (see reference). To do so, you would need to parallelize the s3 keys and then read in the files during a flatMap step like below.
import boto3
import json
from pyspark.sql import Row
def distributedJsonRead(s3Key):
s3obj = boto3.resource('s3').Object(bucket_name='bucketName', key=s3Key)
contents = json.loads(s3obj.get()['Body'].read().decode('utf-8'))
for dicts in content['interactions']
yield Row(**dicts)
pkeys = sc.parallelize(keyList) #keyList is a list of s3 keys
dataRdd = pkeys.flatMap(distributedJsonRead)
Boto3 Reference
What about using DataFrames?
does
testFrame = sqlContext.read.json('s3n://<bucket>/<key>')
give you what you want from one file?
Does every observation have the same "columns" (# of keys)?
If so you could use boto to list each object you want to add, read them in and union them with each other.
from pyspark.sql import SQLContext
import boto3
from pyspark.sql.types import *
sqlContext = SQLContext(sc)
s3 = boto3.resource('s3')
bucket = s3.Bucket('<bucket>')
aws_secret_access_key = '<secret>'
aws_access_key_id = '<key>'
#Configure spark with your S3 access keys
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", aws_access_key_id)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", aws_secret_access_key)
object_list = [k for k in bucket.objects.all() ]
key_list = [k.key for k in bucket.objects.all()]
paths = ['s3n://'+o.bucket_name+'/'+ o.key for o in object_list ]
dataframes = [sqlContext.read.json(path) for path in paths]
df = dataframes[0]
for idx, frame in enumerate(dataframes):
df = df.unionAll(frame)
I'm new to spark myself so I'm wondering if there's a better way to use dataframes with a lot of s3 files, but so far this is working for me.
The name is misleading (because it's singular), but sparkContext.textFile() (at least in the Scala case) also accepts a directory name or a wildcard path, so you just be able to say textFile("/my/dir/*.json").