Reading Json files using pyspark - json

I am trying to read multiple json files from dbfs in databricks.
raw_df = spark.read.json('/mnt/testdatabricks/metrics-raw/',recursiveFileLookup=True)
This returns data for only 35 files whereas there are around 1600 files.
I tried to read some of the files (except those 35) using pandas and it returned data.
However the driver fails when I try to read all 1600 files using pandas.
import pandas as pd
from glob import glob
jsonFiles = glob('/dbfs/mnt/testdatabricks/metrics-raw/***/*.json')
dfList = []
for jsonFile in jsonFiles:
df = pd.read_json(jsonFile)
dfList.append(df)
print("written :", jsonFile )
dfTrainingDF = pd.concat(dfList, axis=0)
Not sure why spark is not able to read all the files.

Try:
spark.read.option("recursiveFileLookup", "true").json("file:///dir1/subdirectory")
Ref: How to make Spark session read all the files recursively?

Related

dataframe results are not returned while reading csv file

I m trying to read a csv file, below is the code i used , its not returning any results. In the specified path , the csv file has data in it. I had some issue when i used ValidFile = spark.read.csv(ValidationFileDest, header = True) , for this the result is returned but the data for the columns were interchanges and nulls were assinged thats the reason i applied mode DROPMALFORMED in my code. But it is not returning any result.
parquetextension=".parquet"
BronzeStage_Path = "dbfs:/mnt/bronze/stage/" +parentname+"/" +filename
#validated_path="dbfs:/mnt/bronze/landing/ClaimDenialsSouce/"+parentname+"/"+"current/"+"Valid/"+todayDate+"_"+"CDAValidFile"+extension
# df_sourcefilevalid.repartition(1).write.format(write_format).option("header", "true").save(BronzeStagePath)
# ValidFileSrc_BS= get_csv_files(exception_path)
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local") \
.appName("parquet_example") \
.getOrCreate()
spark.conf.set("spark.sql.csv.parser.columnPruning.enabled",False)
ValidFile = spark.read.format('csv').option("mode","DROPMALFORMED").options(header='true', inferSchema='true').load(ValidationFileDest)
display(ValidFile)
Make sure to check if you are providing the correct file path or the variable of your CSV file. I have repro'd in our environment and was able read the CSV file without any issue
Reading CSV file :
filepath="dbfs:/FileStore/test11-1.csv"
df11 = spark.read.format("csv").option("mode", "DROPMALFORMED").option("header", "true").load(filepath)
display(df11)

How to merge multiple JSON files reading from S3, convert to single .csv and store in S3?

Input :
There are 5 part JSON files named as test_par1.json, test_part2.json, test_part3.json, test_part4.json, test_part5.json in s3://test/json_files/data/.
Expected Output :
Single csv file
Explanation : All of the json files are having same number of columns with same structure. They are basically part files of same source.
I want to merge/re partition all of them and convert them into a csv file and store it in S3.
import pandas as pd
import os
import boto3
import numpy
# Boto3 clients
resource = boto3.resource('s3')
client = boto3.client('s3')
session = boto3.session.Session()
bucket = 'test'
path = 'json_files/data/'
delimiter = '/'
suffix = '.json'
json_files = client.list_objects(Bucket=bucket, Prefix=path, Delimiter=delimiter)
#print(inter_files)
for obj in inter_files['Contents']:
#print(obj)
obj = client.get_object(Bucket=bucket, Key=obj['Key'])
#print(obj)
df = pd.read_json(obj["Body"], lines=True)
print(df)

how to edit large json files with pandas?

I have a 200mb txt file which includes roughly about 25k JSON files (metadata and the content of newspaper articles). Now i want to manipulate the data so that the file is smaller and it only contains such data which is relevant for my analysis (only 3 out of 16 columns).
Question:
How to delete/drop columns in pandas dataframe and safe these changes to the .json file?
JSON:
{"_version_":1609422219455234049,
"content": " abc ",
"docType":"shNews",
"id":"SNW_000050a3-38c6-4794-8e73-3ab3464be248",
"publishDate":"2017-08-16T16:01:018Z",
"stakeholderId":482,
"status":"BlackListed",
"systemDate":"2017-08-16T17:42:010Z"
"tags2":"type_de_Institution;subtype_de_Administration;industry_de_Staat;continent_de_Europa;country_de_Deutschland;level_de_National;highrelevance_eu_0;"
,"title":"Waffen schaffen keine Sicherheit. Von Außenminister Sigmar Gabriel",
"url":"http://www.auswaertiges-amt.de/sid_A5AB4A9D659FF8612B357392137BE7EB/DE/Infoservice/Presse/Interviews/2017/170816-BM_Rheinische_Post.html"}
Code:
import pandas as pd
articles=pd.read_json('/Users/Flo/export_harnisch.json', lines=True, orient='columns')
print (type (articles))
df = pd.DataFrame(articles)
df[df['tags2'].str.contains('country_de_Deutschland')==True]
i already tried this:
df.to_json ("example_name.json")
The actual result of the line i tried is a json file which is larger than the original file and atom cannot read it out. Moreover the changes i made in the dataframe (del/drop of columns) are not applied to the .json file on my pc.
import pandas as pd
df = pd.read_json('/Users/Flo/export_harnisch.json', lines=True, orient='columns')
# read_json should convert things into dataframe already
print(type(articles))
# you forgot to re assign df
df = df[df['tags2'].str.contains('country_de_Deutschland')==True]
df.to_json("example_name.json")

Reading a big JSON file with multiple objects in Python

I have a big GZ compressed JSON file where each line is a JSON object (i.e. a python dictionary).
Here is an example of the first two lines:
{"ID_CLIENTE":"o+AKj6GUgHxcFuaRk6/GSvzEWRYPXDLjtJDI79c7ccE=","ORIGEN":"oaDdZDrQCwqvi1YhNkjIJulA8C0a4mMZ7ESVlEWGwAs=","DESTINO":"OOcb8QTlctDfYOwjBI02hUJ1o3Bro/ir6IsmZRigja0=","PRECIO":0.0023907284768211919,"RESERVA":"2015-05-20","SALIDA":"2015-07-26","LLEGADA":"2015-07-27","DISTANCIA":0.48962542317352847,"EDAD":"19","sexo":"F"}{"ID_CLIENTE":"WHDhaR12zCTCVnNC/sLYmN3PPR3+f3ViaqkCt6NC3mI=","ORIGEN":"gwhY9rjoMzkD3wObU5Ito98WDN/9AN5Xd5DZDFeTgZw=","DESTINO":"OOcb8QTlctDfYOwjBI02hUJ1o3Bro/ir6IsmZRigja0=","PRECIO":0.001103046357615894,"RESERVA":"2015-04-08","SALIDA":"2015-07-24","LLEGADA":"2015-07-24","DISTANCIA":0.21382548869717155,"EDAD":"13","sexo":"M"}
So, I'm using the following code to read each line into a Pandas DataFrame:
import json
import gzip
import pandas as pd
import random
with gzip.GzipFile('data/000000000000.json.gz', 'r',) as fin:
data_lan = pd.DataFrame()
for line in fin:
data_lan = pd.DataFrame([json.loads(line.decode('utf-8'))]).append(data_lan)
But it's taking years.
Any suggestion to read the data quicker?
EDIT:
Finally what solved the problem:
import json
import gzip
import pandas as pd
with gzip.GzipFile('data/000000000000.json.gz', 'r',) as fin:
data_lan = []
for line in fin:
data_lan.append(json.loads(line.decode('utf-8')))
data = pd.DataFrame(data_lan)
I've worked on a similar problem myself, The append() is kinda slow. I generally use a list of dicts to load the json file and then create a Dataframe at once. This ways, you can have the flexibility the lists give you and only when you're sure about the Data in the list you convert it into a Dataframe. Below is an implementation of the concept:
import pandas as pd
import gzip
def get_contents_from_json(file_path)-> dict:
"""
Reads the contents of the json file into a dict
:param file_path:
:return: A dictionary of all contents in the file.
"""
try:
with gzip.open(file_path) as file:
contents = file.read()
return json.loads(contents.decode('UTF-8'))
except json.JSONDecodeError:
print('Error while reading json file')
except FileNotFoundError:
print(f'The JSON file was not found at the given path: \n{file_path}')
def main(file_path: str):
file_contents = get_contents_from_json(file_path)
if not isinstance(file_contents,list):
# I've considered you have a JSON Array in your file
# if not let me know in the comments
raise TypeError("The file doesn't have a JSON Array!!!")
all_columns = file_contents[0].keys()
data_frame = pd.DataFrame(columns=all_columns, data=file_contents)
print(f'Loaded {int(data_frame.size / len(all_columns))} Rows', 'Done!', sep='\n')
if __name__ == '__main__':
main(r'C:\Users\carrot\Desktop\dummyData.json.gz')
A pandas DataFrame fits into a contiguous block of memory which means that pandas needs to know the size of the data set when the frame is created. Since append changes the size, new memory must be allocated and the original plus new data sets are copied in. As your set grows, the copy gets bigger and bigger.
You can use from_records to avoid this problem. First, you need to know the row count and that means scanning the file. You could potentially cache that number if you do it often, but its a relatively fast operation. Now you have the size and pandas can allocate the memory efficiently.
# count rows
with gzip.GzipFile(file_to_test, 'r',) as fin:
row_count = sum(1 for _ in fin)
# build dataframe from records
with gzip.GzipFile(file_to_test, 'r',) as fin:
data_lan = pd.DataFrame.from_records(fin, nrows=row_count)

pyspark csv at url to dataframe, without writing to disk

How can I read a csv at a url into a dataframe in Pyspark without writing it to disk?
I've tried the following with no luck:
import urllib.request
from io import StringIO
url = "https://raw.githubusercontent.com/pandas-dev/pandas/master/pandas/tests/data/iris.csv"
response = urllib.request.urlopen(url)
data = response.read()
text = data.decode('utf-8')
f = StringIO(text)
df1 = sqlContext.read.csv(f, header = True, schema=customSchema)
df1.show()
TL;DR It is not possible and in general transferring data through driver is a dead-end.
Before Spark 2.3 csv reader can read only from URI (and http is not supported).
In Spark 2.3 you use RDD:
spark.read.csv(sc.parallelize(text.splitlines()))
but data will be written to disk.
You can createDataFrame from Pandas:
spark.createDataFrame(pd.read_csv(url)))
but this once again writes to disk
If file is small I'd just use sparkFiles:
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
spark.read.csv(SparkFiles.get("iris.csv"), header=True))