Takes forever to read and flatten 100K+ json files

Takes forever to read and flatten 100K+ json files - json

I have large number of json file that i have to read, flatten and filter (select specific key-value). below is the code i am using
for i in range(len(json_files)): # is the list of json files in directory
text = []
for index, js in enumerate(json_files):
with open(os.path.join(path_to_json, js)) as json_file: # path_to_json is json directory
jtext = json.load(json_file)
json_text=jtext['KEY_TO_FILTER']
text_json=flatten_data(json_text) # function i'm using to select specific key in json
text.append(text_json)
i really appreciate hint how to use multiprocessing. any other suggestion ow to speed up the process is very welcomed

Related

How can I write to JSON file, without deleting all the content in it?

import json
f = open("filename.json", "w")
data = {"username": "justausername"}
json.dump(data, f)
When I run this code, all the data in the "filename.json" is replaced by "{'username': 'justausername'}". Please help!

Read the file
Parse the JSON to a data structure
Modify the data structure instead of creating a new one
Serialise the data structure back to JSON
Write it to the file
Consider using a real database instead so that you get benefits like automatic protection for concurrent edits. SQLite is a good choice if you want a single file to store the data in.

import json
with open("filename.json", "r") as f: # reading a file
data = json.load(f) # deserialization
data["username"] = "justausername" # modifying the python object
with open("filename.json", "w") as f:
json.dump(data, f) # serializing back to the original file

Fast processing json using Python3.x from s3

I have json files on s3 like:
{'key1':value1, 'key2':'value2'}{'key1':value1, 'key2':'value2'}{'key1':value1, 'key2':'value2'}
{'key1':value1, 'key2':'value2'}{'key1':value1, 'key2':'value2'}{'key1':value1, 'key2':'value2'}
The structure is not an array, concatenated jsons without any newlines. There are 1000s of files from which I need only a couple of fields. How can I process them fast?
I will use this on AWS Lambda.
The code I am thinking of is somewhat like this:
data_chunk = data_file.read()
recs = data_chunk.split('}')
json_recs = []
# This part onwards it becomes inefficient where I have to iterate every record
for rec in recs:
json_recs.append(json.loads(rec + '}'))
# Extract Individual fields
How can this be improved? Will using Pandas dataframe help? Individual files are small about 128 MB.

S3 Select supports this JSON Lines structure. You can query it with a SQL-like langugage. It's fast and cheap.

Spark - How to write a single csv file WITHOUT folder?

Suppose that df is a dataframe in Spark. The way to write df into a single CSV file is
df.coalesce(1).write.option("header", "true").csv("name.csv")
This will write the dataframe into a CSV file contained in a folder called name.csv but the actual CSV file will be called something like part-00000-af091215-57c0-45c4-a521-cd7d9afb5e54.csv.
I would like to know if it is possible to avoid the folder name.csv and to have the actual CSV file called name.csv and not part-00000-af091215-57c0-45c4-a521-cd7d9afb5e54.csv. The reason is that I need to write several CSV files which later on I will read together in Python, but my Python code makes use of the actual CSV names and also needs to have all the single CSV files in a folder (and not a folder of folders).
Any help is appreciated.

A possible solution could be convert the Spark dataframe to a pandas dataframe and save it as csv:
df.toPandas().to_csv("<path>/<filename>")
EDIT: As caujka or snark suggest, this works for small dataframes that fits into driver. It works for real cases that you want to save aggregated data or a sample of the dataframe. Don't use this method for big datasets.

If you want to use only the python standard library this is an easy function that will write to a single file. You don't have to mess with tempfiles or going through another dir.
import csv
def spark_to_csv(df, file_path):
""" Converts spark dataframe to CSV file """
with open(file_path, "w") as f:
writer = csv.DictWriter(f, fieldnames=df.columns)
writer.writerow(dict(zip(fieldnames, fieldnames)))
for row in df.toLocalIterator():
writer.writerow(row.asDict())

If the result size is comparable to spark driver node's free memory, you may have problems with converting the dataframe to pandas.
I would tell spark to save to some temporary location, and then copy the individual csv files into desired folder. Something like this:
import os
import shutil
TEMPORARY_TARGET="big/storage/name"
DESIRED_TARGET="/export/report.csv"
df.coalesce(1).write.option("header", "true").csv(TEMPORARY_TARGET)
part_filename = next(entry for entry in os.listdir(TEMPORARY_TARGET) if entry.startswith('part-'))
temporary_csv = os.path.join(TEMPORARY_TARGET, part_filename)
shutil.copyfile(temporary_csv, DESIRED_TARGET)
If you work with databricks, spark operates with files like dbfs:/mnt/..., and to use python's file operations on them, you need to change the path into /dbfs/mnt/... or (more native to databricks) replace shutil.copyfile with dbutils.fs.cp.

A more databricks'y' solution is here:
TEMPORARY_TARGET="dbfs:/my_folder/filename"
DESIRED_TARGET="dbfs:/my_folder/filename.csv"
spark_df.coalesce(1).write.option("header", "true").csv(TEMPORARY_TARGET)
temporary_csv = os.path.join(TEMPORARY_TARGET, dbutils.fs.ls(TEMPORARY_TARGET)[3][1])
dbutils.fs.cp(temporary_csv, DESIRED_TARGET)
Note if you are working from Koalas data frame you can replace spark_df with koalas_df.to_spark()

For pyspark, you can convert to pandas dataframe and then save it.
df.toPandas().to_csv("<path>/<filename.csv>", header=True, index=False)

There is no dataframe spark API which writes/creates a single file instead of directory as a result of write operation.
Below both options will create one single file inside directory along with standard files (_SUCCESS , _committed , _started).
1. df.coalesce(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header",
"true").csv("PATH/FOLDER_NAME/x.csv")
2. df.repartition(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header",
"true").csv("PATH/FOLDER_NAME/x.csv")
If you don't use coalesce(1) or repartition(1) and take advantage of sparks parallelism for writing files then it will create multiple data files inside directory.
You need to write function in driver which will combine all data file parts to single file(cat part-00000* singlefilename ) once write operation is done.

I had the same problem and used python's NamedTemporaryFile library to solve this.
from tempfile import NamedTemporaryFile
s3 = boto3.resource('s3')
with NamedTemporaryFile() as tmp:
df.coalesce(1).write.format('csv').options(header=True).save(tmp.name)
s3.meta.client.upload_file(tmp.name, S3_BUCKET, S3_FOLDER + 'name.csv')
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-uploading-files.html for more info on upload_file()

Create temp folder inside output folder. Copy file part-00000* with the file name to output folder. Delete the temp folder. Python code snippet to do the same in Databricks.
fpath=output+'/'+'temp'
def file_exists(path):
try:
dbutils.fs.ls(path)
return True
except Exception as e:
if 'java.io.FileNotFoundException' in str(e):
return False
else:
raise
if file_exists(fpath):
dbutils.fs.rm(fpath)
df.coalesce(1).write.option("header", "true").csv(fpath)
else:
df.coalesce(1).write.option("header", "true").csv(fpath)
fname=([x.name for x in dbutils.fs.ls(fpath) if x.name.startswith('part-00000')])
dbutils.fs.cp(fpath+"/"+fname[0], output+"/"+"name.csv")
dbutils.fs.rm(fpath, True)

You can go with pyarrow, as it provides file pointer for hdfs file system. You can write your content to file pointer as a usual file writing. Code example:
import pyarrow.fs as fs
HDFS_HOST: str = 'hdfs://<your_hdfs_name_service>'
FILENAME_PATH: str = '/user/your/hdfs/file/path/<file_name>'
hadoop_file_system = fs.HadoopFileSystem(host=HDFS_HOST)
with hadoop_file_system.open_output_stream(path=FILENAME_PATH) as f:
f.write("Hello from pyarrow!".encode())
This will create a single file with the specified name.
To initiate pyarrow you should define environment CLASSPATH properly, set the output of hadoop classpath --glob to it

df.write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("PATH/FOLDER_NAME/x.csv")
you can use this and if you don't want to give the name of CSV everytime you can write UDF or create an array of the CSV file name and give it to this it will work

Reading all CSV files in current working directory into pandas with correct filenames

I'm trying to use a loop to read in multiple CSVs (for now but mix of that and xls in the future).
I'd like each data frame in pandas to be the same name excluding file extension in my folder.
import os
import pandas as pd
files = filter(os.path.isfile, os.listdir( os.curdir ) )
files # this shows a list of the files that I want to use/have in my directory- they are all CSVs if that matters
# i want to load these into pandas data frames with the corresponding filenames
# not sure if this is the right approach....
# but what is wrong is the variable is named 'weather_today.csv'... i need to drop the .csv or .xlsx or whatever it might be
for each_file in files:
frame = pd.read_csv( each_file)
each_file = frame
Bernie seems to be great but one problem:
or each_file in files:
frame = pd.read_csv(each_file)
filename_only = os.path.splitext(each_file)[0]
# Right below I am assigning my looped data frame the literal variable name of "filename_only" rather than the value that filename_only represents
#rather than what happens if I print(filename_only)
filename_only = frame
for example if my two files are weather_today, earthquakes.csv (in that order) in my files list, then both 'earthquakes' and 'weather' will not be created.
however, if I simply type 'filename_only' and click the enter key in python - then I will see the earthquake dataframe. If I have 100 files, then the last data frame name in the list loop will be titled 'filename_only' and the other 99 won't because the previous assignments are never made and the 100th one overwrites them.

You can use os.path.splitext() for this to "split the pathname path into a pair (root, ext) such that root + ext == path, and ext is empty or begins with a period and contains at most one period."
for each_file in files:
frame = pd.read_csv(each_file)
filename_only = os.path.splitext(each_file)[0]
filename_only = frame
As asked in a comment we would like a way to filter for just CSV files so you can do something like this:
files = [file for file in os.listdir( os.curdir ) if file.endswith(".csv")]

Use a dictionary to store your frames:
frames = {}
for each_file in files:
frames[os.path.splitext(each_file)[0]] = pd.read_csv(each_file)
Now you can get the DataFrame of your choice with:
frames[filename_without_ext]
Simple, right? Be careful about RAM usage though, reading a bunch of files can quickly fill up system memory and cause a crash.

Read and parse a >400MB .json file in Julia without crashing kernel

The following is crashing my Julia kernel. Is there a better way to read and parse a large (>400 MB) JSON file?
using JSON
data = JSON.parsefile("file.json")

Unless some effort is invested into making a smarter JSON parser, the following might work: There is a good chance file.json has many lines. In this case, reading the file and parsing a big repetitive JSON section line-by-line or chunk-by-chuck (for the right chunk length) could do the trick. A possible way to code this, would be:
using JSON
f = open("file.json","r")
discard_lines = 12 # lines up to repetitive part
important_chunks = 1000 # number of data items
chunk_length = 2 # each data item has a 2-line JSON chunk
for i=1:discard_lines
l = readline(f)
end
for i=1:important_chunks
chunk = join([readline(f) for j=1:chunk_length])
push!(thedata,JSON.parse(chunk))
end
close(f)
# use thedata
There is a good chance this could be a temporary stopgap solution for your problem. Inspect file.json to find out.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Takes forever to read and flatten 100K+ json files - json

Related

How can I write to JSON file, without deleting all the content in it?

Fast processing json using Python3.x from s3

Spark - How to write a single csv file WITHOUT folder?

Reading all CSV files in current working directory into pandas with correct filenames

Read and parse a >400MB .json file in Julia without crashing kernel

Categories

Resources