Dump Pandas Dataframe to multiple json files - json

I have a Dataframe loaded from a CSV file
df = pd.read_csv(input_file, header=0)
and I want to process it and eventually save it into multiple JSON files (for example a new file every X rows).
Any suggestion how to achieve that?

This should work:
for idx, group in df.groupby(np.arange(len(df))//10):
group.to_json(f'{idx}_name.json', orient='index') # orient: split, records, index, values, table, columns
Change the 10 to the number of rows you wish to write out for each iteration.

Related

Fast processing json using Python3.x from s3

I have json files on s3 like:
{'key1':value1, 'key2':'value2'}{'key1':value1, 'key2':'value2'}{'key1':value1, 'key2':'value2'}
{'key1':value1, 'key2':'value2'}{'key1':value1, 'key2':'value2'}{'key1':value1, 'key2':'value2'}
The structure is not an array, concatenated jsons without any newlines. There are 1000s of files from which I need only a couple of fields. How can I process them fast?
I will use this on AWS Lambda.
The code I am thinking of is somewhat like this:
data_chunk = data_file.read()
recs = data_chunk.split('}')
json_recs = []
# This part onwards it becomes inefficient where I have to iterate every record
for rec in recs:
json_recs.append(json.loads(rec + '}'))
# Extract Individual fields
How can this be improved? Will using Pandas dataframe help? Individual files are small about 128 MB.
S3 Select supports this JSON Lines structure. You can query it with a SQL-like langugage. It's fast and cheap.

PySpark Querying Multiple JSON Files

I have uploaded into Spark 2.2.0 many JSONL files (the structure is the same for all of them) contained in a directory using the command (python spark): df = spark.read.json(mydirectory) df.createGlobalTempView("MyDatabase") sqlDF = spark.sql("SELECT count(*) FROM MyDatabase") sqlDF.show().
The uploading works, but when I query sqlDF (sqlDF.show()), it seems that Spark counts the rows of just one file (the first?) and not those of all of them. I am assuming that "MyDatabase" is a dataframe containing all the files.
What am I missing?
If I upload just one file consisting of only one line of multiple json objects {...}, Spark can properly identify the tabular structure. If I have more than one file, I have to put each {} on a new line to get the same result.

Spark pipeline optimizations :Coalesce over filtered dataset

I'm afraid about what is doing spark (1.5.1) in background when I filter a dataset and them performs a coalesce. I want to generate a CSV file with a subset of my dataset. To do that I execute:
df = sqlContext.read.parquet('...') # huge dataset
df_ids = sqlContext.read.parquet('...') # dataframe with the wanted IDs
df = df.join(df_ids, 'id', 'inner') # keep only rows with wanted ids from df
df = df.dropna().filter( df.Ti % 5 == 0 ) # only 2 in 10 rows
Now, if I do the coalesce and save as csv, the process fails because it can't reserve enought memory:
df.coalesce(1)\
.write.format('com.databricks.spark.csv')\
.mode('overwrite')\
.option('header', 'true')\
.save('/tmp/my.csv')
Instead of that, if I save the dataframe and then load it and do the coalesce, it works like magic:
df.write.parquet('/tmp/df.parquet')
sqlContext.read.parquet('/tmp/df.parquet').coalesce(1)\
.write.format('com.databricks.spark.csv')\
.mode('overwrite')\
.option('header', 'true')\
.save('/tmp/my.csv')
It seems to me that spark is doing the coalesce(1) before any other operation so the whole dataset can't be allocated in memory. Is it really happening? Can I avoid that?
Thank you!

Apache Spark - Redundant JSON output after Spark SQL query

"I have done the following:
DataFrame input= sqlContext.read().json("/path/t*json");
DataFrame res = sqlContext.sql("SELECT *, Some_Calculation calc FROM input group by user_id")
res.write().json("/path/result.json");
This reads all the files in the directory that start with a 't' and are JSON files. Thus far, OK. But the output of this program is not a JSON file, as I intended, but a directory called result.json with a as many crc and as many output files as the number of input files. This results in many identical files. Since my calculations are grouped by the user_id, my aggregate calculations check out, but I get just as much output files as the input files I have, and since the aggregate calculations have the same result, many of those files are identical.
First of all, how can I generate a single JSON file as the output?
Second of all, how can I get rid of the redundancy?
Thank you!

Python 3 code to read CSV file, manipulate then create new file....works, but looking for improvements

This is my first ever post here. I am trying to learn a bit of Python. Using Python 3 and numpy.
Did a few tutorials then decided to dive in and try a little project I might find useful at work as thats a good way to learn for me.
I have written a program that reads in data from a CSV file which has a few rows of headers, I then want to extract certain columns from that file based on the header names, then output that back to a new csv file in a particular format.
The program I have works fine and does what I want, but as I'm a newbie I would like some tips as to how I can improve my code.
My main data file (csv) is about 57 columns long and about 36 rows deep so not big.
It works fine, but looking for advice & improvements.
import csv
import numpy as np
#make some arrays..at least I think thats what this does
A=[]
B=[]
keep_headers=[]
#open the main data csv file 'map.csv'...need to check what 'r' means
input_file = open('map.csv','r')
#read the contents of the file into 'data'
data=csv.reader(input_file, delimiter=',')
#skip the first 2 header rows as they are junk
next(data)
next(data)
#read in the next line as the 'header'
headers = next(data)
#Now read in the numeric data (float) from the main csv file 'map.csv'
A=np.genfromtxt('map.csv',delimiter=',',dtype='float',skiprows=5)
#Get the length of a column in A
Alen=len(A[:,0])
#now read the column header values I want to keep from 'keepheader.csv'
keep_headers=np.genfromtxt('keepheader.csv',delimiter=',',dtype='unicode_')
#Get the length of keep headers....i.e. how many headers I'm keeping.
head_len=len(keep_headers)
#Now loop round extracting all the columns with the keep header titles and
#append them to array B
i=0
while i < head_len:
#use index to find the apprpriate column number.
item_num=headers.index(keep_headers[i])
i=i+1
#append the selected column to array B
B=np.append(B,A[:,item_num])
#now reshape the B array
B=np.reshape(B,(head_len,36))
#now transpose it as thats the format I want.
B=np.transpose(B)
#save the array B back to a new csv file called 'cmap.csv'
np.savetxt('cmap.csv',B,fmt='%.3f',delimiter=",")
Thanks.
You can greatly simplify your code using more of numpy capabilities.
A = np.loadtxt('stack.txt',skiprows=2,delimiter=',',dtype=str)
keep_headers=np.loadtxt('keepheader.csv',delimiter=',',dtype=str)
headers = A[0,:]
cols_to_keep = np.in1d( headers, keep_headers )
B = np.float_(A[1:,cols_to_keep])
np.savetxt('cmap.csv',B,fmt='%.3f',delimiter=",")