Create Multiple .json Files from an Excel file with multiple sheets using Pandas - json

I 'm trying to convert a very big number of Excel Files with multiple sheets (some of them very big also) to .json files. So I created a list with the names of the sheets and then made a loop to create a data frame for each sheet and then I wrote this dataframe to a .json file. My code is :
from zipfile import ZipFile
from bs4 import BeautifulSoup
import pandas as pd
file = 'filename.xlsx'
with ZipFile(file) as zipped_file:
summary = zipped_file.open(r'xl/workbook.xml').read()
soup = BeautifulSoup(summary, "xml")
sheets = [sheet.get("name") for sheet in soup.find_all("sheet")]
for i in sheets:
df = pd.read_excel(file, sheet_name = i, index = False, header = 1)
json_file = df.to_json(("{}.json").format(i))
This code works like a charm when the sheets are not very big. When I run it for an excel file it works and creates the json files I want up to the point that it finds a very big sheet with a lot of data and it crashes.
So my question is : Is there a different more efficient way to do this without crashing the program. When I run the df=pd.read_excel command separately for each sheet it works without a problem, but I need this to happen in a loop

Import numpy. Declare an empty numpy array, out_array. Then, given a list of paths, paths, for each path in paths, read the file into a temporary dataframe, temp_df, get the values of the temporary dataframe using the .values() method, store the values into a temporary numpy array, temp_array, concatenate out_array and temp_array using numpy.concatenate.
Once this loop completes the process, convert the out_array to dataframe, out_df, using pandas.DataFrame. And finally, set column names for your new dataframe.

Related

How to convert multiple nested JSON files into single CSV file using python?

I have about 200 nested JSON files with varying levels of nesting from one to three. Each JSON file consist of more than thousand data points. The keys of the values are same in all the files. My objective is to combine the data in all the files in a tabular format in a single CSV file so that I can read all the data and analyze it. I am looking for a simpler python code with brief explanation of each steps of the code to help in understanding the whole sequence of the code.
You can use this code snippet.
First of all install pandas using
pip install pandas
After that, you can use this code to convert JSON files to CSV.
# code to save all data to a single file
import pandas as pd
import glob
path = './path to directory/*.json'
files = glob.glob(path)
data_frames = []
for file in files:
f = open(file, 'r')
data_frames.append(pd.read_json(f))
f.close()
pd.concat(data_frames).to_csv("data.csv")
# code to save CSV data to individual files
import pandas as pd
import glob
path = './path to directory/*.json'
files = glob.glob(path)
for file in files:
f = open(file, 'r')
jsonData = pd.read_json(f.read())
jsonData.to_csv(f.name+".csv")
f.close()

Zapier Code Step Model Data into CSV

I'm looking for help with some JavaScript to insert inside of a code step in Zapier. I have two inputs that are named/look like the following:
RIDS: 991,992,993
LineIDs: 1,2,3
Each of these should match in the quantity of items in the list. There can be 1, 2 or 100 of them. The order is significant.
What I'm looking for is a code step to model the data into one CSV matching up the positions of each. So using the above data, my output would look like this:
991,1
992,2
993,3
Does anyone have code or easily know how to achieve this? I am not a JavaScript developer.
Zapier doesn't allow you to create files in a code step. You can, though, use the code step to generate text which can then be used in another step. I used Python for my example (I'm not as familiar with Javascript but the strategy is the same).
Create CSV file in Zapier from Raw Data
Code Step with LindeIDs and RIDs as inputs
import csv
import io
# Convert inputs into lists
lids = input_data['LineIDs'].split(',')
rids = input_data['RIDs'].split(',')
# Create file-like CSV object
csvfile = io.StringIO()
filewriter = csv.writer(csvfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
# Write CSV rows
filewriter.writerow(['LineID', 'RID'])
for x in range(len(lids)):
filewriter.writerow([lids[x], rids[x]])
# Get CSV object value as text and set to output
output = {'text': csvfile.getvalue()}
Use a Google Drive step to Create File from Text
File Content = Text from Step 1
Convert to Document = no
This will create a *.txt document
Use a CloudConvert step to Convert File from txt to csv.

Spark - How to write a single csv file WITHOUT folder?

Suppose that df is a dataframe in Spark. The way to write df into a single CSV file is
df.coalesce(1).write.option("header", "true").csv("name.csv")
This will write the dataframe into a CSV file contained in a folder called name.csv but the actual CSV file will be called something like part-00000-af091215-57c0-45c4-a521-cd7d9afb5e54.csv.
I would like to know if it is possible to avoid the folder name.csv and to have the actual CSV file called name.csv and not part-00000-af091215-57c0-45c4-a521-cd7d9afb5e54.csv. The reason is that I need to write several CSV files which later on I will read together in Python, but my Python code makes use of the actual CSV names and also needs to have all the single CSV files in a folder (and not a folder of folders).
Any help is appreciated.
A possible solution could be convert the Spark dataframe to a pandas dataframe and save it as csv:
df.toPandas().to_csv("<path>/<filename>")
EDIT: As caujka or snark suggest, this works for small dataframes that fits into driver. It works for real cases that you want to save aggregated data or a sample of the dataframe. Don't use this method for big datasets.
If you want to use only the python standard library this is an easy function that will write to a single file. You don't have to mess with tempfiles or going through another dir.
import csv
def spark_to_csv(df, file_path):
""" Converts spark dataframe to CSV file """
with open(file_path, "w") as f:
writer = csv.DictWriter(f, fieldnames=df.columns)
writer.writerow(dict(zip(fieldnames, fieldnames)))
for row in df.toLocalIterator():
writer.writerow(row.asDict())
If the result size is comparable to spark driver node's free memory, you may have problems with converting the dataframe to pandas.
I would tell spark to save to some temporary location, and then copy the individual csv files into desired folder. Something like this:
import os
import shutil
TEMPORARY_TARGET="big/storage/name"
DESIRED_TARGET="/export/report.csv"
df.coalesce(1).write.option("header", "true").csv(TEMPORARY_TARGET)
part_filename = next(entry for entry in os.listdir(TEMPORARY_TARGET) if entry.startswith('part-'))
temporary_csv = os.path.join(TEMPORARY_TARGET, part_filename)
shutil.copyfile(temporary_csv, DESIRED_TARGET)
If you work with databricks, spark operates with files like dbfs:/mnt/..., and to use python's file operations on them, you need to change the path into /dbfs/mnt/... or (more native to databricks) replace shutil.copyfile with dbutils.fs.cp.
A more databricks'y' solution is here:
TEMPORARY_TARGET="dbfs:/my_folder/filename"
DESIRED_TARGET="dbfs:/my_folder/filename.csv"
spark_df.coalesce(1).write.option("header", "true").csv(TEMPORARY_TARGET)
temporary_csv = os.path.join(TEMPORARY_TARGET, dbutils.fs.ls(TEMPORARY_TARGET)[3][1])
dbutils.fs.cp(temporary_csv, DESIRED_TARGET)
Note if you are working from Koalas data frame you can replace spark_df with koalas_df.to_spark()
For pyspark, you can convert to pandas dataframe and then save it.
df.toPandas().to_csv("<path>/<filename.csv>", header=True, index=False)
There is no dataframe spark API which writes/creates a single file instead of directory as a result of write operation.
Below both options will create one single file inside directory along with standard files (_SUCCESS , _committed , _started).
1. df.coalesce(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header",
"true").csv("PATH/FOLDER_NAME/x.csv")
2. df.repartition(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header",
"true").csv("PATH/FOLDER_NAME/x.csv")
If you don't use coalesce(1) or repartition(1) and take advantage of sparks parallelism for writing files then it will create multiple data files inside directory.
You need to write function in driver which will combine all data file parts to single file(cat part-00000* singlefilename ) once write operation is done.
I had the same problem and used python's NamedTemporaryFile library to solve this.
from tempfile import NamedTemporaryFile
s3 = boto3.resource('s3')
with NamedTemporaryFile() as tmp:
df.coalesce(1).write.format('csv').options(header=True).save(tmp.name)
s3.meta.client.upload_file(tmp.name, S3_BUCKET, S3_FOLDER + 'name.csv')
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-uploading-files.html for more info on upload_file()
Create temp folder inside output folder. Copy file part-00000* with the file name to output folder. Delete the temp folder. Python code snippet to do the same in Databricks.
fpath=output+'/'+'temp'
def file_exists(path):
try:
dbutils.fs.ls(path)
return True
except Exception as e:
if 'java.io.FileNotFoundException' in str(e):
return False
else:
raise
if file_exists(fpath):
dbutils.fs.rm(fpath)
df.coalesce(1).write.option("header", "true").csv(fpath)
else:
df.coalesce(1).write.option("header", "true").csv(fpath)
fname=([x.name for x in dbutils.fs.ls(fpath) if x.name.startswith('part-00000')])
dbutils.fs.cp(fpath+"/"+fname[0], output+"/"+"name.csv")
dbutils.fs.rm(fpath, True)
You can go with pyarrow, as it provides file pointer for hdfs file system. You can write your content to file pointer as a usual file writing. Code example:
import pyarrow.fs as fs
HDFS_HOST: str = 'hdfs://<your_hdfs_name_service>'
FILENAME_PATH: str = '/user/your/hdfs/file/path/<file_name>'
hadoop_file_system = fs.HadoopFileSystem(host=HDFS_HOST)
with hadoop_file_system.open_output_stream(path=FILENAME_PATH) as f:
f.write("Hello from pyarrow!".encode())
This will create a single file with the specified name.
To initiate pyarrow you should define environment CLASSPATH properly, set the output of hadoop classpath --glob to it
df.write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("PATH/FOLDER_NAME/x.csv")
you can use this and if you don't want to give the name of CSV everytime you can write UDF or create an array of the CSV file name and give it to this it will work

Reading all CSV files in current working directory into pandas with correct filenames

I'm trying to use a loop to read in multiple CSVs (for now but mix of that and xls in the future).
I'd like each data frame in pandas to be the same name excluding file extension in my folder.
import os
import pandas as pd
files = filter(os.path.isfile, os.listdir( os.curdir ) )
files # this shows a list of the files that I want to use/have in my directory- they are all CSVs if that matters
# i want to load these into pandas data frames with the corresponding filenames
# not sure if this is the right approach....
# but what is wrong is the variable is named 'weather_today.csv'... i need to drop the .csv or .xlsx or whatever it might be
for each_file in files:
frame = pd.read_csv( each_file)
each_file = frame
Bernie seems to be great but one problem:
or each_file in files:
frame = pd.read_csv(each_file)
filename_only = os.path.splitext(each_file)[0]
# Right below I am assigning my looped data frame the literal variable name of "filename_only" rather than the value that filename_only represents
#rather than what happens if I print(filename_only)
filename_only = frame
for example if my two files are weather_today, earthquakes.csv (in that order) in my files list, then both 'earthquakes' and 'weather' will not be created.
however, if I simply type 'filename_only' and click the enter key in python - then I will see the earthquake dataframe. If I have 100 files, then the last data frame name in the list loop will be titled 'filename_only' and the other 99 won't because the previous assignments are never made and the 100th one overwrites them.
You can use os.path.splitext() for this to "split the pathname path into a pair (root, ext) such that root + ext == path, and ext is empty or begins with a period and contains at most one period."
for each_file in files:
frame = pd.read_csv(each_file)
filename_only = os.path.splitext(each_file)[0]
filename_only = frame
As asked in a comment we would like a way to filter for just CSV files so you can do something like this:
files = [file for file in os.listdir( os.curdir ) if file.endswith(".csv")]
Use a dictionary to store your frames:
frames = {}
for each_file in files:
frames[os.path.splitext(each_file)[0]] = pd.read_csv(each_file)
Now you can get the DataFrame of your choice with:
frames[filename_without_ext]
Simple, right? Be careful about RAM usage though, reading a bunch of files can quickly fill up system memory and cause a crash.

Python 3 code to read CSV file, manipulate then create new file....works, but looking for improvements

This is my first ever post here. I am trying to learn a bit of Python. Using Python 3 and numpy.
Did a few tutorials then decided to dive in and try a little project I might find useful at work as thats a good way to learn for me.
I have written a program that reads in data from a CSV file which has a few rows of headers, I then want to extract certain columns from that file based on the header names, then output that back to a new csv file in a particular format.
The program I have works fine and does what I want, but as I'm a newbie I would like some tips as to how I can improve my code.
My main data file (csv) is about 57 columns long and about 36 rows deep so not big.
It works fine, but looking for advice & improvements.
import csv
import numpy as np
#make some arrays..at least I think thats what this does
A=[]
B=[]
keep_headers=[]
#open the main data csv file 'map.csv'...need to check what 'r' means
input_file = open('map.csv','r')
#read the contents of the file into 'data'
data=csv.reader(input_file, delimiter=',')
#skip the first 2 header rows as they are junk
next(data)
next(data)
#read in the next line as the 'header'
headers = next(data)
#Now read in the numeric data (float) from the main csv file 'map.csv'
A=np.genfromtxt('map.csv',delimiter=',',dtype='float',skiprows=5)
#Get the length of a column in A
Alen=len(A[:,0])
#now read the column header values I want to keep from 'keepheader.csv'
keep_headers=np.genfromtxt('keepheader.csv',delimiter=',',dtype='unicode_')
#Get the length of keep headers....i.e. how many headers I'm keeping.
head_len=len(keep_headers)
#Now loop round extracting all the columns with the keep header titles and
#append them to array B
i=0
while i < head_len:
#use index to find the apprpriate column number.
item_num=headers.index(keep_headers[i])
i=i+1
#append the selected column to array B
B=np.append(B,A[:,item_num])
#now reshape the B array
B=np.reshape(B,(head_len,36))
#now transpose it as thats the format I want.
B=np.transpose(B)
#save the array B back to a new csv file called 'cmap.csv'
np.savetxt('cmap.csv',B,fmt='%.3f',delimiter=",")
Thanks.
You can greatly simplify your code using more of numpy capabilities.
A = np.loadtxt('stack.txt',skiprows=2,delimiter=',',dtype=str)
keep_headers=np.loadtxt('keepheader.csv',delimiter=',',dtype=str)
headers = A[0,:]
cols_to_keep = np.in1d( headers, keep_headers )
B = np.float_(A[1:,cols_to_keep])
np.savetxt('cmap.csv',B,fmt='%.3f',delimiter=",")