Write very big DataFrame into text file or split Dataframe - csv

I have a dataframe, its shape is "(4255300, 10)". I have to open this into csv file, but due to size restrictions of EXcel, this is not possible.
I tried to split df row-wise (Pandas: split dataframe into multiple dataframes by number of rows) but only index numbers are getting inserted into splits(I wrote those splits into csv files).
Also I tried to write this df into text file, (np.savetxt('desktop/s2.txt', z.values, fmt='%d', delimiter="\t") ) but wrong data is getting inserted into text file.
There is no issue with width of df, only problem is length of it i.e.number of rows.
Can anyone help me with this?

You could split the DataFrame into smaller chunks and then export like this:
# Creating a DataFrame with some numbers
df = pd.DataFrame(np.random.randint(0,100,size=(42000, 10)), index=np.arange(0,42000)).reset_index()
# Setting my chunk size
chunk_size = 10000
# Assigning chunk numbers to rows
df['chunk'] = df['index'].apply(lambda x: int(x / chunk_size))
# We don't want the 'chunk' and 'index' columns in the output
cols = [col for col in df.columns if col not in ['chunk', 'index']]
# groupby chunk and export each chunk to a different csv.
i = 0
for _, chunk in df.groupby('chunk'):
chunk[cols].to_csv(f'chunk{i}.csv', index=False)
i += 1

Related

Single CSV output with data in different columns

I have a number of CSV files with data in the first three columns only. I want to copy data from each CSV file and paste it into one single CSV file in column order. For example data from the first CSV file goes into columns 1,2 and 3 in the output file. Similarly, data from the 2nd CSV goes to columns 4,5, and 6 of the same output CSV file and so on. Any help would be highly appreciated. Thanks.
I have tried the following code but it gets me the output in same columns only.
import glob
import pandas as pd
import time
import numpy as np
start = time.time()
Filename='Combined_Data.csv'
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
for i in range(len(all_filenames)):
data= pd.read_csv(all_filenames[i],skiprows=23)
data= data.rename({'G1': 'CH1', 'G2': 'CH2','Dis': 'CH3'},axis=1)
data= data[['CH1','CH2','CH3']]
data= data.apply(pd.to_numeric, errors='coerce')
print(all_filenames[i])
if i == 0:
data.to_csv(Filename,sep=',',index=False,header=True,mode='a')
else:
data.to_csv(Filename,sep=',',index=False,header=False,mode='a')
end = time.time()
print((end - start),'Seconds(Execution Time)')
If you don't need to write your own code for this, I'd recommend GoCSV's zip command; it can also handle the CSVs having different numbers of rows.
I have three CSV files:
file1.csv
Dig1,Dig2,Dig3
1,2,3
4,5,6
7,8,9
file2.csv
Letter1,Letter2,Letter3
a,b,c
d,e,f
and file3.csv
RomNum1,RomNum2,RomNum3
I,II,III
When I run gocsv zip file2.csv file1.csv file3.csv I get:
Letter1,Letter2,Letter3,Dig1,Dig2,Dig3,RomNum1,RomNum2,RomNum3
a,b,c,1,2,3,I,II,III
d,e,f,4,5,6,,,
,,,7,8,9,,,
GoCSV is pre-built for a number of different OS'es.
Here's how to do it with Python's CSV module, using these files:
file1.csv
Dig1,Dig2,Dig3
1,2,3
4,5,6
7,8,9
file2.csv
Letter1,Letter2,Letter3
a,b,c
d,e,f
and file3.csv
RomNum1,RomNum2,RomNum3
I,II,III
The more-memory-intensive option
This accumulates the final CSV one file at a time, expanding a list that represents the final CSV with with each new input CSV.
#!/usr/bin/env python3
import csv
import sys
csv_files = [
'file2.csv',
'file1.csv',
'file3.csv',
]
all = []
for csv_file in csv_files:
with open(csv_file) as f:
reader = csv.reader(f)
rows = list(reader)
len_all = len(all)
# First file, initialize all and continue (skip)
if len_all == 0:
all = rows
continue
# The number of columns in all so far
len_cols = len(all[0])
# Extend all with the new rows
for i, row in enumerate(rows):
# Check to make sure all has as many rows as this file
if i >= len_all:
all.append(['']*len_cols)
all[i].extend(row)
# Finally, pad all rows on the right
len_cols = len(all[0])
for i in range(len(all)):
len_row = len(all[i])
if len_row < len_cols:
col_diff = len_cols - len_row
all[i].extend(['']*col_diff)
writer = csv.writer(sys.stdout)
writer.writerows(all)
The streaming option
This reads-and-writes a line/row at a time.
(this is basically a Python port of the Go code from GoCSV's zip, from above)
import csv
import sys
fnames = [
'file2.csv',
'file1.csv',
'file3.csv',
]
num_files = len(fnames)
readers = [csv.reader(open(x)) for x in fnames]
# Collect "header" lines; each header defines the number
# of columns for its file
headers = []
num_cols = 0
offsets = [0]
for reader in readers:
header = next(reader)
headers.append(header)
num_cols += len(header)
offsets.append(num_cols)
writer = csv.writer(sys.stdout)
# With all headers counted, every row must have this many columns
shell_row = [''] * num_cols
for i, header in enumerate(headers):
start = offsets[i]
end = offsets[i+1]
shell_row[start:end] = header
# Write headers
writer.writerow(shell_row)
# Expect that not all CSVs have the same number of rows; some will "finish" ahead of others
file_is_complete = [False] * num_files
num_complete = 0
# Loop a row at a time...
while True:
# ... for each CSV
for i, reader in enumerate(readers):
if file_is_complete[i]:
continue
start = offsets[i]
end = offsets[i+1]
try:
row = next(reader)
# Put this row in its place in the main row
shell_row[start:end] = row
except StopIteration:
file_is_complete[i] = True
num_complete += 1
except:
raise
if num_complete == num_files:
break
# Done iterating CSVs (for this row), write it
writer.writerow(shell_row)
# Reset for next main row
shell_row = [''] * num_cols
For either, I get:
Letter1,Letter2,Letter3,Dig1,Dig2,Dig3,RomNum1,RomNum2,RomNum3
a,b,c,1,2,3,I,II,III
d,e,f,4,5,6,,,
,,,7,8,9,,,

Reading a big JSON file with multiple objects in Python

I have a big GZ compressed JSON file where each line is a JSON object (i.e. a python dictionary).
Here is an example of the first two lines:
{"ID_CLIENTE":"o+AKj6GUgHxcFuaRk6/GSvzEWRYPXDLjtJDI79c7ccE=","ORIGEN":"oaDdZDrQCwqvi1YhNkjIJulA8C0a4mMZ7ESVlEWGwAs=","DESTINO":"OOcb8QTlctDfYOwjBI02hUJ1o3Bro/ir6IsmZRigja0=","PRECIO":0.0023907284768211919,"RESERVA":"2015-05-20","SALIDA":"2015-07-26","LLEGADA":"2015-07-27","DISTANCIA":0.48962542317352847,"EDAD":"19","sexo":"F"}{"ID_CLIENTE":"WHDhaR12zCTCVnNC/sLYmN3PPR3+f3ViaqkCt6NC3mI=","ORIGEN":"gwhY9rjoMzkD3wObU5Ito98WDN/9AN5Xd5DZDFeTgZw=","DESTINO":"OOcb8QTlctDfYOwjBI02hUJ1o3Bro/ir6IsmZRigja0=","PRECIO":0.001103046357615894,"RESERVA":"2015-04-08","SALIDA":"2015-07-24","LLEGADA":"2015-07-24","DISTANCIA":0.21382548869717155,"EDAD":"13","sexo":"M"}
So, I'm using the following code to read each line into a Pandas DataFrame:
import json
import gzip
import pandas as pd
import random
with gzip.GzipFile('data/000000000000.json.gz', 'r',) as fin:
data_lan = pd.DataFrame()
for line in fin:
data_lan = pd.DataFrame([json.loads(line.decode('utf-8'))]).append(data_lan)
But it's taking years.
Any suggestion to read the data quicker?
EDIT:
Finally what solved the problem:
import json
import gzip
import pandas as pd
with gzip.GzipFile('data/000000000000.json.gz', 'r',) as fin:
data_lan = []
for line in fin:
data_lan.append(json.loads(line.decode('utf-8')))
data = pd.DataFrame(data_lan)
I've worked on a similar problem myself, The append() is kinda slow. I generally use a list of dicts to load the json file and then create a Dataframe at once. This ways, you can have the flexibility the lists give you and only when you're sure about the Data in the list you convert it into a Dataframe. Below is an implementation of the concept:
import pandas as pd
import gzip
def get_contents_from_json(file_path)-> dict:
"""
Reads the contents of the json file into a dict
:param file_path:
:return: A dictionary of all contents in the file.
"""
try:
with gzip.open(file_path) as file:
contents = file.read()
return json.loads(contents.decode('UTF-8'))
except json.JSONDecodeError:
print('Error while reading json file')
except FileNotFoundError:
print(f'The JSON file was not found at the given path: \n{file_path}')
def main(file_path: str):
file_contents = get_contents_from_json(file_path)
if not isinstance(file_contents,list):
# I've considered you have a JSON Array in your file
# if not let me know in the comments
raise TypeError("The file doesn't have a JSON Array!!!")
all_columns = file_contents[0].keys()
data_frame = pd.DataFrame(columns=all_columns, data=file_contents)
print(f'Loaded {int(data_frame.size / len(all_columns))} Rows', 'Done!', sep='\n')
if __name__ == '__main__':
main(r'C:\Users\carrot\Desktop\dummyData.json.gz')
A pandas DataFrame fits into a contiguous block of memory which means that pandas needs to know the size of the data set when the frame is created. Since append changes the size, new memory must be allocated and the original plus new data sets are copied in. As your set grows, the copy gets bigger and bigger.
You can use from_records to avoid this problem. First, you need to know the row count and that means scanning the file. You could potentially cache that number if you do it often, but its a relatively fast operation. Now you have the size and pandas can allocate the memory efficiently.
# count rows
with gzip.GzipFile(file_to_test, 'r',) as fin:
row_count = sum(1 for _ in fin)
# build dataframe from records
with gzip.GzipFile(file_to_test, 'r',) as fin:
data_lan = pd.DataFrame.from_records(fin, nrows=row_count)

Python Spark- How to output empty DataFrame to csv file (Only output header)?

I want to output empty dataframe to csv file. I use these codes:
df.repartition(1).write.csv(path, sep='\t', header=True)
But due to there is no data in dataframe, spark won't output header to csv file.
Then I modify the codes to:
if df.count() == 0:
empty_data = [f.name for f in df.schema.fields]
df = ss.createDataFrame([empty_data], df.schema)
df.repartition(1).write.csv(path, sep='\t')
else:
df.repartition(1).write.csv(path, sep='\t', header=True)
It works, but I want to ask whether there are a better way without count function.
df.count() == 0 will make your driver program retrieve the count of all your dataframe partitions across the executors.
In your case I would use df.take(1).isEmpty (Spark >= 2.1). Still slow, but preferable to a raw count().
Only header:
cols = '\t'.join(df.columns)
with open('./cols.csv', 'w') as f:
f.write(cols)

import chosen columns of json file in r

Can't find any solution how to load huge JSON. I try with well known Yelp dataset. It's 3.2 GB and I want to analyse 9 out of 10 columns. I need to skip import $text column, which will give me much lighter file to load. Probably about -70%. I don't want to manipulate the file.
I tried many libraries and stuck. I've found a solution for data.frame to apply pipe function:
df <- read.table(pipe("cut -f1,5,28 myFile.txt"))
from this thread: Ways to read only select columns from a file into R? (A happy medium between `read.table` and `scan`?)
How to do it for JSON? I'd like to do:
json <- read.table(pipe("cut -text yelp_academic_dataset_review.json"))
but this of course throws an error due to wrong format. Is there any possibilities without parsing whole file with regex?
EDIT
Structure of one row: (even can't count them all)
{"review_id":"vT2PALXWX794iUOoSnNXSA","user_id":"uKGWRd4fONB1cXXpU73urg","business_id":"D7FK-xpG4LFIxpMauvUStQ","stars":1,"date":"2016-10-31","text":"some long text here","useful":0,"funny":0,"cool":0,"type":"review"}
SECOND EDIT
Finally, I've created a loop to convert all data into csv file, omitted unwanted column. It's slow but I've got 150 mb (zipped) from 3.2 gb.
# files to process
filepath <- jfile1
fcsv <- "c:/Users/cp/Documents/R DATA/Yelp/tests/reviews.csv"
write.table(x = paste(colns, collapse=","), file = fcsv, quote = F, row.names = F, col.names = F)
con = file(filepath, "r")
while ( TRUE ) {
line = readLines(con, n = 1)
if ( length(line) == 0 ) {
break
}
# regex process
d <- NULL
for (i in rcols) {
pst <- paste(".*\"",colns[i],"\":*(.*?) *,\"",colns[i+1],"\":.*", sep="")
w <- sub(pst, "\\1", line)
d <- cbind(d, noquote(w))
}
# save on the fly
write.table(x = paste(d, collapse = ","), file = fcsv, append = T, quote = F, row.names = F, col.names = F)
}
close(con)
It can be save to json also. I wonder if it's the most efficient way, but other scripts I tested was slow and often had some encoding issues.
Try this:
library(jsonlite)
df <- as.data.frame(fromJSON('yelp_academic_dataset_review.json', flatten=TRUE))
Then once it is a dataframe delete the column(s) you don't need.
If you don't want to manipulate the file in advance of importing it, I'm not sure what options you have in R. Alternatively, you could make a copy of the file, then delete the text column with this script, then import the copy to R, then delete the copy.

pandas find constant variables in a huge csv file

I have a large csv file that I can not load into memory. I need to find which variables are constant. How can I do that?
I am reading the csv as
d = pd.read_csv(load_path, header=None, chunksize=10)
Is there an elegant way to solve the problem?
The data contains string and numerical variables
This is my current slow solution that does not use pandas
constant_variables = [True for i in range(number_of_columns)]
with open(load_path) as f:
line0 = next(f).split(',')
for num, line in enumerate(f):
line = line.split(',')
for i in range(n_col):
if line[i] != line0[i]:
constant_variables[i] = False
if num % 10000 == 0:
print(num)
You have 2 methods I can think of iterate over each column and check for uniqueness:
col_list = pd.read_csv(path, nrows=1).columns
for col in range(len(col_list)):
df = pd.read_csv(path, usecols=col)
if len(df.drop_duplicates()) == len(df):
print("all values are constant for: ", df.column[0])
or iterate over the csv in chunks and check again the lengths:
for df in pd.read_csv(path, chunksize=1000):
t = dict(zip(df, [len(df[col].value_counts()) for col in df]))
print(t)
The latter will read in chunks and tell you how unique each columns data is, this is just rough code which you can modify for your needs