I have to work with a big CSV file, up to 2GB. More specifically I have to upload all this data to the mySQL database, but before I have to make a few calculation on that, so I need to do all this thing in MATLAB (also my supervisor want to do in MATLAB because he familiar just with MATLAB :( ).
Any idea how can I handle these big files?
You should probably use textscan to read the data in in chunks and then process. This will probably be more efficient than reading a single line at a time. For example, if you have 3 columns of data, you could do:
filename = 'fname.csv';
[fh, errMsg] = fopen( filename, 'rt' );
if fh == -1, error( 'couldn''t open file: %s: %s', filename, errMsg ); end
N = 100; % read 100 rows at a time
while ~feof( fh )
c = textscan( fh, '%f %f %f', N, 'Delimiter', ',' );
doStuff(c);
end
EDIT
These days (R2014b and later), it's easier and probably more efficient to use a datastore.
There is good advice on handling large datasets in MATLAB in this file exchange item.
Specific topics include:
* Understanding the maximum size of an array and the workspace in MATLAB
* Using undocumented features to show you the available memory in MATLAB
* Setting the 3GB switch under Windows XP to get 1GB more memory for MATLAB
* Using textscan to read large text files and memory mapping feature to read large binary files
Related
I have a CSV file which has nearly 22M records. I want to split this into multiple CSV files so that I can use it further.
I tried to open it using Excel(tried Transform Data Option as well)/Notepad++/Notepad, but all give me an error.
When I explore the options, I found that we can split the file using some coding methodologies like Java, Python, etc.. I am not much familiar with coding and want to know if there is any option to split the file without using any coding process. Also, since the file has client sensitive data I don't want to download/use any external tools.
Any help would be much appreciated.
I know you're concerned about security of sensitive data, and that makes you want to avoid external tools (even a nominally trusted tool like Google Big Query... unless your data is medical in nature).
I know you don't want a custom solution w/Python, but I don't understand why that is—this is a big problem, and CSVs can be tricky to handle.
Maybe your CSV is a "simple one" where there are no embedded line breaks, and the quoting is minimal. But if it isn't, you're going to want to a tool that's meant for CSV.
And because the file is so big, I don't see how you can do it without code. Even if you could load it into a trusted tool, how would you process the 22M records?
I look forward to seeing what else the community has to offer you.
The best I can think of based on my experience is exactly what you said you don't want.
It's a small-ish Python script that uses its CSV library to correctly read in your large file and write out several smaller files. If you don't trust this, or me, maybe find someone you do trust who can read this and assure you it won't compromise your sensitive data.
#!/usr/bin/env python3
import csv
MAX_ROWS = 22_000
# The name of your input
INPUT_CSV = 'big.csv'
# The "base name" of all new sub-CSVs, a counter will be added after the '-':
# e.g., new-1.csv, new-2.csv, etc...
NEW_BASE = 'new-'
# This function will be called over-and-over to make a new CSV file
def make_new_csv(i, header=None):
# New name
new_name = f'{NEW_BASE}{i}.csv'
# Create a new file from that name
new_f = open(new_name, 'w', newline='')
# Creates a "writer", a dedicated object for writing "rows" of CSV data
writer = csv.writer(new_f)
if header:
writer.writerow(header)
return new_f, writer
# Open your input CSV
with open(INPUT_CSV, newline='') as in_f:
# Like the "writer", dedicated to reading CSV data
reader = csv.reader(in_f)
your_header = next(reader) # see note below about "header"
# Give your new files unique, and sequential names: e.g., new-1.csv, new-2.csv, etc...
new_i = 1
# Make first new file and writer
new_f, writer = make_new_csv(new_i, your_header)
# Loop over all input rows, and count how many
# records have been written for each "new file"
new_rows = 0
for row in reader:
if new_rows == MAX_ROWS:
new_f.close() # This file is full, close it and...
break
new_i += 1
new_f, writer = make_new_csv(new_i, your_header) # get a new file and writer
new_rows = 0 # Reset row counter
writer.writerow(row)
new_rows +=1
# All done reading input rows, close last file
new_f.close()
There's also a fantastic tool I use daily for processing large CSVs, also with sensitive client contact and personally identifying information, GoCSV.
Its split command is exactly what you need:
Split a CSV into multiple files.
Usage:
gocsv split --max-rows N [--filename-base FILENAME] FILE
I'd recommend downloading it for your platform, unzipping it, putting a sample file with non-sensitive information in that folder and trying it out:
gocsv split --max-rows 1000 --filename-base New sample.csv
would end up creating a number of smaller CSVs, New-1.csv, New-2.csv, etc..., each with a header and no more than 1000 rows.
Pardon me but I'm a Q novice and couldn't find a solution. The code below appends a four-column CSV file to a KDB+ database. This code worked well but, now that my database is large, it throws a WSFULL error. Perhaps there is a more memory efficient way to write it. Please help:
// FILE_LOADER.q
\c 520 500
if [(count .z.x) < 1;
show `$"usage: q loadcsv.q inputfile destfile
where inputfile and destfile are absolute or relative paths to
the files. Inputfile has the following fields:
DATE, TICKER, FIELD, VALUE. DATE is of type date,
TICKER and FIELD are strings, and VALUE is converted to a float.
Any string VALUEs will show up as nulls.";
exit 1
]
f1: hsym `$.z.x[0]
f2: hsym `$.z.x[1]
columns: `DATE`TICKER`FIELD`VALUE
if [() ~ key f1; show ("Input file '",.z.x[0],"' not found");exit 1]
x: .Q.fsn[{f2 upsert flip columns!("DSSF";",")0:x};f1;4194000000]
show ("loaded ",(string x)," characters into the kdb database")
exit 0
First just from trying this out I assume your input csv file never has a header? If it does you'll need a slight code change so kdb is aware.
You are correct that it's a memory issue so what you can do is just decrease the chunk size. You are reading in 4194000000 bytes at a time right now. Try lowering this in accordance with available memory.
If you are still seeing issues it may be your garbage collection setting. You could force a gc after each read/upsert.
.Q.fsn[{f2 upsert flip columns!("DSSF";",")0:x;**.Q.gc[]**};f1;4194000000]
I want to load a set of data that contains variables type string and float. But when I use textscan in octave, my data doesn't load. I got matrix 1x6 (i have 6 features), but in this matrix I got cells that contains nothing(cells that are 0x1).
my code:
filename='data1.txt';
fileID = fopen(filename,'r');
data = textscan(fileID,'%f %s %s %f %f %s','Delimiter',',');
fclose(fileID);
when I for example try data(1):
>> data(1)
ans =
{
[1,1] = [](0x1)
}
>>
there is it
there is my set
Also my file id isn't -1.
I had been searching for in the ethernet problem like this but I couldn't find any.
I tried to delete headers in data and smaller training set but it don't work.
Pls help.
Don't use textscan. Textscan is horrible, and one should only use it when they're trying to parse data when there's no better way available.
Your data is a standard csv file. Just use csv2cell from the io package.
I have a code to calculate the mean of the first five values of each column of a file, for then use these values as a reference point for all set. The problem is that now I need to do the same but for many files. So I will need to obtain the mean of each file to then use these values again with the originals files. I have tried in this way but I obtain an error. Thanks.
%%% - Loading the file of each experiment
myfiles = dir('*.lvm'); % To load every file of .lvm
for i = 1:length(myfiles) % Loop with the number of files
files=myfiles(i).name;
mydata(i).files = files;
mydata(i).T = fileread(files);
arraymean(i) = mean(mydata(i));
end
The files that I need to compute are more or less like this:
Delta_X 3.000000 3.000000 3.000000
***End_of_Header***
X_Value C_P1N1 C_P1N2 C_P1N3
0.000000 -0.044945 -0.045145 -0.045705
0.000000 -0.044939 -0.045135 -0.045711
3.000000 -0.044939 -0.045132 -0.045706
6.000000 -0.044938 -0.045135 -0.045702
Your first line results in 'myfiles' being a structure array with components that you will find defined when you type 'help dir'. In particular, the names of all the files are contained in the structure element myfiles(i).name. To display all the file names, type myfiles.name. So far so good. In the for loop you use 'fileread', but fileread (see help fileread) returns the character string rather than the actual values. I have named your prototype .lvm file DinaF.lvm and I have written a very, very simple function to read the data in that file, by skipping the first three lines, then storing the following matrix, assumed to have 4 columns, in an array called T inside the function and arrayT in the main program
Here is a modified script, where a function read_lvm has been included to read your 'model' lvm file.
The '1' in the first line tells Octave that there is more to the script than just the following function: the main program has to be interpreted as well.
1;
function T=read_lvm(filename)
fid = fopen (filename, "r");
%% Skip by first three lines
for lhead=1:3
junk=fgetl(fid);
endfor
%% Read nrow lines of data, quit when file is empty
nrow=0;
while (! feof (fid) )
nrow=nrow + 1;
thisline=fscanf(fid,'%f',4);
T(nrow,1:4)=transpose(thisline);
endwhile
fclose (fid);
endfunction
## main program
myfiles = dir('*.lvm'); % To load every file of .lvm
for i = 1:length(myfiles) % Loop with the number of files
files=myfiles(i).name;
arrayT(i,:,:) = read_lvm(files);
columnmean(i,1:4)=mean(arrayT(i,:,:))
end
Now the tabular values associated with each .lvm file are in the array arrayT and the mean for that data set is in columnmean(i,1:4). If i>1 then columnmean would be an array, with each row containing the files for each lvm file. T
This discussion is getting to be too distant from the initial question. I am happy to continue to help. If you want more help, close this discussion by accepting my answer (click the swish), then ask a new question with a heading like 'How to read .lvm files in Octave'. That way you will get the insights from many more people.
I have a scenario where I am loading and processing 4TB of data,
which is about 15000 .csv files in a folder.
since I have limited resources, I am planning to process them in two
batches and them union them.
I am trying to understand if I can load only 50% (or first n
number of files in batch1 and the rest in batch 2) using
spark.read.csv.
I can not use a regular expression as these files are generated
from multiple sources and they are of uneven number(from some
sources they are few and from other sources there are many ). If I
consider processing files in uneven batches using wild cards or regex
i may not get optimized performance.
Is there a way where i can tell the spark.read.csv reader to pick first n files and next I would just mention to load last n-1 files
I know this can be doneby writing another program. but I would not prefer as I have more than 20000 files and I dont want to iterate over them.
It's easy if you use hadoop API to list files first and then create dataframes based on this list chunks. For example:
path = '/path/to/files/'
from py4j.java_gateway import java_import
fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())
list_status = fs.listStatus(spark._jvm.org.apache.hadoop.fs.Path(path))
paths = [file.getPath().toString() for file in list_status]
df1 = spark.read.csv(paths[:7500])
df2 = spark.read.csv(paths[7500:])