How do I accelerate reading large files in GNU Octave?

How do I accelerate reading large files in GNU Octave? - csv

I'm importing a large CSV file into GNU Octave, doing some simple data manipulation and creating some plots. The file has about 6.5 million rows. I expected the process of file reading to take about two to three hours, because that's how long it usually takes to create a file this size in my experience. Added a status counter when it wasn't finishing and found that it was slowing down as it read; after 12 hours, only at line 1.5 million and moving at a crawl. According to Resource Monitor, though, no memory issues. Is there a more efficient way to read the code than what I have below? Do I need to do something special to allocate memory to the process so it doesn't slow down? This is the loop that's reading in the CSV. It's a while loop that scans the csv one line at a time, extracts the columns I need and ends when it reaches the first blank line:
% Process File
F=1;
while 1
% Status Counter
printf ("Status: %d \r", F);
fflush (stdout);
F=F+1;
% Read first unread line
line = fgetl(fileID);
% Exit while loop if line is empty
if ~ischar(line)
break;
endif
% Translate Line
Bank = textscan (line, '%f', 'Delimiter', ',');
Bank = cell2mat (Bank);
Bank = transpose (Bank);
% Append Bank to Output
Output = [Output; Bank(1, 1:9), Bank(1, 13:14), Bank(1, 20:21)];
endwhile

This is the slow part:
Output = [Output; Bank(1, 1:9), Bank(1, 13:14), Bank(1, 20:21)];
What you do here is create a new matrix, copy Output and the new row into it, and assign it to Output. As Output becomes larger, the copy becomes increasingly expensive.
What you need to do is preallocate the output array. Always preallocate!

Related

GnuPlot :: Plotting 3D recorded in an unconventional format

I would like to prepare a script file to draw a 3D plot of some kinetic spectroscopy results. In the experiment the absorption spectrum of a solution is measured sequentially at increasing times from t0 to tf with a constant increase in time Δt.
The plot will show the variation of absorbamce (Z) with wavelength and time.
The data are recorded using a UV-VIS spectrometer and saved as a CSV text file.
The file contains a table in which the first column are the wavelengths of the spectra. Afterwards, a column is added for each the measured spectra, and a number of columns depends on the total time and the time interval between measuerments. The time for each spectra appears in the headers line.
I wonder if the data can be plotted directly witha minimum of preformatting and without the need to rewrite the data in a more estandar XYZ format.
The structure of the data file is something like this
Title; espectroscopia UV-Vis
Comment;
Date; 23/10/2018 16:41:12
Operator; laboratorios
System Name; Undefined
Wavelength (nm); 0 Min; 0,1 Min; 0,2 Min; 0,3 Min; ... 28,5 Min
400,5551; 1,491613E-03; 1,810312E-03; 2,01891E-03; ... 4,755786E-03
... ... ... ... ... ...
799,2119; -5,509266E-04; 3,26314E-04; -4,319865E-04; ... -5,087912E-04
(EOF)
A copy of a sample data is included in this file kinetic_spectroscopy.csv.
Thanks.

Your data is in an acceptable form for gnuplot, but persuading the program to plot this as one line per wavelength rather than a gridded surface is more difficult. First let's establish that the file can be read and plotted. The following commands should read in the x/y coordinates (x = first row, y = first column) and the z values to construct a surface.
DATA = 'espectros cinetica.csv'
set datafile separator ';' # csv file with semicolon
# Your data uses , as a decimal point.
set decimal locale # The program can handle this if your locale is correct.
show decimal # confirm this by inspecting the output from "show".
set title DATA
set ylabel "Wavelength"
set xlabel "Time (min)"
set xyplane 0
set style data lines
splot DATA matrix nonuniform using 1:2:3 lc palette
This actually looks OK with your data. For a smaller number of scans it is probably not what you would want. In order to plot separate lines, one per scan, we could break this up into a sequence of line plots rather than a single surface plot:
DATA = 'espectros cinetica.csv'
set datafile separator ";"
set decimal locale
unset key
set title DATA
set style data lines
set ylabel "Wavelength"
set xlabel "Time (min)"
set xtics offset 0,-1 # move labels away from axis
splot for [row=0:*] DATA matrix nonuniform every :::row::row using 1:2:3
This is what I get for the first 100 rows of your data file. The row data is colored sequentially by gnuplot linetypes. Other coloring schemes are possible.

json data appears in a single line

I have a badly formatted json file
R˜{"xData":{"x":[7872,7904,...4670]}} R˜{"xData":{"x":[7904,7904,...8000]}} ...
That is, there is only one line, whereas each new data record should start with a new line starting from the character R. Also, the character ~ after R is unwanted. Since the file is about 1GB, it is impossible to manually insert a new line just before each R. Each x is a vector of the same number of data points, say 5000. If the total number of lines are 100000, i.e. 1e5 occurences of the character R, then how to obtain a separate output file containing only the matrix of x values in that output file? This matrix will be 5000 columns by 100000 rows.

How to read 3x3xN coordinates string into matlab array efficently

I have a MATLAB script that takes a JSON that was created by myself in a remote server and contains a long list of 3x3xN coordinates e.g. for N=1:
str = '[1,2,3.14],[4,5.66,7.8],[0,0,0],';
I want to avoid string splitting it, is there any approach to use strread or similar to read this 3×3×N tensor?
It's a multi-particle system and N can be large, though I have enough memory to store it all at once in the memory.
Any suggestion of how to format the array string in the JSON is very welcome as well.

If you can guarantee the format is always the same, I think it's easiest, safest and fastest to use sscanf:
fmt = '[%f,%f,%f],[%f,%f,%f],[%f,%f,%f],';
data = reshape(sscanf(str, fmt), 3, 3).';
Depending on the rest of your data (how is that "N" represented?), you might need to adjust that reshape/transpose.
EDIT
Based on your comment, I think this will solve your problem quite efficiently:
% Strip unneeded concatenation characters
str(str == ',') = ' ';
str(str == ']' | str == '[') = [];
% Reshape into workable dimensions
data = permute( reshape(sscanf(str, '%f '), 3,3,[]), [2 1 3]);
As noted by rahnema1, you can avoid the permute and/or character removal by adjusting your JSON generators to spit out the data column-major and without brackets, but you'll have to ask yourself these questions:
whether that is really worth the effort, considering that this code right here is already quite tiny and pretty efficient
whether other applications are going to use the JSON interface, because in essence you're de-generalizing the JSON output just to fit your processing script on the other end. I think that's a pretty bad design practice, but oh well.
Just something to keep in mind:
emitting 500k values in binary is about 34 MB
doing the same in ASCII is about 110 MB
Now depending a bit on your connection speed, I'd be getting really annoyed really quickly because every little test run takes about 3 times as long as it should be taking :)
So if an API call straight to the raw data is not possible, I would at least base64 that data in the JSON.

You can use eval function:
str = '[1,2,3.14],[4,5.66,7.8],[0,0,0],';
result=permute(reshape(eval(['[' ,str, ']']),3,3,[]),[2 1 3])
result =
1.00000 2.00000 3.14000
4.00000 5.66000 7.80000
0.00000 0.00000 0.00000
Using eval all elements concatenated to create a row vector. Then row vector reshaped to a 3d array. Since in MATLAB elements are placed in matrix columnwise it is required to permute the array so each 3*3 matrix are trasposed.
note1: There is no need to place [] in jSON string so you can use str2num instead of eval :
result=permute(reshape(str2num(str),3,3,[]),[2 1 3])
note2:
if you save data columnwise there is no need to permute:
str='1 4 0 2 5.66 0 3.14 7.8 0';
result=reshape(str2num(str),3,3,[])
Update: As Ander Biguri and excaza noted about security an speed issues related to eval and str2num and after Rody Oldenhuis 's suggestion about using sscanf I tested 3 methods in Octave:
a=num2str(rand(1,60000));
disp('-----SSCANF---------')
tic
sscanf(a,'%f ');
toc
disp('-----STR2NUM---------')
tic
str2num(a);
toc
disp('-----STRREAD---------')
tic
strread(a,'%f ');
toc
and here is the result:
-----SSCANF---------
Elapsed time is 0.0344398 seconds.
-----STR2NUM---------
Elapsed time is 0.142491 seconds.
-----STRREAD---------
Elapsed time is 0.515257 seconds.
So it is more secure and faster to use sscanf, in your case:
str='1 4 0 2 5.66 0 3.14 7.8 0';
result=reshape(sscanf(str,'%f '),3,3,[])
or
str='1, 4, 0, 2, 5.66, 0, 3.14, 7.8, 0';
result=reshape(sscanf(str,'%f,'),3,3,[])

R: NaiveBayes incrementally on a large data set

I have a large data set in a MySQL database (at least 11 GB of data). I would like to train a NaiveBayes model on the entire set and then test is against a smaller but also quite large data set (~3 GB).
The second part seems feasible - I assume that I would run the following in a loop:
data_test <- sqlQuery(con, paste("select * from test_data LIMIT 10000", "OFFSET", (i*10000) ))
model_pred <- predict(model, data_test, type="raw")
...and then dump the predictions back to MySQL or a CSV.
How can I, however, train my model incrementally on such a large data set? I noticed in the R documentation of the function (http://www.inside-r.org/packages/cran/e1071/docs/naiveBayes) that there is an addtional argument in the predict function "newdata" which suggests that incremental learning is possible. The predict function however will return the predictions and not a new model.
Please provide me with an example of how to incrementally train my model.

Is there a fast way of adding or removing content in the middle of a very large file

Say I have a very large file (say > 1GB) and I want to add a single character in the middle of it. Is it possible to do this without reading and writing the whole file out? My current solution is this (in pseudocode):
x = 0
chunk = read 4KB chunk x of input file
if chunkToEdit = x, chunk = addCharacter(chunk)
append chunk to the output file
x = x + 1
repeat last 4 steps until input file is fully read
delete input file
move output file to input file
While that works, it results in 1GB of reading, and 1GB of writing to make a single character change. It also requires a spare 1GB of disk space. What I would rather do is modify the part of the file that needs to be changed in place, so I only have to read and write one part of the file (ie 4KB of reading, and 4KB of writing). Is this possible (or a solution better than my one)?
I thought a solution for this could be possible by the OS fragmenting the file and making a new fragment for the changed section, but I don't know if this capability has been written and exposed to developers.

No. Files don't work like that. If you need to change the size of the file then you need to operate from the modification point to the end.
Unless you're using a file format that can handle insertions/deletions cleanly, but it sounds like you aren't.

Adding a single character in the middle necessarily requires shifting everything after this one character by one character. This necessarily requires that you read and write everything from the point of insertion to the end of the file. A way that uses as little memory as possible to do so would be:
i = 0
read last (n byte * i) of file
write back to file shifted by 1 character
i++
repeat until reaching the point of insertion
write single character
In other words: shift everything in chunks of n bytes by one character starting from the end going backwards through the file to the point of insertion, then insert the character. The farther back in the file you want to insert the character, the faster this will be. If you often want to insert near the beginning of the file, this may not be the best solution.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

How do I accelerate reading large files in GNU Octave? - csv

Related

GnuPlot :: Plotting 3D recorded in an unconventional format

json data appears in a single line

How to read 3x3xN coordinates string into matlab array efficently

R: NaiveBayes incrementally on a large data set

Is there a fast way of adding or removing content in the middle of a very large file

Categories

Resources