Opening a file of varying row and column structure in Scilab - csv

I habitually use csvRead in scilab to read my data files however I am now faced with one which contains blocks of 200 rows, preceeded by 3 lines of headers, all of which I would like to take into account.
I've tried specifying a range of data following the example on the scilab help website for csvRead (example is right at the bottom of the page) (https://help.scilab.org/doc/6.0.0/en_US/csvRead.html) but I always come out with the same error messages :
The line and/or colmun indices are outside of the limits
or
Error in the column structure.
My first three lines are headers which I know can cause a problem but even if I omit them from my block-range, I still have the same problem.
Otherwise, my data is ordered such that I have my three lines of headers (two lines containing a header over just one or two columns, one line containing a header over all columns), 200 lines of data, and a blank line - this represents data from one image and I have about 500 images in the file, I would like to be able to read and process all of them and keep track of the headers because they state the image number which I need to reference later. Example:
DTN-dist_Devissage-1_0006_0,,,,,,
L0,,,,,,
X [mm],Y [mm],W [mm],exx [1] - Lagrange,eyy [1] - Lagrange,exy [1] - Lagrange,Von Mises Strain [1] - Lagrange
-1.13307,-15.0362,-0.00137507,7.74679e-05,8.30045e-05,5.68249e-05,0.00012711
-1.10417,-14.9504,-0.00193334,7.66086e-05,8.02914e-05,5.43132e-05,0.000122655
-1.07528,-14.8647,-0.00249155,7.57493e-05,7.75786e-05,5.18017e-05,0.0001182
Does anyone have a solution to this?
My current code, following an adapted version of the Scilab-help example looks like this (I have tried varying the blocksize and iblock values to include/omit headers:
blocksize=200;
C1=1;
C2=14;
iblock=1
while (%t)
R1=(iblock-1)*blocksize+4;
R2=blocksize+R1-1;
irange=[R1 C1 R2 C2];
V=csvRead(filepath+filename,",",".","",[],"",irange);
iblock=iblock+1
end

Errors
The CSV
A lot's of your problem comes from the inconsistency of the number of coma in your csv file. Opening it in LibreOffice Calc and saving it puts the right number of comma, even on empty lines.
R1
Your current code doesn't position R1 at the beginning of the values. The right formula is
R1=(iblock-1)*(blocksize+blanksize+headersize)+1+headersize;
End of file
Currently your code raise an error and the end of the file because R1 becomes greater than the number of lines. To solve this, you can specify the maximum number of block or test the value of R1 against the number of lines.
Improved solution for much bigger file.
When solving your probem with a big file, two problems were raised :
We need to know the number of blocks or the number of lines
Each call of csvRead is really slow because it process the whole file at each call (1s / block !)
My idea was to read the whole file and store it in a string matrix ( since mgetl as been improved since 6.0.0 ), then use csvTextScan on a submatrix. Doing so also removes the manual writing of the number of block/lines.
The code follows :
clear all
clc
s = filesep()
filepath='.'+s;
filename='DTN_full.csv';
// header is important as it as the image name
headersize=3;
blocksize=200;
C1=1;
C2=14;
iblock=1
// let save everything. Good for the example.
bigstruct = struct();
// Read all the value in one pass
// then using csvTextScan is much more efficient
text = mgetl(filepath+filename);
nlines = size(text,'r');
while ( %t )
mprintf("Block #%d",iblock);
// Lets read the header
R1=(iblock-1)*(headersize+blocksize+1)+1;
R2=R1 + headersize-1;
// if R1 or R1 is bigger than the number of lines, stop
if sum([R1,R2] > nlines )
mprintf('; End of file\n')
break
end
// We use csvTextScan ony on the lines that matters
// speed the program, since csvRead read thge whole file
// every time it is used.
H=csvTextScan(text(R1:R2),",",".","string");
mprintf("; %s",H(1,1))
R1 = R1 + headersize;
R2 = R1 + blocksize-1;
if sum([R1,R2]> nlines )
mprintf('; End of file\n')
break
end
mprintf("; rows %d to %d\n",R1,R2)
// Lets read the values
V=csvTextScan(text(R1:R2),",",".","double");
iblock=iblock+1
// Let save theses data
bigstruct(H(1,1)) = V;
end
and returns
Block #1; DTN-dist_0005_0; rows 4 to 203
....
Block #178; DTN-dist_0710_0; rows 36112 to 36311
Block #179; End of file
Time elapsed 1.827092s

Related

Extrapolate two columns from txt with commas and text

I got a problem reading some data from a txt file. I appreciate any suggestion and thank you in advance!
I have a txt file with text/number on top, followed by two tab-separated columns (additionally, they have commas instead of dots).
I want to extract the two columns without text, and replace the commas with dots in order to plot them.
I tried with importdata to be able to replace the commas, but it separates every single character, so I get 36k elements instead of 2048.
Tried dlmread but it ignores the second column...
I have no idea how to proceed without modifying every single file manually.
here is an example of the file:
Data from FLMS012901__118__10-30-26-589.txt Node
Date: Tue Jul 05 10:30:26 CEST 2022
User: Myself
Number of Pixels in Spectrum: 2048
>>>>>Begin Spectral Data<<<<<
338,147 -2183,94
338,527 -2183,94
338,906 -2183,94
339,286 -2251,25
Any suggestions?
EDIT:
Apparently, there was already a solution, even though a bit slow:
% Read file in as a series of strings
fid = fopen('data.txt', 'rb');
strings = textscan(fid, '%s', 'Delimiter', '');
fclose(fid);
% Replace all commas with decimal points
decimal_strings = regexprep(strings{1}, ',', '.');
% Convert to doubles and join all rows together
data = cellfun(#str2num, decimal_strings, 'uni', 0);
data = cat(1, data{:});
On the sample that you provide, the following works:
>> [a,b,c,d] = textread("SO_73502149.txt","%f,%f %f,%f", "headerlines", 6);
>> format free
>> [a+b/1000, c+sign(c).*d/100]
ans =
338.147 -2183.94
338.527 -2183.94
338.906 -2183.94
339.286 -2251.25
However there are some possible traps, according to the way decimal figures are handled in your file, you should adapt the post-processing: If for 338.10 338,1 is printed in the file instead of 338,10 , the decoding would be a bit harder. Whenever c becomes zero, sign(c) would kill the decimal part. A less trivial post-processing would be required.

Q/KDB+ / CSV upload and WSFULL

Pardon me but I'm a Q novice and couldn't find a solution. The code below appends a four-column CSV file to a KDB+ database. This code worked well but, now that my database is large, it throws a WSFULL error. Perhaps there is a more memory efficient way to write it. Please help:
// FILE_LOADER.q
\c 520 500
if [(count .z.x) < 1;
show `$"usage: q loadcsv.q inputfile destfile
where inputfile and destfile are absolute or relative paths to
the files. Inputfile has the following fields:
DATE, TICKER, FIELD, VALUE. DATE is of type date,
TICKER and FIELD are strings, and VALUE is converted to a float.
Any string VALUEs will show up as nulls.";
exit 1
]
f1: hsym `$.z.x[0]
f2: hsym `$.z.x[1]
columns: `DATE`TICKER`FIELD`VALUE
if [() ~ key f1; show ("Input file '",.z.x[0],"' not found");exit 1]
x: .Q.fsn[{f2 upsert flip columns!("DSSF";",")0:x};f1;4194000000]
show ("loaded ",(string x)," characters into the kdb database")
exit 0
First just from trying this out I assume your input csv file never has a header? If it does you'll need a slight code change so kdb is aware.
You are correct that it's a memory issue so what you can do is just decrease the chunk size. You are reading in 4194000000 bytes at a time right now. Try lowering this in accordance with available memory.
If you are still seeing issues it may be your garbage collection setting. You could force a gc after each read/upsert.
.Q.fsn[{f2 upsert flip columns!("DSSF";",")0:x;**.Q.gc[]**};f1;4194000000]

How to replace MATLAB's timeseries and synchronize functions in Octave?

I have a MATLAB script that I would like to run in Octave. But it turns out that the timeseries and synchronize functions from MATLAB are not yet implemented in Octave. So my question is if there is a way to express or replace these functions in Octave.
For understanding, I have two text files with different row lengths, which I want to synchronize into one text file with the same row length over time. The content of the text files is:
Text file 1:
1st column contains the distance
2nd column contains the time
Text file 2:
1st column contains the angle
2nd column contains the time
Here is the part of my code that I use in MATLAB to synchronize the files.
ts1 = timeseries(distance,timed);
ts2 = timeseries(angle,timea);
[ts1 ts2] = synchronize(ts1,ts2,'union');
distance = ts1.Data;
angle = ts2.Data;
Thanks in advance for your help.
edit:
Here are some example files.
input distance
input roation angle
output
The synchronize function seems to create a common timeseries from two separate ones (here, specifically via their union), and then use interpolation (here 'linear') to find interpolated values for both distance and angle at the common timepoints.
An example of how to achieve this to get the same output in octave as your provided output file is as follows.
Note: I had to preprocess your input files first to replace 'decimal commas' with dots, and then 'tabs' with commas, to make them valid csv files.
Distance_t = csvread('input_distance.txt', 1, 0); % skip header row
Rotation_t = csvread('input_rotation_angle.txt', 1, 0); % skip header row
Common_t = union( Distance_t(:,2), Rotation_t(:,2) );
InterpolatedDistance = interp1( Distance_t(:,2), Distance_t(:,1), Common_t );
InterpolatedRotation = interp1( Rotation_t(:,2), Rotation_t(:,1), Common_t );
Output = [ InterpolatedRotation, InterpolatedDistance ];
Output = sortrows( Output, -1 ); % sort according to column 1, in descending order
Output = Output(~isna(Output(:,2)), :); % remove NA entries
(Note, The step involving removal of NA entries was necessary because we did not specify we wanted extrapolation during the interpolation step, and some of the resulting distance values would be outside the original timerange, which octave labels as NA).

It is possible to obtain the mean of different files to make some computations with it after?

I have a code to calculate the mean of the first five values of each column of a file, for then use these values as a reference point for all set. The problem is that now I need to do the same but for many files. So I will need to obtain the mean of each file to then use these values again with the originals files. I have tried in this way but I obtain an error. Thanks.
%%% - Loading the file of each experiment
myfiles = dir('*.lvm'); % To load every file of .lvm
for i = 1:length(myfiles) % Loop with the number of files
files=myfiles(i).name;
mydata(i).files = files;
mydata(i).T = fileread(files);
arraymean(i) = mean(mydata(i));
end
The files that I need to compute are more or less like this:
Delta_X 3.000000 3.000000 3.000000
***End_of_Header***
X_Value C_P1N1 C_P1N2 C_P1N3
0.000000 -0.044945 -0.045145 -0.045705
0.000000 -0.044939 -0.045135 -0.045711
3.000000 -0.044939 -0.045132 -0.045706
6.000000 -0.044938 -0.045135 -0.045702
Your first line results in 'myfiles' being a structure array with components that you will find defined when you type 'help dir'. In particular, the names of all the files are contained in the structure element myfiles(i).name. To display all the file names, type myfiles.name. So far so good. In the for loop you use 'fileread', but fileread (see help fileread) returns the character string rather than the actual values. I have named your prototype .lvm file DinaF.lvm and I have written a very, very simple function to read the data in that file, by skipping the first three lines, then storing the following matrix, assumed to have 4 columns, in an array called T inside the function and arrayT in the main program
Here is a modified script, where a function read_lvm has been included to read your 'model' lvm file.
The '1' in the first line tells Octave that there is more to the script than just the following function: the main program has to be interpreted as well.
1;
function T=read_lvm(filename)
fid = fopen (filename, "r");
%% Skip by first three lines
for lhead=1:3
junk=fgetl(fid);
endfor
%% Read nrow lines of data, quit when file is empty
nrow=0;
while (! feof (fid) )
nrow=nrow + 1;
thisline=fscanf(fid,'%f',4);
T(nrow,1:4)=transpose(thisline);
endwhile
fclose (fid);
endfunction
## main program
myfiles = dir('*.lvm'); % To load every file of .lvm
for i = 1:length(myfiles) % Loop with the number of files
files=myfiles(i).name;
arrayT(i,:,:) = read_lvm(files);
columnmean(i,1:4)=mean(arrayT(i,:,:))
end
Now the tabular values associated with each .lvm file are in the array arrayT and the mean for that data set is in columnmean(i,1:4). If i>1 then columnmean would be an array, with each row containing the files for each lvm file. T
This discussion is getting to be too distant from the initial question. I am happy to continue to help. If you want more help, close this discussion by accepting my answer (click the swish), then ask a new question with a heading like 'How to read .lvm files in Octave'. That way you will get the insights from many more people.

What does 'multiline strings are different' meant by from RIDE (Robot Framework) output?

i am trying to compare two csv file data and followed below process in RIDE -
${csvA} = Get File ${filePathA}
${csvB} = Get File ${filePathB}
Should Be Equal As Strings ${csvA} ${csvB}
Here are my two csv contents -
csvA data
Harshil,45,8.03,DMJ
Divy,55,8,VVN
Parth,1,9,vvn
kjhjmb,44,0.5,bugg
csvB data
Harshil,45,8.03,DMJ
Divy,55,78,VVN
Parth,1,9,vvnbcb
acc,5,6,afafa
As few of the data is not in match, when i Run the code in RIDE, the result is FAIL. But in the log below data is shown -
**
Multiline strings are different:
--- first
+++ second
## -1,4 +1,4 ##
Harshil,45,8.03,DMJ
-Divy,55,8,VVN
-Parth,1,9,vvn
-kjhjmb,44,0.5,bugg
+Divy,55,78,VVN
+Parth,1,9,vvnbcb
+acc,5,6,afafa**
I would like to know the meaning of ---first +++second ##-1,4+1,4## content.
Thanks in advance!
When robot compares multiline strings (data that has newlines in it), it uses the standard unix tool diff to show the differences. Those characters are all part of what's called a unified diff. Even though you pass in raw data, it's treating the data as two files and showing the differences between the two in a format familiar to most programmers.
Here are two references to read more about the format:
What does "## -1 +1 ##" mean in Git's diff output?. (stackoverflow)
the diff man page (gnu.org)
In short, the ## gives you a reference for which line numbers are different, and the + and - show you which lines are different.
In your specific example it's telling you that three lines were different between the two strings: the line beginning with Divy, the line beginning with Parth, and the line beginning with acc. Since the line beginning with Harshil does not show a + or -, that means it was identical between the two strings.