Q/KDB+ / CSV upload and WSFULL - csv

Pardon me but I'm a Q novice and couldn't find a solution. The code below appends a four-column CSV file to a KDB+ database. This code worked well but, now that my database is large, it throws a WSFULL error. Perhaps there is a more memory efficient way to write it. Please help:
// FILE_LOADER.q
\c 520 500
if [(count .z.x) < 1;
show `$"usage: q loadcsv.q inputfile destfile
where inputfile and destfile are absolute or relative paths to
the files. Inputfile has the following fields:
DATE, TICKER, FIELD, VALUE. DATE is of type date,
TICKER and FIELD are strings, and VALUE is converted to a float.
Any string VALUEs will show up as nulls.";
exit 1
]
f1: hsym `$.z.x[0]
f2: hsym `$.z.x[1]
columns: `DATE`TICKER`FIELD`VALUE
if [() ~ key f1; show ("Input file '",.z.x[0],"' not found");exit 1]
x: .Q.fsn[{f2 upsert flip columns!("DSSF";",")0:x};f1;4194000000]
show ("loaded ",(string x)," characters into the kdb database")
exit 0

First just from trying this out I assume your input csv file never has a header? If it does you'll need a slight code change so kdb is aware.
You are correct that it's a memory issue so what you can do is just decrease the chunk size. You are reading in 4194000000 bytes at a time right now. Try lowering this in accordance with available memory.
If you are still seeing issues it may be your garbage collection setting. You could force a gc after each read/upsert.
.Q.fsn[{f2 upsert flip columns!("DSSF";",")0:x;**.Q.gc[]**};f1;4194000000]

Related

Convert 10000-element Vector{BitVector} into a matrix of 1 and 0 and save it in Julia

I have a "10000-element Vector{BitVector}", each of those vector has a fixed length of 100 and I just want to save it into a csv file of 0 and 1 that is all. When I type my variable I almost see the kind of output I want in my csv file.
Amongst the many things I have tried, the closest to success was:
CSV.write("\\Folder\\file.csv", Tables.table(variable), writeheader=false)
But my csv file has a 10000 rows and 1 column where each row is something like Bool[0,1,0,0,1,1,0,1,0].
This is not the most efficient option, but I hope it should be good enough for you and it is relatively simple and does not require any packages:
open("out.csv", "w") do io
foreach(v -> println(io, join(Int8.(v), ',')), variable)
end
(the Int8 part is needed to make sure 1 and 0 are printed and not true and false)

Octave - dlmread and csvread convert the first value to zero

When I try to read a csv file in Octave I realize that the very first value from it is converted to zero. I tried both csvread and dlmread and I'm receiving no errors. I am able to open the file in a plain text editor and I can see the correct value there. From what I can tell, there are no funny hidden characters, spacings, or similar in the csv file. Files also contain only numbers. The only thing that I feel might be important is that I have five columns/groups that each have different number of values in them.
I went through the commands' documentation on Octave Forge and I do not know what may be causing this. Does anyone have an idea what I can troubleshoot?
To try to illustrate the issue, if I try to load a file with the contents:
1.1,2.1,3.1,4.1,5.1
,2.2,3.2,4.2,5.2
,2.3,3.3,4.3,
,,3.4,4.4
,,3.5,
Command window will return:
0.0,2.1,3.1,4.1,5.1
,2.2,3.2,4.2,5.2
,2.3,3.3,4.3,
,,3.4,4.4
,,3.5,
( with additional trailing zeros after the decimal point).
Command syntaxes I'm using are:
dt = csvread("FileName.csv")
and
dt = dlmread("FileName.csv",",")
and they both return the same.
Your csv file contains a Byte Order Mark right before the first number. You can confirm this if you open the file in a hex editor, you will see the sequence EF BB BF before the numbers start.
This causes the first entry to be interpreted as a 'string', and since strings are parsed based on whether there are numbers in 'front' of the string sequence, this is parsed as the number zero. (see also this answer for more details on how csv entries are parsed).
In my text editor, if I start at the top left of the file, and press the right arrow key once, you can tell that the cursor hasn't moved (meaning I've just gone over the invisible byte order mark, which takes no visible space). Pressing backspace at this point to delete the byte order mark allows the csv to be read properly. Alternatively, you may have to fix your file in a hex editor, or find some other way to convert it to a proper Ascii file (or UTF without the byte order mark).
Also, it may be worth checking how this file was produced; if you have any control in that process, perhaps you can find why this mark was placed in the first place and prevent it. E.g., if this was exported from Excel, you can choose plain 'csv' format instead of 'utf-8 csv'.
UPDATE
In fact, this issue seems to have already been submitted as a bug and fixed in the development branch of octave. See #58813 :)

problem when trying to load data with textscan

I want to load a set of data that contains variables type string and float. But when I use textscan in octave, my data doesn't load. I got matrix 1x6 (i have 6 features), but in this matrix I got cells that contains nothing(cells that are 0x1).
my code:
filename='data1.txt';
fileID = fopen(filename,'r');
data = textscan(fileID,'%f %s %s %f %f %s','Delimiter',',');
fclose(fileID);
when I for example try data(1):
>> data(1)
ans =
{
[1,1] = [](0x1)
}
>>
there is it
there is my set
Also my file id isn't -1.
I had been searching for in the ethernet problem like this but I couldn't find any.
I tried to delete headers in data and smaller training set but it don't work.
Pls help.
Don't use textscan. Textscan is horrible, and one should only use it when they're trying to parse data when there's no better way available.
Your data is a standard csv file. Just use csv2cell from the io package.

Opening a file of varying row and column structure in Scilab

I habitually use csvRead in scilab to read my data files however I am now faced with one which contains blocks of 200 rows, preceeded by 3 lines of headers, all of which I would like to take into account.
I've tried specifying a range of data following the example on the scilab help website for csvRead (example is right at the bottom of the page) (https://help.scilab.org/doc/6.0.0/en_US/csvRead.html) but I always come out with the same error messages :
The line and/or colmun indices are outside of the limits
or
Error in the column structure.
My first three lines are headers which I know can cause a problem but even if I omit them from my block-range, I still have the same problem.
Otherwise, my data is ordered such that I have my three lines of headers (two lines containing a header over just one or two columns, one line containing a header over all columns), 200 lines of data, and a blank line - this represents data from one image and I have about 500 images in the file, I would like to be able to read and process all of them and keep track of the headers because they state the image number which I need to reference later. Example:
DTN-dist_Devissage-1_0006_0,,,,,,
L0,,,,,,
X [mm],Y [mm],W [mm],exx [1] - Lagrange,eyy [1] - Lagrange,exy [1] - Lagrange,Von Mises Strain [1] - Lagrange
-1.13307,-15.0362,-0.00137507,7.74679e-05,8.30045e-05,5.68249e-05,0.00012711
-1.10417,-14.9504,-0.00193334,7.66086e-05,8.02914e-05,5.43132e-05,0.000122655
-1.07528,-14.8647,-0.00249155,7.57493e-05,7.75786e-05,5.18017e-05,0.0001182
Does anyone have a solution to this?
My current code, following an adapted version of the Scilab-help example looks like this (I have tried varying the blocksize and iblock values to include/omit headers:
blocksize=200;
C1=1;
C2=14;
iblock=1
while (%t)
R1=(iblock-1)*blocksize+4;
R2=blocksize+R1-1;
irange=[R1 C1 R2 C2];
V=csvRead(filepath+filename,",",".","",[],"",irange);
iblock=iblock+1
end
Errors
The CSV
A lot's of your problem comes from the inconsistency of the number of coma in your csv file. Opening it in LibreOffice Calc and saving it puts the right number of comma, even on empty lines.
R1
Your current code doesn't position R1 at the beginning of the values. The right formula is
R1=(iblock-1)*(blocksize+blanksize+headersize)+1+headersize;
End of file
Currently your code raise an error and the end of the file because R1 becomes greater than the number of lines. To solve this, you can specify the maximum number of block or test the value of R1 against the number of lines.
Improved solution for much bigger file.
When solving your probem with a big file, two problems were raised :
We need to know the number of blocks or the number of lines
Each call of csvRead is really slow because it process the whole file at each call (1s / block !)
My idea was to read the whole file and store it in a string matrix ( since mgetl as been improved since 6.0.0 ), then use csvTextScan on a submatrix. Doing so also removes the manual writing of the number of block/lines.
The code follows :
clear all
clc
s = filesep()
filepath='.'+s;
filename='DTN_full.csv';
// header is important as it as the image name
headersize=3;
blocksize=200;
C1=1;
C2=14;
iblock=1
// let save everything. Good for the example.
bigstruct = struct();
// Read all the value in one pass
// then using csvTextScan is much more efficient
text = mgetl(filepath+filename);
nlines = size(text,'r');
while ( %t )
mprintf("Block #%d",iblock);
// Lets read the header
R1=(iblock-1)*(headersize+blocksize+1)+1;
R2=R1 + headersize-1;
// if R1 or R1 is bigger than the number of lines, stop
if sum([R1,R2] > nlines )
mprintf('; End of file\n')
break
end
// We use csvTextScan ony on the lines that matters
// speed the program, since csvRead read thge whole file
// every time it is used.
H=csvTextScan(text(R1:R2),",",".","string");
mprintf("; %s",H(1,1))
R1 = R1 + headersize;
R2 = R1 + blocksize-1;
if sum([R1,R2]> nlines )
mprintf('; End of file\n')
break
end
mprintf("; rows %d to %d\n",R1,R2)
// Lets read the values
V=csvTextScan(text(R1:R2),",",".","double");
iblock=iblock+1
// Let save theses data
bigstruct(H(1,1)) = V;
end
and returns
Block #1; DTN-dist_0005_0; rows 4 to 203
....
Block #178; DTN-dist_0710_0; rows 36112 to 36311
Block #179; End of file
Time elapsed 1.827092s

Entry delimiter of JSON files for Hive table

We are collecting JSON data (public social media posts in particular) via REST API invocations, which we plan to dump into HDFS, then abstract a Hive table on top it using SerDe. I wonder though what would be the appropriate delimiter per JSON entry in a file? Is it new line ("\n")? So it would look like this:
{ id: entry1 ... post: }
{ id: entry2 ... post: }
...
{ id: entryn ... post: }
How about if we encounter a new line character within the JSON data itself, for example in post?
The best way would be one record per line, separated by "\n" exactly as you guessed.
This also means that you should be careful to escape "\n" that may be inside the JSON elements.
Indented JSON won't work well with hadoop/hive, since to distribute processing, hadoop must be able to tell when a records ends, so it can split processing of a file with N bytes with W workers in W chunks of size roughly N/W.
The splitting is done by the particular InputFormat that's been used, in case of text, TextInputFormat.
TextInputFormat will basically split the file at the first instance of "\n" found after byte i*N/W (for i from 1 to W-1).
For this reason, having other "\n" around would confuse Hadoop and it will give you incomplete records.
As an alternative, I wouldn't recommend it, but if you really wanted you could use a character other than "\n" by configuring the property "textinputformat.record.delimiter" when reading the file through hadoop/hive, using a character that won't be in JSON (for instance, \001 or CTRL-A is commonly used by Hive as a field delimiter) but that can be tricky since it has to also be supported by the SerDe.
Also, if you change the record delimiter, anybody who copies/uses the file on HDFS must be aware of the delimiter, or they won't be able to parse it correctly, and will need special code to do it, while keeping "\n" as a delimiter, the files will still be normal text files and can be used by other tools.
As for the SerDe, I'd recommend this one, with the disclaimer that I wrote it :)
https://github.com/rcongiu/Hive-JSON-Serde