Format requirements for reading csv files into q/kdb+

Format requirements for reading csv files into q/kdb+ - csv

(I'm using 32 bit KDB+ 3.3 on OS X.)
If I copy and paste the iris dataset into Excel and save it as "MS-DOS Comma Separated (.csv)" and read it into kdb+, I get this:
q)("FFFFS";enlist ",")0:`iris.csv
5.1al Length Sepal Width Petal Length Petal Width Species
-------------------------------------------------------------
If I save it as "Windows Comma Separated (.csv)", I get this:
q)("FFFFS";enlist ",")0:`iris.csv
Sepal Length Sepal Width Petal Length Petal Width Species
---------------------------------------------------------
5.1 3.5 1.4 0.2 setosa
4.9 3 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa
..
Obviously saving as a Windows csv is what I need to do, and this answer explains the differences, but why does this matter for kdb+? And is there an option I can add to the code to read in MS-DOS csv files?

I'm running on windows rather than OSX so I can only reproduce the opposite problem but it'll be the same either way.
Use "read0" to see the difference. In my case:
q)read0 `:macintosh.csv
"col1,col2\ra,1\rb,2\rc,3"
q)read0 `:msdos.csv
"col1,col2"
"a,1"
"b,2"
"c,3"
In order to use 0: to parse the file as a table, kdb is expecting multiple strings (as in my msdos file) rather than that single string where the newlines weren't recognised.
So I get:
q)("SI";enlist ",")0:`:msdos.csv
col1 col2
---------
a 1
b 2
c 3
q)("SI";enlist ",")0:`:macintosh.csv
aol1 col2
-----------
You could put something in your code to recognise the situation and handle it accordingly but it would be slower and less efficient:
q)("SI";enlist ",")0:{$[1=count x;"\r" vs first x;x]}read0 `:msdos.csv
col1 col2
---------
a 1
b 2
c 3
q)("SI";enlist ",")0:{$[1=count x;"\r" vs first x;x]}read0 `:macintosh.csv
col1 col2
---------
a 1
b 2
c 3
Works either way

Related

Extract CSV from plotly plot

I have a .html file which is a plot made with Plotly. Is there an easy/already implemented way of creating a CSV with the data from this plot?
For example consider this plot (Python):
import plotly.express as px
df = px.data.iris() # iris is a pandas DataFrame
fig = px.scatter(df, x="sepal_width", y="sepal_length", color="petal_length", facet_col='species')
fig.write_html('plot.html',include_plotlyjs='cdn')
where df looks like this
sepal_length sepal_width petal_length petal_width species species_id
0 5.1 3.5 1.4 0.2 setosa 1
1 4.9 3.0 1.4 0.2 setosa 1
2 4.7 3.2 1.3 0.2 setosa 1
3 4.6 3.1 1.5 0.2 setosa 1
4 5.0 3.6 1.4 0.2 setosa 1
.. ... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica 3
146 6.3 2.5 5.0 1.9 virginica 3
147 6.5 3.0 5.2 2.0 virginica 3
148 6.2 3.4 5.4 2.3 virginica 3
149 5.9 3.0 5.1 1.8 virginica 3
[150 rows x 6 columns]
and plot.html contains this plot:
If you open the .html file with a text editor you will find all the data shown in the plot spread out in some kind of dictionary or so, probably JavaScript? How can one recover a CSV as close as possible to df from this?
Not looking for a Python exclusive answer, it can be anything. However Python is preferred.

Datasets used by Plotly are available on their dedicated repository : plotly/datasets.
The formats used are mainly CSV, JSON and GeoJSON. It happens that Iris data originally are in CSV format (iris.csv), so if you need the whole set you can grab it from there.
Otherwise you can always use df.to_csv().

Grafana: Combination of variables (Athena Dataset)

Goal: I got an Athena Dataset that is visualized with grafana. With this i want to create several variables so i can precisely select individual areas. The test-data has a format similar to this one:
Time SensorID Location Measurement
/ 1 Berlin 12.1
/ 2 London 14.0
/ 3 NewYork 23.3
/ 3 Sydney 45.1
/ 2 London 1.3
/ 1 NewYork 17.3
/ 2 Berlin 18.9
/ 3 Sydney 4.8
I now want 2 variables where i can select the SensorID and Location at the same time. For example if i select SensorID = 1 and Location = Berlin => Measurement in my Grafana Graph should be 12.1.
Is there a solution to solve this issue, because the syntax for the athena plugin is very new to me even if it is similar to mysql. I tried to create the syntax but it wont work for me (see the pictures below):
Creation of the first variable
Creation of the panel function for the different variables
I would really look forward to hear about possible solutions or help for the athena syntax :)

Legend does not work for scatter plot in Octave

I am trying to draw a scatter plot in Octave 5.1.0 and the legend sometimes does not work.
I import data from external files, scatter some part of data and add a legend. Only a horizontal line is displayed instead of a full legend box. I don't understand why, since I created a similar plot several weeks ago with a different data set, and it worked correctly.
Also it works with fltk, but does not work with gnuplot. However, I need exactly gnuplot to use Russian symbols.
clf
graphics_toolkit ("gnuplot")
set (0, "defaultaxesfontname", "Arial")
load cryo.dat
load hy2a.dat
load sent.dat
load saral.dat
load ja2.dat
load ja3.dat
subplot(3,1,1)
hold on
scatter(cryo(:,1),cryo(:,2),40,[0.6 0.6 0.6],'s','filled')
legend("CRYOSAT2","location","northeast")
Several first strings of cryo.dat file:
57754.985 0.82
57755.999 0.96
57756.999 0.93
57757.999 1.04
57758.999 0.83
57759.999 0.97
57760.999 0.9
57761.999 0.93
57762.999 0.93
57763.999 0.96
57764.999 0.94
57765.999 0.95
57766.999 0.94
57767.999 0.86
57768.999 0.92
57769.999 0.97
57770.999 0.97
57771.999 0.98
57772.999 0.88
57773.999 0.84
57774.999 0.92
57775.999 0.85
57776.999 0.9
I am also able to reproduce it with rand function:
test(:,1) = rand(100,1)
test(:,2) = rand(100,1)
subplot(3,1,1)
hold on
scatter(test(:,1),test(:,2),40,[0.6 0.6 0.6],'s','filled')
legend('test','location','northeastoutside')
grid on

Tesseract Training - new font with only digits

Hello i try to train tesseract for a new font based on the following digits:
all digits are provided in a png file with transparent background. If i create a box file from it, train it and so on - all works fine!
Now the problem, same situation but i want to train tesseract based on the following image:
as you can see the digits are exactly the same as well as the positions and so on. The only difference from image 1 is that i used a yellow background and from now on nothing is working anymore. I create a box file i set the same positions as for the first image:
0 5 4 20 22 0
1 27 4 38 21 0
2 48 4 60 22 0
3 71 3 83 22 0
4 94 5 109 22 0
5 119 5 131 22 0
6 143 5 157 22 0
7 172 5 184 22 0
8 197 5 211 23 0
9 224 5 238 22 0
well and then i trained the box, but the resulting .tr file is completely empty i didn't stop here and completed all other steps. The resulting font is not possible to use!
So my question is how to train tesseract to recognize this digits no matter which background is used for them?
Edit 2016-04-16:
I used ImageMagick to preprocess the images and i found a command which works very well for all kind of backgrounds. So i wanted to train tesseract for this created images, but it doesn't work as i thought it would... .
First of all i created box files, where most of them were empty. Well i used a website to organize the character positions and i spent a lot of time to make the cropping perfectly! Afterwards i created the resulting .tr files and did also the other stuff to train tesseract.
Finally i got the "traineddata", i moved the file to the "tessdata" directory of tesseract and used it like it should be used:
tesseract example.jpg output -l mg
(i called the new font "mg")
Okay whatever it doesn't recognize all or most of them! I opened this thread to find help, till now nobody really has a clue how to do this, sadly... . Please help me out.
The whole tesseract training files, which i used and created, u can find here:
Tesseract training directory (as no zip/not compressed -> view of all files of the directory)

You can change any color image to binary image and then use tesseract on it, that way no matter what color you are using you will always have same result.

Weka Decision Tree

I am trying to use weka to analyze some data. I've got a dataset with 3 variables and 1000+ instances.
The dataset references movie remakes and
how similar they are (0.0-1.0)
the difference in years between the movie and the remake
and lastly if they were made by the same studio (yes or no)
I am trying to make a decision tree to analyze the data. Using the J48 (because that's all I have ever used) I only get one leaf. Im assuming I'm doing something wrong. Any help is appreciated.
Here is a snippet from the data set:
Similarity YearDifference STUDIO TYPE
0.5 36 No
0.5 9 No
0.85 18 No
0.4 10 No
0.5 15 No
0.7 6 No
0.8 11 No
0.8 0 Yes
...
If interested the data can be downloaded as a csv here http://s000.tinyupload.com/?file_id=77863432352576044943

Your data set is not balanced cause there are almost 5 times more "No" then "Yes" for class attribute. That's why J48 is tree which is actually just one leaf that classifies everything as "NO". You can do one of these things:
sample your data set so you have equal number of No and Yes
Try using better classification algorithm e.g. Random Forest (it's located few spaces below J48 in Weka explorer GUI)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008