I have a .html file which is a plot made with Plotly. Is there an easy/already implemented way of creating a CSV with the data from this plot?
For example consider this plot (Python):
import plotly.express as px
df = px.data.iris() # iris is a pandas DataFrame
fig = px.scatter(df, x="sepal_width", y="sepal_length", color="petal_length", facet_col='species')
fig.write_html('plot.html',include_plotlyjs='cdn')
where df looks like this
sepal_length sepal_width petal_length petal_width species species_id
0 5.1 3.5 1.4 0.2 setosa 1
1 4.9 3.0 1.4 0.2 setosa 1
2 4.7 3.2 1.3 0.2 setosa 1
3 4.6 3.1 1.5 0.2 setosa 1
4 5.0 3.6 1.4 0.2 setosa 1
.. ... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica 3
146 6.3 2.5 5.0 1.9 virginica 3
147 6.5 3.0 5.2 2.0 virginica 3
148 6.2 3.4 5.4 2.3 virginica 3
149 5.9 3.0 5.1 1.8 virginica 3
[150 rows x 6 columns]
and plot.html contains this plot:
If you open the .html file with a text editor you will find all the data shown in the plot spread out in some kind of dictionary or so, probably JavaScript? How can one recover a CSV as close as possible to df from this?
Not looking for a Python exclusive answer, it can be anything. However Python is preferred.
Datasets used by Plotly are available on their dedicated repository : plotly/datasets.
The formats used are mainly CSV, JSON and GeoJSON. It happens that Iris data originally are in CSV format (iris.csv), so if you need the whole set you can grab it from there.
Otherwise you can always use df.to_csv().
Related
I have a partitioned dataset stored on internal S3 cloud. I am reading the dataset with pyarrow table
import pyarrow.dataset as ds
my_dataset = ds.dataset( ds_name, format="parquet", filesystem=s3file, partitioning="hive")
fragments = list(my_dataset.get_fragments())
required_fragment = fragements.pop()
The metadata from the required fragment shows the following:
required_fragment.metadata
<pyarrow._parquet.FileMetaData object at 0x00000291798EDF48>
created_by: parquet-cpp-arrow version 9.0.0
num_columns: 22
num_rows: 949650
num_row_groups: 29
format_version: 1.0
serialized_size: 68750
converting this to table however takes a long time
%timeit required_fragment.to_table()
6min 29s ± 1min 15s per loop (mean ± std. dev. of 7 runs, 1 loop each)
The size of the table itself is about 272mb
required_fragment.to_table().nbytes
272850898
Any ideas how i can speed up converting the ds.fragment to table?
Updates
So I instead of pyarrow.dataset, i tried using pyarrow.parquet
Only part of my code that changed is
import pyarrow.parquet as pq
my_dataset = pq.ParquetDataset(ds_name,filesystem=s3file, partitioning="hive", use_legacy_dataset=False )
fragments = my_dataset.fragments
required_fragment = fragements.pop()
and when i tried the code again, the performance was much better
%timeit required_fragment.to_table()
12.4 s ± 1.56 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
While i am happy with the better performance, it still feels confusing as under the hood, by setting use_legacy_dataset = False, the program should have similar outcomes
PC Information
Installed RAM: 21.0GB
Software: Windows 10 Enterprise
Internet speed: 10Mbps / 156 Mpbs (download / upload)
s3 location: Asia
Goal: I got an Athena Dataset that is visualized with grafana. With this i want to create several variables so i can precisely select individual areas. The test-data has a format similar to this one:
Time SensorID Location Measurement
/ 1 Berlin 12.1
/ 2 London 14.0
/ 3 NewYork 23.3
/ 3 Sydney 45.1
/ 2 London 1.3
/ 1 NewYork 17.3
/ 2 Berlin 18.9
/ 3 Sydney 4.8
I now want 2 variables where i can select the SensorID and Location at the same time. For example if i select SensorID = 1 and Location = Berlin => Measurement in my Grafana Graph should be 12.1.
Is there a solution to solve this issue, because the syntax for the athena plugin is very new to me even if it is similar to mysql. I tried to create the syntax but it wont work for me (see the pictures below):
Creation of the first variable
Creation of the panel function for the different variables
I would really look forward to hear about possible solutions or help for the athena syntax :)
I am trying to draw a scatter plot in Octave 5.1.0 and the legend sometimes does not work.
I import data from external files, scatter some part of data and add a legend. Only a horizontal line is displayed instead of a full legend box. I don't understand why, since I created a similar plot several weeks ago with a different data set, and it worked correctly.
Also it works with fltk, but does not work with gnuplot. However, I need exactly gnuplot to use Russian symbols.
clf
graphics_toolkit ("gnuplot")
set (0, "defaultaxesfontname", "Arial")
load cryo.dat
load hy2a.dat
load sent.dat
load saral.dat
load ja2.dat
load ja3.dat
subplot(3,1,1)
hold on
scatter(cryo(:,1),cryo(:,2),40,[0.6 0.6 0.6],'s','filled')
legend("CRYOSAT2","location","northeast")
Several first strings of cryo.dat file:
57754.985 0.82
57755.999 0.96
57756.999 0.93
57757.999 1.04
57758.999 0.83
57759.999 0.97
57760.999 0.9
57761.999 0.93
57762.999 0.93
57763.999 0.96
57764.999 0.94
57765.999 0.95
57766.999 0.94
57767.999 0.86
57768.999 0.92
57769.999 0.97
57770.999 0.97
57771.999 0.98
57772.999 0.88
57773.999 0.84
57774.999 0.92
57775.999 0.85
57776.999 0.9
I am also able to reproduce it with rand function:
test(:,1) = rand(100,1)
test(:,2) = rand(100,1)
subplot(3,1,1)
hold on
scatter(test(:,1),test(:,2),40,[0.6 0.6 0.6],'s','filled')
legend('test','location','northeastoutside')
grid on
Hello i try to train tesseract for a new font based on the following digits:
all digits are provided in a png file with transparent background. If i create a box file from it, train it and so on - all works fine!
Now the problem, same situation but i want to train tesseract based on the following image:
as you can see the digits are exactly the same as well as the positions and so on. The only difference from image 1 is that i used a yellow background and from now on nothing is working anymore. I create a box file i set the same positions as for the first image:
0 5 4 20 22 0
1 27 4 38 21 0
2 48 4 60 22 0
3 71 3 83 22 0
4 94 5 109 22 0
5 119 5 131 22 0
6 143 5 157 22 0
7 172 5 184 22 0
8 197 5 211 23 0
9 224 5 238 22 0
well and then i trained the box, but the resulting .tr file is completely empty i didn't stop here and completed all other steps. The resulting font is not possible to use!
So my question is how to train tesseract to recognize this digits no matter which background is used for them?
Edit 2016-04-16:
I used ImageMagick to preprocess the images and i found a command which works very well for all kind of backgrounds. So i wanted to train tesseract for this created images, but it doesn't work as i thought it would... .
First of all i created box files, where most of them were empty. Well i used a website to organize the character positions and i spent a lot of time to make the cropping perfectly! Afterwards i created the resulting .tr files and did also the other stuff to train tesseract.
Finally i got the "traineddata", i moved the file to the "tessdata" directory of tesseract and used it like it should be used:
tesseract example.jpg output -l mg
(i called the new font "mg")
Okay whatever it doesn't recognize all or most of them! I opened this thread to find help, till now nobody really has a clue how to do this, sadly... . Please help me out.
The whole tesseract training files, which i used and created, u can find here:
Tesseract training directory (as no zip/not compressed -> view of all files of the directory)
You can change any color image to binary image and then use tesseract on it, that way no matter what color you are using you will always have same result.
(I'm using 32 bit KDB+ 3.3 on OS X.)
If I copy and paste the iris dataset into Excel and save it as "MS-DOS Comma Separated (.csv)" and read it into kdb+, I get this:
q)("FFFFS";enlist ",")0:`iris.csv
5.1al Length Sepal Width Petal Length Petal Width Species
-------------------------------------------------------------
If I save it as "Windows Comma Separated (.csv)", I get this:
q)("FFFFS";enlist ",")0:`iris.csv
Sepal Length Sepal Width Petal Length Petal Width Species
---------------------------------------------------------
5.1 3.5 1.4 0.2 setosa
4.9 3 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa
..
Obviously saving as a Windows csv is what I need to do, and this answer explains the differences, but why does this matter for kdb+? And is there an option I can add to the code to read in MS-DOS csv files?
I'm running on windows rather than OSX so I can only reproduce the opposite problem but it'll be the same either way.
Use "read0" to see the difference. In my case:
q)read0 `:macintosh.csv
"col1,col2\ra,1\rb,2\rc,3"
q)read0 `:msdos.csv
"col1,col2"
"a,1"
"b,2"
"c,3"
In order to use 0: to parse the file as a table, kdb is expecting multiple strings (as in my msdos file) rather than that single string where the newlines weren't recognised.
So I get:
q)("SI";enlist ",")0:`:msdos.csv
col1 col2
---------
a 1
b 2
c 3
q)("SI";enlist ",")0:`:macintosh.csv
aol1 col2
-----------
You could put something in your code to recognise the situation and handle it accordingly but it would be slower and less efficient:
q)("SI";enlist ",")0:{$[1=count x;"\r" vs first x;x]}read0 `:msdos.csv
col1 col2
---------
a 1
b 2
c 3
q)("SI";enlist ",")0:{$[1=count x;"\r" vs first x;x]}read0 `:macintosh.csv
col1 col2
---------
a 1
b 2
c 3
Works either way