I'd like to convert the file of float vector data written in CSV format which consists of 3 million rows and 150 columns like the following into NetCDF format.
0.3,0.9,1.3,0.5,...,0.9
-5.1,0.1,1.0,8.4,...,6.7
...
First, I tried something like cache-all-the-data-and-then-convert-it algorithm, but it didn't work because it could not allocate the memory for the cache.
So I need the code written in convert-one-by-one algorithm.
Does any one know such solutions?
The memory capacity of my machine is 8 MiB, and it's OK for any programming language such as C, Java, and Python.
With python you can read the file line by line.
with open("myfile.csv") as infile:
for line in infile:
appendtoNetcdf(line)
So you dont have to load all the file contents into memory.
Check the netCDF4-python library, you can create a netcdf4 or many netcdf4 files easily.
Related
I am trying to train an xgboost model using its external memory version, which takes a libsvm file as training set. Right now, all the data is stored in a bunch of csv files which combine together are way larger than the memory I have, say 70G.(you can easily read any one of them). I just wonder how to create one large libsvm file for xgboost. Or if there is any other work round for this. Thank you.
If you csv files do not have headers you can combine them with the Unix cat command.
Example:
> ls
file1.csv file2.csv
> cat *.csv > combined.csv
Now combined.csv is the concatenation of all the other files.
If all your csv files have headers you''ll want to do something trickier, like take the n-1 lines with tail.
XGBoost supports csv as an input.
If you want to convert that to libsvm regardless, you can use phraug's scripts.
We can load a large csv file as row-chunks through (e.g.) below:
from pandas import *
tp = read_csv('large_dataset.csv', iterator=True, chunksize=1000) # gives TextFileReader, which is iterable with chunks of 1000 rows.
One might argue that using 'usecols' is the solution; however, in my experience, 'usecols' is, qualitatively, not as fast as using 'chunksize'. Because, I presume that the entire file is still read into the memory when 'usecols' is used, and yet 'chunksize' iterates through the file, instead; hence, faster.
How could we load a large csv file as column-chunks?
I am trying to import in Octave a file (i.e. data.txt) containing 2 columns of integers, such as:
101448,1077
96906,924
105704,1017
I use the following command:
data = load('data.txt')
However, the "data" matrix that results has a 1 x 1 dimension, with all the content of the data.txt file saved in just one cell. If I adjust the numbers to look like floats:
101448.0,1077.0
96906.0,924.0
105704.0,1017.0
the loading works as expected, and I obtain a matrix with 3 rows and 2 columns.
I looked at the various options that can be set for the load command but none of them seem to help. The data file has no headers, just plain integers, comma separated.
Any suggestions on how to load this type of data? How can I force Octave to cast the data as numeric?
The load function is not to read csv files. It is meant to load files saved from Octave itself which define variables.
To read a csv file use csvread ("data.txt"). Also, 3.2.4 is a very old version no longer supported, you should upgrade.
my data set contains 1300000 observations with 56 columns. it is a .csv file and i'm trying to import it by using proc import. after importing i find that only 44 out of 56 columns are imported.
i tried increasing the guessing rows but it is not helping.
P.S: i'm using sas 9.3
If (and only in that case as far as I am aware) you specify the file to load in a filename statement, you have to set the option lrecl to a value that is large enough.
If you don't, the default is only 256. Ergo, if your csv has lines longer than 256, he will not read the full line.
See this link for more information (just search for lrecl): https://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a000308090.htm
If you have SAS Enterprise Guide (I think it's now included with all desktop licenses) try out the import wizard. It's excellent. And it will generate code you can reuse with a little editing.
It will take a while to run because it will read your entire file before writing the import logic.
I want to access every value (~10000) in .txt files (~1000) stored in directories (~20) in the most efficient manner possible. When the data is grabbed I would like to place them in a HTML string. I do this in order to display a HTML page with tables for each file. Pseudo:
fh=open('MyHtmlFile.html','w')
fh.write('''<head>Lots of tables</head><body>''')
for eachDirectory in rootFolder:
for eachFile in eachDirectory:
concat=''
for eachData in eachFile:
concat=concat+<tr><td>eachData</tr></td>
table='''
<table>%s</table>
'''%(concat)
fh.write(table)
fh.write('''</body>''')
fh.close()
There must be a better way (I imagine it would take forever)! I've checked out set() and read a bit about hashtables but rather ask the experts before the hole is dug.
Thank you for your time!
/Karl
import os, os.path
# If you're on Python 2.5 or newer, use 'with'
# needs 'from __future__ import with_statement' on 2.5
fh=open('MyHtmlFile.html','w')
fh.write('<html>\r\n<head><title>Lots of tables</title></head>\r\n<body>\r\n')
# this will recursively descend the tree
for dirpath, dirname, filenames in os.walk(rootFolder):
for filename in filenames:
# again, use 'with' on Python 2.5 or newer
infile = open(os.path.join(dirpath, filename))
# this will format the lines and join them, then format them into the table
# If you're on Python 2.6 or newer you could use 'str.format' instead
fh.write('<table>\r\n%s\r\n</table>' %
'\r\n'.join('<tr><td>%s</tr></td>' % line for line in infile))
infile.close()
fh.write('\r\n</body></html>')
fh.close()
Why do you "imagine it would take forever"? You are reading the file and then printing it out - that's pretty much the only thing you have as a requirement - and that's all you're doing.
You could tweak the script in a couple of ways (read blocks not lines, adjust buffers, print out instead of concatenating, etc.), but if you don't know how much time do you take now, how do you know what is better/worse?
Profile first, then find if the script is too slow, then find a place where it's slow, and only then optimise (or ask about optimisation).