How can I import external prebuilt .LMDB files into NVIDIA DIGITS? - caffe

I have several databases and I need to do classification on them on NVIDIA DIGITS. But importing my big data into DIGITS takes a lot of time ( 2-4 days)!!!
Imagine I have converted 2 image sets into .lmdb forms like:
data1 data2
--> folder train1_db: data.mdb, lock.mdb --> folder train2_db: data.mdb, lock.mdb
--> folder val1_db: data.mdb, lock.mdb --> folder val2_db: data.mdb, lock.mdb
--> mean.binaryproto --> mean.binaryproto
--> some other txt files... --> some other txt files...
Now I need to concatenate these two .lmdb databases and save time. So I have done that separately in python from Merge two LMDB databases for feeding to the network (caffe)
and I have the third dataset containing: train_db and val_db folders each containing data.mdb and lock.mdb files like above.
data3
--> folder train3_db: train1_db + train2_db
--> folder va3_db: val_db + va2_db
I need to import these into DIGITS so that I train a network on them.
My questions are:
1- should I import the folders train_db and val_db in image LMDB part?
2- I searched for label LMDB but I did not understand what I should do in this part. Could you please clearly explain what I should do?
Many thanks for your help.

You have to create them in the same way that they did. I read them first then created what they did.
This works if you changing an existing Classification DataSet with the same class structure. You do have to edit the pickle file to update total number of images for both train and val in 2 places. You have to generate the lmdb files just like they have them.
By the way… Of course they don’t recommend this:
Check out:
https://github.com/NVIDIA/DIGITS/issues/1035
Here is my code:
https://github.com/GemHunt/lmdb-testing/blob/master/create_lmdb_rotate_whole_image.py

Related

How to see the compression used to create a parquet file with pyarrow?

If I have a parquet file I can do
pqfile=pq.ParquetFile("pathtofile.parquet")
pqfile.metadata
but exploring around using dir in the pqfile object, I can't find anything that would indicate the compression of the file. How can I get that info?
#0x26res has a good point in the comments that converting the metadata to a dict will be easier than using dir.
Compression is stored at the column level. A parquet file consists of a number of row groups. Each row group has columns. So you would want something like...
import pyarrow as pa
import pyarrow.parquet as pq
table = pa.Table.from_pydict({'x': list(range(100000))})
pq.write_table(table, '/tmp/foo.parquet')
pq.ParquetFile('/tmp/foo.parquet').metadata.row_group(0).column(0).compression
# 'SNAPPY'

Create libsvm from multiple csv files for xgboost external memory training

I am trying to train an xgboost model using its external memory version, which takes a libsvm file as training set. Right now, all the data is stored in a bunch of csv files which combine together are way larger than the memory I have, say 70G.(you can easily read any one of them). I just wonder how to create one large libsvm file for xgboost. Or if there is any other work round for this. Thank you.
If you csv files do not have headers you can combine them with the Unix cat command.
Example:
> ls
file1.csv file2.csv
> cat *.csv > combined.csv
Now combined.csv is the concatenation of all the other files.
If all your csv files have headers you''ll want to do something trickier, like take the n-1 lines with tail.
XGBoost supports csv as an input.
If you want to convert that to libsvm regardless, you can use phraug's scripts.

How to load only first n files in pyspark spark.read.csv from a single directory

I have a scenario where I am loading and processing 4TB of data,
which is about 15000 .csv files in a folder.
since I have limited resources, I am planning to process them in two
batches and them union them.
I am trying to understand if I can load only 50% (or first n
number of files in batch1 and the rest in batch 2) using
spark.read.csv.
I can not use a regular expression as these files are generated
from multiple sources and they are of uneven number(from some
sources they are few and from other sources there are many ). If I
consider processing files in uneven batches using wild cards or regex
i may not get optimized performance.
Is there a way where i can tell the spark.read.csv reader to pick first n files and next I would just mention to load last n-1 files
I know this can be doneby writing another program. but I would not prefer as I have more than 20000 files and I dont want to iterate over them.
It's easy if you use hadoop API to list files first and then create dataframes based on this list chunks. For example:
path = '/path/to/files/'
from py4j.java_gateway import java_import
fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())
list_status = fs.listStatus(spark._jvm.org.apache.hadoop.fs.Path(path))
paths = [file.getPath().toString() for file in list_status]
df1 = spark.read.csv(paths[:7500])
df2 = spark.read.csv(paths[7500:])

How to merge multiple csv files into 1 SAS file

I just started using SAS 3 days ago and I need to merge ~50 csv files into 1 SAS dataset.
The 50 csv files have multiple variables with only 1 variable in common i.e. "region_id"
I've used SAS enterprise guide drag and drop functionalities to do this but it was too manual and took me half a day to upload and merge 47 csv files into 1 SAS file.
I was wondering whether anyone has a more intelligent way of doing this using base SAS?
Any advice and tips appreciated!
Thank you!
Example filenames:
2011Census_B01_AUST_short
2011Census_B02A_AUST_short
2011Census_B02B_AUST_short
2011Census_B03_AUST_short
.
.
2011Census_xx_AUST_short
I have more than 50 csv files to upload and merge.
The number and type of variables in the csv file varies in each csv file. However, all csv files have 1 common variable = "region_id"
Example variables:
region_id, Tot_P_M, Tot_P_F, Tot_P_P, Age_0_4_yr_F etc...
First, we'll need an automated way to import. The below simple macro takes the location of the file and the name of the file as inputs, and outputs a dataset to the work directory. (I'd use the concatenate function in Excel to create the SAS code 50 times). Also, we are sorting it to make the merge easier later.
%macro importcsv(location=,filename=);
proc import datafile="&location./&filename..csv"
out=&filename.
dbms=csv
replace;
getnames=yes;
run;
proc sort data= &filename.; by region_id; run;
%mend;
%importcsv(location = C:/Desktop,filename = 2011Census_B01_AUST_short)
.
.
.
Then simply merge all of the data together again. I added ellipses simply because I didn't want to right out 50 times.
data merged;
merge dataseta datasetb datasetc ... datasetax;
by region_id;
run;
Hope this helps.

all the columns of a csv file cannot be imported in sas dataset

my data set contains 1300000 observations with 56 columns. it is a .csv file and i'm trying to import it by using proc import. after importing i find that only 44 out of 56 columns are imported.
i tried increasing the guessing rows but it is not helping.
P.S: i'm using sas 9.3
If (and only in that case as far as I am aware) you specify the file to load in a filename statement, you have to set the option lrecl to a value that is large enough.
If you don't, the default is only 256. Ergo, if your csv has lines longer than 256, he will not read the full line.
See this link for more information (just search for lrecl): https://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a000308090.htm
If you have SAS Enterprise Guide (I think it's now included with all desktop licenses) try out the import wizard. It's excellent. And it will generate code you can reuse with a little editing.
It will take a while to run because it will read your entire file before writing the import logic.