Shuffle data in LMDB file - caffe

I already have an existing LMDB(Symas Lightning Memory-Mapped Database) file which is created for caffe. Is there any possible way to shuffle the data in already existing LMDB to create a new LMDB with data shuffled. Any suggestions or ideas would be helpful.

LMDB traverses data according to the lexicological order of the key. You can prepend a random number to your current key and the data will be shuffled accordingly. I am also investigating an efficient way to rewrite the keys randomly between epochs as I would like to use Batch normalization in my data set.

To add to the answer given by #Manolo, while creating the LMDB dataset as suggested here, I concatenated a random int to the beginning as follows:
random.seed(i)
str_id = '{:05}'.format(random.randint(1,70000))+'{:05}'.format(i)
I chose 70000, since my LMDB had about 72000 images.

Related

How to index a 1 billion row CSV file with elastic search?

Imagine you had a large CSV file - let's say 1 billion rows.
You want each row in the file to become a document in elastic search.
You can't load the file into memory - it's too large so has to be streamed or chunked.
The time taken is not a problem. The priority is making sure ALL data gets indexed, with no missing data.
What do you think of this approach:
Part 1: Prepare the data
Loop over the CSV file in batches of 1k rows
For each batch, transform the rows into JSON and save them into a smaller file
You now have 1m files, each with 1000 lines of nice JSON
The filenames should be incrementing IDs. For example, running from 1.json to 1000000.json
Part 2: Upload the data
Start looping over each JSON file and reading it into memory
Use the bulk API to upload 1k documents at a time
Record the success/failure of the upload in a result array
Loop over the result array and if any upload failed, retry
The steps you've mentioned above looks good. A couple of other things which will make sure ES does not get under load:
From what I've experienced, you can increase the bulk request size to a greater value as well, say somewhere in the range 4k-7k (start with 7k and if it causes pain, experiment with smaller batches but going lower than 4k probably might not be needed).
Ensure the value of refresh_interval is set to a very great value. This will ensure that the documents are not indexed very frequently. IMO the default value will also do. Read more here.
As the above comment suggests, it'd be better if you start with a smaller batch of data. Of-course, if you use constants instead of hardcoding the values, your task just got easier.

How can I get the IDs of specific items in a Pytorch dataloader-based dataset with a query?

I have a large dataset (approx. 500GB and 180k data points plus labels) in a Pytorch dataloader. Until now, I used torch.utils.data.random_split to split the dataset randomly into training and validation. However, this lead to serious overfitting. Now, I want to rather use a deterministic split, i.e. based on the paths stored in the dataloader, I could figure out a non-random split. However, I have no idea how to do so... The question is: How can I get the IDs of about 10% of the data points based on some query that has a look at the information about the files stored in the data loader (e.g. the paths)?
Have you used a custom dataset along with the dataloader? If the underlying dataset has some variable that stores the filenames of the individual files, you can access it using .dataloader.dataset.filename_variable.
If thats not available, you can create a custom dataset yourself, where you essentially call the original dataset itself.

How to write one Json file for each row from the dataframe in Scala/Spark and rename the files

Need to create one json file for each row from the dataframe. I'm using PartitionBy which creates subfolders for each file. Is there a way to avoid creating the subfolders and rename the json files with the unique key?
OR any other alternatives? Its a huge dataframe with thousands (~300K) of unique values, so Repartition is eating up a lot of resources and taking time.Thanks.
df.select(Seq(col("UniqueField").as("UniqueField_Copy")) ++
df.columns.map(col): _*)
.write.partitionBy("UniqueField")
.mode("overwrite").format("json").save("c:\temp\json\")
Putting all the output in one directory
Your example code is calling partitionBy on a DataFrameWriter object. The documentation tells us that this function:
Partitions the output by the given columns on the file system. If specified, the output is laid out on the file system similar to Hive's partitioning scheme. As an example, when we partition a dataset by year and then month, the directory layout would look like:
year=2016/month=01/
year=2016/month=02/
This is the reason you're getting subdirectories. Simply removing the call to partitionBy will get all your output in one directory.
Getting one row per file
Spark SQL
You had the right idea partitioning your data by UniqueField, since Spark writes one file per partition. Rather than using DataFrameWriter's partition, you can use
df.repartitionByRange(numberOfJson, $"UniqueField")
to get the desired number of partitions, with one JSON per partition. Notice that this requires you to know the number of JSON's you will end up with in advance. You can compute it by
val numberOfJson = df.select(count($"UniqueField")).first.getAs[Long](0)
However, this adds an additional action to your query, which will cause your entire dataset to be computed again. It sounds like your dataset is too big to fit in memory, so you'll need to carefully consider if caching (or checkpointing) with df.cache (or df.checkpoint) actually saves you computation time. (For large datasets that don't require intensive computation to create, recomputation can actually be faster)
RDD
An alternative to using the Spark SQL API is to drop down to the lower-level RDD. Partitioning by key (in pyspark) for RDDs was discussed thoroughly in the answer to this question. In scala, you'd have to specify a custom Partitioner as described in this question.
Renaming Spark's output files
This is a fairly common question, and AFAIK, the consensus is it's not possible.
Hope this helps, and welcome to Stack Overflow!

Loading json format data into google bigquery performance issue

I have loaded JSON format data structure into Google bigquery "Nested" table (I have 2 levels of nested "repeated" records ) the average length of JSON line is 5000 characters.
The load time is much slower than loading flat file( same size in total ) into Google bigquery .
What are the "rule of thumbs" while loading json into nested records?
How can i improve my performance ?
In terms query of performance, is it much slower also to retreive date from nested table , than flat table ?
Please Help , I have found it difficult to reach experienced "DBA" in that area
Regards
I don't know of any reason json imports should be slower, but we haven't benchmarked them.
If perf is slow, you may be better off breaking the import into chunks and passing multiple source files into the load job.
It shouldn't be any slower retrieving the data from the nested table (and might be faster). The columnar storage format should store your nested data more efficiently than a corresponding flat table.

MySQL Blob vs. Disk for "video frames"

I have a c++ app that generates 6x relatively small image-like integer arrays per second. The data is 64x48x2-dimensional int (ie, a grid of 64x48 two-dimensional vectors, with each vector consisting of two floats). That works out to ~26kb per image. The app also generates a timestamp and some features describing the data. I want to store the timestamp and the features in a MySQL db column, per frame. I also need to store the original array as binary data, either in a file on disc or as a blob field in the database. Assume that the app will be running more or less nonstop, and that I'll come up with a way to archive data older than a certain age, so that storage does not become a problem.
What are the tradeoffs here for blobs, files-on-disc, or other methods I may not even be thinking of? I don't need to query against the binary data, but I need to query against the other metadata/features in the table (I'll definitely have an index built against timestamp), and retrieve the binary data. Does the equation change if I store multiple frames in a single file on disk, vs. one frame per file?
Yes, I've read MySQL Binary Storage using BLOB VS OS File System: large files, large quantities, large problems and To Do or Not to Do: Store Images in a Database, but I think my question differs because in this case there are going to be millions of identically-dimensioned binary files. I'm not sure how the performance hit to maintaining that many small files in a filesystem compares to storing that many files in db blob columns. Any perspective would be appreciated.
At a certain point, querying for many blobs becomes unbearably slow. I suspect that even if your identically dimensioned binary files this will be the case. Moreover you will still need some code to access and process the blobs. And this doesn't take advantage of file caching that might speed up image queries straight from the file system.
But! The link you provided did not mention object based databases, which can store the data you described in a way that you can access it extremely quickly, and possibly return it in native format. For a discussion see the link or just search google, there are many discussions:
Storing images in NoSQL stores
I would also look into HBase.
I figured since you were not sure about what to use in the first place(and there were no answers), an alternative solution might be appropriate.