The speed between ImageDataLayer and LMDB data layer - deep-learning

Caffe support LMDB data layer and ImageDataLayer.
Create LMDB database from some dataset require some time and a lot of space.
In contrast, ImageDataLayer only use a txt file which is very convenient.
My question is, is there big speed difference between these two kinds of layers?
Thank you very much!

LMDB is designed for faster fetching of data from a given key value. Also the data is stored in uncompressed format, which makes it easy for the machine to just read the data and directly pass them to the GPU for processing.
In ImageDataLayer, we have to read the image details from the text file, and use OpenCV to read the image to memory. This uncompressing of image is computationally expensive.
But the best performance may not always be for the LMDB layer, it depends heavily on the configuration of the machine. Consider an example of 256 image batch size and the images of size 227x227x3. Also consider than you are using a very good GPU and a high end i8 processor machine. Here single image in LMDB format may occupy 151KB. A whole batch may occupy 37MB. If the GPU is able to perform 10 batches a second, the harddisk should have a speed of reading 370MB/s. If you are using a normal SATA or external harddisk, there will be bottlenecks on reading such large chunks of data due to the limits of the hard disk.
If caffe could not fetch data in the required speed, the bottleneck slows the whole training process even worse. At the same time, if you were reading 256 images and use multi-core version of OpenCV, the data prefetching may be handled more effectively than reading an LMDB.
The above case will not occur if you have stored the LMDB data on a SSD though!

Yes, the speed difference is indeed big. LMDB is optimized for high speed batch processing.

Related

what happens to model weights and how does checkpointing work?

I have basic question about about model weights and checkpoints.
When training a model, each layer in the model graph calls kernel executed on the GPU. These weights remain on the GPU for forward pass and backward pass. Once the weights are updated during backward pass, where are all the updated weights stored. Are they moved back to CPU memory? when does this move happen ?
when checkpointing is done, do we get weights from CPU memory ?
Can someone explain the whole execution flow ?
In most cases, the updated weights from the backward pass remain on the GPU memory. The weights are typically stored in the GPU's memory as floating-point numbers, which allows for fast matrix operations and helps to optimize the training process. The weights are updated during each iteration of the training loop and remain on the GPU until the end of the training process.
When checkpointing is done, the weights are saved to disk, either on the CPU or in a remote storage if the execution is stopped. These weights are usually loaded in CPU memory when needed for execution. This is the general process but it can vary with architecture and hardware sometimes.
The weights stay on the GPU unless they are explicitly moved somewhere else.
When you save a checkpoint, the weights are serialized to the disk using pickle, without being first moved to CPU, That's why if e.g. you pickle a model's state_dict thats on the GPU, and try to load it on a system without a GPU, it will fail.
Also note that the pickle itself, has to move the data it needs to dump, to system ram and does its required processings first but it doesn't change the objects underlying attributes when doing so, thus your models weight gets stored in its original form and intact attributes.

How can I reduce compute costs and waste in my Foundry transforms?

We have a lot of pretty complicated data pipelines and the amount of compute being consumed has been steadily rising every month. How can I figure out where compute is being wasted and make things more efficient?
So, this will turn into a little bit of an involved answer but hopefully I can point people to a useful set of resources to help them manage waste.
Let's start in the obvious place. Compute profiles:
Engineers will commonly increase the executor memory to solve an executor OOM, the cause of this OOM is often skew. Try to mitigate the skew first and increase memory usage second.
Memory is relatively cheap, but when you increase memory you do so on every executor, which can get expensive across a large number of executors. Usually only a single executor is OOMing and 90% of the time it is due to skew.
Local Spark: You can use the compute profile KUBERNETES_NO_EXECUTORS on small transforms (a rule of thumb might be <50mb of input and output data) which will mean your transform will be run on the driver (reminder on drivers vs executors) This will mean 2 fewer modules are spun up reducing the amount of resources consumed by 66%. Often a job this small does not need executors and using them just causes shuffles and other wasted compute. When you're dealing with small data try to use local spark, your jobs will spin up faster, and will cost less.
Views: Docs on views have not been added to public docs yet, but you can find them on your platform docs at documentation/product/views/overview.
Views are a really useful way to reduce compute usage by eliminating the need for a transform altogether. Anywhere you have an identity transform being used to move a dataset between projects, or a transform that exists only to union several other datasets together, this transform can be replaced by a view. Views work by containing the information on the backing datasets and files, rather than containing any files themselves. They therefore require no processing of their own.
Incremental Pipelines: Where you have data that does not need to be changed after it is processed you might be able to use an incremental pipeline. This way you only process the new data as it comes into your pipeline without having to reprocess the entire mass of data.
This is probably the most powerful tool to reduce compute consumption in large intensive pipelines with high data throughput.

Offline Data Augmentation in Google Colab - Problems with RAM

What would be most efficient way to perform OFFLINE data augmentation in Google Colab?
Since I am not from US I cannot purchase Google Chrome for bigger RAM, so I am trying to be "smart" about it. For example, when I finish loading 11000 images, first as NumPy arrays and then creating pandas DataFrame from them, it occupies around 7.5GB of RAM. Problem is, I tried to del every object (NumPy array, tf.data object etc) in order to check if RAM changes, and RAM does not changed.
Is it better to try and be smart about RAM or maybe write to disk any time I augment image and do not keep anything in RAM? If this is the case, is using TFRecords a smart approach for this?

How to speed up Gensim Word2vec model load time?

I'm building a chatbot so I need to vectorize the user's input using Word2Vec.
I'm using a pre-trained model with 3 million words by Google (GoogleNews-vectors-negative300).
So I load the model using Gensim:
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
The problem is that it takes about 2 minutes to load the model. I can't let the user wait that long.
So what can I do to speed up the load time?
I thought about putting each of the 3 million words and their corresponding vector into a MongoDB database. That would certainly speed things up but intuition tells me it's not a good idea.
In recent gensim versions you can load a subset starting from the front of the file using the optional limit parameter to load_word2vec_format(). (The GoogleNews vectors seem to be in roughly most- to least- frequent order, so the first N are usually the N-sized subset you'd want. So use limit=500000 to get the most-frequent 500,000 words' vectors – still a fairly large vocabulary – saving 5/6ths of the memory/load-time.)
So that may help a bit. But if you're re-loading for every web-request, you'll still be hurting from loading's IO-bound speed, and the redundant memory overhead of storing each re-load.
There are some tricks you can use in combination to help.
Note that after loading such vectors in their original word2vec.c-originated format, you can re-save them using gensim's native save(). If you save them uncompressed, and the backing array is large enough (and the GoogleNews set is definitely large enough), the backing array gets dumped in a separate file in a raw binary format. That file can later be memory-mapped from disk, using gensim's native [load(filename, mmap='r')][1] option.
Initially, this will make the load seem snappy – rather than reading all the array from disk, the OS will just map virtual address regions to disk data, so that some time later, when code accesses those memory locations, the necessary ranges will be read-from-disk. So far so good!
However, if you are doing typical operations like most_similar(), you'll still face big lags, just a little later. That's because this operation requires both an initial scan-and-calculation over all the vectors (on first call, to create unit-length-normalized vectors for every word), and then another scan-and-calculation over all the normed vectors (on every call, to find the N-most-similar vectors). Those full-scan accesses will page-into-RAM the whole array – again costing the couple-of-minutes of disk IO.
What you want is to avoid redundantly doing that unit-normalization, and to pay the IO cost just once. That requires keeping the vectors in memory for re-use by all subsequent web requestes (or even multiple parallel web requests). Fortunately memory-mapping can also help here, albeit with a few extra prep steps.
First, load the word2vec.c-format vectors, with load_word2vec_format(). Then, use model.init_sims(replace=True) to force the unit-normalization, destructively in-place (clobbering the non-normalized vectors).
Then, save the model to a new filename-prefix: model.save('GoogleNews-vectors-gensim-normed.bin'`. (Note that this actually creates multiple files on disk that need to be kept together for the model to be re-loaded.)
Now, we'll make a short Python program that serves to both memory-map load the vectors, and force the full array into memory. We also want this program to hang until externally terminated (keeping the mapping alive), and be careful not to re-calculate the already-normed vectors. This requires another trick because the loaded KeyedVectors actually don't know that the vectors are normed. (Usually only the raw vectors are saved, and normed versions re-calculated whenever needed.)
Roughly the following should work:
from gensim.models import KeyedVectors
from threading import Semaphore
model = KeyedVectors.load('GoogleNews-vectors-gensim-normed.bin', mmap='r')
model.syn0norm = model.syn0 # prevent recalc of normed vectors
model.most_similar('stuff') # any word will do: just to page all in
Semaphore(0).acquire() # just hang until process killed
This will still take a while, but only needs to be done once, before/outside any web requests. While the process is alive, the vectors stay mapped into memory. Further, unless/until there's other virtual-memory pressure, the vectors should stay loaded in memory. That's important for what's next.
Finally, in your web request-handling code, you can now just do the following:
model = KeyedVectors.load('GoogleNews-vectors-gensim-normed.bin', mmap='r')
model.syn0norm = model.syn0 # prevent recalc of normed vectors
# … plus whatever else you wanted to do with the model
Multiple processes can share read-only memory-mapped files. (That is, once the OS knows that file X is in RAM at a certain position, every other process that also wants a read-only mapped version of X will be directed to re-use that data, at that position.).
So this web-reqeust load(), and any subsequent accesses, can all re-use the data that the prior process already brought into address-space and active-memory. Operations requiring similarity-calcs against every vector will still take the time to access multiple GB of RAM, and do the calculations/sorting, but will no longer require extra disk-IO and redundant re-normalization.
If the system is facing other memory pressure, ranges of the array may fall out of memory until the next read pages them back in. And if the machine lacks the RAM to ever fully load the vectors, then every scan will require a mixing of paging-in-and-out, and performance will be frustratingly bad not matter what. (In such a case: get more RAM or work with a smaller vector set.)
But if you do have enough RAM, this winds up making the original/natural load-and-use-directly code "just work" in a quite fast manner, without an extra web service interface, because the machine's shared file-mapped memory functions as the service interface.
I really love vzhong's Embedding library. https://github.com/vzhong/embeddings
It stores word vectors in SQLite which means we don't need to load model but just fetch corresponding vectors from DB :D
Success method:
model = Word2Vec.load_word2vec_format('wikipedia-pubmed-and-PMC-w2v.bin',binary=True)
model.init_sims(replace=True)
model.save('bio_word')
later load the model
Word2Vec.load('bio_word',mmap='r')
for more info: https://groups.google.com/forum/#!topic/gensim/OvWlxJOAsCo
I have that problem whenever I use the google news dataset. The issue is there are way more words in the dataset than you'll ever need. There are a huge amount of typos and what not. What I do is scan the data I'm working on, build a dictionary of say the 50k most common words, get the vectors with Gensim and save the dictionary. Loading this dictionary takes half a second instead of 2 minutes.
If you have no specific dataset, you could use the 50 or 100k most common words from a big dataset, such as a news dataset from WMT to get you started.
Other options are to always keep Gensim running. You can create a FIFO for a script running Gensim. The script acts like a "server" that can read a file to which a "client" writes, watching for vector requests.
I think the most elegant solution is to run a web service providing word embeddings. Check out the word2vec API as an example. After installing, getting the embedding for "restaurant" is as simple as:
curl http://127.0.0.1:5000/word2vec/model?word=restaurant

Speeding up puts operation to file in tcl

I have to puts large amounts of data to a file in TCL, and it takes very long. I tried increasing the buffer capacity from 4KB to 1MB using fconfigure, but noticed no improvement whatsoever.
I am not sure if I could flush my buffer at intervals, as I was guessing some of my data would be lost if I do so.
Is there some way I could increase the speed of puts without losing any data?
Generally the output speed is going to be limited by your disk drive's speed and computer system's i/o bandwidth.
Increasing the buffer size is probably the only thing you can do to help.
flush will slow down the write, as it will force-push the write buffer to the operating system.
If your incoming data stream ever pauses or comes in one big chunk that can be fit into memory, you can buffer the incoming data internally, and let the write catch up later.
If your data is coming from another channel (file, socket, whatever) then you can use fcopy to move it across. The fcopy command is careful to work as efficiently as possible, and if you configure both sides (incoming and outgoing) to use binary data transfer — so no encoding conversion or EOL/EOF character processing — then it can do it with minimal data copies; it's as efficient as a user-process level system can copy data (and you'd have to do hackery to move the copy into the OS kernel to do better). Obviously, having to process encoding conversion and transformation of end-of-line markers will slow things down.
Otherwise, the main bottleneck will still (probably) be device to which the output is being written. If it is going to a file, moving to writing to an SSD is the simplest option (but not necessarily the cheapest!) When writing over the network, better networking will make a gigantic difference. You really have to identify what the bottleneck really is; if Tcl is spending most of its time waiting for the hardware, there's very little point in working hard to make Tcl faster as you'll see virtually no results for that work. Fixing hardware bottlenecks is out of the scope of Stack Overflow, though some sister sites might be able to assist.
puts will not lose data unless you do something really evil like doing a force kill (kill -9) on the process, or doing a reset on the location of the file pointer from C code.