Offline Data Augmentation in Google Colab - Problems with RAM - deep-learning

What would be most efficient way to perform OFFLINE data augmentation in Google Colab?
Since I am not from US I cannot purchase Google Chrome for bigger RAM, so I am trying to be "smart" about it. For example, when I finish loading 11000 images, first as NumPy arrays and then creating pandas DataFrame from them, it occupies around 7.5GB of RAM. Problem is, I tried to del every object (NumPy array, tf.data object etc) in order to check if RAM changes, and RAM does not changed.
Is it better to try and be smart about RAM or maybe write to disk any time I augment image and do not keep anything in RAM? If this is the case, is using TFRecords a smart approach for this?

Related

How can I reduce compute costs and waste in my Foundry transforms?

We have a lot of pretty complicated data pipelines and the amount of compute being consumed has been steadily rising every month. How can I figure out where compute is being wasted and make things more efficient?
So, this will turn into a little bit of an involved answer but hopefully I can point people to a useful set of resources to help them manage waste.
Let's start in the obvious place. Compute profiles:
Engineers will commonly increase the executor memory to solve an executor OOM, the cause of this OOM is often skew. Try to mitigate the skew first and increase memory usage second.
Memory is relatively cheap, but when you increase memory you do so on every executor, which can get expensive across a large number of executors. Usually only a single executor is OOMing and 90% of the time it is due to skew.
Local Spark: You can use the compute profile KUBERNETES_NO_EXECUTORS on small transforms (a rule of thumb might be <50mb of input and output data) which will mean your transform will be run on the driver (reminder on drivers vs executors) This will mean 2 fewer modules are spun up reducing the amount of resources consumed by 66%. Often a job this small does not need executors and using them just causes shuffles and other wasted compute. When you're dealing with small data try to use local spark, your jobs will spin up faster, and will cost less.
Views: Docs on views have not been added to public docs yet, but you can find them on your platform docs at documentation/product/views/overview.
Views are a really useful way to reduce compute usage by eliminating the need for a transform altogether. Anywhere you have an identity transform being used to move a dataset between projects, or a transform that exists only to union several other datasets together, this transform can be replaced by a view. Views work by containing the information on the backing datasets and files, rather than containing any files themselves. They therefore require no processing of their own.
Incremental Pipelines: Where you have data that does not need to be changed after it is processed you might be able to use an incremental pipeline. This way you only process the new data as it comes into your pipeline without having to reprocess the entire mass of data.
This is probably the most powerful tool to reduce compute consumption in large intensive pipelines with high data throughput.

Storing objects/variables outside of volatile memory

OVERVIEW
I don’t have a lot of experience programming, but I’m working on a hybrid mobile app using Cordova. This app is going to have a large amount of static (not changing) data. Some of this data will be referenced about once every minute, complete some simple operations based on that reference, and that will determine which object will be referenced in the next iteration of the loop.
From what I understand all that an object or variable is, is a reserved space within memory identified using a name. Which in hardware terms is synonyms with volatile storage or RAM. Because I will be working with mobile devices I am afraid that the massive amounts of objects I predict I will be working with (say close to 10,000), will max out the devices memory pretty fast.
My initial thought is to store this collection of static data in local storage instead of declaring these objects within the code itself. Then I would reference that file for the data when needed with each iteration of my loop, which processes once every minute. I don’t have experience with JSON but from what I know about it, this seems like it could be a good option.
BREAKDOWN
• I’m using typescript and Cordova.
• I will possibly be working with 10s of thousands of static objects.
• These objects will all be using one of a few interfaces as an outline.
• A few of these objects will be referenced for some information about once every minute.
• That information will be used to perform very simple operations.
• The Id of the object that was referenced may need to be saved permanently for future use.
• Those operations will determine what objects need to be referenced in the next iteration.
QUESTION(S)
So, my question is this. Am I correct in my understanding of how objects are stored? If so, will this number of objects be enough to max out a mobile devices RAM? Is my thought of storing all the static information in something like a JSON file and then referencing the individual objects in that file as needed plausible?
Not quite correct. Modern operating systems don't always map the application's memory to the hardware RAM.
Let's say you have a phone that only has 256MB of total RAM, but your application ends up loading 128MB of data into memory. Does that mean you can only use one more application that can load 128MB of memory? What about the OS itself using memory? The answer is that, the OS will move some of the data from the primary memory (e.g. RAM), into secondary storage (e.g. solid-state drive,) making room for your app and other apps to do their work as needed. If the data that was moved out of the RAM is needed again, the OS can move it back into the RAM from the SSD. This is called paging, and it's one of the many different pieces that make up the operating system's memory management. Most of it is done without your application code having to be aware of it.
Of course, even though the OS does a pretty great job of making memory available to your application, you still want to write code that's still memory efficient. Specially on mobile phones.
For your specific example, your suggestion of storing the static data in local storage is a good start. But it has some drawbacks as well that you should be aware; and some questions you should answer.
Can you divide up the data so that you can load only the portion you need at a time? Or do you need to have all of it loaded anyway?
Can you store your data in a more compressed data structure? (see for example Tries)
How frequently will you be loading the data from local storage?
Will loading the data from local storage take too long (e.g. if your loop does a thousand iterations, and during each iteration loads a lot of static data from the disk, it might end up being really slow).
Good luck!

How to speed up Gensim Word2vec model load time?

I'm building a chatbot so I need to vectorize the user's input using Word2Vec.
I'm using a pre-trained model with 3 million words by Google (GoogleNews-vectors-negative300).
So I load the model using Gensim:
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
The problem is that it takes about 2 minutes to load the model. I can't let the user wait that long.
So what can I do to speed up the load time?
I thought about putting each of the 3 million words and their corresponding vector into a MongoDB database. That would certainly speed things up but intuition tells me it's not a good idea.
In recent gensim versions you can load a subset starting from the front of the file using the optional limit parameter to load_word2vec_format(). (The GoogleNews vectors seem to be in roughly most- to least- frequent order, so the first N are usually the N-sized subset you'd want. So use limit=500000 to get the most-frequent 500,000 words' vectors – still a fairly large vocabulary – saving 5/6ths of the memory/load-time.)
So that may help a bit. But if you're re-loading for every web-request, you'll still be hurting from loading's IO-bound speed, and the redundant memory overhead of storing each re-load.
There are some tricks you can use in combination to help.
Note that after loading such vectors in their original word2vec.c-originated format, you can re-save them using gensim's native save(). If you save them uncompressed, and the backing array is large enough (and the GoogleNews set is definitely large enough), the backing array gets dumped in a separate file in a raw binary format. That file can later be memory-mapped from disk, using gensim's native [load(filename, mmap='r')][1] option.
Initially, this will make the load seem snappy – rather than reading all the array from disk, the OS will just map virtual address regions to disk data, so that some time later, when code accesses those memory locations, the necessary ranges will be read-from-disk. So far so good!
However, if you are doing typical operations like most_similar(), you'll still face big lags, just a little later. That's because this operation requires both an initial scan-and-calculation over all the vectors (on first call, to create unit-length-normalized vectors for every word), and then another scan-and-calculation over all the normed vectors (on every call, to find the N-most-similar vectors). Those full-scan accesses will page-into-RAM the whole array – again costing the couple-of-minutes of disk IO.
What you want is to avoid redundantly doing that unit-normalization, and to pay the IO cost just once. That requires keeping the vectors in memory for re-use by all subsequent web requestes (or even multiple parallel web requests). Fortunately memory-mapping can also help here, albeit with a few extra prep steps.
First, load the word2vec.c-format vectors, with load_word2vec_format(). Then, use model.init_sims(replace=True) to force the unit-normalization, destructively in-place (clobbering the non-normalized vectors).
Then, save the model to a new filename-prefix: model.save('GoogleNews-vectors-gensim-normed.bin'`. (Note that this actually creates multiple files on disk that need to be kept together for the model to be re-loaded.)
Now, we'll make a short Python program that serves to both memory-map load the vectors, and force the full array into memory. We also want this program to hang until externally terminated (keeping the mapping alive), and be careful not to re-calculate the already-normed vectors. This requires another trick because the loaded KeyedVectors actually don't know that the vectors are normed. (Usually only the raw vectors are saved, and normed versions re-calculated whenever needed.)
Roughly the following should work:
from gensim.models import KeyedVectors
from threading import Semaphore
model = KeyedVectors.load('GoogleNews-vectors-gensim-normed.bin', mmap='r')
model.syn0norm = model.syn0 # prevent recalc of normed vectors
model.most_similar('stuff') # any word will do: just to page all in
Semaphore(0).acquire() # just hang until process killed
This will still take a while, but only needs to be done once, before/outside any web requests. While the process is alive, the vectors stay mapped into memory. Further, unless/until there's other virtual-memory pressure, the vectors should stay loaded in memory. That's important for what's next.
Finally, in your web request-handling code, you can now just do the following:
model = KeyedVectors.load('GoogleNews-vectors-gensim-normed.bin', mmap='r')
model.syn0norm = model.syn0 # prevent recalc of normed vectors
# … plus whatever else you wanted to do with the model
Multiple processes can share read-only memory-mapped files. (That is, once the OS knows that file X is in RAM at a certain position, every other process that also wants a read-only mapped version of X will be directed to re-use that data, at that position.).
So this web-reqeust load(), and any subsequent accesses, can all re-use the data that the prior process already brought into address-space and active-memory. Operations requiring similarity-calcs against every vector will still take the time to access multiple GB of RAM, and do the calculations/sorting, but will no longer require extra disk-IO and redundant re-normalization.
If the system is facing other memory pressure, ranges of the array may fall out of memory until the next read pages them back in. And if the machine lacks the RAM to ever fully load the vectors, then every scan will require a mixing of paging-in-and-out, and performance will be frustratingly bad not matter what. (In such a case: get more RAM or work with a smaller vector set.)
But if you do have enough RAM, this winds up making the original/natural load-and-use-directly code "just work" in a quite fast manner, without an extra web service interface, because the machine's shared file-mapped memory functions as the service interface.
I really love vzhong's Embedding library. https://github.com/vzhong/embeddings
It stores word vectors in SQLite which means we don't need to load model but just fetch corresponding vectors from DB :D
Success method:
model = Word2Vec.load_word2vec_format('wikipedia-pubmed-and-PMC-w2v.bin',binary=True)
model.init_sims(replace=True)
model.save('bio_word')
later load the model
Word2Vec.load('bio_word',mmap='r')
for more info: https://groups.google.com/forum/#!topic/gensim/OvWlxJOAsCo
I have that problem whenever I use the google news dataset. The issue is there are way more words in the dataset than you'll ever need. There are a huge amount of typos and what not. What I do is scan the data I'm working on, build a dictionary of say the 50k most common words, get the vectors with Gensim and save the dictionary. Loading this dictionary takes half a second instead of 2 minutes.
If you have no specific dataset, you could use the 50 or 100k most common words from a big dataset, such as a news dataset from WMT to get you started.
Other options are to always keep Gensim running. You can create a FIFO for a script running Gensim. The script acts like a "server" that can read a file to which a "client" writes, watching for vector requests.
I think the most elegant solution is to run a web service providing word embeddings. Check out the word2vec API as an example. After installing, getting the embedding for "restaurant" is as simple as:
curl http://127.0.0.1:5000/word2vec/model?word=restaurant

The speed between ImageDataLayer and LMDB data layer

Caffe support LMDB data layer and ImageDataLayer.
Create LMDB database from some dataset require some time and a lot of space.
In contrast, ImageDataLayer only use a txt file which is very convenient.
My question is, is there big speed difference between these two kinds of layers?
Thank you very much!
LMDB is designed for faster fetching of data from a given key value. Also the data is stored in uncompressed format, which makes it easy for the machine to just read the data and directly pass them to the GPU for processing.
In ImageDataLayer, we have to read the image details from the text file, and use OpenCV to read the image to memory. This uncompressing of image is computationally expensive.
But the best performance may not always be for the LMDB layer, it depends heavily on the configuration of the machine. Consider an example of 256 image batch size and the images of size 227x227x3. Also consider than you are using a very good GPU and a high end i8 processor machine. Here single image in LMDB format may occupy 151KB. A whole batch may occupy 37MB. If the GPU is able to perform 10 batches a second, the harddisk should have a speed of reading 370MB/s. If you are using a normal SATA or external harddisk, there will be bottlenecks on reading such large chunks of data due to the limits of the hard disk.
If caffe could not fetch data in the required speed, the bottleneck slows the whole training process even worse. At the same time, if you were reading 256 images and use multi-core version of OpenCV, the data prefetching may be handled more effectively than reading an LMDB.
The above case will not occur if you have stored the LMDB data on a SSD though!
Yes, the speed difference is indeed big. LMDB is optimized for high speed batch processing.

What are the advantages of memory-mapped files?

I've been researching memory mapped files for a project and would appreciate any thoughts from people who have either used them before, or decided against using them, and why?
In particular, I am concerned about the following, in order of importance:
concurrency
random access
performance
ease of use
portability
I think the advantage is really that you reduce the amount of data copying required over traditional methods of reading a file.
If your application can use the data "in place" in a memory-mapped file, it can come in without being copied; if you use a system call (e.g. Linux's pread() ) then that typically involves the kernel copying the data from its own buffers into user space. This extra copying not only takes time, but decreases the effectiveness of the CPU's caches by accessing this extra copy of the data.
If the data actually have to be read from the disc (as in physical I/O), then the OS still has to read them in, a page fault probably isn't any better performance-wise than a system call, but if they don't (i.e. already in the OS cache), performance should in theory be much better.
On the downside, there's no asynchronous interface to memory-mapped files - if you attempt to access a page which isn't mapped in, it generates a page fault then makes the thread wait for the I/O.
The obvious disadvantage to memory mapped files is on a 32-bit OS - you can easily run out of address space.
I have used a memory mapped file to implement an 'auto complete' feature while the user is typing. I have well over 1 million product part numbers stored in a single index file. The file has some typical header information but the bulk of the file is a giant array of fixed size records sorted on the key field.
At runtime the file is memory mapped, cast to a C-style struct array, and we do a binary search to find matching part numbers as the user types. Only a few memory pages of the file are actually read from disk -- whichever pages are hit during the binary search.
Concurrency - I had an implementation problem where it would sometimes memory map the file multiple times in the same process space. This was a problem as I recall because sometimes the system couldn't find a large enough free block of virtual memory to map the file to. The solution was to only map the file once and thunk all calls to it. In retrospect using a full blown Windows service would of been cool.
Random Access - The binary search is certainly random access and lightning fast
Performance - The lookup is extremely fast. As users type a popup window displays a list of matching product part numbers, the list shrinks as they continue to type. There is no noticeable lag while typing.
Memory mapped files can be used to either replace read/write access, or to support concurrent sharing. When you use them for one mechanism, you get the other as well.
Rather than lseeking and writing and reading around in a file, you map it into memory and simply access the bits where you expect them to be.
This can be very handy, and depending on the virtual memory interface can improve performance. The performance improvement can occur because the operating system now gets to manage this former "file I/O" along with all your other programmatic memory access, and can (in theory) leverage the paging algorithms and so forth that it is already using to support virtual memory for the rest of your program. It does, however, depend on the quality of your underlying virtual memory system. Anecdotes I have heard say that the Solaris and *BSD virtual memory systems may show better performance improvements than the VM system of Linux--but I have no empirical data to back this up. YMMV.
Concurrency comes into the picture when you consider the possibility of multiple processes using the same "file" through mapped memory. In the read/write model, if two processes wrote to the same area of the file, you could be pretty much assured that one of the process's data would arrive in the file, overwriting the other process' data. You'd get one, or the other--but not some weird intermingling. I have to admit I am not sure whether this is behavior mandated by any standard, but it is something you could pretty much rely on. (It's actually agood followup question!)
In the mapped world, in contrast, imagine two processes both "writing". They do so by doing "memory stores", which result in the O/S paging the data out to disk--eventually. But in the meantime, overlapping writes can be expected to occur.
Here's an example. Say I have two processes both writing 8 bytes at offset 1024. Process 1 is writing '11111111' and process 2 is writing '22222222'. If they use file I/O, then you can imagine, deep down in the O/S, there is a buffer full of 1s, and a buffer full of 2s, both headed to the same place on disk. One of them is going to get there first, and the other one second. In this case, the second one wins. However, if I am using the memory-mapped file approach, process 1 is going to go a memory store of 4 bytes, followed by another memory store of 4 bytes (let's assume that't the maximum memory store size). Process 2 will be doing the same thing. Based on when the processes run, you can expect to see any of the following:
11111111
22222222
11112222
22221111
The solution to this is to use explicit mutual exclusion--which is probably a good idea in any event. You were sort of relying on the O/S to do "the right thing" in the read/write file I/O case, anyway.
The classing mutual exclusion primitive is the mutex. For memory mapped files, I'd suggest you look at a memory-mapped mutex, available using (e.g.) pthread_mutex_init().
Edit with one gotcha: When you are using mapped files, there is a temptation to embed pointers to the data in the file, in the file itself (think linked list stored in the mapped file). You don't want to do that, as the file may be mapped at different absolute addresses at different times, or in different processes. Instead, use offsets within the mapped file.
Concurrency would be an issue.
Random access is easier
Performance is good to great.
Ease of use. Not as good.
Portability - not so hot.
I've used them on a Sun system a long time ago, and those are my thoughts.