What does tf.contrib.slim.prefetch_queue do in tensorflow - tensorflow-slim

I saw an example of training cifar10 data using tensorflow:
https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10
the code generate a batch of images from several single image using tf.train.batch
and create a queue of batchs using prefetch_queue. I understand it is necessary to
use queues to pre-fetch data when training data is large. I guess tf.train.batch maintains
a queue internally (because it has a capacity parameter). Since a queue of batches is already maintained in tf.train.battch, is it necessary to create
another queue with tf.contrib.slim.prefetch_queue? what does tf.contrib.slim.prefetch_queue do exactly?
the key parts of the cifar-10 example code is shown below:
import tensorflow as tf
images, labels = tf.train.batch(
[image, label],
batch_size=...,
num_threads=...,
capacity=...,
min_after_dequeue=...)
batch_queue = tf.contrib.slim.prefetch_queue.prefetch_queue(
[images, labels],
capacity=...)

After a few hours of research, I think I can answer my own question.
tf.train.batch maintains a pool of single-frame images. When a new batch (with batch size n for example)is needed, tf.train.batch fetch n items from the pool and create a new batch.
prefetch_queue internally maintains a queue. It receives the batch created from tf.train.batch and put it in the queue.
The implemntation of prefetch_queue and tf.train.batch can be visualized via tensorboard.

I am not sure if you are right. According to the document here, tf.train.batch itself uses a QueueRunner to hold the data, that is to say, tf.train.batch could enqueue asynchronously, you could fetch your data from its queue directly without delay when you need, so why a prefetch_queue is essential if it only maintains a queue?
I am not very sure about what I said, just little advice, I am also study its mechanism these days. BTW, new API of tf.data.dataset may be a better choice with which we do not need to worry about its implementation.

Related

What's a good mechanism to store device config information

I have 10k devices which had same configuration attributes but different values. The config file includes the name, type, model, manufacture year, temperature, usage, status. The first 4 values don't change and last 3 values keep changing every few seconds. Each device has a computer connected to it.
Between the following 2 ways of storing the configs, which is better?
1)way 1: put the config information in a json file and store the json file on the computer which is connected to a device;
2)way 2: put the config information in a database table.
The advantage of way 1 is that it has less latency, but hard to maintain the data. Way 2 has more latency but easier to maintain. We can just create an API to get the data from the table. Way 2 also has TPS issue if there are more and more devices. For example, if there are 80k devices, and every device is writing config data to the database table at the same time.
Update: As mentioned by Ciaran McHale, the three variables are dynamic information, so I added the following information to the question:
The 3 variables (temperature, usage, status) are dynamic information and can be kept in memory, but we also want to keep the final values somewhere so that when we reboot the device or application we know those values. So my question was about the good mechanism to keep those final values. (database table vs local json/xml/txt file).
It seems to me that only the first 4 variables (type, model, manufacturer and year) belong in a configuration file. Since the other 3 variables (temperature, usage, status) change every few seconds, they are really dynamic state information that should be held in memory, and you should provide an API so this state information can be examined. The API might be, for example, via a client-server socket connection, or via shared memory. You are probably going to say, "I can't do that because [such-and-such]", so I suggest you update your question with such reasoning. Doing that might help you obtain more useful answers.
Edit due to extra information provided in updated question...
What I am about to suggest will work on Linux. I don't know about other operating systems. You can do man shm_open and man mmap to learn about shared memory. A shared-memory segment can survive across process reboots, and be backed to a file (on disk), so it can survive across machine reboots. My (possibly incorrect) understanding is that, most of the time, the file's contents will be cached in kernel buffers and virtual memory will map those kernel buffers into a process's address space, so reading/writing will be a memory-only operation; hence you won't suffer frequent disk I/O.
For simplicity, I am going to assume that each device needs to store the same sort of dynamic information, and this can be represented in a fixed-length struct, for example:
struct DeviceData {
double temperature;
SomeEnum usage;
AnotherEnum status;
};
You can have a shared-memory segment large enough to store an array of, say, 100,000 DeviceData structs. The static configuration file for each device will contain entries such as:
name="foo";
type="bar";
model="widget";
manufacture_year="2020";
shared_memory_id="/somename";
shared_memory_array_index="42";
The last two entries in the static configuration file specify the shared memory segment that the process should connect to, and the array index it should use to update the DeviceData associated with the process.
If the above seems suitable for your needs, then a challenge to deal with is efficient synchronization for reading/updating a DeviceData in shared memory. A good basic approach is discussed in a blog article called A scalable reader/writer scheme with optimistic retry. That blog article uses C# to illustrate the concept. If you are using C++, then I recommend you read Chapter 5 (The C++ memory model and operations on atomic types) of C++ Concurrency in Action by Anthony Williams. By the way, if you can use padding to ensure that DeviceData (complete with fields for m_version1 and m_version2 used in the blog article) is exactly the same size as one or more cache lines (a cache line is 64 bytes in most CPU architectures) then your implementation won't suffer from false sharing (which can needlessly reduce performance).
The final step is to avoid exposing the low-level shared-memory operations to developers. So write a simple wrapper API with, say, four operations to: connect() to and disconnect() from the shared memory segment, and readDeviceData() and updateDeviceData().

Most generally correct way of updating a vertex buffer in Vulkan

Assume a vertex buffer in device memory and a staging buffer that's host coherent and visible. Also assume a desktop system with a discrete GPU (so separate memories). And lastly, assume correct inter-frame synchronization.
I see two general possible ways of updating a vertex buffer:
Map + memcpy + unmap into the staging buffer, followed by a transient (single command) command buffer that contains a vkCmdCopyBuffer, submit it to the graphics queue and wait for the queue to idle, then free the transient command buffer. After that submit the regular frame draw queue to the graphics queue as usual. This is the code used on https://vulkan-tutorial.com (for example, this .cpp file).
Similar to above, only instead use additional semaphores to signal after the staging buffer copy submit, and wait in the regular frame draw submit, thus skipping the "wait-for-idle" command.
#2 sort of makes sense to me, and I've repeatedly read not to do any "wait-for-idle" operations in Vulkan because it synchronizes the CPU with the GPU, but I've never seen it used in any tutorial or example online. What do the pros usually do if the vertex buffer has to be updated relatively often?
First, if you allocated coherent memory, then you almost certainly did so in order to access it from the CPU. Which requires mapping it. Vulkan is not OpenGL; there is no requirement that memory be unmapped before it can be used (and OpenGL doesn't even have that requirement anymore).
Unmapping memory should only ever be done when you are about to delete the memory allocation itself.
Second, if you think of an idea that involves having the CPU wait for a queue or device to idle before proceeding, then you have come up with a bad idea and should use a different one. The only time you should wait for a device to idle is when you want to destroy the device.
Tutorial code should not be trusted to give best practices. It is often intended to be simple, to make it easy to understand a concept. Simple Vulkan code often gets in the way of performance (and if you don't care about performance, you shouldn't be using Vulkan).
In any case, there is no "most generally correct way" to do most things in Vulkan. There are lots of definitely incorrect ways, but no "generally do this" advice. Vulkan is a low-level, explicit API, and the result of that is that you need to apply Vulkan's tools to your specific circumstances. And maybe profile on different hardware.
For example, if you're generating completely new vertex data every frame, it may be better to see if the implementation can read vertex data directly from coherent memory, so that there's no need for a staging buffer at all. Yes, the reads may be slower, but the overall process may be faster than a transfer followed by a read.
Then again, it may not. It may be faster on some hardware, and slower on others. And some hardware may not allow you to use coherent memory for any buffer that has the vertex input usage at all. And even if it's allowed, you may be able to do other work during the transfer, and thus the GPU spends minimal time waiting before reading the transferred data. And some hardware has a small pool of device-local memory which you can directly write to from the CPU; this memory is meant for these kinds of streaming applications.
If you are going to do staging however, then your choices are primarily about which queue you submit the transfer operation on (assuming the hardware has multiple queues). And this primarily relates to how much latency you're willing to endure.
For example, if you're streaming data for a large terrain system, then it's probably OK if it takes a frame or two for the vertex data to be usable on the GPU. In that case, you should look for an alternative, transfer-only queue on which to perform the copy from the staging buffer to the primary memory. If you do, then you'll need to make sure that later commands which use the eventual results synchronize with that queue, which will need to be done via a semaphore.
If you're in a low-latency scenario where the data being transferred needs to be used this frame, then it may be better to submit both to the same queue. You could use an event to synchronize them rather than a semaphore. But you should also endeavor to put some kind of unrelated work between the transfer and the rendering operation, so that you can take advantage of some degree of parallelism in operations.

Storing objects/variables outside of volatile memory

OVERVIEW
I don’t have a lot of experience programming, but I’m working on a hybrid mobile app using Cordova. This app is going to have a large amount of static (not changing) data. Some of this data will be referenced about once every minute, complete some simple operations based on that reference, and that will determine which object will be referenced in the next iteration of the loop.
From what I understand all that an object or variable is, is a reserved space within memory identified using a name. Which in hardware terms is synonyms with volatile storage or RAM. Because I will be working with mobile devices I am afraid that the massive amounts of objects I predict I will be working with (say close to 10,000), will max out the devices memory pretty fast.
My initial thought is to store this collection of static data in local storage instead of declaring these objects within the code itself. Then I would reference that file for the data when needed with each iteration of my loop, which processes once every minute. I don’t have experience with JSON but from what I know about it, this seems like it could be a good option.
BREAKDOWN
• I’m using typescript and Cordova.
• I will possibly be working with 10s of thousands of static objects.
• These objects will all be using one of a few interfaces as an outline.
• A few of these objects will be referenced for some information about once every minute.
• That information will be used to perform very simple operations.
• The Id of the object that was referenced may need to be saved permanently for future use.
• Those operations will determine what objects need to be referenced in the next iteration.
QUESTION(S)
So, my question is this. Am I correct in my understanding of how objects are stored? If so, will this number of objects be enough to max out a mobile devices RAM? Is my thought of storing all the static information in something like a JSON file and then referencing the individual objects in that file as needed plausible?
Not quite correct. Modern operating systems don't always map the application's memory to the hardware RAM.
Let's say you have a phone that only has 256MB of total RAM, but your application ends up loading 128MB of data into memory. Does that mean you can only use one more application that can load 128MB of memory? What about the OS itself using memory? The answer is that, the OS will move some of the data from the primary memory (e.g. RAM), into secondary storage (e.g. solid-state drive,) making room for your app and other apps to do their work as needed. If the data that was moved out of the RAM is needed again, the OS can move it back into the RAM from the SSD. This is called paging, and it's one of the many different pieces that make up the operating system's memory management. Most of it is done without your application code having to be aware of it.
Of course, even though the OS does a pretty great job of making memory available to your application, you still want to write code that's still memory efficient. Specially on mobile phones.
For your specific example, your suggestion of storing the static data in local storage is a good start. But it has some drawbacks as well that you should be aware; and some questions you should answer.
Can you divide up the data so that you can load only the portion you need at a time? Or do you need to have all of it loaded anyway?
Can you store your data in a more compressed data structure? (see for example Tries)
How frequently will you be loading the data from local storage?
Will loading the data from local storage take too long (e.g. if your loop does a thousand iterations, and during each iteration loads a lot of static data from the disk, it might end up being really slow).
Good luck!

How to speed up Gensim Word2vec model load time?

I'm building a chatbot so I need to vectorize the user's input using Word2Vec.
I'm using a pre-trained model with 3 million words by Google (GoogleNews-vectors-negative300).
So I load the model using Gensim:
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
The problem is that it takes about 2 minutes to load the model. I can't let the user wait that long.
So what can I do to speed up the load time?
I thought about putting each of the 3 million words and their corresponding vector into a MongoDB database. That would certainly speed things up but intuition tells me it's not a good idea.
In recent gensim versions you can load a subset starting from the front of the file using the optional limit parameter to load_word2vec_format(). (The GoogleNews vectors seem to be in roughly most- to least- frequent order, so the first N are usually the N-sized subset you'd want. So use limit=500000 to get the most-frequent 500,000 words' vectors – still a fairly large vocabulary – saving 5/6ths of the memory/load-time.)
So that may help a bit. But if you're re-loading for every web-request, you'll still be hurting from loading's IO-bound speed, and the redundant memory overhead of storing each re-load.
There are some tricks you can use in combination to help.
Note that after loading such vectors in their original word2vec.c-originated format, you can re-save them using gensim's native save(). If you save them uncompressed, and the backing array is large enough (and the GoogleNews set is definitely large enough), the backing array gets dumped in a separate file in a raw binary format. That file can later be memory-mapped from disk, using gensim's native [load(filename, mmap='r')][1] option.
Initially, this will make the load seem snappy – rather than reading all the array from disk, the OS will just map virtual address regions to disk data, so that some time later, when code accesses those memory locations, the necessary ranges will be read-from-disk. So far so good!
However, if you are doing typical operations like most_similar(), you'll still face big lags, just a little later. That's because this operation requires both an initial scan-and-calculation over all the vectors (on first call, to create unit-length-normalized vectors for every word), and then another scan-and-calculation over all the normed vectors (on every call, to find the N-most-similar vectors). Those full-scan accesses will page-into-RAM the whole array – again costing the couple-of-minutes of disk IO.
What you want is to avoid redundantly doing that unit-normalization, and to pay the IO cost just once. That requires keeping the vectors in memory for re-use by all subsequent web requestes (or even multiple parallel web requests). Fortunately memory-mapping can also help here, albeit with a few extra prep steps.
First, load the word2vec.c-format vectors, with load_word2vec_format(). Then, use model.init_sims(replace=True) to force the unit-normalization, destructively in-place (clobbering the non-normalized vectors).
Then, save the model to a new filename-prefix: model.save('GoogleNews-vectors-gensim-normed.bin'`. (Note that this actually creates multiple files on disk that need to be kept together for the model to be re-loaded.)
Now, we'll make a short Python program that serves to both memory-map load the vectors, and force the full array into memory. We also want this program to hang until externally terminated (keeping the mapping alive), and be careful not to re-calculate the already-normed vectors. This requires another trick because the loaded KeyedVectors actually don't know that the vectors are normed. (Usually only the raw vectors are saved, and normed versions re-calculated whenever needed.)
Roughly the following should work:
from gensim.models import KeyedVectors
from threading import Semaphore
model = KeyedVectors.load('GoogleNews-vectors-gensim-normed.bin', mmap='r')
model.syn0norm = model.syn0 # prevent recalc of normed vectors
model.most_similar('stuff') # any word will do: just to page all in
Semaphore(0).acquire() # just hang until process killed
This will still take a while, but only needs to be done once, before/outside any web requests. While the process is alive, the vectors stay mapped into memory. Further, unless/until there's other virtual-memory pressure, the vectors should stay loaded in memory. That's important for what's next.
Finally, in your web request-handling code, you can now just do the following:
model = KeyedVectors.load('GoogleNews-vectors-gensim-normed.bin', mmap='r')
model.syn0norm = model.syn0 # prevent recalc of normed vectors
# … plus whatever else you wanted to do with the model
Multiple processes can share read-only memory-mapped files. (That is, once the OS knows that file X is in RAM at a certain position, every other process that also wants a read-only mapped version of X will be directed to re-use that data, at that position.).
So this web-reqeust load(), and any subsequent accesses, can all re-use the data that the prior process already brought into address-space and active-memory. Operations requiring similarity-calcs against every vector will still take the time to access multiple GB of RAM, and do the calculations/sorting, but will no longer require extra disk-IO and redundant re-normalization.
If the system is facing other memory pressure, ranges of the array may fall out of memory until the next read pages them back in. And if the machine lacks the RAM to ever fully load the vectors, then every scan will require a mixing of paging-in-and-out, and performance will be frustratingly bad not matter what. (In such a case: get more RAM or work with a smaller vector set.)
But if you do have enough RAM, this winds up making the original/natural load-and-use-directly code "just work" in a quite fast manner, without an extra web service interface, because the machine's shared file-mapped memory functions as the service interface.
I really love vzhong's Embedding library. https://github.com/vzhong/embeddings
It stores word vectors in SQLite which means we don't need to load model but just fetch corresponding vectors from DB :D
Success method:
model = Word2Vec.load_word2vec_format('wikipedia-pubmed-and-PMC-w2v.bin',binary=True)
model.init_sims(replace=True)
model.save('bio_word')
later load the model
Word2Vec.load('bio_word',mmap='r')
for more info: https://groups.google.com/forum/#!topic/gensim/OvWlxJOAsCo
I have that problem whenever I use the google news dataset. The issue is there are way more words in the dataset than you'll ever need. There are a huge amount of typos and what not. What I do is scan the data I'm working on, build a dictionary of say the 50k most common words, get the vectors with Gensim and save the dictionary. Loading this dictionary takes half a second instead of 2 minutes.
If you have no specific dataset, you could use the 50 or 100k most common words from a big dataset, such as a news dataset from WMT to get you started.
Other options are to always keep Gensim running. You can create a FIFO for a script running Gensim. The script acts like a "server" that can read a file to which a "client" writes, watching for vector requests.
I think the most elegant solution is to run a web service providing word embeddings. Check out the word2vec API as an example. After installing, getting the embedding for "restaurant" is as simple as:
curl http://127.0.0.1:5000/word2vec/model?word=restaurant

Using eventlet to process large number of https requests with large data returns

I am trying to use eventlets to process a large number of data requests, approx. 100,000 requests at a time to a remote server, each of which should generate a 10k-15k byte JSON response. I have to decode the JSON, then perform some data transformations (some field name changes, some simple transforms like English->metric, but a few require minor parsing), and send all 100,000 requests out the back end as XML in a couple of formats expected by a legacy system. I'm using the code from the eventlet example which uses imap() "for body in pool.imap(fetch, urls):...."; lightly modified. eventlet is working well so far on a small sample (5K urls), to fetch the JSON data. My question is whether I should add the non-I/O processing (JSON decode, field transform, XML encode) to the "fetch()" function so that all that transform processing happens in the greenthread, or should I do the bare minimum in the greenthread, return the raw response body, and do the main processing in the "for body in pool.imap():" loop? I'm concerned that if I do the latter, the amount of data from completed threads will start building up, and will bloat memory, where doing the former would essentially throttle the process to where the XML output would keep up. Suggestions as to preferred method to implement this welcome. Oh, and this will eventually run off of cron hourly, so it really has a time window it has to fit into. Thanks!
Ideally, you put each data processing operation into separate green thread. Then, only when required, combine several operations into batch or use a pool to throttle concurrency.
When you do non-IO-bound processing in one loop, essentially you throttle concurrency to 1 simultaneous task. But you can run those in parallel using (OS) thread pool in eventlet.tpool module.
Throttle concurrency only when you have too many parallel CPU-bound code running.