Best way to save/load 4-dimensional array in Octave - csv

I have an Octave code that gathers data from thousands of .csv files and stores it in a 4-dimensional matrix (800x8x80x213) so I can access it with other code. The process of reading in the data takes about 10 minutes so I thought it would be a good idea to save the matrix and then I could load it into the workspace whenever I wanted to work with the data instead of waiting 10 minutes for the matrix to be created. I used Save to save the matrix and Load to load it into the workspace, however when I loaded the matrix, it took 30 minutes to complete. Is there a better/faster way to save/load this 4-D matrix? Seems ridiculous that it takes 3 times longer to load a matrix than to create it from 4000+ files...

The default 'format' option used by the save command is -text, which is human readable. For large datasets, this will take a long time to create (not to mention, it will lead to a much larger file, since it will need to represent floating point numbers via their text representations...), so it is indeed inappropriate for this kind of data. Loading from a large text format file will also take quite a long time, especially on a slow computer, for the same reasons.
Octave also supports a -binary option, which is octave's internal binary format. This is what you need. E.g.
save -binary outputfile.bin varname
In this particular case, the text file is 2.2G, whereas the binary format is the expected 872Mb (i.e. number of elements * 8 bytes per element). Saving and loading is near instant.
Alternatively, there's a bunch of other options too, corresponding to other common formats, e.g. as a commenter has also mentioned here, -hdf5, or -v7 which is matlab's .mat format.
Type help save on your octave console for more details.

Related

How to generate reports in Snakemake (that include runtime, cpu, memory) that mimic the nextflow reports

I am used to working with nextflow which automatically generates reports for each process, so that I know how much time, cpu, and memory was used in each part of my workflow. Is there an equivalent of this in snakemake? If the authors of the snakemake pipeline don't manually report this, is there a way to extract this information automatically?
You might be able to use the benchmark directive to extract what you want without too much work:
The benchmark directive takes a string that points to the file where
benchmarking results shall be stored. Similar to output files, the
path can contain wildcards (it must be the same wildcards as in the
output files). When a job derived from the rule is executed, Snakemake
will measure the wall clock time and memory usage (in MiB) and store
it in the file in tab-delimited format. It is possible to repeat a
benchmark multiple times in order to get a sense for the variability
of the measurements. This can be done by annotating the benchmark
file, e.g., with repeat("benchmarks/{sample}.bwa.benchmark.txt", 3)
Snakemake can be told to run the job three times. The repeated
measurements occur as subsequent lines in the tab-delimited benchmark
file.
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#benchmark-rules

How to speed up Gensim Word2vec model load time?

I'm building a chatbot so I need to vectorize the user's input using Word2Vec.
I'm using a pre-trained model with 3 million words by Google (GoogleNews-vectors-negative300).
So I load the model using Gensim:
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
The problem is that it takes about 2 minutes to load the model. I can't let the user wait that long.
So what can I do to speed up the load time?
I thought about putting each of the 3 million words and their corresponding vector into a MongoDB database. That would certainly speed things up but intuition tells me it's not a good idea.
In recent gensim versions you can load a subset starting from the front of the file using the optional limit parameter to load_word2vec_format(). (The GoogleNews vectors seem to be in roughly most- to least- frequent order, so the first N are usually the N-sized subset you'd want. So use limit=500000 to get the most-frequent 500,000 words' vectors – still a fairly large vocabulary – saving 5/6ths of the memory/load-time.)
So that may help a bit. But if you're re-loading for every web-request, you'll still be hurting from loading's IO-bound speed, and the redundant memory overhead of storing each re-load.
There are some tricks you can use in combination to help.
Note that after loading such vectors in their original word2vec.c-originated format, you can re-save them using gensim's native save(). If you save them uncompressed, and the backing array is large enough (and the GoogleNews set is definitely large enough), the backing array gets dumped in a separate file in a raw binary format. That file can later be memory-mapped from disk, using gensim's native [load(filename, mmap='r')][1] option.
Initially, this will make the load seem snappy – rather than reading all the array from disk, the OS will just map virtual address regions to disk data, so that some time later, when code accesses those memory locations, the necessary ranges will be read-from-disk. So far so good!
However, if you are doing typical operations like most_similar(), you'll still face big lags, just a little later. That's because this operation requires both an initial scan-and-calculation over all the vectors (on first call, to create unit-length-normalized vectors for every word), and then another scan-and-calculation over all the normed vectors (on every call, to find the N-most-similar vectors). Those full-scan accesses will page-into-RAM the whole array – again costing the couple-of-minutes of disk IO.
What you want is to avoid redundantly doing that unit-normalization, and to pay the IO cost just once. That requires keeping the vectors in memory for re-use by all subsequent web requestes (or even multiple parallel web requests). Fortunately memory-mapping can also help here, albeit with a few extra prep steps.
First, load the word2vec.c-format vectors, with load_word2vec_format(). Then, use model.init_sims(replace=True) to force the unit-normalization, destructively in-place (clobbering the non-normalized vectors).
Then, save the model to a new filename-prefix: model.save('GoogleNews-vectors-gensim-normed.bin'`. (Note that this actually creates multiple files on disk that need to be kept together for the model to be re-loaded.)
Now, we'll make a short Python program that serves to both memory-map load the vectors, and force the full array into memory. We also want this program to hang until externally terminated (keeping the mapping alive), and be careful not to re-calculate the already-normed vectors. This requires another trick because the loaded KeyedVectors actually don't know that the vectors are normed. (Usually only the raw vectors are saved, and normed versions re-calculated whenever needed.)
Roughly the following should work:
from gensim.models import KeyedVectors
from threading import Semaphore
model = KeyedVectors.load('GoogleNews-vectors-gensim-normed.bin', mmap='r')
model.syn0norm = model.syn0 # prevent recalc of normed vectors
model.most_similar('stuff') # any word will do: just to page all in
Semaphore(0).acquire() # just hang until process killed
This will still take a while, but only needs to be done once, before/outside any web requests. While the process is alive, the vectors stay mapped into memory. Further, unless/until there's other virtual-memory pressure, the vectors should stay loaded in memory. That's important for what's next.
Finally, in your web request-handling code, you can now just do the following:
model = KeyedVectors.load('GoogleNews-vectors-gensim-normed.bin', mmap='r')
model.syn0norm = model.syn0 # prevent recalc of normed vectors
# … plus whatever else you wanted to do with the model
Multiple processes can share read-only memory-mapped files. (That is, once the OS knows that file X is in RAM at a certain position, every other process that also wants a read-only mapped version of X will be directed to re-use that data, at that position.).
So this web-reqeust load(), and any subsequent accesses, can all re-use the data that the prior process already brought into address-space and active-memory. Operations requiring similarity-calcs against every vector will still take the time to access multiple GB of RAM, and do the calculations/sorting, but will no longer require extra disk-IO and redundant re-normalization.
If the system is facing other memory pressure, ranges of the array may fall out of memory until the next read pages them back in. And if the machine lacks the RAM to ever fully load the vectors, then every scan will require a mixing of paging-in-and-out, and performance will be frustratingly bad not matter what. (In such a case: get more RAM or work with a smaller vector set.)
But if you do have enough RAM, this winds up making the original/natural load-and-use-directly code "just work" in a quite fast manner, without an extra web service interface, because the machine's shared file-mapped memory functions as the service interface.
I really love vzhong's Embedding library. https://github.com/vzhong/embeddings
It stores word vectors in SQLite which means we don't need to load model but just fetch corresponding vectors from DB :D
Success method:
model = Word2Vec.load_word2vec_format('wikipedia-pubmed-and-PMC-w2v.bin',binary=True)
model.init_sims(replace=True)
model.save('bio_word')
later load the model
Word2Vec.load('bio_word',mmap='r')
for more info: https://groups.google.com/forum/#!topic/gensim/OvWlxJOAsCo
I have that problem whenever I use the google news dataset. The issue is there are way more words in the dataset than you'll ever need. There are a huge amount of typos and what not. What I do is scan the data I'm working on, build a dictionary of say the 50k most common words, get the vectors with Gensim and save the dictionary. Loading this dictionary takes half a second instead of 2 minutes.
If you have no specific dataset, you could use the 50 or 100k most common words from a big dataset, such as a news dataset from WMT to get you started.
Other options are to always keep Gensim running. You can create a FIFO for a script running Gensim. The script acts like a "server" that can read a file to which a "client" writes, watching for vector requests.
I think the most elegant solution is to run a web service providing word embeddings. Check out the word2vec API as an example. After installing, getting the embedding for "restaurant" is as simple as:
curl http://127.0.0.1:5000/word2vec/model?word=restaurant

Flink writes SingleOutputStreamOperator to two files instead of one

I am trying flink for a project at work. I have reached the point where I process a stream by applying count windowing, etc. However, I noticed a peculiar behavior, which I cannot explain.
It seems that a stream is processed by two threads, and the output is also separated in two parts.
First I noticed the behavior when printing the stream to standard console by using stream.print().
Then, I printed to a file and it is actually printing in two files named 1 and 2, in the output folder.
SingleOutputStreamOperator<Tuple3<String, String,String>> c = stream_with_no_err.countWindow(4).apply(new CountPerWindowFunction());
// c.print() // this olso prints two streams in the standard console
c.writeAsCsv("output");
Can someone please explain why is this behavior in flink? How can I configure it? Why is it necessary to have the resulting stream split?
Parallelism I understand as being useful for speed (multiple threads), but why having the resulting stream split?
Usually, I would like to have the resulting stream (after the processing) as a single file, or tcp stream ,etc. Is the normal workflow to manually combine the two files and produce a single output?
Thanks!
Flink is a distributed and parallel stream processor. As you said correctly, parallelization is necessary to achieve high throughput. The throughput of an application is bounded by its slowest operator. So in many cases also the sink needs to be parallelized.
Having said this, it is super simple to reduce the parallelism of your sink to 1:
c.writeAsCsv("output").setParallelism(1);
Now, the sink will run as a single thread and only produce a single file.

How does a stored image or video appear in binary on the hard drive?

In attempting to understand the concept of binary, my question is "How does a stored image or video look in binary on the hard drive?"
As for how it is physically stored, it depends on the technology of your storage device. For a hard disk drive you can read about it on Wikipedia.
The next layer is how the controller on the storage device sends the data to the motherboard.
Then how the motherboard sends the data to the operating system.
Then how the operating system stores the data on the disk (what file system it uses; NTFS is common in modern Windows installations.)
Finally, what you'll see when reading the data is groups of 8 bits (bytes) which are basically 8 on/off flags, which together form 256 possible combinations. Which is why most image formats are stored with colors varying from 0-255 for each channel (red, green, blue.) Most raw formats are stored linearly, so you can actually try reading them yourself. A raw image where the first pixel is red (assuming it stores the pixels left-to-right, top-to-bottom) would look like this in bits:
11111111 00000000 00000000
red green blue
For more information, you'll have to be more specific.
Every file on disk is basically a number of bits in a row.
The difference between "binary" and "something else" (often called ASCII, or text, or...) is that non-binary is basically human readable when opened in a text editor. In other words: the bytes in the file map to human readable letter (and other) characters in some way a generic text editor knows how to handle.
So called binary files can only be interpreted back to that data that they actually contain when you know the format which was used to map the content (image, sound, movie, whatever) to a stream of zeros and ones. This mapping is called the file format and is usually part of the file name in the form of an extension. You need a piece of software that knows the mapping and can interpret the row of bits back into the original content.
Mind you: this is usually only a hint. Renaming a JPEG image file to have a .mp3 extension doesn't change it into an audio file; it is still just an image file, containing the image (=dimensions of the image in pixels + the color values for each pixel, basically) encoded into a stream of zeros and ones in the way described in the JPEG file format encoding description.
Check out the link: Binary File Format
The images are sequential flow of colored dots... But it's not hardware dependent i.e. your hard-disk will store any thing in any format which your OS provide it to... However the OS maintain standards of saving file formats other wise a JPG image will not be valid one across different platforms...
Simillarly the videos are flows of images and voice data multiplexed into a sequential flow.
All data on commercial computer systems are stored in binary format (we'll ignore scientific studies into quantum and optical computing).
At the lowest level all files and processing by a computer are performed in binary. This is because our computing systems are powered by the flow of electrons. They either flow or don't. Electric current is on or off. 1 and 0.
The data stored on a hard disk is there due to pulsing of the hard disk write head coil which magnetises spots of hard disk material. These magnetised spots cause a current pulse in the read coil (in actual fact the read and write coils are the same) as the hard disk head passes over them. Hence the data is read as a stream of current pulses, 1s and 0s.
Now processors are built to accept process a finite number of binary "pulses" or data bits simultaneously (it can be anything from 4 bits upwards). Hence a modern 64bit PC can process 64 binary data bits i.e. 64 1s and 0s, at any one time.
Now at a higher level, although all files are stored as binary and can be read in binary format we help the processing of them by telling the processor what format to read them in. This is so that it process the file data as small chunks e.g. 8 bits or 1 byte for ASCII text.
The operating system provides the processor with a template for any given file. This is set up in an extension relation table. And according to what the file extension is the operating system will expect that data to be in a particular format and link it to code that can be used by the processor to interpret it. Hence changing a file name extension will confuse the processor as it won't interpret the data correctly. That's why changing the filename from *.jpg to *.exe won't show the image, as the processor has been told to expect executable code, which the data within the file clearly isn't.
So back to your original question the image within the jpeg file has been encoded as series of 1s and 0s in a specific order.
I'm not sure how exactly they are arranged, but as an example:
A picture was captured and stored as a bitmap at a resoultion of 800 x 600 in 24bit colour. The first pixel is stored as 3 bytes (8 bit binary) representing a red, green and blue value. The value of each byte dictates the intensity of that colour. 0 - 255, with 0 being none at all to 255 being the highest value. Unsigned 255 in binary is 11111111, I won't confuse you with 2's complement for signed values. So the full picture will require a file of minimum 1,440,000 bytes or about 1,406 kilobytes (a kilobyte being 1024 bytes).
The binary as follows: 000010101011010101101010101 would be stored on a hard drive actual microscopic bumps and troughs by changing the polarity of the metalic grains on the disk in specific regions.Binary is actually read from right to left, obviously the opposite way of how most people read text.
If your question is really "how does it look": See Figure 4 on this page; it shows high resolution measurements of a hard drive.
Although googletorp's answer does not look very helpful, it's not totally untrue. To store binary data, the only thing you need is the possibility to have two different states for each storage unit (be it an on/off switch, hole or no hole in a punchcard, or, as in the case of hard drives, the direction of ferromagnetic particles).
The Wikipedia page for the BMP File Format contains an example(Including all hex values) of a 2x2 pixel bitmap image, it should be very good at explaining the basis of the binary representation of an image.
In general if you're really curious how the binary looks for a file you could always use a Hex Viewer and take a look yourself :) I normally use od on Linux to dump the binary information of a file. I'm sure you can google a good Hex Editor for Windows (or maybe someone can suggest one.)
Headers ? Every file created contains header information, that are also stored as binary bits along with the data. The header bits of a files holds the information of header length, file type, file location and length. Now each application is designed to read certain file types. If the application tries to open a file on hard disk which has a header with a different file format, that is not supported by the application, it fails to read the file. Thus a text file cannot be opened using a media player. Because a media player expects a file that contains a header with audio file format binary pattern. Similarly, same in case of picture files.

What are the advantages of memory-mapped files?

I've been researching memory mapped files for a project and would appreciate any thoughts from people who have either used them before, or decided against using them, and why?
In particular, I am concerned about the following, in order of importance:
concurrency
random access
performance
ease of use
portability
I think the advantage is really that you reduce the amount of data copying required over traditional methods of reading a file.
If your application can use the data "in place" in a memory-mapped file, it can come in without being copied; if you use a system call (e.g. Linux's pread() ) then that typically involves the kernel copying the data from its own buffers into user space. This extra copying not only takes time, but decreases the effectiveness of the CPU's caches by accessing this extra copy of the data.
If the data actually have to be read from the disc (as in physical I/O), then the OS still has to read them in, a page fault probably isn't any better performance-wise than a system call, but if they don't (i.e. already in the OS cache), performance should in theory be much better.
On the downside, there's no asynchronous interface to memory-mapped files - if you attempt to access a page which isn't mapped in, it generates a page fault then makes the thread wait for the I/O.
The obvious disadvantage to memory mapped files is on a 32-bit OS - you can easily run out of address space.
I have used a memory mapped file to implement an 'auto complete' feature while the user is typing. I have well over 1 million product part numbers stored in a single index file. The file has some typical header information but the bulk of the file is a giant array of fixed size records sorted on the key field.
At runtime the file is memory mapped, cast to a C-style struct array, and we do a binary search to find matching part numbers as the user types. Only a few memory pages of the file are actually read from disk -- whichever pages are hit during the binary search.
Concurrency - I had an implementation problem where it would sometimes memory map the file multiple times in the same process space. This was a problem as I recall because sometimes the system couldn't find a large enough free block of virtual memory to map the file to. The solution was to only map the file once and thunk all calls to it. In retrospect using a full blown Windows service would of been cool.
Random Access - The binary search is certainly random access and lightning fast
Performance - The lookup is extremely fast. As users type a popup window displays a list of matching product part numbers, the list shrinks as they continue to type. There is no noticeable lag while typing.
Memory mapped files can be used to either replace read/write access, or to support concurrent sharing. When you use them for one mechanism, you get the other as well.
Rather than lseeking and writing and reading around in a file, you map it into memory and simply access the bits where you expect them to be.
This can be very handy, and depending on the virtual memory interface can improve performance. The performance improvement can occur because the operating system now gets to manage this former "file I/O" along with all your other programmatic memory access, and can (in theory) leverage the paging algorithms and so forth that it is already using to support virtual memory for the rest of your program. It does, however, depend on the quality of your underlying virtual memory system. Anecdotes I have heard say that the Solaris and *BSD virtual memory systems may show better performance improvements than the VM system of Linux--but I have no empirical data to back this up. YMMV.
Concurrency comes into the picture when you consider the possibility of multiple processes using the same "file" through mapped memory. In the read/write model, if two processes wrote to the same area of the file, you could be pretty much assured that one of the process's data would arrive in the file, overwriting the other process' data. You'd get one, or the other--but not some weird intermingling. I have to admit I am not sure whether this is behavior mandated by any standard, but it is something you could pretty much rely on. (It's actually agood followup question!)
In the mapped world, in contrast, imagine two processes both "writing". They do so by doing "memory stores", which result in the O/S paging the data out to disk--eventually. But in the meantime, overlapping writes can be expected to occur.
Here's an example. Say I have two processes both writing 8 bytes at offset 1024. Process 1 is writing '11111111' and process 2 is writing '22222222'. If they use file I/O, then you can imagine, deep down in the O/S, there is a buffer full of 1s, and a buffer full of 2s, both headed to the same place on disk. One of them is going to get there first, and the other one second. In this case, the second one wins. However, if I am using the memory-mapped file approach, process 1 is going to go a memory store of 4 bytes, followed by another memory store of 4 bytes (let's assume that't the maximum memory store size). Process 2 will be doing the same thing. Based on when the processes run, you can expect to see any of the following:
11111111
22222222
11112222
22221111
The solution to this is to use explicit mutual exclusion--which is probably a good idea in any event. You were sort of relying on the O/S to do "the right thing" in the read/write file I/O case, anyway.
The classing mutual exclusion primitive is the mutex. For memory mapped files, I'd suggest you look at a memory-mapped mutex, available using (e.g.) pthread_mutex_init().
Edit with one gotcha: When you are using mapped files, there is a temptation to embed pointers to the data in the file, in the file itself (think linked list stored in the mapped file). You don't want to do that, as the file may be mapped at different absolute addresses at different times, or in different processes. Instead, use offsets within the mapped file.
Concurrency would be an issue.
Random access is easier
Performance is good to great.
Ease of use. Not as good.
Portability - not so hot.
I've used them on a Sun system a long time ago, and those are my thoughts.