Strategies for writing expanding ordered files to disk - language-agnostic

I an a graduate student of nuclear physics currently working on a data analysis program. The data consists of billions of multidimensional points.
Anyways I am using space filling curves to map the multiple dimensions to a single dimension and I am using a B+ tree to index the pages of data. Each page will have some constant maximum number of points within it.
As I read the raw data (several hundred gigs) in from the original files and preprocess and index it I need to insert the individual points into pages. Obviously there will be far too many pages to simply store them in memory and then dump them to disk. So my question is this: What is a good strategy for writing the pages to the disk so that there is a minimum of reshuffling of data when a page hits it's maximum size and needs to be split.
Based on the comments let me reduce this a little.
I have a file that will contain ordered records. These records are being inserted into the file and there are too many of these records to simply do this in memory and then write to the file. What strategy should I use to minimize the amount of reshuffling needed when I insert a record.
If this is making any sense at all I would appreciate any solutions to this that you might have.
Edit:
The data are points in multidimensional spaces. Essentially lists of integers. Each of these integers is 2 bytes but each integer also has an additional 2 bytes of meta-data associated with it. So 4 bytes per coordinate and anywhere between 3 and 20 coordinate. So essentially the data consists of billions of chunks each chunk somewhere between 12 and 100 bytes. (obviously points with 4 dimensions will be located in a different file than points with 5 dimensions once they have been extracted).
I am using techniques similar to those discussed in this article:
http://www.ddj.com/184410998
Edit 2:
I kinda regret asking this question here so consider it officially rescinded; but here is my reason for not using off the shelf products. My data are points that range anywhere from 3 to 22 dimensions. If you think of each point as simply a list you can think of how I want to query the points as what are all the numbers that appeared in the same lists as these numbers. Below are some examples with low dimensionality (and many fewer data points than normal)
Example:
Data
237, 661, 511, 1021
1047, 661, 237
511, 237, 1021
511, 661, 1047, 1021
Queries:
511
1021
237, 661
1021, 1047
511, 237, 1047
Responses:
237, 661, 1021, 237, 1021, 661, 1047, 1021
237, 661, 511, 511, 237, 511, 661, 1047
511, 1021, 1047
511, 661
_
So that is a difficult little problem for most database programs, though I know of some that exist that can handle this well.
But the problem gets more complex. Not all the coordinates are the same. Many times we just run with gammasphere by itself and so each coordinate represents a gamma ray energy. But at other times we insert neutron detectors into gammasphere or a detector system called microball, or sometimes the nuclides produced in gammasphere are channeled into the fragment mass analyzer, all those and more detector systems can beused singly or in any combination with gammasphere. Unfortunately we almost always want to be able to select on this additional data in a manner similar to that described above. So now coordinates can have different meanings, if one just has microball in addition to gammasphere you make make up an n dimensional event in as many ways as there are positive solutions to the equation x + y = n. Additionally each coordinate has metadata associated with it. so each of the numbers I showed would have at least 2 additional numbers associated with them, the first, a detector number, for the detector that picked up the event, the second, an effeciency value, to describe how many times that particular gamma ray counts for (since the percentage of gamma rays entering the detector that are actually detected, varies with teh detector and with the energy).
I sincerely doubt that any off the shelf database solution can do all these things and perform well at the same time without an enourmous amount of customization. I believe that the time spent on that is better spent on writing my own, much less general, solution. Because of the loss of generality I do not need to implement a delete function for any of the databasing code, I do not need to build secondary indices to gate on different types of coordinates (just one set, effectively counting each point only once), etc.

I believe you should first look at what commercial and free databases have to offer. They are designed to perform fast range searches (given the right indexes) and efficiently manage memory and reading/writing pages to disk.
Failing that, have a look at one of the variants of Binary Space Partition (BSP) trees.

So the first aspect is to do this in a threaded application to get through it quicker. Break your chunks of data into workable sections. Which leads me to think...
I was going to initially suggest that you use Lucene...but in thinking of that this really sounds like something you should process with Hadoop. It was made for this sort of work (assuming you have the infrastructure for it).
I most certainly wouldn't do this in a database.
When you are speaking of indexing data and filling documents with data points...and you don't have the infrstructure, know how, or time to implement hadoop, you should then revert back to my original thought and use Lucene. You can actually index your data that way and store your datapoints directly into an index (by numeric range I would think) with a "document" (object) structure as you think is best.

I have come up with an answer myself. As events are inserted into pages when a page needs to split a new page is made at the end of the file. Half of the events of the original page are moved to that page. This leaves the pages unsorted which somewhat defeats the fast retrieval mechanisms.
However, since I only write to the db in one big initial rush (lasting several days probably) I can justify spending a little extra time after the writing to go through the pages and sort them after they have all been built. This part is quite easy in fact because of the nature of the B+ trees used to index the pages. I simply start in the leftmost leaf node of the B+tree and read the first page and put it first in a final file, then I read the second page and put it second, so on and so forth.
In this manner at the end of the insert all the pages will be sorted within their files, allowing the methods I am using to map multidimensional requests to single dimensional indexes to work efficiently and quickly when reading the data from disk.

Related

Saving Random Forest Classifiers (sklearn) with picke/joblib creates huge files

I am trying to save a bunch of trained random forest classifiers in order to reuse them later. For this, I am trying to use pickle or joblib. The problem I encounter is, that the saved files get huge. This seems to be correlated to the amount of data that I use for training (which is several 10-millions of samples per forest, leading to dumped files in the order of up to 20GB!).
Is the RF classifier itself saving the training data in its structure? If so, how could I take the structure apart and only save the necessary parameters for later predictions? Sadly, I could not find anything on the subject of size yet.
Thanks for your help!
Baradrist
Here's what I did in a nutshell:
I trained the (fairly standard) RF on a large dataset and saved the trained forest afterwards, trying both pickle and joblib (also with the compress-option set to 3).
X_train, y_train = ... some data
classifier = RandomForestClassifier(n_estimators=24, max_depth=10)
classifier.fit(X_train, y_train)
pickle.dump(classifier, open(path+'classifier.pickle', 'wb'))
or
joblib.dump(classifier, path+'classifier.joblib', compress=True)
Since the saved files got quite big (5GB to nearly 20GB, compressed aprox. 1/3 of this - and I will need >50 such forests!) and the training takes a while, I experimented with different subsets of the training data. Depending on the size of the train set, I found different sizes for the saved classifier, making me believe that information about the training is pickled/joblibed as well. This seems unintuitive to me, as for predictions, I only need the information of all the trained weak predictors (decision trees) which should be steady and since the number of trees and the max depth is not too high, they should also not take up that much space. And certainly not more due to a larger training set.
All in all, I suspect that the structure is containing more than I need. Yet, I couldn't find a good answer on how to exclude these parts from it and save only the necessary information for my future predictions.
I ran into a similar issue and I also thought in the beginning that the model was saving unnecessary information or that the serialization was introducing some redundancy. It turns out in fact that decision trees are indeed memory hungry structures that consists of multiple arrays of length given by the total number of nodes. Nodes in general grow with the size of data (and parameters like max_depth cannot effectively used to limit growth since the reasonable values still have room to generate huge number of nodes). See details in this answer but the gist is:
a single decision tree can easy grow to a few MBs (example above has a 5MB decision tree for 100K data and a 50MB decision tree for 1M data)
a random forest commonly contains at least 100 such decision tree and for the example above you would have models in the range of 0.5/5GB
compression is usually not enough to reduce to reasonable sizes (1/2, 1/3 are usual ranges)
Other notes:
using a different algorithm models might remain of a more manageable size (e.g. with xgboost I saw much smaller serialized models)
it is probably possible to "prune" some of the data used by decision trees if you only plan it to reuse it for prediction. In particular I imagine the array of impurity and possible those on n_samples might not be needed but I have not checked.
with respect to you hypothesis that the random forest is saving the data on which it is trained: not it is not and the data itself would likely be one or more order of magnitude smaller than the final model
so in principle another strategy if you have a reproducible training pipeline could be to save the data instead of the model and retrain on purpose, but this is only possible if you can spare the time to retrain (for example if in a use case where you have a long running service which has the model in memory and you serialize the model in order to have a backup for when the model goes down)
there are probably also other options to limit growth of random forest, the best one I have found until now is in this answer, where the suggestion is to work with min_samples_leaf to set it as a percentage of data

Cracking a binary file format if I have the contents of one of these files

I have about 300 measurements (each stored in a dat file) that I would like to read using MATLAB or Python. The files can be exported to text or csv using a proprietary program, but this has to be done one by one.
The question is: what would be the best approach to crack the format of the binary file using the known content from the exported file?
Not sure if this makes any difference to make the cracking easier, but the files are just two columns of (900k) numbers, and from the dat files' size (1,800,668 bytes), it appears as if each number is 16 bits (float) and there is some other information (possible the header).
I tried using HEX-Editor, but wasn't able to pick up any trends from there.
Lastly, I want to make sure to specify that these are measurements I made and the data in them belongs to me. I am not trying to obtain data that I am not supposed to.
Thanks for any help.
EDIT: Reading up a little more I realized that there my be some kind of compression going on. When you look at the data in StreamWare, it gives 7 decimal places, leading me to believe that it is a single precision value (4 bytes). However, the size of the files suggests that each value only takes 2 bytes.
After thinking about it a little more, I finally figured it out. This is very specific, but just in case another Dantec StreamWare user runs into the same problem, it could save him/her a little time.
First, the data is actually only a single vector. The time column is calculated from the length of the recorded signal and the sampling frequency. That information is probably in the header (but I wasn't able to crack that portion).
To obtain the values in MATLAB, I skipped the header bytes using fseek(fid, 668, 'bof'), then I read the data as uint16 using fread(fid, 900000, 'uint16'). This gives you integers.
To get the float value, all you have to do is divide by 2^16 (it's a 16 bit resolution system) and multiply by ten. I assume the factor of ten depends on the range of your data acquisition system.
I hope this helps.

Sorting a AS3 ByteArray

I am designing an Air application that needs to store thousands of records in memory, and needs to sort them efficiently, by various keys.
I thought of using a ByteArray, since that would avoid all the overhead of normal AS3 objects, and would allow me to use memory more efficiently.
However, the challenge is how to sort the records inside the ByteArray. I have thought of two possibilities:
1- Implement quick-sort or heap-sort in AS3, and sort the array this way. However, I am not sure this will be performant enough. For example, ByteArrays do not have methods to copy chunks of memory around; it has to be done byte-by-byte.
2- Create an Air Native Extension (ANE) that takes the ByteArray and sorts it, using C. the drawback of this is that it will be harder to implement for all the platforms it needs to run on.
What would you recommend? Do you have any previous experience doing something similar?
I'd say use Array or Vector objects, there's a possibility to sort Arrays on whatever key you want via sortOn(), and Vectors via sort(), so you can achieve whatever behavior you need, as the latter accepts a function as its parameter, check here. And I believe you won't get anywhere with ByteArrays, since what is actually done in sorting objects is sorting links in there, while a ByteArray will contain actual data.
You should never design anything that HAS to have hundreds of thousands of anything in the memory at once. Offload stuff while you don't need it. Do you know how much 100,000 is? Taking a single byte and multiplying by 100,000 gives you a MB. For every 1 byte of data in a record, you will generate 1MB of memory. Recording 100,000 ints takes 4MB.
If your records have 2 20 character strings (a first and last name), a String character is represented with 8 bytes, so you have just filled the memory with 640 MB of nothing more than first and last names. Most 'bad' computers have like, what... 2GB of memory? Good Job taking up 1/4 of that. Even if you managed to truncate this down to ByteArray level with superhuman uber bitshifing, you're still talking about reducing data by a factor of 8. So now you have 80MB for just first and last names and no other data. You could survive on that- except I suspect your records have more data then 2 strings. 20 strings? You're eating 800MB of data. Offload everything but 100 records at a time, and you're down to 640KB of memory for those names. And yes, you can load and offload while sorting.
Chunks of memory don't copy faster than bytes. It's all the same. The reasons Vectors of Objects are performant when switching is because they switch references/pointers/one single 32 bit/64 bit number instead of copying chunks of memory.
It's not clear what you're sorting. Bytes only go up to values of 256, so clearly you're using more bytes than 1 for each record. You want to evaluate each set of... like 2000+ bytes against each other set of 2000+ bytes? Like "Ah, last name is bytes 32-83, so extract those bytes, for every group of 4 bytes, bit shift them 0, 8, 16, 32 bits respectively, add them together, concatinate their integer values into a a String, do a comparison, now compare bytes 84-124 against the bytes in the next option, now transfer bytes 0-2000 to location 443530-441530 and....... Do these records have variable length strings or arrays in them? Oh lord.
Flash is not the place to write in assembly!
Use objects and test the speed & memory consumption. If either makes you cry, use more conventional methods of reducing load; like offloading materials temporarily into text files. The ugliest you should be getting is avoiding objects by storing each individual property in a different Vector. Vector., etc and having the same index refer to one record across the board.

Best practice for storing GPS data of a tracking app in mysql database

I have a datamodel question for a GPS tracking app. When someone uses our app it will save latitude, longitude, current speed, timestamp and burned_calories every 5 seconds. When a workout is completed, the average speed, total time/distance and burned calories of the workout will be stored in a database. So far so good..
What we want is to also store the data that is saved those every 5 seconds, so we can utilize this later on to plot graphs/charts of a workout for example.
How should we store this amount of data in a database? A single workout can contain 720 rows if someone runs for an hour. Perhaps a serialised/gzcompressed data array in a single row. I'm aware though that this is bad practice..
A relational one/many to many model would be undone? I know MySQL can easily handle large amounts of data, but we are talking about 720 * workouts
twice a week * 7000 users = over 10 million rows a week.
(Ofcourse we could only store the data of every 10 seconds to halve the no. of rows, or every 20 seconds, etc... but it would still be a large amount of data over time + the accuracy of the graphs would decrease)
How would you do this?
Thanks in advance for your input!
Just some ideas:
Quantize your lat/lon data. I believe that for technical reasons, the data most likely will be quantized already, so if you can detect that quantization, you might use it. The idea here is to turn double numbers into reasonable integers. In the worst case, you may quantize to the precision double numbers provide, which means using 64 bit integers, but I very much doubt your data is even close to that resolution. Perhaps a simple grid with about one meter edge length is enough for you?
Compute differences. Most numbers will be fairly large in terms of absolute values, but also very close together (unless your members run around half the world…). So this will result in rather small numbers. Furthermore, as long as people run with constant speed into a constant direction, you will quite often see the same differences. The coarser your spatial grid in step 1, the more likely you get exactly the same differences here.
Compute a Huffman code for these differences. You might try encoding lat and long movement separately, or computing a single code with 2d displacement vectors at its leaves. Try both and compare the results.
Store the result in a BLOB, together with the dictionary to decode your Huffman code, and the initial position so you can return data to absolute coordinates.
The result should be a fairly small set of data for each data set, which you can retrieve and decompress as a whole. Retrieving individual parts from the database is not possible, but it sounds like you wouldn't be needing that.
The benefit of Huffman coding over gzip is that you won't have to artificially introduce an intermediate byte stream. Directly encoding the actual differences you encounter, with their individual properties, should work much better.

How do I assess the hash collision probability?

I'm developing a back-end application for a search system. The search system copies files to a temporary directory and gives them random names. Then it passes the temporary files' names to my application. My application must process each file within a limited period of time, otherwise it is shut down - that's a watchdog-like security measure. Processing files is likely to take long so I need to design the application capable of handling this scenario. If my application gets shut down next time the search system wants to index the same file it will likely give it a different temporary name.
The obvious solution is to provide an intermediate layer between the search system and the backend. It will queue the request to the backend and wait for the result to arrive. If the request times out in the intermediate layer - no problem, the backend will continue working, only the intermediate layer is restarted and it can retrieve the result from the backend when the request is later repeated by the search system.
The problem is how to identify the files. Their names change randomly. I intend to use a hash function like MD5 to hash the file contents. I'm well aware of the birthday paradox and used an estimation from the linked article to compute the probability. If I assume I have no more than 100 000 files the probability of two files having the same MD5 (128 bit) is about 1,47x10-29.
Should I care of such collision probability or just assume that equal hash values mean equal file contents?
Equal hash means equal file, unless someone malicious is messing around with your files and injecting collisions. (this could be the case if they are downloading stuff from the internet) If that is the case go for a SHA2 based function.
There are no accidental MD5 collisions, 1,47x10-29 is a really really really small number.
To overcome the issue of rehashing big files I would have a 3 phased identity scheme.
Filesize alone
Filesize + a hash of 64K * 4 in different positions in the file
A full hash
So if you see a file with a new size you know for certain you do not have a duplicate. And so on.
Just because the probability is 1/X it does not mean that it won't happen to you until you have X records. It's like the lottery, you're not likely to win, but somebody out there will win.
With the speed and capacity of computers these days (not even talking about security, just reliability) there is really no reason not to just use a bigger/better hash function than MD5 for anything critical. Stepping up to SHA-1 should help you sleep better at night, but if you want to be extra cautious then go to SHA-265 and never think about it again.
If performance is truly an issue then use BLAKE2 which is actually faster than MD5 but supports 256+ bits to make collisions less likely while having same or better performance. However, while BLAKE2 has been well-adopted, it probably would require adding a new dependency to your project.
I think you shouldn't.
However, you should if you have the notion of two equal files having different (real names, not md5-based). Like, in search system two document might have exactly same content, but being distinct because they're located in different places.
I came up with a Monte Carlo approach to be able to sleep safely while using UUID for distributed systems that have to serialize without collisions.
from random import randint
from math import log
from collections import Counter
def colltest(exp):
uniques = []
while True:
r = randint(0,2**exp)
if r in uniques:
return log(len(uniques) + 1, 2)
uniques.append(r)
for k,v in Counter([colltest(20) for i in xrange(1000)]):
print k, "hash orders of magnitude events before collission:",v
would print something like:
5 hash orders of magnitude events before collission: 1
6 hash orders of magnitude events before collission: 5
7 hash orders of magnitude events before collission: 21
8 hash orders of magnitude events before collission: 91
9 hash orders of magnitude events before collission: 274
10 hash orders of magnitude events before collission: 469
11 hash orders of magnitude events before collission: 138
12 hash orders of magnitude events before collission: 1
I had heard the formula before: If you need to store log(x/2) keys, use a hashing function that has at least keyspace e**(x).
Repeated experiments show that for a population of 1000 log-20 spaces, you sometimes get a collision as early as log(x/4).
For uuid4 which is 122 bits that means I sleep safely while several computers pick random uuid's till I have about 2**31 items. Peak transactions in the system I am thinking about is roughly 10-20 events per second, I'm assuming an average of 7. That gives me an operating window of roughly 10 years, given that extreme paranoia.
Here's an interactive calculator that lets you estimate probability of collision for any hash size and number of objects - http://everydayinternetstuff.com/2015/04/hash-collision-probability-calculator/