I am designing an Air application that needs to store thousands of records in memory, and needs to sort them efficiently, by various keys.
I thought of using a ByteArray, since that would avoid all the overhead of normal AS3 objects, and would allow me to use memory more efficiently.
However, the challenge is how to sort the records inside the ByteArray. I have thought of two possibilities:
1- Implement quick-sort or heap-sort in AS3, and sort the array this way. However, I am not sure this will be performant enough. For example, ByteArrays do not have methods to copy chunks of memory around; it has to be done byte-by-byte.
2- Create an Air Native Extension (ANE) that takes the ByteArray and sorts it, using C. the drawback of this is that it will be harder to implement for all the platforms it needs to run on.
What would you recommend? Do you have any previous experience doing something similar?
I'd say use Array or Vector objects, there's a possibility to sort Arrays on whatever key you want via sortOn(), and Vectors via sort(), so you can achieve whatever behavior you need, as the latter accepts a function as its parameter, check here. And I believe you won't get anywhere with ByteArrays, since what is actually done in sorting objects is sorting links in there, while a ByteArray will contain actual data.
You should never design anything that HAS to have hundreds of thousands of anything in the memory at once. Offload stuff while you don't need it. Do you know how much 100,000 is? Taking a single byte and multiplying by 100,000 gives you a MB. For every 1 byte of data in a record, you will generate 1MB of memory. Recording 100,000 ints takes 4MB.
If your records have 2 20 character strings (a first and last name), a String character is represented with 8 bytes, so you have just filled the memory with 640 MB of nothing more than first and last names. Most 'bad' computers have like, what... 2GB of memory? Good Job taking up 1/4 of that. Even if you managed to truncate this down to ByteArray level with superhuman uber bitshifing, you're still talking about reducing data by a factor of 8. So now you have 80MB for just first and last names and no other data. You could survive on that- except I suspect your records have more data then 2 strings. 20 strings? You're eating 800MB of data. Offload everything but 100 records at a time, and you're down to 640KB of memory for those names. And yes, you can load and offload while sorting.
Chunks of memory don't copy faster than bytes. It's all the same. The reasons Vectors of Objects are performant when switching is because they switch references/pointers/one single 32 bit/64 bit number instead of copying chunks of memory.
It's not clear what you're sorting. Bytes only go up to values of 256, so clearly you're using more bytes than 1 for each record. You want to evaluate each set of... like 2000+ bytes against each other set of 2000+ bytes? Like "Ah, last name is bytes 32-83, so extract those bytes, for every group of 4 bytes, bit shift them 0, 8, 16, 32 bits respectively, add them together, concatinate their integer values into a a String, do a comparison, now compare bytes 84-124 against the bytes in the next option, now transfer bytes 0-2000 to location 443530-441530 and....... Do these records have variable length strings or arrays in them? Oh lord.
Flash is not the place to write in assembly!
Use objects and test the speed & memory consumption. If either makes you cry, use more conventional methods of reducing load; like offloading materials temporarily into text files. The ugliest you should be getting is avoiding objects by storing each individual property in a different Vector. Vector., etc and having the same index refer to one record across the board.
Related
I am creating an application that stores ECG data.
I want to eventually graph this data in react but for now I need help storing it.
The biggest problem is storing the data points that will go along the x & y axis on the graph. Along the bottom is time and the y axis will be some value between. There are no limits but as it’s basically a heart rhythm most points will lie close to 0.
What is the best way to store the x and y data??
An example of the y data : [204.77, 216.86 … 3372.872]
The files that I will be getting this data from can contain millions of data points, depending on the sampling rate and the time the experiment took.
What is the best way to store this type of data in MySQL. I cannot use any other DB as they’re not installed on the server this will be hosted on.
Thanks
Well as you said there are million of points, so
JSON is the best way to store these points.
The space required to store a JSON document is roughly the same as for LONGBLOB or LONGTEXT;
Please have a look into this -
https://dev.mysql.com/doc/refman/8.0/en/json.html
The JSON encoding of your sample data would take 7-8 bytes per reading. Multiply that by the number of readings you will get at a single time. There is a practical limit of 16MB for a string being fed to MySQL. That seems "too big".
A workaround is to break the list into, say, 1K points at a time. Then there would be several rows, each row being manageable. There would be virtually no limit on the number of points you could store.
FLOAT is 4 bytes, but you would need a separate row for each reading. So, let's say about 25 bytes per row (including overhead); So size is not a problem, however, two other problems could arise. 7 digits is about the limit of precision for FLOAT. Fetching a million rows will not be very fast.
DOUBLE is 8 bytes, 16 digits of precision.
DECIMAL(6,2) is 3 bytes and overflows above 9999.99.
Considering that a computer monitor has less than 4 digits of precision (4K pixels < 10^4), I would argue for FLOAT as "good enough".
Another option is to take the JSON string, compress it, then store that in a LONGBLOB. The compression will give you an average of about 2.5 bytes per reading and the row for one complete reading will be a few megabytes.
I have experience difficulty in INSERTing a row bigger than 1MB. Changing a setting let me got to 16MB; I have not tried any row bigger than that. If you run into troubles there, start a new question with just that topic. I will probably come along and explain how to chunk up the data, thereby even allowing a "row" spread over multiple database rows that could effectively be even bigger than 4GB. That is the 'hard' limit for JSON, LONGTEXT and LONGBLOB.
You did not mention the X values. Are you assuming that they are evenly spaced? If you need to provide X,Y pairs, the computations above get a bit messier, but I have provided some of the data for analysis.
I have about 300 measurements (each stored in a dat file) that I would like to read using MATLAB or Python. The files can be exported to text or csv using a proprietary program, but this has to be done one by one.
The question is: what would be the best approach to crack the format of the binary file using the known content from the exported file?
Not sure if this makes any difference to make the cracking easier, but the files are just two columns of (900k) numbers, and from the dat files' size (1,800,668 bytes), it appears as if each number is 16 bits (float) and there is some other information (possible the header).
I tried using HEX-Editor, but wasn't able to pick up any trends from there.
Lastly, I want to make sure to specify that these are measurements I made and the data in them belongs to me. I am not trying to obtain data that I am not supposed to.
Thanks for any help.
EDIT: Reading up a little more I realized that there my be some kind of compression going on. When you look at the data in StreamWare, it gives 7 decimal places, leading me to believe that it is a single precision value (4 bytes). However, the size of the files suggests that each value only takes 2 bytes.
After thinking about it a little more, I finally figured it out. This is very specific, but just in case another Dantec StreamWare user runs into the same problem, it could save him/her a little time.
First, the data is actually only a single vector. The time column is calculated from the length of the recorded signal and the sampling frequency. That information is probably in the header (but I wasn't able to crack that portion).
To obtain the values in MATLAB, I skipped the header bytes using fseek(fid, 668, 'bof'), then I read the data as uint16 using fread(fid, 900000, 'uint16'). This gives you integers.
To get the float value, all you have to do is divide by 2^16 (it's a 16 bit resolution system) and multiply by ten. I assume the factor of ten depends on the range of your data acquisition system.
I hope this helps.
I hope this isn't too opinionated for SO; it may not have a good answer.
In a portion of a library I'm writing, I have a byte array that gets populated with values supplied by the user. These values might be of type Float, Double, Int (of different sizes), etc. with binary representations you might expect from C, say. This is all we can say about the values.
I have an opportunity for an optimization: I can initialize my byte array with the byte MAGIC, and then whenever no byte of the user-supplied value is equal to MAGIC I can take a fast path, otherwise I need to take the slow path.
So my question is: what is a principled way to go about choosing my magic byte, such that it will be reasonably likely not to appear in the (variously-encoded and distributed) data I receive?
Part of my question, I suppose, is whether there's something like a Benford's law that can tell me something about the distribution of bytes in many sorts of data.
Capture real-world data from a diverse set of inputs that would be used by applications of your library.
Write a quick and dirty program to analyze dataset. It sounds like what you want to know is which bytes are most frequently totally excluded. So the output of the program would say, for each byte value, how many inputs do not contain it.
This is not the same as least frequent byte. In data analysis you need to be careful to mind exactly what you're measuring!
Use the analysis to define your architecture. If no byte never appears, you can abandon the optimization entirely.
I was inclined to use byte 255 but I discovered that is also prevalent in MSWord files. So I use byte 254 now, for EOF code to terminate a file.
There are some scenarios where programmers need or want to find grossly large numbers. These are often so large that they defy the programmer's comprehension. I'm talking about things like the largest known prime number (with 12978189 digits) and the recently calculated 10 trillion digits of pi.
How can you create a program that handles these? This far exceeds an integer, a long, a double, a BigInteger, a BigDecimal, or anything of the sort. How do these kinds of programs for discovering these numbers get created? How can you even store them in memory when no appropriate datatypes exist, and they would likely consume gigabytes of data each?
To address your specific examples:
A 12 million digit integer isn't terribly large for a typical "large integer" class to handle. This should be able to be stored in memory.
To store 10 trillion digits of π, you could use a disk file and memory-map it. You'll need a 64 bit OS and application, but you can simply create a 10 terabyte file on disk (you'll probably need a few disks and a filesystem like ZFS that can store it across disks), and map it into CPU address space. The algorithms that calculate π (such as BBP) conveniently calculate one hex digit at a time which fits well into half a byte of memory.
The (abstract) answer is to write algorithms using the machine's native types that produce the results you want. For instance, when you do addition by hand on paper of two very large integers, the biggest actual calculation you need is only 9+9+1 (nine plus nine plus one for the carry). Of course you need paper large enough to write the two numbers down in the first place and the answer down as well. So as long as the two numbers and the answer can be stored in a computer's harddisk (the paper), an algorithm can be written that does it with variables that only need a value up to 19; so even a char variable is more than capable of handling this let alone an int variable.
The (concrete) answer is that really good programmers have already done this and there even FOSS libraries for it. One good one is the GNU Project's GMP library which has loads of functions to handle arbitrary size integer arithmetic and arbitrary precision floating point arithmetic. So as long as your computer can store the information needing during the calculation, it can be done. You'll need to invest the time to read the documentation of course.
I need to keep track of around 10000 elements of an array in my algorithm .So for this I need boolean for each record.If I used char array to keep track of 10000 arrays (as 0/1),it would take up lot of memory.
So Can I create an bit array of 10000 bits in Cuda where each bit represents corresponding array record?
As Roger said, the answer is yes, CUDA provides the same bitwise operations (i.e. >>, << and &) as normal C so you can implement your bit array essentially normally (almost, see thread synchronisation issues below).
However, for your situation it is almost certainly not a good idea.
There are problems with thread syncronisation. Imagine two of the threads on your GPU are inverting two bits of a single entry of your array. Each thread will read the same value out of memory, and apply their operation to it, but the thread that writes its value back to memory last will overwrite the result of the other thread. (Note: if your bit array is not being modified by the GPU code then this isn't a problem.)
And, unless this is explicitly required, you shouldn't be optimising for memory use, an array with 10K elements does not take much memory at all: even if you were storing each boolean in an 64 bit integer it's only 80 KB. And obviously you can store them in a smaller datatype. (You should only start worrying about compressing the array as much as possible when you are getting upwards of tens of millions, or even hundreds of millions of elements.)
Also, the way GPUs work means that you might get best performance by using a reasonably large data type for each boolean (most likely a 32 bit one) so that, for example, memory coalescing works better. (I haven't tested this assertion, you would need to run some benchmarks to check it.)