Storing matrices in mysql for fast full matrices retrival - mysql

In a project, there is a time consuming computation result that is represented in a numeric matrix format, which can be commonly used in follow-on tasks. So, I want to store the calculation result into the database, so the new tasks can re-use the result.
The size of matrix is not fixed and there will be multiple matrices.
Which one is more suitable for my case?
Storing serialized matrix.
Create a table like follows:
Matrix ID
X-coord
Y-coord
Value
Or maybe there might be better ways?

(A short discussion of option 1.)
If MySQL does not need to look at the cells of a matrix, then serialize it in any form and store it in a TEXT or BLOB column in the table.
JSON is a relatively simple serialization that is available in a lot of programming languages and is easily readable by humans. XML is, in my opinion, too clunky to consider.
Or you could do something ad-hoc such as numbers separated by commas. And start the string with the length and width of the matrix, followed by the values in order. (No need for x and y coordinates.) When reading, your language may have a "split()" or "explode()" function to break the string apart on ",".

Related

Is it possible to append rows to an existing Arrow (PyArrow) Table?

I am aware that "Many Arrow objects are immutable: once constructed, their logical properties cannot change anymore" (docs). In this blog post by one of the Arrow creators it's said
Table columns in Arrow C++ can be chunked, so that appending to a table is a zero copy operation, requiring no non-trivial computation or memory allocation.
However, I am unable to find in the documentation how to append a row to a table. pyarrow.concat_tables(tables, promote=False) does something similar, but it is my understanding that it produces a new Table object, rather than, say, adding chunks to the existing one.
I am unsure if this is operation is at all possible/makes sense (in which case I'd like to know how) or if it doesn't (in which case, pyarrow.concat_tables is exactly what I need).
Similar questions:
In PyArrow, how to append rows of a table to a memory mapped file? asks specifically about memory-mapped files. I am asking generally about any Table object. Could be coming from a read_csv operation or be manually constructed.
Using pyarrow how do you append to parquet file? talks about Parquet files. See above.
Pyarrow Write/Append Columns Arrow File talks about columns, but I'm talking about rows.
https://github.com/apache/arrow/issues/3622 asks this same question, but it doesn't have a satisfying answer (in my opinion).
Basically, a Table in PyArrow/Arrow C++ isn't really the data itself, but rather a container consisting of pointers to data. How it works is:
A Buffer represents an actual, singular allocation. In other words, Buffers are contiguous, full stop. They may be mutable or immutable.
An Array contains 0+ Buffers and imposes some sort of semantics into them. (For instance, an array of integers, or an array of strings.) Arrays are "contiguous" in the sense that each buffer is contiguous, and conceptually the "column" is not "split" across multiple buffers. (This gets really fuzzy with nested arrays: a struct array does split its data across multiple buffers, in some sense! I need to come up with a better wording of this, and will contribute this to upstream docs. But I hope what I mean here is reasonably clear.)
A ChunkedArray contains 0+ Arrays. A ChunkedArray is not logically contiguous. It's kinda like a linked list of chunks of data. Two ChunkedArrays can be concatenated "zero copy", i.e. the underlying buffers will not get copied.
A Table contains 0+ ChunkedArrays. A Table is a 2D data structure (both columns and rows).
A RecordBatch contains 0+ Arrays. A RecordBatch is also a 2D data structure.
Hence, you can concantenate two Tables "zero copy" with pyarrow.concat_tables, by just copying pointers. But you cannot concatenate two RecordBatches "zero copy", because you have to concatenate the Arrays, and then you have to copy data out of buffers.

How to access the elements of a sparse matrix efficiently in Eigen library?

I have filled in a sparse matrix A using Eigen library. Then I need to access the non-zero elements of the sparse matrix, if I perform it as A(rowindex,colindex), it will be very slow.
I also try to use the unordered_map in stl to solve this problem, it is also very slow.
Is there any efficient way to solve this problem?
Compressed sparse matrices are stored in CSR or CSC format. Considering how a CSR matrix stores entries internally, there is an array storing x nonzero values, a corresponding array of length x storing their respective column location, and an array (usually much smaller than the other two) "pointing" to where rows change in those arrays.
There is no way to know where each nonzero element is, or if a row-column pair exists without searching for it in the two ordered arrays of rows (outer index) and columns (inner index). This isn't very efficient for accessing elements in an random way.

Logic or lookup table: Best practices

Suppose you have a function/method that uses two metric to return a value — essentially a 2D matrix of possible values. Is it better to use logic (nested if/switch statements) to choose the right value, or just build that matrix (as an Array/Hash/Dictionary/whatever), and then the return value becomes simply a matter of performing a lookup?
My gut feeling says that for an M⨉N matrix, relatively small values for both M and N (like ≤3) would be OK to use logic, but for larger values it would be more efficient to just build the matrix.
What are general best practices for this? What about for an N-dimensional matrix?
The decision depends on multiple factors, including:
Which option makes the code more readable and hence easier to maintain
Which option performs faster, especially if the lookup happens squillions of times
How often do the values in the matrix change? If the answer is "often" then it is prob better to externalise the values out of the code and put them in an matrix stored in a way that can be edited simply.
Not only how big is the matrix but how sparse is it?
What I say is that about nine conditions is the limit for an if .. else ladder or a switch. So if you have a 2D cell you can reasonably hard-code the up, down, diagonals, and so on. If you go to three dimensions you have 27 cases and it's too much, but OK if you're restricted to the six cub faces.
Once you've got a a lot of conditions, start coding via look-up tables.
But there's no real answer. For example Windows message loops need to deal with a lot of different messages, and you can't sensibly encode the handling code in look-up tables.

Choosing a magic byte least likely to appear in real data

I hope this isn't too opinionated for SO; it may not have a good answer.
In a portion of a library I'm writing, I have a byte array that gets populated with values supplied by the user. These values might be of type Float, Double, Int (of different sizes), etc. with binary representations you might expect from C, say. This is all we can say about the values.
I have an opportunity for an optimization: I can initialize my byte array with the byte MAGIC, and then whenever no byte of the user-supplied value is equal to MAGIC I can take a fast path, otherwise I need to take the slow path.
So my question is: what is a principled way to go about choosing my magic byte, such that it will be reasonably likely not to appear in the (variously-encoded and distributed) data I receive?
Part of my question, I suppose, is whether there's something like a Benford's law that can tell me something about the distribution of bytes in many sorts of data.
Capture real-world data from a diverse set of inputs that would be used by applications of your library.
Write a quick and dirty program to analyze dataset. It sounds like what you want to know is which bytes are most frequently totally excluded. So the output of the program would say, for each byte value, how many inputs do not contain it.
This is not the same as least frequent byte. In data analysis you need to be careful to mind exactly what you're measuring!
Use the analysis to define your architecture. If no byte never appears, you can abandon the optimization entirely.
I was inclined to use byte 255 but I discovered that is also prevalent in MSWord files. So I use byte 254 now, for EOF code to terminate a file.

Efficient way to Store Huge Number of Large Arrays of Integers in Database

I need to store an array of integers of length about 1000 against an integer ID and string name. The number of such tuples is almost 160000.
I will pick one array and calculate the root mean square deviation (RMSD) elementwise with all others and store an (ID1,ID2,RMSD) tuple in another table.
Could you please suggest the best way to do this? I am currently using MySQL for other datatables in the same project but if necessary I will switch.
One possibility would be to store the arrays in a BINARY or a BLOB type column. Given that the base type of your arrays is an integer, you could step through four bytes at a time to extract values at each index.
If I understand the context correctly, the arrays must all be of the same fixed length, so a BINARY type column would be the most efficient, if it offers sufficient space to hold your arrays. You don't have to worry about database normalisation here, because your array is an atomic unit in this context (again, assuming I'm understanding the problem correctly).
If you did have a requirement to access only part of each array, then this may not be the most practical way to store the data.
The secondary consideration is whether to compute the RMSD value in the database itself, or in some external language on the server. As you've mentioned in your comments, this will be most efficient to do in the database. It sounds like queries are going to be fairly expensive anyway, though, and the execution time may not be a primary concern: simplicity of coding in another language may be more desirable. Also depending on the cost of computing the RMSD value relative to the cost of round-tripping a query to the database, it may not even make that much of a difference?
Alternatively, as you've alluded to in your question, using Postgres could be worth considering, because of its more expressive PL/pgSQL language.
Incidentally, if you want to search around for more information on good approaches, searching for database and time series would probably be fruitful. Your data is not necessarily time series data, but many of the same considerations would apply.