How to access the elements of a sparse matrix efficiently in Eigen library? - stl

I have filled in a sparse matrix A using Eigen library. Then I need to access the non-zero elements of the sparse matrix, if I perform it as A(rowindex,colindex), it will be very slow.
I also try to use the unordered_map in stl to solve this problem, it is also very slow.
Is there any efficient way to solve this problem?

Compressed sparse matrices are stored in CSR or CSC format. Considering how a CSR matrix stores entries internally, there is an array storing x nonzero values, a corresponding array of length x storing their respective column location, and an array (usually much smaller than the other two) "pointing" to where rows change in those arrays.
There is no way to know where each nonzero element is, or if a row-column pair exists without searching for it in the two ordered arrays of rows (outer index) and columns (inner index). This isn't very efficient for accessing elements in an random way.

Related

Is it possible to append rows to an existing Arrow (PyArrow) Table?

I am aware that "Many Arrow objects are immutable: once constructed, their logical properties cannot change anymore" (docs). In this blog post by one of the Arrow creators it's said
Table columns in Arrow C++ can be chunked, so that appending to a table is a zero copy operation, requiring no non-trivial computation or memory allocation.
However, I am unable to find in the documentation how to append a row to a table. pyarrow.concat_tables(tables, promote=False) does something similar, but it is my understanding that it produces a new Table object, rather than, say, adding chunks to the existing one.
I am unsure if this is operation is at all possible/makes sense (in which case I'd like to know how) or if it doesn't (in which case, pyarrow.concat_tables is exactly what I need).
Similar questions:
In PyArrow, how to append rows of a table to a memory mapped file? asks specifically about memory-mapped files. I am asking generally about any Table object. Could be coming from a read_csv operation or be manually constructed.
Using pyarrow how do you append to parquet file? talks about Parquet files. See above.
Pyarrow Write/Append Columns Arrow File talks about columns, but I'm talking about rows.
https://github.com/apache/arrow/issues/3622 asks this same question, but it doesn't have a satisfying answer (in my opinion).
Basically, a Table in PyArrow/Arrow C++ isn't really the data itself, but rather a container consisting of pointers to data. How it works is:
A Buffer represents an actual, singular allocation. In other words, Buffers are contiguous, full stop. They may be mutable or immutable.
An Array contains 0+ Buffers and imposes some sort of semantics into them. (For instance, an array of integers, or an array of strings.) Arrays are "contiguous" in the sense that each buffer is contiguous, and conceptually the "column" is not "split" across multiple buffers. (This gets really fuzzy with nested arrays: a struct array does split its data across multiple buffers, in some sense! I need to come up with a better wording of this, and will contribute this to upstream docs. But I hope what I mean here is reasonably clear.)
A ChunkedArray contains 0+ Arrays. A ChunkedArray is not logically contiguous. It's kinda like a linked list of chunks of data. Two ChunkedArrays can be concatenated "zero copy", i.e. the underlying buffers will not get copied.
A Table contains 0+ ChunkedArrays. A Table is a 2D data structure (both columns and rows).
A RecordBatch contains 0+ Arrays. A RecordBatch is also a 2D data structure.
Hence, you can concantenate two Tables "zero copy" with pyarrow.concat_tables, by just copying pointers. But you cannot concatenate two RecordBatches "zero copy", because you have to concatenate the Arrays, and then you have to copy data out of buffers.

Storing matrices in mysql for fast full matrices retrival

In a project, there is a time consuming computation result that is represented in a numeric matrix format, which can be commonly used in follow-on tasks. So, I want to store the calculation result into the database, so the new tasks can re-use the result.
The size of matrix is not fixed and there will be multiple matrices.
Which one is more suitable for my case?
Storing serialized matrix.
Create a table like follows:
Matrix ID
X-coord
Y-coord
Value
Or maybe there might be better ways?
(A short discussion of option 1.)
If MySQL does not need to look at the cells of a matrix, then serialize it in any form and store it in a TEXT or BLOB column in the table.
JSON is a relatively simple serialization that is available in a lot of programming languages and is easily readable by humans. XML is, in my opinion, too clunky to consider.
Or you could do something ad-hoc such as numbers separated by commas. And start the string with the length and width of the matrix, followed by the values in order. (No need for x and y coordinates.) When reading, your language may have a "split()" or "explode()" function to break the string apart on ",".

Dividing a vector of points into two spaces

I have a memory-mapped file of many millions of 3D points as a STL vector using CGAL. Given an arbitrary plane that divides the dataset into approximately equal parts, I would like to sort the dataset such that all inside points are contiguous in the vector, and likewise the outside points. This process then needs to be repeated to an arbitrary depth, creating a non axis-aligned BSP tree.
Due to the size of the dataset I would like to do this in place if possible. I have a predicate functor that I use to create a filtered_iterator, but of course that doesn't sort the points, just skips non-matching ones. So I could create a second vector and copy the sorted points into that, and the re-use the original vector round-robin style, but I would like to avoid that if possible, if only to keep the iterators that mark the start and end of each space.
Of course, by invoking the question gods, I received direct communication from them almost as soon as I posted!
I had simply been blind to the STL algorithm partition which does exactly what I need.

Moving memory around on device in CUDA

What is the fastest way to move data that is on the device around in CUDA?
What I need to do is basically copy continuous sub-rows and sub-columns (of which I have the indexes on the device) from row-major matrices into new smaller matrices, but from what I've observed, memory access in CUDA is not particularly efficient, as it seems the cores are optimized to do computation rather that memory stuff.
Now the CPU seems to be pretty good at doing sequential stuff like moving rows of aligned memory from a place to another.
I see three options:
make a kernel that does the memory copying
outside a kernel, call cudaMemcpy(.., device to device) for each position (terribly slow for columns I would guess)
move the memory to the host, create the new smaller matrix and send it back on the device
Now I could test this on my specific gpu, but given its specs I don't think it would be representative. In general, what is recommended?
Edit:
I'm essentially multiplying two matrices A,B but I'm only interested in multiplying the X elements:
A =[[XX XX]
[ XX XX ]
[XX XX ]]
with the corresponding elements in the columns of B. The XX are always of the same length and I know their positions (and there's a fixed number of them per row).
If you have a matrix storage pattern that involves varying spacing between corresponding row elements (or corresponding column elements), none of the input transformation or striding capabilities of cublas will help, and none of the api striding-copy functions (such as cudaMemcpy2D) will help.
You'll need to write your own kernel to gather the data, before feeding it to cublasXgemm. This should be fairly trivial to do, if you have the locations of the incoming data elements listed in a vector or otherwise listed.

Calculate conditional mean

I'm new to cuda programming and am interested in implementing an algorithm that when coded serially calculates two or more means from a vector in one pass. What would be an efficient scheme for doing something like this in cuda?
There are two vectors of length N, element values and an indicator values identifying which subset each element belongs to.
Is there an efficient way to do this in one pass or should this be done in M passes, where M is the number of means to be calcuated and use a vector of index keys for the element values of each subset?
You can achieve this with one pass over the data with a single call to thrust::reduce_by_key. In particular, look at the "summary statistics" example, which computes several statistical properties of a single vector at once. You could generalize this method to reduce_by_key which computes reductions over many sub-vectors in parallel. Your "indicator values" would provide be the "keys" reduce_by_key uses to determine which sub-vector each element belongs to.
Partition each vector into smaller vectors and use threads to sum required elements of each sub vector. Then combine the sums and generate the global means. I would try to generate the M means at the same time rather than do M passes.