Is it possible to append rows to an existing Arrow (PyArrow) Table? - pyarrow

I am aware that "Many Arrow objects are immutable: once constructed, their logical properties cannot change anymore" (docs). In this blog post by one of the Arrow creators it's said
Table columns in Arrow C++ can be chunked, so that appending to a table is a zero copy operation, requiring no non-trivial computation or memory allocation.
However, I am unable to find in the documentation how to append a row to a table. pyarrow.concat_tables(tables, promote=False) does something similar, but it is my understanding that it produces a new Table object, rather than, say, adding chunks to the existing one.
I am unsure if this is operation is at all possible/makes sense (in which case I'd like to know how) or if it doesn't (in which case, pyarrow.concat_tables is exactly what I need).
Similar questions:
In PyArrow, how to append rows of a table to a memory mapped file? asks specifically about memory-mapped files. I am asking generally about any Table object. Could be coming from a read_csv operation or be manually constructed.
Using pyarrow how do you append to parquet file? talks about Parquet files. See above.
Pyarrow Write/Append Columns Arrow File talks about columns, but I'm talking about rows.
https://github.com/apache/arrow/issues/3622 asks this same question, but it doesn't have a satisfying answer (in my opinion).

Basically, a Table in PyArrow/Arrow C++ isn't really the data itself, but rather a container consisting of pointers to data. How it works is:
A Buffer represents an actual, singular allocation. In other words, Buffers are contiguous, full stop. They may be mutable or immutable.
An Array contains 0+ Buffers and imposes some sort of semantics into them. (For instance, an array of integers, or an array of strings.) Arrays are "contiguous" in the sense that each buffer is contiguous, and conceptually the "column" is not "split" across multiple buffers. (This gets really fuzzy with nested arrays: a struct array does split its data across multiple buffers, in some sense! I need to come up with a better wording of this, and will contribute this to upstream docs. But I hope what I mean here is reasonably clear.)
A ChunkedArray contains 0+ Arrays. A ChunkedArray is not logically contiguous. It's kinda like a linked list of chunks of data. Two ChunkedArrays can be concatenated "zero copy", i.e. the underlying buffers will not get copied.
A Table contains 0+ ChunkedArrays. A Table is a 2D data structure (both columns and rows).
A RecordBatch contains 0+ Arrays. A RecordBatch is also a 2D data structure.
Hence, you can concantenate two Tables "zero copy" with pyarrow.concat_tables, by just copying pointers. But you cannot concatenate two RecordBatches "zero copy", because you have to concatenate the Arrays, and then you have to copy data out of buffers.

Related

Will pre-padding Tcl dicts with empty values speed up runtime?

While solving one of the Advent of Code 2021 puzzles in Tcl, I wanted to speed up the runtime of my script.
My script uses a dictionary with keys as {x y} coordinates and a 0 or 1 as the value. The x-y area of interest for the puzzle increases for each iteration of a loop. As a result, additional key-value pairs are added to the dict with each iteration of the loop.
I think I once learned that Tcl dicts may become re-structured in memory if necessary, possibly due to adding more and more keys. If so, does this cause a runtime hit?
To speed up runtime, would it be a good idea to pre-pad a dict with keys set to empty strings matching the expected final size of the dict?
At the implementation level, yes, rebuilding the hash table has a cost that is linear in the number of entries; after all, each entry has to be placed in a new bucket of the enlarged hash table array. However, the entries themselves do not need to be reallocated; the only memory management changes are for the hash table arrays themselves (allocate new, dispose old) so the cost isn't crazy high. The rebuild triggers whenever the number of entries in the hash table exceeds a fixed multiplier of the size of the hash table; that loading factor is a compile time constant. (Dicts are wrappers around hash tables with Tcl_Obj keys, mostly to add value semantics and ensure that the iteration order is consistent; those aren't things that matter for the rebuild semantics.) There's no notion of pre-sizing a hash table; the implementation doesn't expose that in a useful way. It also doesn't shrink the array; once it has grown, it stays grown (and most of the time that's not a problem at all).
The complexities of rebuild semantics are part of why Tcl's associative arrays are said to have a random order of enumeration: it's not actually random, but the deterministic algorithm is sensitive to a lot of factors that people normally ignore. You don't need to care about that when working with dicts, where the order of iteration is exactly knowable from the way that the value was built, irrespective of the details of how the hashing is done.
If you're doing lookups using compact integer keys from 0 up, a list will be substantially faster, as hashing is currently always performed on string representations. Compound integer keys may become nested lists.

Storing matrices in mysql for fast full matrices retrival

In a project, there is a time consuming computation result that is represented in a numeric matrix format, which can be commonly used in follow-on tasks. So, I want to store the calculation result into the database, so the new tasks can re-use the result.
The size of matrix is not fixed and there will be multiple matrices.
Which one is more suitable for my case?
Storing serialized matrix.
Create a table like follows:
Matrix ID
X-coord
Y-coord
Value
Or maybe there might be better ways?
(A short discussion of option 1.)
If MySQL does not need to look at the cells of a matrix, then serialize it in any form and store it in a TEXT or BLOB column in the table.
JSON is a relatively simple serialization that is available in a lot of programming languages and is easily readable by humans. XML is, in my opinion, too clunky to consider.
Or you could do something ad-hoc such as numbers separated by commas. And start the string with the length and width of the matrix, followed by the values in order. (No need for x and y coordinates.) When reading, your language may have a "split()" or "explode()" function to break the string apart on ",".

Why are hash table based data structures not the default when implementing adjacency lists?

I looked at some existing implementations of adjacency lists online, and most if not all of them have been implemented using dynamic arrays. But wouldn't hashtable based data structures be more suitable? (set and map)
There are very limited scenarios where we would access graph nodes by index. Even if that's the case, if some indices are missing from the graph, there will be wasted space. And if the nodes are not inserted in order, lookups are O(n).
However, if we use a hashtable based data structure, lookups will be O(1) whether the nodes are indexed or otherwise.
So why are maps and sets not the default data structures used when implementing adjacency lists?
The choice of the right container is not quite easy.
I will consider some of the most common:
a list (elements which contain a reference to the next and/or previous)
an array (with consecutive storage)
an associated array
a hash table.
Each of them has advantages and disadvantages.
Concerning a list, insertions and removals can be very fast (worst case O(1) if the insertion point / removal element is known) but a look-up has worst case time complexity of O(N).
The look-up in an array has a complexity of O(1) in worst case if the index is known (but insertion and removal can be slow if the order must be kept).
A hash table has a look-up of O(1) in best case but the worst case might be O(N) (even if it's unlikely to happen often if the hash table isn't completely bad implemented).
An associated array has a time complexity of O(lg N) in worst case.
So the choice always depends on the expected use cases to find the best compromise where the advantages pay off most while the disadvantages doesn't hurt too much.
For the management of node and edge lists in graphs, OP made the observation that arrays seem to be very common.
I recently had a look into the Boost Graph Library (for curiosity and inspiration). Concerning the data structures, it is mentioned:
The adjacency_list class is the general purpose “swiss army knife” of graph classes. It is highly parameterized so that it can be optimized for different situations: the graph is directed or undirected, allow or disallow parallel edges, efficient access to just the out-edges or also to the in-edges, fast vertex insertion and removal at the cost of extra space overhead, etc.
For the configuration (according to a specific use case), there is spent an extra page BGL – adjacency_list.
However, the defaults for vertex (node) list and edge list are in fact vectors (aka. dynamic arrays). Assuming that the average use case is an non-mutable graph (loaded once and never modified) which is explored by algorithms to answer certain user questions, the worst case of O(1) for look-up in arrays is hard to beat and will very probably pay off.
To organize this, the nodes and edges have to be enumerated. If the input data doesn't provide this, it's easy to add this as a kind of internal ID to the in-memory representation of the graph.
In this case, "public" node references have to be mapped into the internal IDs, and answers have to be mapped back. For the mapping of the public node references, the most appropriate container should be used. This might be in fact an associated array or hash table.
Considering that a request like e.g. find the shortest route from A to B has to map A and B once to the corresponding internal IDs but may need many look-up of nodes and edges to compute the answer, the choice of the array for storage of nodes and edges makes very sense.
There are very limited scenarios where we would access graph nodes by index.
This is true, and exactly what you should be thinking about: you want a data structure which can efficiently do whatever operations you actually want to use it for. So the question is, what operations do you want to be efficient?
Suppose you are implementing some kind of standard algorithm which uses an adjacency list, e.g. Dijkstra's algorithm, A* search, depth-first search, breadth-first search, topological sorting, or so on. For almost every algorithm like this, you will find that the only operation you need to use the adjacency list for is: for a given node, iterate over its neighbours.
That operation is more efficient for a dynamic array than for a hashtable, because a hashtable has to be sufficiently sparse to prevent too many collisions. Besides that, dynamic arrays will use less memory than hashtables, for the same reason; and the dynamic arrays are more efficient to build in the first place, because you don't have to compute any hashes.
Now, if you have a different algorithm where you need to be able to test for the existence of an edge in O(1) time, then an adjacency list implemented using hashtables may be a good choice; but you should also consider whether an adjacency matrix is more suitable.

Efficient way to Store Huge Number of Large Arrays of Integers in Database

I need to store an array of integers of length about 1000 against an integer ID and string name. The number of such tuples is almost 160000.
I will pick one array and calculate the root mean square deviation (RMSD) elementwise with all others and store an (ID1,ID2,RMSD) tuple in another table.
Could you please suggest the best way to do this? I am currently using MySQL for other datatables in the same project but if necessary I will switch.
One possibility would be to store the arrays in a BINARY or a BLOB type column. Given that the base type of your arrays is an integer, you could step through four bytes at a time to extract values at each index.
If I understand the context correctly, the arrays must all be of the same fixed length, so a BINARY type column would be the most efficient, if it offers sufficient space to hold your arrays. You don't have to worry about database normalisation here, because your array is an atomic unit in this context (again, assuming I'm understanding the problem correctly).
If you did have a requirement to access only part of each array, then this may not be the most practical way to store the data.
The secondary consideration is whether to compute the RMSD value in the database itself, or in some external language on the server. As you've mentioned in your comments, this will be most efficient to do in the database. It sounds like queries are going to be fairly expensive anyway, though, and the execution time may not be a primary concern: simplicity of coding in another language may be more desirable. Also depending on the cost of computing the RMSD value relative to the cost of round-tripping a query to the database, it may not even make that much of a difference?
Alternatively, as you've alluded to in your question, using Postgres could be worth considering, because of its more expressive PL/pgSQL language.
Incidentally, if you want to search around for more information on good approaches, searching for database and time series would probably be fruitful. Your data is not necessarily time series data, but many of the same considerations would apply.

What are hashtables and hashmaps and their typical use cases?

I have recently run across these terms few times but I am quite confused how they work and when they are usualy implemented?
Well, think of it this way.
If you use an array, a simple index-based data structure, and fill it up with random stuff, finding a particular entry gets to be a more and more expensive operation as you fill it with data, since you basically have to start searching from one end toward the other, until you find the one you want.
If you want to get faster access to data, you typicall resort to sorting the array and using a binary search. This, however, while increasing the speed of looking up an existing value, makes inserting new values slow, as you need to move existing elements around when you need to insert an element in the middle.
A hashtable, on the other hand, has an associated function that takes an entry, and reduces it to a number, a hash-key. This number is then used as an index into the array, and this is where you store the entry.
A hashtable revolves around an array, which initially starts out empty. Empty does not mean zero length, the array starts out with a size, but all the elements in the array contains nothing.
Each element has two properties, data, and a key that identifies the data. For instance, a list of zip-codes of the US would be a zip-code -> name type of association. The function reduces the key, but does not consider the data.
So when you insert something into the hashtable, the function reduces the key to a number, which is used as an index into this (empty) array, and this is where you store the data, both the key, and the associated data.
Then, later, you want to find a particular entry that you know the key for, so you run the key through the same function, get its hash-key, and goes to that particular place in the hashtable and retrieves the data there.
The theory goes that the function that reduces your key to a hash-key, that number, is computationally much cheaper than the linear search.
A typical hashtable does not have an infinite number of elements available for storage, so the number is typically reduced further down to an index which fits into the size of the array. One way to do this is to simply take the modulus of the index compared to the size of the array. For an array with a size of 10, index 0-9 will map directly to an index, and index 10-19 will map down to 0-9 again, and so on.
Some keys will be reduced to the same index as an existing entry in the hashtable. At this point the actual keys are compared directly, with all the rules associated with comparing the data types of the key (ie. normal string comparison for instance). If there is a complete match, you either disregard the new data (it already exists) or you overwrite (you replace the old data for that key), or you add it (multi-valued hashtable). If there is no match, which means that though the hash keys was identical, the actual keys were not, you typically find a new location to store that key+data in.
Collision resolution has many implementations, and the simplest one is to just go to the next empty element in the array. This simple solution has other problems though, so finding the right resolution algorithm is also a good excercise for hashtables.
Hashtables can also grow, if they fill up completely (or close to), and this is usually done by creating a new array of the new size, and calculating all the indexes once more, and placing the items into the new array in their new locations.
The function that reduces the key to a number does not produce a linear value, ie. "AAA" becomes 1, then "AAB" becomes 2, so the hashtable is not sorted by any typical value.
There is a good wikipedia article available on the subject as well, here.
lassevk's answer is very good, but might contain a little too much detail. Here is the executive summary. I am intentionally omitting certain relevant information which you can safely ignore 99% of the time.
There is no important difference between hash tables and hash maps 99% of the time.
Hash tables are magic
Seriously. Its a magic data structure which all but guarantees three things. (There are exceptions. You can largely ignore them, although learning them someday might be useful for you.)
1) Everything in the hash table is part of a pair -- there is a key and a value. You put in and get out data by specifying the key you are operating on.
2) If you are doing anything by a single key on a hash table, it is blazingly fast. This implies that put(key,value), get(key), contains(key), and remove(key) are all really fast.
3) Generic hash tables fail at doing anything not listed in #2! (By "fail", we mean they are blazingly slow.)
When do we use hash tables?
We use hash tables when their magic fits our problem.
For example, caching frequently ends up using a hash table -- for example, let's say we have 45,000 students in a university and some process needs to hold on to records for all of them. If you routinely refer to student by ID number, then a ID => student cache makes excellent sense. The operation you are optimizing for this cache is fast lookup.
Hashes are also extraordinarily useful for storing relationships between data when you don't want to go whole hog and alter the objects themselves. For example, during course registration, it might be a good idea to be able to relate students to the classes they are taking. However, for whatever reason you might not want the Student object itself to know about that. Use a studentToClassRegistration hash and keep it around while you do whatever it is you need to do.
They also make a fairly good first choice for a data structure except when you need to do one of the following:
When Not To Use Hash Tables
Iterate over the elements. Hash tables typically do not do iteration very well. (Generic ones, that is. Particular implementations sometimes contain linked lists which are used to make iterating over them suck less. For example, in Java, LinkedHashMap lets you iterate over keys or values quickly.)
Sorting. If you can't iterate, sorting is a royal pain, too.
Going from value to key. Use two hash tables. Trust me, I just saved you a lot of pain.
if you are talking in terms of Java, both are collections which allow objects addition, deletion and updation and use Hasing algorithms internally.
The significant difference however, if we talk in reference to Java, is that hashtables are inherently synchronized and hence are thread safe while the hash maps are not thread safe collection.
Apart from the synchronization, the internal mechanism to store and retrieve objects is hashing in both the cases.
If you need to see how Hashing works, I would recommend a bit of googling on Data Structers and hashing techniques.
Hashtables/hashmaps associate a value (called 'key' for disambiguation purposes) with another value. You can think them as kind of a dictionary (word: definition) or a database record (key: data).