Variable size fields seem like they could cause performance issue.
For the sake of being concrete, let's assume we're using a relational database. Suppose a relation has a variable length text field. What happens if an update to a tuple in the relation increases the variable length field's size? An in-line record edit (i.e. editing the file containing the record in-line) would require shuffling around the other tuples residing on the same physical page -- potentially kicking some out.
I understand that different DBMSs handle this differently, but I'm curious what some of the common practices are for this. It seems to me that the best way to do this would be to simply mark the existing tuple as deleted and create a whole new tuple.
"It depends". Each implementation is different and practically warrants its own small book. (I should really be close-voting this question not answering it, but I figure I'll try to help and I can't make this short enough for a comment).
For PostgreSQL, read the developer documentation about DB storage and VARLENA, storage classes and TOAST, as well as the manual section on MVCC and concurrency control. For more info, start reading the code, many of the key headers and source files have good detailed comments that explain the low level operation.
The condensed version, which you may have to read the above-mentioned resources to understand:
PostgreSQL never overwrites a tuple during an update. It always writes it to a new location. If the location is on the same physical page and there are no indexes changed it avoids index updates, but it'll always do a heap write of a new tuple. It sets the xmax value of the old tuple and the xmin of the new one so that a transaction can only ever see one or the other. See the concurrency and mvcc docs for the gory details.
Variable length values may be stored inline or out-of-line (TOAST). If it's stored inline in the tuple on the heap, which is the default for small values, then when you update the record (whether you update that field or some other) the data gets copied to a new tuple, just like fixed length data does. If it's stored out-of-line in a TOAST side-table then if it's unmodified a pointer to it is copied but the value its self isn't. If it's stored out-of-line and modified then a new record is written to the TOAST table for that value and a new pointer to it is stored in the newly saved heap tuple for the new value.
Later on, VACUUM comes along and marks obsolete tuples, freeing space and allowing them to be overwritten.
Because PostgreSQL must retain the old data to be visible to old transactions it can never do an in-place modification.
In theory it'd be possible to put the old data somewhere else and then overwrite it - that's what Oracle does, with its undo and redo logs - but that's not what PostgreSQL does. Doing that introduces different complexities and trade-offs, solving problems and creating others.
(The only exception to the no-overwrite rule is pg_largeobject, which uses a sort of slice based copy-on-write to allow transactional updates to big file-like chunks of data without copying the whole file. Oh, and you could argue that SEQUENCEs get overwritten too. Also some full-table-lock operations.)
Other RDBMses work in different ways. Some even support multiple modes. MySQL for example uses MyISAM tables (in-place writes, AFAIK) and InnoDB (MVCC copy-on-write). Oracle has the undo and redo logs - it copies the old data to out-of-line storage then does an in-place update. Other DBMSes are no doubt different again.
For PostgreSQL, there is some information about that in http://www.postgresql.org/docs/9.3/static/datatype-character.html:
Tip: There is no performance difference among these three types [varchar(n)/character varying(n), char(n)/charachter(n), text], apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length-constrained column. While character(n) has performance advantages in some other database systems, there is no such advantage in PostgreSQL; in fact character(n) is usually the slowest of the three because of its additional storage costs. In most situations text or character varying should be used instead.
I would assume that "an in-line record edit" will never occur, due to data integrity requirements and transaction processing (MVCC).
There is some (fairly old) information about transaction processing:
http://www.postgresql.org/files/developer/transactions.pdf
We must store multiple versions of every row. A tuple can be removed only after it’s been committed as deleted for long enough that no active transaction can see it anymore.
Related
Just want to know if JSON type is also comes under the transactions. For e.g. If I have started a transaction which insert data both for column JSON types and others and if something wrong happens, will it rollback the json stuff as well?
Everything is transactional and crash-safe in PostgreSQL unless explicitly documented not to be.
PostgreSQL's transactions operate on tuples, not individual fields. The data type is irrelevant. It isn't really possible to implement a data type that is not transactional in PostgreSQL. (The SERIAL "data type" is just a wrapper for the integer type with a DEFAULT, and is a bit of a special case).
Only a few things have special behaviour regarding transactions - sequences, advisory locks, etc - and they're pretty clearly documented where that's the case.
Note that this imposes some limitations you may not immediately expect. Most importantly, because PostgreSQL relies on MVCC for concurrency control it must copy a value when that value is modified (or, sometimes, when other values in the same tuple are modified). It cannot change fields in-place. So if you have a 5MB json document in a field and you change a single integer value, the whole json document must be copied and written out with the changed value. PostgreSQL will then come along later and mark the old copy as free space that can be re-used.
ABSTRACT
Talking with some colleagues we came accross the "extract random row from a big database table" issue. It's a classic one and we know the naive approach (also on SO) is usually something like:
SELECT * FROM mytable ORDER BY RAND() LIMIT 1
THE PROBLEM
We also know a query like that is utterly inefficient and actually usable only with very few rows. There are some approaches that could be taken to attain better efficiency, like these ones still on SO, but they won't work with arbitrary primary keys and the randomness will be skewed as soon as you have holes in your numeric primary keys. An answer to the last cited question links to this article which has a good explanation and some bright solutions involving an additional "equal distribution" table that must be maintained whenever the "master data" table changes. But then again if you have frequent DELETEs on a big table you'll probably be screwed up by the constant updating of the added table. Also note that many solutions rely on COUNT(*) which is ridiculously fast on MyISAM but "just fast" on InnoDB (I don't know how it performs on other platforms but I suspect the InnoDB case could be representative of other transactional database systems).
In addition to that, even the best solutions I was able to find are fast but not Ludicrous Speed fast.
THE IDEA
A separate service could be responsible to generate, buffer and distribute random row ids or even entire random rows:
it could choose the best method to extract random row ids depending on how the original PKs are structured. An ordered list of keys could be maintained in ram by the service (shouldn't take too many bytes per row in addition to the actual size of the PK, it's probably ok up to 100~1000M rows with standard PCs and up to 1~10 billion rows with a beefy server)
once the keys are in memory you have an implicit "row number" for each key and no holes in it so it's just a matter of choosing a random number and directly fetch the corresponding key
a buffer of random keys ready to be consumed could be maintained to quickly respond to spikes in the incoming requests
consumers of the service will connect and request N random rows from the buffer
rows are returned as simple keys or the service could maintain a (pool of) db connection(s) to fetch entire rows
if the buffer is empty the request could block or return EOF-like
if data is added to the master table the service must be signaled to add the same data to its copy too, flush the buffer of random picks and go on from that
if data is deleted from the master table the service must be signaled to remove that data too from both the "all keys" list and "random picks" buffer
if data is updated in the master table the service must be signaled to update corresponding rows in the key list and in the random picks
WHY WE THINK IT'S COOL
does not touch disks other than the initial load of keys at startup or when signaled to do so
works with any kind of primary key, numerical or not
if you know you're going to update a large batch of data you can just signal it when you're done (i.e. not at every single insert/update/delete on the original data), it's basically like having a fine grained lock that only blocks requests for random rows
really fast on updates of any kind in the original data
offloads some work from the relational db to another, memory only process: helps scalability
responds really fast from its buffers without waiting for any querying, scanning, sorting
could easily be extended to similar use cases beyond the SQL one
WHY WE THINK IT COULD BE A STUPID IDEA
because we had the idea without help from any third party
because nobody (we heard of) has ever bothered to do something similar
because it adds complexity in the mix to keep it updated whenever original data changes
AND THE QUESTION IS...
Does anything similar already exists? If not, would it be feasible? If not, why?
The biggest risk with your "cache of eligible primary keys" concept is keeping the cache up to date, when the origin data is changing continually. It could be just as costly to keep the cache in sync as it is to run the random queries against the original data.
How do you expect to signal the cache that a value has been added/deleted/updated? If you do it with triggers, keep in mind that a trigger can fire even if the transaction that spawned it is rolled back. This is a general problem with notifying external systems from triggers.
If you notify the cache from the application after the change has been committed in the database, then you have to worry about other apps that make changes without being fitted with the signaling code. Or ad hoc queries. Or queries from apps or tools for which you can't change the code.
In general, the added complexity is probably not worth it. Most apps can tolerate some compromise and they don't need an absolutely random selection all the time.
For example, the inequality lookup may be acceptable for some needs, even with the known weakness that numbers following gaps are chosen more often.
Or you could pre-select a small number of random values (e.g. 30) and cache them. Let app requests choose from these. Every 60 seconds or so, refresh the cache with another set of randomly chosen values.
Or choose a random value evenly distributed between MIN(id) and MAX(id). Try a lookup by equality, not inequality. If the value corresponds to a gap in the primary key, just loop and try again with a different random value. You can terminate the loop if it's not successful after a few tries. Then try another method instead. On average, the improved simplicity and speed of an equality lookup may make up for the occasional retries.
It appears you are basically addressing a performance issue here. Most DB performance experts recommend you have as much RAM as your DB size, then disk is no longer a bottleneck - your DB lives in RAM and flushes to disk as required.
You're basically proposing a custom developed in-RAM CDC Hashing system.
You could just build this as a standard database only application and lock your mapping table in RAM, if your DB supports this.
I guess I am saying that you can address performance issues without developing custom applications, just use already existing performance tuning methods.
I'm working on building a web application that consists of users doing the following:
Browse and search against a Solr server containing millions of entries. (This part of the app is working really well.)
Select a privileged piece of this data (the results of some particular search), and temporarily save it as a "dataset". (I'd like dataset size to be limited to something really large, say half a million results.)
Perform some sundry operations on that dataset.
(The frontend's built in Rails, though I doubt that's really relevant to how to solve this particular problem.)
Step two, and how to retrieve the data for step 3, are what's giving me trouble. I need to be able to temporarily save datasets, recover them when they're needed, and expire them after a while. The problem is, my results have SHA1 checksum IDs, so each ID is 48 characters. A 500,000 record dataset, even if I only store IDs, is 22 MB of data. So I can't just have a single database table and throw a row in it for each dataset that a user constructs.
Has anybody out there ever needed something like this before? What's the best way to approach this problem? Should I generate a separate table for each dataset that a user constructs? If so, what's the best way to expire/delete these tables after a while? I can deploy a MySQL server if needed (though I don't have one up yet, all the data's in Solr), and I'd be open to some crazier software as well if something else fits the bill.
EDIT: Some more detailed info, in response to Jeff Ferland below.
The data objects are immutable, static, and reside entirely within the Solr database. It might be more efficient as files, but I would much rather (for reasons of search and browse) keep them where they are. Neither the data nor the datasets need to be distributed across multiple systems, I don't expect we'll ever get that kind of load. For now, the whole damn thing runs inside a single VM (I can cross that bridge if I get there).
By "recovering when needed," what I mean is something like this: The user runs a really carefully crafted search query, which gives them some set of objects as a result. They then decide they want to manipulate that set. When they (as a random example) click the "graph these objects by year" button, I need to be able to retrieve the full set of object IDs so I can take them back to the Solr server and run more queries. I'd rather store the object IDs (and not the search query), because the result set may change underneath the user as we add more objects.
A "while" is roughly the length of a user session. There's a complication, though, that might matter: I may wind up needing to implement a job queue so that I can defer processing, in which case the "while" would need to be "as long as it takes to process your job."
Thanks to Jeff for prodding me to provide the right kind of further detail.
First trick: don't represent your SHA1 as text, but rather as the 20 bytes it takes up. The hex value you see is a way of showing bytes in human readable form. If you store them properly, you're at 9.5MB instead of 22.
Second, you haven't really explained the nature of what you're doing. Are your saved datasets references to immutable objects in the existing database? What do you mean by recovering them when needed? How long is "a while" when you talk about expiration? Is the underlying data that you're referencing static or dynamic? Can you save the search pattern and an offset, or do you need to save the individual reference?
Does the data related to a session need to be inserted into a database? Might it be more efficient in files? Does that need to be distributed across multiple systems?
There are a lot of questions left in my answer. For that, you need to better express or even define the requirements beyond the technical overview you've given.
Update: There are many possible solutions for this. Here are two:
Write those to a single table (saved_searches or such) that has an incrementing search id. Bonus points for inserting your keys in sorted order. (search_id unsigned bigint, item_id char(20), primary key (search_id, item_id). That will really limit fragmentation, keep each search clustered, and free up pages in a roughly sequential order. It's almost a rolling table, and that's about the best case for doing great amounts of insertions and deletions. In that circumstance, you pay a cost for insertion, and double that cost for deletion. You must also iterate the entire search result.
If your search items have an incrementing primary id such that any new insertion to the database will have a higher value than anything that is already in the database, that is the most efficient. Alternately, inserting a datestamp would achieve the same effect with less efficiency (every row must actually be checked in a query instead of just the index entries). If you take note of that maximum id, and you don't delete records, then you can save searches that use zero space by always setting a maximum id on the saved query.
I'm new to databases, and this question has to do with how smart I can expect databases to be. Here by "databases" I mean "something like" MySQL or H2 (I actually have no idea if these two are similar, just that they are popular). I'm actually using ScalaQuery, so it abstracts away from the underlying database.
Suppose I have a table with entries of type (String, Int), with lots of redundancy in the String entries. So my table might look like:
(Adam, 18)
(Adam, 24)
(Adam, 34)
... continued ...
(Adam, 3492)
(Bethany, 4)
(Bethany, 45)
... continued ...
(Bethany, 2842)
If I store this table with H2, is it going to be smart enough to realize "Adam" and "Bethany" are repeated lots of times, and can be replaced with enumerations pointing to lookup tables? Or is it going to waste lots of storage?
Related: If H2 is smart in this respect with strings, is it also smart in the same way with doubles? In my probably brain-dead initial table, I happen to have lots of repeated double fields.
Thanks!
The database engine is not built to recognize redundancies in data and fix them. That is the task of the designer / developer.
Databases are designed to store information. There is no way database will know if (Adam, 44) and (Adam,55) can be compressed, and I would be petrified if databases tried to do things like you propose, as this can lead to a various performance and/or logical problems.
On the opposite, databases are not minimising the storage, they are adding redundant information, like indexes and keys, and other internal additional information required for DB.
DBs are built to retrieve information fast, not store it space-effectively. When it comes to complexity, database rather increase storage space, then decrease the performance of a query.
There are some storage systems that compress pages, so the question is valid. I can't talk about MySQL, but I believe it is similar to H2. H2 isn't very smart in this regard. H2 does compress data, but only for the following cases:
LOB compression, if enabled.
The following does not effect storage size of a closed database: H2 compresses the undo log when writing using LZF currently, therefore repeated data in a page will result in a slightly improved write performance (but only after a checkpoint). This may change in the future however.
Also, H2 uses a coded similar to UTF-8 to store text, but I wouldn't call this compression.
MySQL and other SQL products based on contiguous storage are not smart at this kind of thing at all.
Consider two logical sets, one referencing the other (i.e. a foreign key). One possible implementation is to physically store the value common to both sets just once and for both tables to store a pointer to the value (think reference type variables in 3GL programming languages such as C#). However, most SQL products physically store the value in both tables; if you want pointers then the end user has to implement them themselves, typically using autoincrement integer 'surrogate' keys, which sadly get exposed into the logical model.
Either you are talking about data compression, which can be done by the database engine and shouldn't be your concern.
Or you are talking about data normalization. Then you should read up on database design.
Databases are meant to store data, so no need to worry about a bit of redundancy. If you are going into several million lines and gigabytes of data, then you can start considering options. But up to that level you will not have any problems with performance.
given 2 large tables(imagine hundreds of millions of rows), each one has a string column, how do you get the diff?
Check out the open-source Percona Toolkit ---specifically, the pt-table-sync utility.
Its primary purpose is to sync a MySQL table with its replica, but since its output is the set of MySQL commands necessary to reconcile the differences between two tables, it's a natural fit for comparing the two.
What it actually does under the hood is a bit complex, and it actually uses different approaches depending on what it can tell about your tables (indexes, etc.), but one of the basic ideas is that it does fast CRC32 checksums on chunks of the indexes, and if the checksums don't match, it examines those records more closely. Note that this method is much faster than walking both indexes linearly and comparing them.
It only gets you part of the way, though. Because the generated commands are intended to sync a replica with its master, they simply replace the current contents of the replica for all differing records. In other words, the commands generated modify all fields in the record (not just the ones that have changed). So once you use pt-table-sync to find the diffs, you'd need to wrap the results in something to examine the differing records by comparing each field in the record.
But pt-table-sync does what you already knew to be the hard part: detecting diffs, really fast. It's written in Perl; the source should provide good breadcrumbs.
I'd think about creating an index on that column in each DB, then using a program to process through each DB in parallel using an ordering on that column. It would advance in both as you have records that are equal, and in one or the other as you find they are out of sync (keeping track of the out of sequence records). The creation of the index could be very costly in terms of both time and space (at least initially). Keeping it updated, though, if you are going to continue adding records may not add to much overhead. Once you have the index in place you should be able to process the difference in linear time. Producing the index -- assuming you have enough space -- should be an O(nlogn) operation.