Unique, numeric, incremental identifier - unique

I need to generate unique, incremental, numeric transaction id's for each request I make to a certain XML RPC. These numbers only need to be unique across my domain, but will be generated on multiple machines.
I really don't want to have to keep track of this number in a database and deal with row locking etc on every single transaction. I tried to hack this using a microsecond timestamp, but there were collisions with just a few threads - my application needs to support hundreds of threads.
Any ideas would be appreciated.
Edit: What if each transaction id just has to be larger than the previous request's?

If you're going to be using this from hundreds of threads, working on multiple machines, and require an incremental ID, you're going to need some centralized place to store and lock the last generated ID number. This doesn't necessarily have to be in a database, but that would be the most common option. A central server that did nothing but serve IDs could provide the same functionality, but that probably defeats the purpose of distributing this.
If they need to be incremental, any form of timestamp won't be guaranteed unique.
If you don't need them to be incremental, a GUID would work. Potentially doing some type of merge of the timestamp + a hardware ID on each system could give unique identifiers, but the ID number portion would not necessarily be unique.
Could you use a pair of Hardware IDs + incremental timestamps? This would make each specific machine's IDs incremental, but not necessarily be unique across the entire domain.
---- EDIT -----
I don't think using any form of timestamp is going to work for you, for 2 reasons.
First, you'll never be able to guarantee that 2 threads on different machines won't try to schedule at exactly the same time, no matter what resolution of timer you use. At a high enough resolution, it would be unlikely, but not guaranteed.
Second, to make this work, even if you could resolve the collision issue above, you'd have to get every system to have exactly the same clock with microsecond accuracy, which isn't really practical.

This is a very difficult problem, particularly if you don't want to create a performance bottleneck. You say that the IDs need to be 'incremental' and 'numeric' -- is that a concrete business constraint, or one that exists for some other purpose?
If these aren't necessary you can use UUIDs, which most common platforms have libraries for. They allow you to generate many (millions!) of IDs in very short timespans and be quite comfortable with no collisions. The relevant article on wikipedia claims:
In other words, only after generating
1 billion UUIDs every second for the
next 100 years, the probability of
creating just one duplicate would be
about 50%.

If you remove 'incremental' from your requirements, you could use a GUID.
I don't see how you can implement incremental across multiple processes without some sort of common data.

If you target a Windows platform, did you try Interlocked API ?

Google for GUID generators for whatever language you are looking for, and then convert that to a number if you really need it to be numeric. It isn't incremental though.
Or have each thread "reserve" a thousand (or million, or billion) transaction IDs and hand them out one at a time, and "reserve" the next bunch when it runs out. Still not really incremental.

I'm with the GUID crowd, but if that's not possible, could you consider using db4o or SQL Lite over a heavy-weight database?

If each client can keep track of its own "next id", then you could talk to a sentral server and get a range of id's, perhaps a 1000 at a time. Once a client runs out of id's, it will have to talk to the server again.
This would make your system have a central source of id's, and still avoid having to talk to the database for every id.

Related

Pros and cons of Flake ids and cryptographic Ids

A distributed system can generate unique ids either by Flake or cryptographic ids (e.g., 128 bit murmur3).
Wonder what are the pros and cons of each method.
I'm going to assume 128-bit ids, kind-a like UUIDs. Let's start at a baseline, though
TL;DR: Use random ids. If and only if you have database performance issues try flake ids.
Auto-increment ids
Auto-increment ids are when your backend system assigns a unique, densely-packed id to each new entity. This is usually done by a database, but not always.
The clear advantage is that the id is guaranteed unique to your system, though 128 bits is probably overkill.
The first disadvantage is that you leak information every time you expose your id. You leak what other ids there are (an attacker can easily guess what to look for). You also leak how busy your system is (your competition now knows how many ids you create in a time period and can infer, say financial information).
The second disadvantage is that your backend is no longer as scalable. You are tied to some slow, less scalable id generator that will always be a bottleneck in a large system.
Random ids
Random ids are when you just generate 128 random bytes. v4 UUIDs 122-bit random ids (e.g. 2bbfb5ba-f5a2-11e7-8c3f-9a214cf093ae). These are also practically unique.
Random ids get rid of both of the disadvantages of auto-increment ids: they leak no information and are infinitely scalable.
The disadvantage comes when storing ids in b-trees (à la databases) because they randomize the memory/disk pages that the tree accesses. This may be a source of slow-downs to your system.
To me this is still the ideal id scheme, and you should have a good reason to move off of it. (i.e. profiler data).
Flake ids
Flake ids are random ids with except that the high k bits are taken from the lower bits of a timestamp. For example, you may get the following three ids in a row, where the top bits are really close together.
2bbfb5baf5a211e78c3f9a214cf093ae
2bbf9d4ec10c41049fb1671d6616b213
2bc6bb66e5964fb59050fcf3beed51b1
While you may leak some information, it isn't much if your k and timestamp granularity are designed well.
But if you mal-design the ids they can be less-than-helpful, either too infrequently updated—leading the b-trees to rely on the top random bits negating the usefulness—or too frequently—where you thrash the database because your updates.
Note: By time granularity, I mean how frequently the low bits of a timestamp change. Depending on your data throughput, you probably want this to be hour, deca-minutes, or minutes. It's a balance.
If you see the ids otherwise semantic-less (i.e. never infer anything from the top bits) then you can change any of these parameters at any time without interruption—even going back to purely random where k = 0.
Cryptographic ids
I'm assuming by this you mean ids have some semantic information encrypted in them. Maybe like hashids?
Disadvantages abound:
You'll have different length ids for different data, unless you have a fixed-length protocol.
You'll be tempted to add more and more info to the ids.
Look random, but no mitigation to add flake-like timestamps to the front
Ids become tied to the system that made it. You may start asking that system for decrypted versions of the id instead of just asking for the data it points to.
Your system burns time decrypting ids to extract data.
You add encryption problems
what happens if the secret-key is leaked? (Better not have too sensitive of data in there, customer name, or heaven forbid a credit card number)
coordinating key rotation.
Small ids like hashid can be brute-forced attack.
As you can see, I am not a fan of semantic ids in general. There are a few places where I use them, though I call them tokens. These don't get stored as keys in a database (or likely not stored anywhere).
For example I use encryption for pagination tokens: encrypted {last-id / context} of a pagination API. I prefer this over having the client pass the last element of the prior page because we keep the database context hidden from the user. It's simpler for everyone, and the encryption is little more than obfuscation (no sensitive information).

I know a GUID is nearly unique. But is it acceptable practice to assume it is unique?

So I completely understand the mathematical unlikeliness of creating two GUID values with the same number. But is it acceptable practice to assume they are unique?
For example I am working with a system for dealing with medical files. When I began to layout the database structure the manager (Not very technically knowledgeable, but likes to think he is and delegates things that would be better left for the more technically minded to decide) says he wants to use GUID's to separate different medical records instead of INT because it is "More unique". I explained how an INT is always going to be unique because it is sequential. I suggested we use BigINT if it will make him feel more comfortable since there are more numbers in that then if the population of the planet increased to the point people would only fit standing next to one another across the planet, but he is insisting on using GUIDs.
My feeling is although it is NEARLY IMPOSSIBLE for there to be a mix up, when dealing with medical records, why take the chance? What is the advantage of using a GUID vs an INT in this scenario?
But is it acceptable practice to assume it is unique?
Yes. That is the entire purpose of UUID, to be used as a reliable unique identifier without centralized coordination. (A GUID is Microsoft’s variation of a UUID.)
Only you (or your appropriate management) can make the final judgement for your particular project.
But if you truly begin to appreciate the enormity of the numerical range of 12x bits (which is actually incomprehensible to the human mind), then you know you can remove the usage of a properly generated UUID from your list of worries.
By “properly generated” I mean things like using the date-time Versions, or for lower number of values use the random (Version 4) if backed by a cryptographically-strong random number generator. Nearly every modern operating system today includes a UUID generation library. Or you can use the OSSP UUID project. Improperly-generated would include roll-your-own implementations you may see bandied about the inter webs.
As for the suggestion to use a database’s auto-incrementing serial/sequence number, every database person I know with years of real-world experience has been burned by those. I’ve never heard of or read of anyone ever having a collision with properly-generated UUIDs. I'm not saying sequences are necessarily bad or don't have their place, I'm just saying that all I can do is laugh when I hear people turn away from a UUID because of some beyond-astrononomically incomprehensibly minute possibility of a UUID collision and choose a sequence instead.
when dealing with medical records, why take the chance?
Your medical system is far far more likely to fail because of faulty data-entry or other human error with handling records. But do you post 3 clerks on duty to independently triple-enter the same data to reduce that chance of error? No. And that risk is incomprehensibly mathematically more likely to happen than a UUID problem. Yet every medical facility I know of accepts that enormous risk without even thinking about it.
What is the advantage of using a GUID vs an INT
The advantages include:
No need to manage your sequences.Examples include: Resetting for development, test, and production environments. Or when restoring a backup. Or fixing the sequence after faults in the system’s serial generation library (my own experience).
Avoid users’ intuited assumptions being confused about missing numbers in the sequence. I've had that conversation far too often.
Federating data between distributed systems.This is the biggest advantage, each system can act independently yet easily share data back and forth with other systems. Without UUIDs, the administrative overhead and the risk of error are bothersome at first and only grow over time.
Downsides include:
Larger memory and storage usage.Serial numbers are usually 32-bit integers, sometimes 64-bit. A good database with native support for UUID as a data type will use 128 bits.
Less readable by humans.One workaround is to just read several of the first or last digits for casual work.
Possibly less efficient indexing, with very large number of entries.
using an incrementing integer ID ensures only uniqueness within its own domain/type, an advantage of UUIDs/GUIDs is that they uniquely identify the owning thing in the entire universe.
So if you have multiple objects, say MedicalRecord, ID = 5, VaccinationForm, ID = 5 then you need to specify both the type ("medicalRecord" or "vaccinationForm" with the ID value of 5) whereas with a GUID you only need to store a single quanta of information to uniquely identify it.
It can be argued that using GUIDs is a waste of space as they are 16 bytes long (a 128-bit value).
If your system is self-contained and not interfacing with others you might want to use SQL Server's "sequence" concept, where instead of each table storing its own identity sequence, the sequence is maintained for all tables, making it a Locally-Unique ID value. You can use any size integer too.
See here: https://msdn.microsoft.com/en-us/library/ff878091.aspx

Is there / would be feasible a service providing random elements from a given SQL table?

ABSTRACT
Talking with some colleagues we came accross the "extract random row from a big database table" issue. It's a classic one and we know the naive approach (also on SO) is usually something like:
SELECT * FROM mytable ORDER BY RAND() LIMIT 1
THE PROBLEM
We also know a query like that is utterly inefficient and actually usable only with very few rows. There are some approaches that could be taken to attain better efficiency, like these ones still on SO, but they won't work with arbitrary primary keys and the randomness will be skewed as soon as you have holes in your numeric primary keys. An answer to the last cited question links to this article which has a good explanation and some bright solutions involving an additional "equal distribution" table that must be maintained whenever the "master data" table changes. But then again if you have frequent DELETEs on a big table you'll probably be screwed up by the constant updating of the added table. Also note that many solutions rely on COUNT(*) which is ridiculously fast on MyISAM but "just fast" on InnoDB (I don't know how it performs on other platforms but I suspect the InnoDB case could be representative of other transactional database systems).
In addition to that, even the best solutions I was able to find are fast but not Ludicrous Speed fast.
THE IDEA
A separate service could be responsible to generate, buffer and distribute random row ids or even entire random rows:
it could choose the best method to extract random row ids depending on how the original PKs are structured. An ordered list of keys could be maintained in ram by the service (shouldn't take too many bytes per row in addition to the actual size of the PK, it's probably ok up to 100~1000M rows with standard PCs and up to 1~10 billion rows with a beefy server)
once the keys are in memory you have an implicit "row number" for each key and no holes in it so it's just a matter of choosing a random number and directly fetch the corresponding key
a buffer of random keys ready to be consumed could be maintained to quickly respond to spikes in the incoming requests
consumers of the service will connect and request N random rows from the buffer
rows are returned as simple keys or the service could maintain a (pool of) db connection(s) to fetch entire rows
if the buffer is empty the request could block or return EOF-like
if data is added to the master table the service must be signaled to add the same data to its copy too, flush the buffer of random picks and go on from that
if data is deleted from the master table the service must be signaled to remove that data too from both the "all keys" list and "random picks" buffer
if data is updated in the master table the service must be signaled to update corresponding rows in the key list and in the random picks
WHY WE THINK IT'S COOL
does not touch disks other than the initial load of keys at startup or when signaled to do so
works with any kind of primary key, numerical or not
if you know you're going to update a large batch of data you can just signal it when you're done (i.e. not at every single insert/update/delete on the original data), it's basically like having a fine grained lock that only blocks requests for random rows
really fast on updates of any kind in the original data
offloads some work from the relational db to another, memory only process: helps scalability
responds really fast from its buffers without waiting for any querying, scanning, sorting
could easily be extended to similar use cases beyond the SQL one
WHY WE THINK IT COULD BE A STUPID IDEA
because we had the idea without help from any third party
because nobody (we heard of) has ever bothered to do something similar
because it adds complexity in the mix to keep it updated whenever original data changes
AND THE QUESTION IS...
Does anything similar already exists? If not, would it be feasible? If not, why?
The biggest risk with your "cache of eligible primary keys" concept is keeping the cache up to date, when the origin data is changing continually. It could be just as costly to keep the cache in sync as it is to run the random queries against the original data.
How do you expect to signal the cache that a value has been added/deleted/updated? If you do it with triggers, keep in mind that a trigger can fire even if the transaction that spawned it is rolled back. This is a general problem with notifying external systems from triggers.
If you notify the cache from the application after the change has been committed in the database, then you have to worry about other apps that make changes without being fitted with the signaling code. Or ad hoc queries. Or queries from apps or tools for which you can't change the code.
In general, the added complexity is probably not worth it. Most apps can tolerate some compromise and they don't need an absolutely random selection all the time.
For example, the inequality lookup may be acceptable for some needs, even with the known weakness that numbers following gaps are chosen more often.
Or you could pre-select a small number of random values (e.g. 30) and cache them. Let app requests choose from these. Every 60 seconds or so, refresh the cache with another set of randomly chosen values.
Or choose a random value evenly distributed between MIN(id) and MAX(id). Try a lookup by equality, not inequality. If the value corresponds to a gap in the primary key, just loop and try again with a different random value. You can terminate the loop if it's not successful after a few tries. Then try another method instead. On average, the improved simplicity and speed of an equality lookup may make up for the occasional retries.
It appears you are basically addressing a performance issue here. Most DB performance experts recommend you have as much RAM as your DB size, then disk is no longer a bottleneck - your DB lives in RAM and flushes to disk as required.
You're basically proposing a custom developed in-RAM CDC Hashing system.
You could just build this as a standard database only application and lock your mapping table in RAM, if your DB supports this.
I guess I am saying that you can address performance issues without developing custom applications, just use already existing performance tuning methods.

Storing large, session-level datasets?

I'm working on building a web application that consists of users doing the following:
Browse and search against a Solr server containing millions of entries. (This part of the app is working really well.)
Select a privileged piece of this data (the results of some particular search), and temporarily save it as a "dataset". (I'd like dataset size to be limited to something really large, say half a million results.)
Perform some sundry operations on that dataset.
(The frontend's built in Rails, though I doubt that's really relevant to how to solve this particular problem.)
Step two, and how to retrieve the data for step 3, are what's giving me trouble. I need to be able to temporarily save datasets, recover them when they're needed, and expire them after a while. The problem is, my results have SHA1 checksum IDs, so each ID is 48 characters. A 500,000 record dataset, even if I only store IDs, is 22 MB of data. So I can't just have a single database table and throw a row in it for each dataset that a user constructs.
Has anybody out there ever needed something like this before? What's the best way to approach this problem? Should I generate a separate table for each dataset that a user constructs? If so, what's the best way to expire/delete these tables after a while? I can deploy a MySQL server if needed (though I don't have one up yet, all the data's in Solr), and I'd be open to some crazier software as well if something else fits the bill.
EDIT: Some more detailed info, in response to Jeff Ferland below.
The data objects are immutable, static, and reside entirely within the Solr database. It might be more efficient as files, but I would much rather (for reasons of search and browse) keep them where they are. Neither the data nor the datasets need to be distributed across multiple systems, I don't expect we'll ever get that kind of load. For now, the whole damn thing runs inside a single VM (I can cross that bridge if I get there).
By "recovering when needed," what I mean is something like this: The user runs a really carefully crafted search query, which gives them some set of objects as a result. They then decide they want to manipulate that set. When they (as a random example) click the "graph these objects by year" button, I need to be able to retrieve the full set of object IDs so I can take them back to the Solr server and run more queries. I'd rather store the object IDs (and not the search query), because the result set may change underneath the user as we add more objects.
A "while" is roughly the length of a user session. There's a complication, though, that might matter: I may wind up needing to implement a job queue so that I can defer processing, in which case the "while" would need to be "as long as it takes to process your job."
Thanks to Jeff for prodding me to provide the right kind of further detail.
First trick: don't represent your SHA1 as text, but rather as the 20 bytes it takes up. The hex value you see is a way of showing bytes in human readable form. If you store them properly, you're at 9.5MB instead of 22.
Second, you haven't really explained the nature of what you're doing. Are your saved datasets references to immutable objects in the existing database? What do you mean by recovering them when needed? How long is "a while" when you talk about expiration? Is the underlying data that you're referencing static or dynamic? Can you save the search pattern and an offset, or do you need to save the individual reference?
Does the data related to a session need to be inserted into a database? Might it be more efficient in files? Does that need to be distributed across multiple systems?
There are a lot of questions left in my answer. For that, you need to better express or even define the requirements beyond the technical overview you've given.
Update: There are many possible solutions for this. Here are two:
Write those to a single table (saved_searches or such) that has an incrementing search id. Bonus points for inserting your keys in sorted order. (search_id unsigned bigint, item_id char(20), primary key (search_id, item_id). That will really limit fragmentation, keep each search clustered, and free up pages in a roughly sequential order. It's almost a rolling table, and that's about the best case for doing great amounts of insertions and deletions. In that circumstance, you pay a cost for insertion, and double that cost for deletion. You must also iterate the entire search result.
If your search items have an incrementing primary id such that any new insertion to the database will have a higher value than anything that is already in the database, that is the most efficient. Alternately, inserting a datestamp would achieve the same effect with less efficiency (every row must actually be checked in a query instead of just the index entries). If you take note of that maximum id, and you don't delete records, then you can save searches that use zero space by always setting a maximum id on the saved query.

How can I fix this scaling issue with soft deleting items?

I have a database where most tables have a delete flag for the tables. So the system soft deletes items (so they are no longer accessible unless by admins for example)
What worries me is in a few years, when the tables are much larger, is that the overall speed of the system is going to be reduced.
What can I do to counteract effects like that.
Do I index the delete field?
Do I move the deleted data to an identical delete table and back when undeleted?
Do I spread out the data over a few MySQL servers over time? (based on growth)
I'd appreciate any and all suggestions or stories.
UPDATE:
So partitioning seems to be the key to this. But wouldn't partitioning just create two "tables", one with the deleted items and one without the deleted items.
So over time the deleted partition will grow large and the occasional fetches from it will be slow (and slower over time)
Would the speed difference be something I should worry about? Since I fetch most (if not all) data by some key value (some are searches but they can be slow for this setup)
I'd partition the table on the DELETE flag.
The deleted rows will be physically kept in other place, but from SQL's point of view the table remains the same.
Oh, hell yes, index the delete field. You're going to be querying against it all the time, right? Compound indexes with other fields you query against a lot, like parent IDs, might also be a good idea.
Arguably, this decision could be made later if and only if performance problems actually appear. It very much depends on how many rows are added at what rate, your box specs, etc. Obviously, the level of abstraction in your application (and the limitations of any libraries you are using) will help determine how difficult such a change will be.
If it becomes a problem, or you are certain that it will be, start by partitioning on the deleted flag between two tables, one that holds current data and one that holds historical/deleted data. IF, as you said, the "deleted" data will only be available to administrators, it is reasonable to suppose that (in most applications) the total number of users (here limited only to admins) will not be sufficient to cause a problem. This means that your admins might need to wait a little while longer when searching that particular table, but your user base (arguably more important in most applications) will experience far less latency. If performance becomes unacceptable for the admins, you will likely want to index the user_id (or transaction_id or whatever) field you access the deleted records by (I generally index every field by which I access the table, but at certain scale there can be trade-offs regarding which indexes are most worthwhile).
Depending on how the data is accessed, there are other simple tricks you can employ. If the admin is looking for a specific record most of the time (as opposed to, say, reading a "history" or "log" of user activity), one can often assume that more recent records will be looked at more often than old records. Some DBs include tuning options for making recent records easier to find than older records, but you'll have to look it up for your particular database. Failing that, you can manually do it. The easiest way would be to have an ancient_history table that contains all records older than n days, weeks or months, depending on your constraints and suspected usage patterns. Newer data then lives inside a much smaller table. Even if the admin is going to "browse" all the records rather than searching for a specific one, you can start by showing the first n days and have a link to see all days should they not find what they are looking for (eg, most online banking applications that lets you browse transactions but shows only the first 30 days of history unless you request otherwise.)
Hopefully you can avoid having to go a step further, and sharding on user_id or some such scheme. Depending on the scale of the rest of your application, you might have to do this anyway. Unless you are positive that you will need to, I strongly suggest using vertical partitioning first (eg, keeping your forum_posts on a separate machine than your sales_records), as it is FAR easier to setup and maintain. If you end up needing to shard on user_id, I suggest using google ;-]
Good luck. BTW, I'm not a DBA so take this with a grain of salt.