Is Increment Thread safe? - mysql

I'm implementing a sequence generator for my database, via Grails. I've defined my domain class, and I want to specify a sequence. At present, I'm using:
static mapping = {
id generator: 'uuid'
version false
}
But this is generating long, 128 bit IDs, which I'm conscious that the user might have a hard time using. To combat this, I decided that it might be better to use normal, incrementing IDs, so I found this resource, informing me about the various options I had for preconfigured sequence generators.
I had a look at increment, and found this description:
increment -
generates identifiers of type long, short or int that are unique only when no other process is inserting data into the same table. Do not use in a cluster.
I have one Grails application inserting data, but several users may be inputting data at the same time. As I understand it, Grails (like a normal Servlet) will assign a new thread to each request made from the user. Does this mean then, that increment is not a good fit, because even though there is only one application, there will be multiple threads attempting to insert?
If increment is not a good fit, what other options do I have?

If increment is not a good fit, what other options do I have?
As IsidroGHIf stated in the comment if you don't specify id generator GORM will by default choose the native strategy to generate ids, which in case of MySQL are auto-incrementing columns.
They are definitely thread-safe, they also work in a cluster (compared to increment) and they will by default start with 1, incrementing also by 1 so it won't be long 128 bit IDs.

All threads running in a single instance application are executed in the same process so using increment is thread safe.
In this case, the Hibernate session associated to the process (associated to a thread running in the process) manages these IDs and assures uniqueness.
Anyway, the increment mode is normally (at least in my case) used for testing and/or early stage development phases.
If your database supports sequences, I'd probably use the sequence mode.

Related

Preventing insertion of duplicates without using indices

I have a MariaDB table users that looks roughly like this:
id INT PRIMARY KEY AUTOINCREMENT,
email_hash INT, -- indexed
encrypted_email TEXT,
other_stuff JSON
For privacy reasons, I cannot store actual emails in the database.
The encryption used for emails is not 1-to-1, i.e. one email can be encrypted to many different encrypted representations. This makes it pointless to just slap an index on the encrypted_email column, as it will never catch a duplicate.
There are already data in the database and changing the encryption method or the hashing method is out of question.
The email_hash column cannot have a unique index either, as it is supposed to be a short hash to just speed up duplicate checks. It cannot be too unique, as it would void all privacy guarantees.
How can I prevent two entries with the same email from appearing in the database?
Another limitation: I probably cannot use LOCK TABLE, as according to the documentation https://mariadb.com/kb/en/library/lock-tables/
LOCK TABLES doesn't work when using Galera cluster. You may experience crashes or locks when used with Galera.
LOCK TABLES implicitly commits the active transaction, if any. Also, starting a transaction always releases all table locks acquired with LOCK TABLES.
(I do use Galera and I do need transactions as inserting a new user is accompanied with several other inserts and updates)
Since the backend application server (a monolith) is allowed to handle personal information (for example for sending email messages, verifying logins etc.) as long as it doesn't store it, I do the duplicate check in the application.
Currently, I'm doing something like this (pseudocode):
perform "START TRANSACTION"
h := hash(new_user.email)
conflicts := perform "SELECT encrypted_email FROM users WHERE email_hash = ?", h
for conflict in conflicts :
if decrypt(conflict) == new_user.email :
perform "ROLLBACK"
return DUPLICATE
e := encrypt(new_user.email)
s := new_user.other_stuff
perform "INSERT INTO users (email_hash, encrypted_email, other_stuff) VALUES (?,?,?)", h, e, s
perform some other inserts as part of the transaction
perform "COMMIT"
return OK
which works fine if two attempts are separated in time. However, when two threads try to add the same user simultaneously, then both transactions run in parallel, do the select, see no conflicting duplicate, and then both proceed to add the user. How to prevent that, or at least gracefully immediately recover?
This is how the race looks like, simplified:
Two threads start their transactions
Both threads do the select and the select returns zero rows in both cases.
Both threads assume there won't be a duplicate.
Both threads add the user.
Both threads commit the transactions.
There are now two users with the same email.
Tack FOR UPDATE on the end of the SELECT.
Also, since you are using Galera, you must check for errors after COMMIT. (That is when conflicts with the other nodes are reported.)
Your pseudocode risks race conditions unless you can force the code to run serially. That is, only one request at a time can attempt to insert an email. The whole block of code you show in your pseudocode has to be in a critical section.
If you can't use LOCK TABLES you could try MariaDB's GET_LOCK() function. I'm not sure if that's compatible with Galera, that's something for you to research.
If that's not possible, you'll have to find some other method of forcing that block of code to run serially. You haven't described your programming language or your application deployment architecture. Maybe you could use some kind of distributed lock server in Redis or something like that.
But even if you can accomplish this, making the code run serially, that will probably create a bottleneck in your app. Only one thread at a time will be able to insert a new email, and you'll probably find that they queue up waiting for the global lock.
Sorry, but that is the consequence of the constraints of this system, since you cannot implement it with a unique key, which would be the proper way to do it.
Good luck.
This is too long for a comment.
You can't. You have one field where one email gets multiple values. That's of no use for identifying duplicate values.
You have another field where multiple emails have the same value. That just raises false errors on duplicates.
If you want to prevent duplicates, then I would suggest a more robust hashing mechanism that greatly reduces collisions so you can use that. Otherwise, you need to do the validation behind a PII wall.
Also too long for a comment:
To prevent duplicate entries in a table you should use an unique index, so MariaDB will be able to detect duplicates.
A 4 byte hash/checksum (INT) is not unique enough and might have too many collisions. Instead of checksum, you should store the encrypted password (e.g. encrypting it by using AES-256-CTR or any other block cipher) in the table, the key and iv (initialization vector) should be stored on the client. Each encrypted value will now be unique, and for security encrypted value and key/iv are stored in different locations.
/* Don't send plain password, e.g. by using MariaDB's aes_encryot function
we encrypt it already on client*/
encrypted_unique_email= aes_256_ctr_encrypt(plain_pw);
encrypted_email=encrypt(user.email);
execute("INSERT INTO users VALUES (NULL, encrypted_unique_email, encrypted_email, other_stuff) ...
This solution however will only work with an empty table, since you likely will not be able to decrypt existing records.
In this case likely your proposal might be the best solution, however you need to lock the users table by LOCK TABLE users WRITE and unlock it with UNLOCK TABLES to prevent inconsistency.
You need to add another column and use it to store some one-to-one, collision free unrecoverable projection from email to some comparable output. Take any asymmetric cryptographic algorithm, generate public-private key pair, then destroy the private key and store the public key to encrypt the e-mail. The way asymmetric crytography works, it'll be impossible to recover private key even if the attacker gets their hands on public key you are using to encrypt the emails.
Note, however, that this approach has the same vulnerability as storing un-salted hashes: if the attacker gets his hands on your entire database, public key and algorithm, they can run a brute-force attack using some known e-mail dictionary, and successfully find the matching e-mails in their encrypted form, thus matching accounts fom your system to the actual e-mail. Deciding if that situation is an actual security risk is up to you and your ITSec department; but I think it shouldn't be, since you seem to have decrypt function available, so if attacker already has access to the database AND the system innards they can just decrypt the stored e-mails.
You can take it one step further and store these encrypted e-mails in a separate table without any relation to users. When a new row is inserted to users, make sure that a row is inserted into that table as well. Combined with unique index and a transaction, this will ensure no duplicates; however, managing changes and deletions will get more cumbersome. The potential attacker will get literally nothing besides knowing that yes, some of his known e-mails are registered in the system.
Otherwise, you just have to make sure the writes to users table are always serialized on the software layer before DB. Write a microservice that queues user storage requests and forbid modification of users by any other means.

How to handle "View count" in redis

Our DB is mostly reads, but we want to add a "View count" and "thumbs up/thumbs down" to our videos.
When we stress tested incrementing views in mysql, our database started deadlocking.
I was thinking about handling this problem by having a redis DB that holds the view count, and only writes to the DB once the key expires. But, I hear the notifications are not consistent, and I don't want to lose the view data.
Is there a better way of going about this? Or is the talk of redis notifications being inconsistent not true.
Thanks,
Sammy
Redis' keyspace notifications are consistent, but delivery isn't guaranteed.
If you don't want to lose data, implement your own background process that manually expires the counters - i.e. copies to MySQL and deleted from Redis.
There are several approaches to implementing this lazy eviction pattern. For example, you can use a Redis Hash with two fields: a value field that you can HINCRBY and a timestamp field for expiry logic purposes. Your background process can then SCAN the keyspace to identify outdated keys.
Another way is to use Sorted Sets to manage the counters. In some cases you can use just one Sorted Set, encoding both TTL and count into each member's score (using the float's integer and fractional parts, respectively), but in most cases it is simpler to use two Sorted Sets - one for TTLs and the other fur values.

How to Prevent MySQL UUID V1 Collision

Our API is designed to generate UUIDs in MySQL for all records.
However, 99% of the records being generated in all tables share the same last 3 blocks of the UUID. I'm assuming this is because MySQL uses v1 of UUID which is based on Mac address which doesn't change on the same server. It doesn't seem like enough entropy to have a high level of confidence in uniqueness.
e.g. XXXXXXXX-XXXX-46fc-bb08-f9b12276ed01
This is validated per Wikipedia:
"given the speed of modern processors, successive invocations on the same machine of a naive implementation of a generator of version 1 UUIDs may produce the same UUID, violating the uniqueness property. (Non-naïve implementations can avoid this problem by, for example, remembering the most recently generated UUID, "pocketing" unused UUIDs, and using pocketed UUIDs in case a duplicate is about to be generated.)"
It sounds like if enough API calls are made within a certain amount of time that collision would all be certain (just a matter of reaching transactional volume e.g. 1000 transactions a second? i.e. close to 1 transaction per millisecond).
Assumption: UUID() is function of the MySQL binary which cannot be changed.
At what volume do I need to evaluate a change to prevent collisions and how would I make the wikipedia recommended change in MySQL to "pocket" UUIDs?
Put a unique constraint on your UUID column. That'll make the database check for duplicates before inserting (or updating) a record, so you can be sure there are no collisions in the table. The colliding record will just fail to insert.
If you find that you're actually getting errors due to violation of that constraint — i.e. if collisions are actually happening in the UUID generator and the database is keeping them out of the table — then you can look into more sophisticated methods to generate a new UUID and try again. But chances are, you won't have any problems.
The timestamp field in the UUID is measured in 100-nanosecond intervals, so you'd have to generate two UUIDs within a tenth of a microsecond to get a collision. That corresponds to a rate of ten million transactions per second. A thousand should be fine.

Is there / would be feasible a service providing random elements from a given SQL table?

ABSTRACT
Talking with some colleagues we came accross the "extract random row from a big database table" issue. It's a classic one and we know the naive approach (also on SO) is usually something like:
SELECT * FROM mytable ORDER BY RAND() LIMIT 1
THE PROBLEM
We also know a query like that is utterly inefficient and actually usable only with very few rows. There are some approaches that could be taken to attain better efficiency, like these ones still on SO, but they won't work with arbitrary primary keys and the randomness will be skewed as soon as you have holes in your numeric primary keys. An answer to the last cited question links to this article which has a good explanation and some bright solutions involving an additional "equal distribution" table that must be maintained whenever the "master data" table changes. But then again if you have frequent DELETEs on a big table you'll probably be screwed up by the constant updating of the added table. Also note that many solutions rely on COUNT(*) which is ridiculously fast on MyISAM but "just fast" on InnoDB (I don't know how it performs on other platforms but I suspect the InnoDB case could be representative of other transactional database systems).
In addition to that, even the best solutions I was able to find are fast but not Ludicrous Speed fast.
THE IDEA
A separate service could be responsible to generate, buffer and distribute random row ids or even entire random rows:
it could choose the best method to extract random row ids depending on how the original PKs are structured. An ordered list of keys could be maintained in ram by the service (shouldn't take too many bytes per row in addition to the actual size of the PK, it's probably ok up to 100~1000M rows with standard PCs and up to 1~10 billion rows with a beefy server)
once the keys are in memory you have an implicit "row number" for each key and no holes in it so it's just a matter of choosing a random number and directly fetch the corresponding key
a buffer of random keys ready to be consumed could be maintained to quickly respond to spikes in the incoming requests
consumers of the service will connect and request N random rows from the buffer
rows are returned as simple keys or the service could maintain a (pool of) db connection(s) to fetch entire rows
if the buffer is empty the request could block or return EOF-like
if data is added to the master table the service must be signaled to add the same data to its copy too, flush the buffer of random picks and go on from that
if data is deleted from the master table the service must be signaled to remove that data too from both the "all keys" list and "random picks" buffer
if data is updated in the master table the service must be signaled to update corresponding rows in the key list and in the random picks
WHY WE THINK IT'S COOL
does not touch disks other than the initial load of keys at startup or when signaled to do so
works with any kind of primary key, numerical or not
if you know you're going to update a large batch of data you can just signal it when you're done (i.e. not at every single insert/update/delete on the original data), it's basically like having a fine grained lock that only blocks requests for random rows
really fast on updates of any kind in the original data
offloads some work from the relational db to another, memory only process: helps scalability
responds really fast from its buffers without waiting for any querying, scanning, sorting
could easily be extended to similar use cases beyond the SQL one
WHY WE THINK IT COULD BE A STUPID IDEA
because we had the idea without help from any third party
because nobody (we heard of) has ever bothered to do something similar
because it adds complexity in the mix to keep it updated whenever original data changes
AND THE QUESTION IS...
Does anything similar already exists? If not, would it be feasible? If not, why?
The biggest risk with your "cache of eligible primary keys" concept is keeping the cache up to date, when the origin data is changing continually. It could be just as costly to keep the cache in sync as it is to run the random queries against the original data.
How do you expect to signal the cache that a value has been added/deleted/updated? If you do it with triggers, keep in mind that a trigger can fire even if the transaction that spawned it is rolled back. This is a general problem with notifying external systems from triggers.
If you notify the cache from the application after the change has been committed in the database, then you have to worry about other apps that make changes without being fitted with the signaling code. Or ad hoc queries. Or queries from apps or tools for which you can't change the code.
In general, the added complexity is probably not worth it. Most apps can tolerate some compromise and they don't need an absolutely random selection all the time.
For example, the inequality lookup may be acceptable for some needs, even with the known weakness that numbers following gaps are chosen more often.
Or you could pre-select a small number of random values (e.g. 30) and cache them. Let app requests choose from these. Every 60 seconds or so, refresh the cache with another set of randomly chosen values.
Or choose a random value evenly distributed between MIN(id) and MAX(id). Try a lookup by equality, not inequality. If the value corresponds to a gap in the primary key, just loop and try again with a different random value. You can terminate the loop if it's not successful after a few tries. Then try another method instead. On average, the improved simplicity and speed of an equality lookup may make up for the occasional retries.
It appears you are basically addressing a performance issue here. Most DB performance experts recommend you have as much RAM as your DB size, then disk is no longer a bottleneck - your DB lives in RAM and flushes to disk as required.
You're basically proposing a custom developed in-RAM CDC Hashing system.
You could just build this as a standard database only application and lock your mapping table in RAM, if your DB supports this.
I guess I am saying that you can address performance issues without developing custom applications, just use already existing performance tuning methods.

Unique, numeric, incremental identifier

I need to generate unique, incremental, numeric transaction id's for each request I make to a certain XML RPC. These numbers only need to be unique across my domain, but will be generated on multiple machines.
I really don't want to have to keep track of this number in a database and deal with row locking etc on every single transaction. I tried to hack this using a microsecond timestamp, but there were collisions with just a few threads - my application needs to support hundreds of threads.
Any ideas would be appreciated.
Edit: What if each transaction id just has to be larger than the previous request's?
If you're going to be using this from hundreds of threads, working on multiple machines, and require an incremental ID, you're going to need some centralized place to store and lock the last generated ID number. This doesn't necessarily have to be in a database, but that would be the most common option. A central server that did nothing but serve IDs could provide the same functionality, but that probably defeats the purpose of distributing this.
If they need to be incremental, any form of timestamp won't be guaranteed unique.
If you don't need them to be incremental, a GUID would work. Potentially doing some type of merge of the timestamp + a hardware ID on each system could give unique identifiers, but the ID number portion would not necessarily be unique.
Could you use a pair of Hardware IDs + incremental timestamps? This would make each specific machine's IDs incremental, but not necessarily be unique across the entire domain.
---- EDIT -----
I don't think using any form of timestamp is going to work for you, for 2 reasons.
First, you'll never be able to guarantee that 2 threads on different machines won't try to schedule at exactly the same time, no matter what resolution of timer you use. At a high enough resolution, it would be unlikely, but not guaranteed.
Second, to make this work, even if you could resolve the collision issue above, you'd have to get every system to have exactly the same clock with microsecond accuracy, which isn't really practical.
This is a very difficult problem, particularly if you don't want to create a performance bottleneck. You say that the IDs need to be 'incremental' and 'numeric' -- is that a concrete business constraint, or one that exists for some other purpose?
If these aren't necessary you can use UUIDs, which most common platforms have libraries for. They allow you to generate many (millions!) of IDs in very short timespans and be quite comfortable with no collisions. The relevant article on wikipedia claims:
In other words, only after generating
1 billion UUIDs every second for the
next 100 years, the probability of
creating just one duplicate would be
about 50%.
If you remove 'incremental' from your requirements, you could use a GUID.
I don't see how you can implement incremental across multiple processes without some sort of common data.
If you target a Windows platform, did you try Interlocked API ?
Google for GUID generators for whatever language you are looking for, and then convert that to a number if you really need it to be numeric. It isn't incremental though.
Or have each thread "reserve" a thousand (or million, or billion) transaction IDs and hand them out one at a time, and "reserve" the next bunch when it runs out. Still not really incremental.
I'm with the GUID crowd, but if that's not possible, could you consider using db4o or SQL Lite over a heavy-weight database?
If each client can keep track of its own "next id", then you could talk to a sentral server and get a range of id's, perhaps a 1000 at a time. Once a client runs out of id's, it will have to talk to the server again.
This would make your system have a central source of id's, and still avoid having to talk to the database for every id.