Preventing insertion of duplicates without using indices

Preventing insertion of duplicates without using indices - mysql

I have a MariaDB table users that looks roughly like this:
id INT PRIMARY KEY AUTOINCREMENT,
email_hash INT, -- indexed
encrypted_email TEXT,
other_stuff JSON
For privacy reasons, I cannot store actual emails in the database.
The encryption used for emails is not 1-to-1, i.e. one email can be encrypted to many different encrypted representations. This makes it pointless to just slap an index on the encrypted_email column, as it will never catch a duplicate.
There are already data in the database and changing the encryption method or the hashing method is out of question.
The email_hash column cannot have a unique index either, as it is supposed to be a short hash to just speed up duplicate checks. It cannot be too unique, as it would void all privacy guarantees.
How can I prevent two entries with the same email from appearing in the database?
Another limitation: I probably cannot use LOCK TABLE, as according to the documentation https://mariadb.com/kb/en/library/lock-tables/
LOCK TABLES doesn't work when using Galera cluster. You may experience crashes or locks when used with Galera.
LOCK TABLES implicitly commits the active transaction, if any. Also, starting a transaction always releases all table locks acquired with LOCK TABLES.
(I do use Galera and I do need transactions as inserting a new user is accompanied with several other inserts and updates)
Since the backend application server (a monolith) is allowed to handle personal information (for example for sending email messages, verifying logins etc.) as long as it doesn't store it, I do the duplicate check in the application.
Currently, I'm doing something like this (pseudocode):
perform "START TRANSACTION"
h := hash(new_user.email)
conflicts := perform "SELECT encrypted_email FROM users WHERE email_hash = ?", h
for conflict in conflicts :
if decrypt(conflict) == new_user.email :
perform "ROLLBACK"
return DUPLICATE
e := encrypt(new_user.email)
s := new_user.other_stuff
perform "INSERT INTO users (email_hash, encrypted_email, other_stuff) VALUES (?,?,?)", h, e, s
perform some other inserts as part of the transaction
perform "COMMIT"
return OK
which works fine if two attempts are separated in time. However, when two threads try to add the same user simultaneously, then both transactions run in parallel, do the select, see no conflicting duplicate, and then both proceed to add the user. How to prevent that, or at least gracefully immediately recover?
This is how the race looks like, simplified:
Two threads start their transactions
Both threads do the select and the select returns zero rows in both cases.
Both threads assume there won't be a duplicate.
Both threads add the user.
Both threads commit the transactions.
There are now two users with the same email.

Tack FOR UPDATE on the end of the SELECT.
Also, since you are using Galera, you must check for errors after COMMIT. (That is when conflicts with the other nodes are reported.)

Your pseudocode risks race conditions unless you can force the code to run serially. That is, only one request at a time can attempt to insert an email. The whole block of code you show in your pseudocode has to be in a critical section.
If you can't use LOCK TABLES you could try MariaDB's GET_LOCK() function. I'm not sure if that's compatible with Galera, that's something for you to research.
If that's not possible, you'll have to find some other method of forcing that block of code to run serially. You haven't described your programming language or your application deployment architecture. Maybe you could use some kind of distributed lock server in Redis or something like that.
But even if you can accomplish this, making the code run serially, that will probably create a bottleneck in your app. Only one thread at a time will be able to insert a new email, and you'll probably find that they queue up waiting for the global lock.
Sorry, but that is the consequence of the constraints of this system, since you cannot implement it with a unique key, which would be the proper way to do it.
Good luck.

This is too long for a comment.
You can't. You have one field where one email gets multiple values. That's of no use for identifying duplicate values.
You have another field where multiple emails have the same value. That just raises false errors on duplicates.
If you want to prevent duplicates, then I would suggest a more robust hashing mechanism that greatly reduces collisions so you can use that. Otherwise, you need to do the validation behind a PII wall.

Also too long for a comment:
To prevent duplicate entries in a table you should use an unique index, so MariaDB will be able to detect duplicates.
A 4 byte hash/checksum (INT) is not unique enough and might have too many collisions. Instead of checksum, you should store the encrypted password (e.g. encrypting it by using AES-256-CTR or any other block cipher) in the table, the key and iv (initialization vector) should be stored on the client. Each encrypted value will now be unique, and for security encrypted value and key/iv are stored in different locations.
/* Don't send plain password, e.g. by using MariaDB's aes_encryot function
we encrypt it already on client*/
encrypted_unique_email= aes_256_ctr_encrypt(plain_pw);
encrypted_email=encrypt(user.email);
execute("INSERT INTO users VALUES (NULL, encrypted_unique_email, encrypted_email, other_stuff) ...
This solution however will only work with an empty table, since you likely will not be able to decrypt existing records.
In this case likely your proposal might be the best solution, however you need to lock the users table by LOCK TABLE users WRITE and unlock it with UNLOCK TABLES to prevent inconsistency.

You need to add another column and use it to store some one-to-one, collision free unrecoverable projection from email to some comparable output. Take any asymmetric cryptographic algorithm, generate public-private key pair, then destroy the private key and store the public key to encrypt the e-mail. The way asymmetric crytography works, it'll be impossible to recover private key even if the attacker gets their hands on public key you are using to encrypt the emails.
Note, however, that this approach has the same vulnerability as storing un-salted hashes: if the attacker gets his hands on your entire database, public key and algorithm, they can run a brute-force attack using some known e-mail dictionary, and successfully find the matching e-mails in their encrypted form, thus matching accounts fom your system to the actual e-mail. Deciding if that situation is an actual security risk is up to you and your ITSec department; but I think it shouldn't be, since you seem to have decrypt function available, so if attacker already has access to the database AND the system innards they can just decrypt the stored e-mails.
You can take it one step further and store these encrypted e-mails in a separate table without any relation to users. When a new row is inserted to users, make sure that a row is inserted into that table as well. Combined with unique index and a transaction, this will ensure no duplicates; however, managing changes and deletions will get more cumbersome. The potential attacker will get literally nothing besides knowing that yes, some of his known e-mails are registered in the system.
Otherwise, you just have to make sure the writes to users table are always serialized on the software layer before DB. Write a microservice that queues user storage requests and forbid modification of users by any other means.

Related

Suitability of AWS Cognito Identity ID for SQL primary key

I am working on a platform where unique user ID's are Identity ID's from a Amazon Cognito identity pool. Which look like so: "us-east-1:128d0a74-c82f-4553-916d-90053e4a8b0f"
The platform has a MySQL database that has a table of items that users can view. I need to add a favorites table that holds every favorited item of every user. This table could possibly grow to millions of rows.
The layout of the 'favorites' table would look like so:
userID, itemID, dateAdded
where userID and itemID together are a composite primary key.
My understanding is that this type of userID (practically an expanded UUID, that needs to be stored as a char or varchar) gives poor indexing performance. So using it as a key or index for millions of rows is discouraged.
My question is: Is my understanding correct, and should I be worried about performance later on due to this key? Are there any mitigations I can take to reduce performance risks?
My overall database knowledge isn't that great, so if this is a large problem...Would moving the favorited list to a NoSQL table (where the userID as a key would allow constant access time), and retrieving an array of favorited item ID's, to be used in a SELECT...WHERE IN query, be an acceptable alternative?
Thanks so much!

Ok so here I want to say why this is not good, the alternative, and the read/write workflow of your application.
Why not: this is not a good architecture because if something happens to your Cognito user pool, you cant repopulate it with the same ids for each individual user. Moreover, Cognito is getting offered in more regions now; compare to last year. Lets say your users' base are in Indonesia, and now that Cognito is being available in Singapore; you want to move your user pools from Tokyo to Singapore; because of the latency issue; not only you have the problem of moving the users; you have the issue of populating your database; so your approach lacks the scalability, maintainability and breaks the single responsibility principle (updating Cognito required you to update the db and vica versa).
Alternative solution: leave the db index to the db domain; and use the username as the link between your db and your Cognito user pool. So:
Read work flow will be:
User authentication: User authenticates and gets the token.
Your app verifies the token, and from its payload get the username.
You app contacts the db and get the information of the user, based on the username.
Your app will bring the user to its page and provides the information which was stored in the database.
Write work flow will be:
Your app gets the write request with the user with the token.
verifies the token.
Writes to the database based on the unique username.

Regarding MySQL, if you use the UserID and CognitoID composite for the primary key, it has a negative impact on query performance therefore not recommended for a large dataset.
However using this or even UserID for NoSQL DynamoDB is more suitable unless you have complex queries. You can also enforce security with AWS DynamoDB fine-grained access control connecting with Cognito Identity Pools.

While cognito itself has some issues, which are discussed in this article, and there are too many to list...
It's a terrible idea to use cognito and then create a completely separate user Id to use as a PK. First of all it is also going to be a CHAR or VARCHAR, so it doesn't actually help. Additionally now you have extra complexity to deal with an imaginary problem. If you don't like what cognito is giving you then either pair it with another solution or replace it altogether.
Don't overengineer your solution to solve a trivial case that may never come up. Use the Cognito userId because you use Cognito. 99.9999% of the time this is all you need and will support your use case.
Specifically this SO post explains that there is are zero problems with your approach:
There's nothing wrong with using a CHAR or VARCHAR as a primary key.
Sure it'll take up a little more space than an INT in many cases, but there are many cases where it is the most logical choice and may even reduce the number of columns you need, improving efficiency, by avoiding the need to have a separate ID field.

Best practice reading newest rows from database

I have a table which stores the location of my user very frequently. I want to query this table frequently and return the newest rows I haven't read from.
What would be the best practice way to do this. My ideas are:
Add a boolean read flag, query all results where this is false, return them and then update them ALL. This might slow things down with the extra writes
Save the id of the last read row on the client side, and query for rows greater than this. Only issue here is that my client could lose their place
Some stream of data
there will eventually be multiple users and readers of the locations so this will not to scale somewhat

If what you have is a SQL database storing rows of things. I'd suggest something like option 2.
What I would probably do is keep a timestamp rather than in ID, and an index on that (a clustered index on MSSQL, or similar construct so that new rows are physically sorted by time). Then just query by anything newer than that.
That does have the "losing their place" issue. If the client MUST read every row published, then I'd either delete them after processing, or have a flag in the database to indicate that they have been processed. If the client just needs to restart reading current data, then I would do as above, but initialize the time with the most recent existing row.
If you MUST process every record, aren't limited to a database, what you're really talking about is a message queue. If you need to be able to access the individual data points after processing, then one step of the message handling could be to insert into a database for later querying(in addition to whatever this is doing with the data read).
Edit per comments:
If there's no processing that needs be done when receiving, but you just want to periodically update data then you'd be fine with solution of keeping the last received time or ID and not deleting the data. In that case I would recommend not persisting a last known id/timestamp across restarts/reconnects since you might end up inadvertently loading a bunch of data. Just reset it max when you restart.
On another note, when I did stuff like this I had good success using MQTT to transmit the data, and for the "live" updates. That is a pub/sub messaging protocol. You could have a process subscribing on the back end and forwarding data to the database, while the thing that wants the data frequently can subscribe directly to the stream of data for live updates. There's also a feature to hold onto the last published message and forward that to new subscribers so you don't start out completely empty.

Work around INSERT race condition mysql, avoiding locks

Suppose I have a couple of scripts sending (legitimate!) emails. Each script handles a part of a bigger list, and they run concurrently. Before sending, every address has to be checked to avoid sending to the same address twice.
To do this, I created a simple table (mysql 5.1, innodb) with just the email-address. If it's not in the table, then add it, and send the mail. Now I need to avoid the race condition where multiple scripts test the same address at the same time and erroneously conclude it's not been sent to. I guess I can use locks for this, but I'd rather not do that because of performance reasons.
So I'd like to know if the following alternative is correct:
adding a unique index on the address column
just insert the address, without checking by selecting
trap the mysql error code returned: if it's 1062, the address already existed.
In this setup, is there still a possibility for a race condition? I mean: is it still possible that two scripts that insert an address at almost the same time both conclude that the mail has not been sent? Or should I use locks for this?
Thanks,
Stijn

Firstly I feel the database isn’t the best place for this. While your bigger list is sending out email (I’m guessing on a very large scale due to your attempt at paralysation) you must be using a temporary table given that you wouldn’t want to restrict sending of a different email to a recipient of a previous mailing.
A cache would be the obvious choice here maintaining a list of addresses, or a server acting as a shared memory resource.
However you could do it in the database, and from my understanding it isn’t really vital if one email address exists more than once as all you’re doing is checking one hasn’t been sent to that in the past. You can’t really control the race condition of multiple scripts sending to the same address at the same time without a locking policy. You could however make it more efficient by using an index. I wouldn’t index the actual address but create a new column with a CRC32 hash of the address (which can be a 32bit unsigned integer which only takes 4 bytes of memory). Using the CRC32 approach you will also have to check the email address in the query due to the birthday paradox.
For example:
SELECT COUNT(*) FROM email_addresses
WHERE email_address_crc = CRC32(?address)
AND email_address = ?address
Having something which is efficient should help against race conditions however as I’ve said before the only way to guarantee it is to lock the database while each email is being sent so you can then maintain an exact list – this unfortunately doesn’t scale and would mean having parallel tasks sending email would probably not help.
Edit in response to comments below:
As pointed out in the comments I actually forgot to address svdr’s alternative to a locking solution. It is true that a unique index containing the email address (or a composite index containing the campaign ID and address) would indeed throw a MySQL exception if the address exists and thus resulting in a working solution with parallel scripts sending to the same address at the same time. However, it is very hard to handle any exceptions such as not sending the email due to SMTP errors / network issues when the address is entered before the script ‘try’s’ to send an email, this could result in a recipient not receiving an email. Also providing this is a very simple INSERT and SELECT it should be fine just to trap the MySQL exception, however if there is anything more complex such as wrapping commands in transactions or using SELECT FOR UPDATE etc this can result in a deadlock situation.
Another couple of considerations are, the email address field would need to be fully indexed for performance reasons, if using INNODB this limit is 767 bytes – given the maximum valid length of an email address is 254 (+1 byte for length if using VARCHAR) you should be fine providing you don’t have some huge primary key.
Index performance should be addressed too, and CHAR vs VCHAR should be evaluated. Index lookups on a CHAR field are usually between 15% - 25% faster than the equivalent VCHAR lookup – fixed width table sizes can also help depending on the table engine used.
To summarise, yes your non-locking solution would work but should be tested and evaluated carefully with your exact requirements (I cannot comment on specifics as I would assume your real life scenario is more complex than your SO question). As stated in the first line of the answer I still believe the database isn't the best place for this and a cache or shared memory space would be more efficient and easier to implement.

What is the best way (in Rails/AR) to ensure writes to a database table are performed synchronously, one after another, one at a time?

I have noticed that using something like delayed_job without a UNIQUE constraint on a table column would still create double entries in the DB. I have assumed delayed_job would run jobs one after another. The Rails app runs on Apache with Passenger Phusion. I am not sure if that is the reason why this would happen, but I would like to make sure that every item in the queue is persisted to AR/DB one after another, in sequence, and to never have more than one write to this DB table happen at the same time. Is this possible? What would be some of the issues that I would have to deal with?
update
The race conditions arise because an AJAX API is used to send data to the application. The application received a bunch of data, each batch of data is identified as belonging together by a Session ID (SID), in the end, the final state of the database has to include the latest most up-to date AJAX PUT query to the API. Sometimes queries arrive at the exact same time for the same SID -- so I need a way to make sure they don't all try to be persisted at the same time, but one after the other, or simply the last to be sent by AJAX request to the API.
I hope that makes my particular use-case easier to understand...

You can lock a specific table (or tables) with the LOCK TABLES statement.
In general I would say that relying on this is poor design and will likely lead to with scalability problems down the road since you're creating an bottleneck in your application flow.
With your further explanations, I'd be tempted to add some extra columns to the table used by delayed_job, with a unique index on them. If (for example) you only ever wanted 1 job per user you'd add a user_id column and then do
something.delay(:user_id => user_id).some_method
You might need more attributes if the pattern is more sophisticated, e.g. there are lots of different types of jobs and you only wanted one per person, per type, but the principle is the same. You'd also want to be sure to rescue ActiveRecord::RecordNotUnique and deal with it gracefully.
For non delayed_job stuff, optimistic locking is often a good compromise between handling the concurrent cases well without slowing down the non concurrent cases.

If you are worried/troubled about/with multiple processes writing to the 'same' rows - as in more users updating the same order_header row - I'd suggest you set some marker bound to the current_user.id on the row once /order_headers/:id/edit was called, and removing it again, once the current_user releases the row either by updating or canceling the edit.
Your use-case (from your description) seems a bit different to me, so I'd suggest you leave it to the DB (in case of a fairly recent - as in post 5.1 - MySQL, you'd add a trigger/function which would do the actual update, and here - you could implement similar logic to the above suggested; some marker bound to the sequenced job id of sorts)

Whats a better strategy for storing log data in a database?

Im building an application that requires extensive logging of actions of the users, payments, etc.
Am I better off with a monolithic logs table, and just log EVERYTHING into that.... or is it better to have separate log tables for each type of action Im logging (log_payment, log_logins, log_acc_changes)?
For example, currently Im logging user's interactions with a payment gateway. When they sign up for a trial, when trial becomes a subscription, when it gets rebilled, refunded, if there was a failure or not, etc.
I'd like to also start logging actions or events that dont interact with the payment gateway (renewal cancellations, bans, payment failures that were intercepted before the data is even sent to the gateway for verification, logins, etc).
EDIT:
The data will be regularly examined to verify its integrity, since based on it, people will need to be paid, so accurate data is very critical. Read queries will be done by myself and 2 other admins, so 99% of the time, its going to be write/update.
I just figured having multiple tables, just creates more points of failure during the critical mysql transactions that deal with inserting and updating the payment data, etc.

All other things being equal, smaller disjoint tables can have a performance advantage, especially when they're write-heavy (as table related to logs are liable to be) -- most DB mechanisms are better tuned for mostly-read, rarely-written tables. In terms of writing (and updating any indices you may have to maintain), small disjoint tables are a clear win, especially if there's any concurrency (depending on what engine you're using for your tables, of course -- that's a pretty important consideration in mysql!-).
In terms of reading, it all depends on your pattern of queries -- what queries will you need, and how often. In certain cases for a usage pattern such as you mention there might be some performance advantage in duplicating certain information -- e.g. if you often need an up-to-the-instant running total of a user's credits or debits, as well as detailed auditable logs of how the running total came to be, keeping a (logically redundant) table of running totals by users may be warranted (as well as the nicely-separated "log tables" about the various sources of credits and debits).

Transactional tables should never change, not be editable, and can serve as log files for that type of information. Design your "billing" tables to have timestamps, and that will be sufficient.
However, where data records are editable, you need to track who-changed-what-when. To do that, you have a couple of choices.
--
For a given table, you can have a table_history table that has a near-identical structure, with NULLable fields, and a two-part primary key (the primary key of the original table + a sequence). If for every insert or update operation, you write a record to this table, you have a complete log of everything that happened to table.
The advantage of this method is you get to keep the same column types for all logged data, plus it is more efficient to query.
--
Alternatively, you can have a single log table that has fields like "table", "key", "date", "who", and a related table that stores the changed fields and values.
The advantage of this method is that you get to write one logging routine and use it everywhere.
--
I suggest you evaluate the number of tables, performance needs, change volume, and then pick one and go with it.

It depends on the purpose of logging. For debugging and general monitoring purpose, a single log table with dynamic log level would be helpful so you can chronologically look at what the system is going through.
On the other hand, for audit trail purpose, there's nothing like having duplicate table for all tables with every CRUD action. This way, every information captured in the payment table or whatever would be captured in your audit table.
So, the answer is both.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Preventing insertion of duplicates without using indices - mysql

Tack FOR UPDATE on the end of the SELECT. Also, since you are using Galera, you must check for errors after COMMIT. (That is when conflicts with the other nodes are reported.)

Related

Suitability of AWS Cognito Identity ID for SQL primary key

Best practice reading newest rows from database

Work around INSERT race condition mysql, avoiding locks

What is the best way (in Rails/AR) to ensure writes to a database table are performed synchronously, one after another, one at a time?

Whats a better strategy for storing log data in a database?

Categories

Resources