Suppose I have a couple of scripts sending (legitimate!) emails. Each script handles a part of a bigger list, and they run concurrently. Before sending, every address has to be checked to avoid sending to the same address twice.
To do this, I created a simple table (mysql 5.1, innodb) with just the email-address. If it's not in the table, then add it, and send the mail. Now I need to avoid the race condition where multiple scripts test the same address at the same time and erroneously conclude it's not been sent to. I guess I can use locks for this, but I'd rather not do that because of performance reasons.
So I'd like to know if the following alternative is correct:
adding a unique index on the address column
just insert the address, without checking by selecting
trap the mysql error code returned: if it's 1062, the address already existed.
In this setup, is there still a possibility for a race condition? I mean: is it still possible that two scripts that insert an address at almost the same time both conclude that the mail has not been sent? Or should I use locks for this?
Thanks,
Stijn
Firstly I feel the database isn’t the best place for this. While your bigger list is sending out email (I’m guessing on a very large scale due to your attempt at paralysation) you must be using a temporary table given that you wouldn’t want to restrict sending of a different email to a recipient of a previous mailing.
A cache would be the obvious choice here maintaining a list of addresses, or a server acting as a shared memory resource.
However you could do it in the database, and from my understanding it isn’t really vital if one email address exists more than once as all you’re doing is checking one hasn’t been sent to that in the past. You can’t really control the race condition of multiple scripts sending to the same address at the same time without a locking policy. You could however make it more efficient by using an index. I wouldn’t index the actual address but create a new column with a CRC32 hash of the address (which can be a 32bit unsigned integer which only takes 4 bytes of memory). Using the CRC32 approach you will also have to check the email address in the query due to the birthday paradox.
For example:
SELECT COUNT(*) FROM email_addresses
WHERE email_address_crc = CRC32(?address)
AND email_address = ?address
Having something which is efficient should help against race conditions however as I’ve said before the only way to guarantee it is to lock the database while each email is being sent so you can then maintain an exact list – this unfortunately doesn’t scale and would mean having parallel tasks sending email would probably not help.
Edit in response to comments below:
As pointed out in the comments I actually forgot to address svdr’s alternative to a locking solution. It is true that a unique index containing the email address (or a composite index containing the campaign ID and address) would indeed throw a MySQL exception if the address exists and thus resulting in a working solution with parallel scripts sending to the same address at the same time. However, it is very hard to handle any exceptions such as not sending the email due to SMTP errors / network issues when the address is entered before the script ‘try’s’ to send an email, this could result in a recipient not receiving an email. Also providing this is a very simple INSERT and SELECT it should be fine just to trap the MySQL exception, however if there is anything more complex such as wrapping commands in transactions or using SELECT FOR UPDATE etc this can result in a deadlock situation.
Another couple of considerations are, the email address field would need to be fully indexed for performance reasons, if using INNODB this limit is 767 bytes – given the maximum valid length of an email address is 254 (+1 byte for length if using VARCHAR) you should be fine providing you don’t have some huge primary key.
Index performance should be addressed too, and CHAR vs VCHAR should be evaluated. Index lookups on a CHAR field are usually between 15% - 25% faster than the equivalent VCHAR lookup – fixed width table sizes can also help depending on the table engine used.
To summarise, yes your non-locking solution would work but should be tested and evaluated carefully with your exact requirements (I cannot comment on specifics as I would assume your real life scenario is more complex than your SO question). As stated in the first line of the answer I still believe the database isn't the best place for this and a cache or shared memory space would be more efficient and easier to implement.
Related
I have a MariaDB table users that looks roughly like this:
id INT PRIMARY KEY AUTOINCREMENT,
email_hash INT, -- indexed
encrypted_email TEXT,
other_stuff JSON
For privacy reasons, I cannot store actual emails in the database.
The encryption used for emails is not 1-to-1, i.e. one email can be encrypted to many different encrypted representations. This makes it pointless to just slap an index on the encrypted_email column, as it will never catch a duplicate.
There are already data in the database and changing the encryption method or the hashing method is out of question.
The email_hash column cannot have a unique index either, as it is supposed to be a short hash to just speed up duplicate checks. It cannot be too unique, as it would void all privacy guarantees.
How can I prevent two entries with the same email from appearing in the database?
Another limitation: I probably cannot use LOCK TABLE, as according to the documentation https://mariadb.com/kb/en/library/lock-tables/
LOCK TABLES doesn't work when using Galera cluster. You may experience crashes or locks when used with Galera.
LOCK TABLES implicitly commits the active transaction, if any. Also, starting a transaction always releases all table locks acquired with LOCK TABLES.
(I do use Galera and I do need transactions as inserting a new user is accompanied with several other inserts and updates)
Since the backend application server (a monolith) is allowed to handle personal information (for example for sending email messages, verifying logins etc.) as long as it doesn't store it, I do the duplicate check in the application.
Currently, I'm doing something like this (pseudocode):
perform "START TRANSACTION"
h := hash(new_user.email)
conflicts := perform "SELECT encrypted_email FROM users WHERE email_hash = ?", h
for conflict in conflicts :
if decrypt(conflict) == new_user.email :
perform "ROLLBACK"
return DUPLICATE
e := encrypt(new_user.email)
s := new_user.other_stuff
perform "INSERT INTO users (email_hash, encrypted_email, other_stuff) VALUES (?,?,?)", h, e, s
perform some other inserts as part of the transaction
perform "COMMIT"
return OK
which works fine if two attempts are separated in time. However, when two threads try to add the same user simultaneously, then both transactions run in parallel, do the select, see no conflicting duplicate, and then both proceed to add the user. How to prevent that, or at least gracefully immediately recover?
This is how the race looks like, simplified:
Two threads start their transactions
Both threads do the select and the select returns zero rows in both cases.
Both threads assume there won't be a duplicate.
Both threads add the user.
Both threads commit the transactions.
There are now two users with the same email.
Tack FOR UPDATE on the end of the SELECT.
Also, since you are using Galera, you must check for errors after COMMIT. (That is when conflicts with the other nodes are reported.)
Your pseudocode risks race conditions unless you can force the code to run serially. That is, only one request at a time can attempt to insert an email. The whole block of code you show in your pseudocode has to be in a critical section.
If you can't use LOCK TABLES you could try MariaDB's GET_LOCK() function. I'm not sure if that's compatible with Galera, that's something for you to research.
If that's not possible, you'll have to find some other method of forcing that block of code to run serially. You haven't described your programming language or your application deployment architecture. Maybe you could use some kind of distributed lock server in Redis or something like that.
But even if you can accomplish this, making the code run serially, that will probably create a bottleneck in your app. Only one thread at a time will be able to insert a new email, and you'll probably find that they queue up waiting for the global lock.
Sorry, but that is the consequence of the constraints of this system, since you cannot implement it with a unique key, which would be the proper way to do it.
Good luck.
This is too long for a comment.
You can't. You have one field where one email gets multiple values. That's of no use for identifying duplicate values.
You have another field where multiple emails have the same value. That just raises false errors on duplicates.
If you want to prevent duplicates, then I would suggest a more robust hashing mechanism that greatly reduces collisions so you can use that. Otherwise, you need to do the validation behind a PII wall.
Also too long for a comment:
To prevent duplicate entries in a table you should use an unique index, so MariaDB will be able to detect duplicates.
A 4 byte hash/checksum (INT) is not unique enough and might have too many collisions. Instead of checksum, you should store the encrypted password (e.g. encrypting it by using AES-256-CTR or any other block cipher) in the table, the key and iv (initialization vector) should be stored on the client. Each encrypted value will now be unique, and for security encrypted value and key/iv are stored in different locations.
/* Don't send plain password, e.g. by using MariaDB's aes_encryot function
we encrypt it already on client*/
encrypted_unique_email= aes_256_ctr_encrypt(plain_pw);
encrypted_email=encrypt(user.email);
execute("INSERT INTO users VALUES (NULL, encrypted_unique_email, encrypted_email, other_stuff) ...
This solution however will only work with an empty table, since you likely will not be able to decrypt existing records.
In this case likely your proposal might be the best solution, however you need to lock the users table by LOCK TABLE users WRITE and unlock it with UNLOCK TABLES to prevent inconsistency.
You need to add another column and use it to store some one-to-one, collision free unrecoverable projection from email to some comparable output. Take any asymmetric cryptographic algorithm, generate public-private key pair, then destroy the private key and store the public key to encrypt the e-mail. The way asymmetric crytography works, it'll be impossible to recover private key even if the attacker gets their hands on public key you are using to encrypt the emails.
Note, however, that this approach has the same vulnerability as storing un-salted hashes: if the attacker gets his hands on your entire database, public key and algorithm, they can run a brute-force attack using some known e-mail dictionary, and successfully find the matching e-mails in their encrypted form, thus matching accounts fom your system to the actual e-mail. Deciding if that situation is an actual security risk is up to you and your ITSec department; but I think it shouldn't be, since you seem to have decrypt function available, so if attacker already has access to the database AND the system innards they can just decrypt the stored e-mails.
You can take it one step further and store these encrypted e-mails in a separate table without any relation to users. When a new row is inserted to users, make sure that a row is inserted into that table as well. Combined with unique index and a transaction, this will ensure no duplicates; however, managing changes and deletions will get more cumbersome. The potential attacker will get literally nothing besides knowing that yes, some of his known e-mails are registered in the system.
Otherwise, you just have to make sure the writes to users table are always serialized on the software layer before DB. Write a microservice that queues user storage requests and forbid modification of users by any other means.
I am working on a platform where unique user ID's are Identity ID's from a Amazon Cognito identity pool. Which look like so: "us-east-1:128d0a74-c82f-4553-916d-90053e4a8b0f"
The platform has a MySQL database that has a table of items that users can view. I need to add a favorites table that holds every favorited item of every user. This table could possibly grow to millions of rows.
The layout of the 'favorites' table would look like so:
userID, itemID, dateAdded
where userID and itemID together are a composite primary key.
My understanding is that this type of userID (practically an expanded UUID, that needs to be stored as a char or varchar) gives poor indexing performance. So using it as a key or index for millions of rows is discouraged.
My question is: Is my understanding correct, and should I be worried about performance later on due to this key? Are there any mitigations I can take to reduce performance risks?
My overall database knowledge isn't that great, so if this is a large problem...Would moving the favorited list to a NoSQL table (where the userID as a key would allow constant access time), and retrieving an array of favorited item ID's, to be used in a SELECT...WHERE IN query, be an acceptable alternative?
Thanks so much!
Ok so here I want to say why this is not good, the alternative, and the read/write workflow of your application.
Why not: this is not a good architecture because if something happens to your Cognito user pool, you cant repopulate it with the same ids for each individual user. Moreover, Cognito is getting offered in more regions now; compare to last year. Lets say your users' base are in Indonesia, and now that Cognito is being available in Singapore; you want to move your user pools from Tokyo to Singapore; because of the latency issue; not only you have the problem of moving the users; you have the issue of populating your database; so your approach lacks the scalability, maintainability and breaks the single responsibility principle (updating Cognito required you to update the db and vica versa).
Alternative solution: leave the db index to the db domain; and use the username as the link between your db and your Cognito user pool. So:
Read work flow will be:
User authentication: User authenticates and gets the token.
Your app verifies the token, and from its payload get the username.
You app contacts the db and get the information of the user, based on the username.
Your app will bring the user to its page and provides the information which was stored in the database.
Write work flow will be:
Your app gets the write request with the user with the token.
verifies the token.
Writes to the database based on the unique username.
Regarding MySQL, if you use the UserID and CognitoID composite for the primary key, it has a negative impact on query performance therefore not recommended for a large dataset.
However using this or even UserID for NoSQL DynamoDB is more suitable unless you have complex queries. You can also enforce security with AWS DynamoDB fine-grained access control connecting with Cognito Identity Pools.
While cognito itself has some issues, which are discussed in this article, and there are too many to list...
It's a terrible idea to use cognito and then create a completely separate user Id to use as a PK. First of all it is also going to be a CHAR or VARCHAR, so it doesn't actually help. Additionally now you have extra complexity to deal with an imaginary problem. If you don't like what cognito is giving you then either pair it with another solution or replace it altogether.
Don't overengineer your solution to solve a trivial case that may never come up. Use the Cognito userId because you use Cognito. 99.9999% of the time this is all you need and will support your use case.
Specifically this SO post explains that there is are zero problems with your approach:
There's nothing wrong with using a CHAR or VARCHAR as a primary key.
Sure it'll take up a little more space than an INT in many cases, but there are many cases where it is the most logical choice and may even reduce the number of columns you need, improving efficiency, by avoiding the need to have a separate ID field.
ABSTRACT
Talking with some colleagues we came accross the "extract random row from a big database table" issue. It's a classic one and we know the naive approach (also on SO) is usually something like:
SELECT * FROM mytable ORDER BY RAND() LIMIT 1
THE PROBLEM
We also know a query like that is utterly inefficient and actually usable only with very few rows. There are some approaches that could be taken to attain better efficiency, like these ones still on SO, but they won't work with arbitrary primary keys and the randomness will be skewed as soon as you have holes in your numeric primary keys. An answer to the last cited question links to this article which has a good explanation and some bright solutions involving an additional "equal distribution" table that must be maintained whenever the "master data" table changes. But then again if you have frequent DELETEs on a big table you'll probably be screwed up by the constant updating of the added table. Also note that many solutions rely on COUNT(*) which is ridiculously fast on MyISAM but "just fast" on InnoDB (I don't know how it performs on other platforms but I suspect the InnoDB case could be representative of other transactional database systems).
In addition to that, even the best solutions I was able to find are fast but not Ludicrous Speed fast.
THE IDEA
A separate service could be responsible to generate, buffer and distribute random row ids or even entire random rows:
it could choose the best method to extract random row ids depending on how the original PKs are structured. An ordered list of keys could be maintained in ram by the service (shouldn't take too many bytes per row in addition to the actual size of the PK, it's probably ok up to 100~1000M rows with standard PCs and up to 1~10 billion rows with a beefy server)
once the keys are in memory you have an implicit "row number" for each key and no holes in it so it's just a matter of choosing a random number and directly fetch the corresponding key
a buffer of random keys ready to be consumed could be maintained to quickly respond to spikes in the incoming requests
consumers of the service will connect and request N random rows from the buffer
rows are returned as simple keys or the service could maintain a (pool of) db connection(s) to fetch entire rows
if the buffer is empty the request could block or return EOF-like
if data is added to the master table the service must be signaled to add the same data to its copy too, flush the buffer of random picks and go on from that
if data is deleted from the master table the service must be signaled to remove that data too from both the "all keys" list and "random picks" buffer
if data is updated in the master table the service must be signaled to update corresponding rows in the key list and in the random picks
WHY WE THINK IT'S COOL
does not touch disks other than the initial load of keys at startup or when signaled to do so
works with any kind of primary key, numerical or not
if you know you're going to update a large batch of data you can just signal it when you're done (i.e. not at every single insert/update/delete on the original data), it's basically like having a fine grained lock that only blocks requests for random rows
really fast on updates of any kind in the original data
offloads some work from the relational db to another, memory only process: helps scalability
responds really fast from its buffers without waiting for any querying, scanning, sorting
could easily be extended to similar use cases beyond the SQL one
WHY WE THINK IT COULD BE A STUPID IDEA
because we had the idea without help from any third party
because nobody (we heard of) has ever bothered to do something similar
because it adds complexity in the mix to keep it updated whenever original data changes
AND THE QUESTION IS...
Does anything similar already exists? If not, would it be feasible? If not, why?
The biggest risk with your "cache of eligible primary keys" concept is keeping the cache up to date, when the origin data is changing continually. It could be just as costly to keep the cache in sync as it is to run the random queries against the original data.
How do you expect to signal the cache that a value has been added/deleted/updated? If you do it with triggers, keep in mind that a trigger can fire even if the transaction that spawned it is rolled back. This is a general problem with notifying external systems from triggers.
If you notify the cache from the application after the change has been committed in the database, then you have to worry about other apps that make changes without being fitted with the signaling code. Or ad hoc queries. Or queries from apps or tools for which you can't change the code.
In general, the added complexity is probably not worth it. Most apps can tolerate some compromise and they don't need an absolutely random selection all the time.
For example, the inequality lookup may be acceptable for some needs, even with the known weakness that numbers following gaps are chosen more often.
Or you could pre-select a small number of random values (e.g. 30) and cache them. Let app requests choose from these. Every 60 seconds or so, refresh the cache with another set of randomly chosen values.
Or choose a random value evenly distributed between MIN(id) and MAX(id). Try a lookup by equality, not inequality. If the value corresponds to a gap in the primary key, just loop and try again with a different random value. You can terminate the loop if it's not successful after a few tries. Then try another method instead. On average, the improved simplicity and speed of an equality lookup may make up for the occasional retries.
It appears you are basically addressing a performance issue here. Most DB performance experts recommend you have as much RAM as your DB size, then disk is no longer a bottleneck - your DB lives in RAM and flushes to disk as required.
You're basically proposing a custom developed in-RAM CDC Hashing system.
You could just build this as a standard database only application and lock your mapping table in RAM, if your DB supports this.
I guess I am saying that you can address performance issues without developing custom applications, just use already existing performance tuning methods.
I need to store user agent strings in a database for tracking and comparing customer behavior and sales performance between different browsers. A pretty plain user agent string is around 100 characters long. It was decided to use a varchar(1024) for holding the useragent data in the database. (I know this is overkill, but that's the idea; it's supposed to accommodate useragent data for years to come and some devices, toolbars, applications are already pushing 500 characters in length.) The table holding these strings will be normalized (each distinct user agent string will only be stored once) and treated like a cache so we don't have to interpret user agents over and over again.
The typical use case is:
User comes to our site, is detected as being a new vistor
New session information is created for this user
Determine if we need to analyze the user agent string or if we have a valid analysis on file for it
If we have it, great, if not, analyze it (currently, we plan on calling a 3rd party API)
Store the pertinent information (browser name, version, os etc.) in a join table tied the existing user session information and pointing to the cache entry
Note: I have a tendency to say 'searching' for the user agent string in the database because it's not a simple look up. But to be clear, the queries are going to use '=' operators, not regexes or LIKE % syntax.
So the speed of looking up the user agent string is paramount. I've explored a few methods of making sure it will have good performance. Indexing the whole column is right out for size reasons. A partial index isn't such a good idea either because most user agents have the distinguishing information at the end; the partial index would have to be fairly long to make it worthwhile by which point its size is causing problems.
So it comes down to a hash function. My thought is to hash the user agent string in web server code and run the select looking for the hash value in the database. I feel like this would minimize the load on the database server (as opposed to having it compute the hash), especially since if the hash isn't found, the code would turn around and ask the database to compute the hash again on the insert.
Hashing to an integer value would offer the best performance at the risk of higher collisions. I'm expecting to see thousands or tens of thousands user agents at the most; even 100,000 user agents would fit reasonably well into a 2^32 size integer with very few collisions which could be deciphered by the webservice with minimal performance impact. Even if you think an integer hash isn't such a good idea, using a 32 character digest (SHA-1, MD5 e.g.) should be much faster for selects than the raw string, right?
My database is MySQL InnoDB engine. The web code will be coming from C# at first and php later (after we consolidate some hosting and authentication) (not that the web code should make a big difference).
Let me apologize at this point if you think this is lame choose-my-hash-algorithm question. I'm really hoping to get some input from people who've done something similar before and their decision process. So, the question:
Which hash would you use for this application?
Would you compute the hash in code or let the db handle it?
Is there a radically different approach for storing/searching long strings in a database?
Your idea of hashing long strings to create a token upon which to lookup within a store (cache, or database) is a good one. I have seen this done for extremely large strings, and within high volume environments, and it works great.
"Which hash would you use for this application?"
I don't think the encryption (hashing) algorithm really matters, as you are not hashing to encrypt data, you are hashing to create a token upon which to use as a key to look up longer values. So the choice of hashing algorithm should be based off of speed.
"Would you compute the hash in code or let the db handle it?"
If it were my project, I would do the hashing at the app layer and then pass it through to look up within the store (cache, then database).
"Is there a radically different approach for storing/searching long strings in a database?"
As I mentioned, I think for your specific purpose, your proposed solution is a good one.
Table recommendations (demonstrative only):
user
id int(11) unsigned not null
name_first varchar(100) not null
user_agent_history
user_id int(11) unsigned not null
agent_hash varchar(255) not null
agent
agent_hash varchar(255) not null
browser varchar(100) not null
agent text not null
Few notes on schema:
From your OP it sounds like you need a M:M relationship between user and agent, due to the fact that a user may be using Firefox from work, but then may switch to IE9 at home. Hence the need for the pivot table.
The varchar(255) used for agent_hash is up for debate. MySQL suggests using a varbinary column type for storing hashes, of which there are several types.
I would also suggest either making agent_hash a primary key, or at the very least, adding a UNIQUE constraint to the column.
Your hash idea is sound. I've actually used hashing to speed up some searches on millions of records. A hash index will be quicker since each entry is the same size. md5 will likely be fine in your case and will probably give you the shortest hash length. If you are worried about hash collisions, you can add include the length of the agent string.
I'm building a web application where the front end is a highly-specialized search engine. Searching is handled at the main URL, and the user is passed off to a sub-directory when they click on a search result for a more detailed display. This hand-off is being done as a GET request with the primary key being passed in the query string. I seem to recall reading somewhere that exposing primary keys to the user was not a good idea, so I decided to implement reversible encryption.
I'm starting to wonder if I'm just being paranoid. The reversible encryption (base64) is probably easily broken by anybody who cares to try, makes the URLs very ugly, and also longer than they otherwise would be. Should I just drop the encryption and send my primary keys in the clear?
What you're doing is basically obfuscation. A reversible encrypted (and base64 doesn't really count as encryption) primary key is still a primary key.
What you were reading comes down to this: you generally don't want to have your primary keys have any kind of meaning outside the system. This is called a technical primary key rather than a natural primary key. That's why you might use an auto number field for Patient ID rather than SSN (which is called a natural primary key).
Technical primary keys are generally favoured over natural primary keys because things that seem constant do change and this can cause problems. Even countries can come into existence and cease to exist.
If you do have technical primary keys you don't want to make them de facto natural primary keys by giving them meaning they didn't otherwise have. I think it's fine to put a primary key in a URL but security is a separate topic. If someone can change that URL and get access to something they shouldn't have access to then it's a security problem and needs to be handled by authentication and authorization.
Some will argue they should never be seen by users. I don't think you need to go that far.
On the dangers of exposing your primary key, you'll want to read "autoincrement considered harmful", By Joshua Schachter.
URLs that include an identifier will
let you down for three reasons.
The first is that given the URL for
some object, you can figure out the
URLs for objects that were created
around it. This exposes the number of
objects in your database to possible
competitors or other people you might
not want having this information (as
famously demonstrated by the Allies
guessing German tank production levels
by looking at the serial numbers.)
Secondly, at some point some jerk will
get the idea to write a shell script
with a for-loop and try to fetch every
single object from your system; this
is definitely no fun.
Finally, in the case of users, it
allows people to derive some sort of
social hierarchy. Witness the frequent
hijacking and/or hacking of
high-prestige low-digit ICQ ids.
If you're worried about someone altering the URL to try and look at other values, then perhaps you need to look at token generation.
For instance, instead of giving the user a 'SearchID' value, you give them a SearchToken, which is some long unique psuedo-random value (Read: GUID), which you then map to the SearchID internally.
Of course, you'll also need to apply session security and soforth still - because even a unique URL with a non-sequential ID isn't protected against sniffing by anything between your server and the user.
If you're obscuring the primary keys for a security reason, don't do it. That's called security by obscurity and there is a better way. Having said that, there is at least one valid reason to obscure primary keys and that's to prevent someone from scraping all your content by simply examining a querystring in a URL and determining that they can simply increment an id value and pull down every record. A determined scraper may still be able to discover your means of obsuring and do this despite your best efforts, but at least you haven't made it easy.
PostgreSQL provides multiple solutions for this problem, and that could be adapted for others RDBMs:
hashids : https://hashids.org/postgresql/
Hashids is a small open-source library that generates short, unique, non-sequential ids from numbers.
It converts numbers like 347 into strings like “yr8”, or array of numbers like [27, 986] into “3kTMd”.
You can also decode those ids back. This is useful in bundling several parameters into one or simply using them as short UIDs.
optimus is similar to hashids but provides only integers as output: https://github.com/jenssegers/optimus
skip32 at https://wiki.postgresql.org/wiki/Skip32_(crypt_32_bits):
It may be used to generate series of unique values that look random, or to obfuscate a SERIAL primary key without loosing its unicity property.
pseudo_encrypt() at https://wiki.postgresql.org/wiki/Pseudo_encrypt:
pseudo_encrypt(int) can be used as a pseudo-random generator of unique values. It produces an integer output that is uniquely associated to its integer input (by a mathematical permutation), but looks random at the same time, with zero collision. This is useful to communicate numbers generated sequentially without revealing their ordinal position in the sequence (for ticket numbers, URLs shorteners, promo codes...)
this article gives details on how this is done at Instagram: https://instagram-engineering.com/sharding-ids-at-instagram-1cf5a71e5a5c and it boils down to:
We’ve delegated ID creation to each table inside each shard, by using PL/PGSQL, Postgres’ internal programming language, and Postgres’ existing auto-increment functionality.
Each of our IDs consists of:
41 bits for time in milliseconds (gives us 41 years of IDs with a custom epoch)
13 bits that represent the logical shard ID
10 bits that represent an auto-incrementing sequence, modulus 1024. This means we can generate 1024 IDs, per shard, per millisecond
Just send the primary keys. As long as your database operations are sealed off from the user interface, this is no problem.
For your purposes (building a search engine) the security tradeoffs benefits of encrypting database primary keys is negligible. Base64 encoding isn't encryption - it's security through obscurity and won't even be a speedbump to an attacker.
If you're trying to secure database query input just use parametrized queries. There's no reason at all to hide primary keys if they are manipulated by the public.
When you see base64 in the URL, you are pretty much guaranteed the developers of that site don't know what they are doing and the site is vulnerable.
URLs that include an identifier will
let you down for three reasons.
Wrong, wrong, wrong.
First - every request has to be validated, regardless of it coming in the form of a HTTP GET with an id, or a POST, or a web service call.
Second - a properly made web-site needs protection against bots which relies on IP address tracking and request frequency analysis; hiding ids might stop some people from writing a shell script to get a sequence of objects, but there are other ways to exploit a web site by using a bruteforce attack of some sort.
Third - ICQ ids are valuable but only because they're related to users and are a user's primary means of identification; it's a one-of-a-kind approach to user authentication, not used by any other service, program or web-site.
So, to conclude.. Yes, you need to worry about scrapers and DDOS attacks and data protection and a whole bunch of other stuff, but hiding ids will not properly solve any of those problems.
When I need a query string parameter to be able to identify a single row in a column, I normally add a GUID column to that table, and then pass the GUID in the connection string instead of the row's primary key value.