Suitability of AWS Cognito Identity ID for SQL primary key - mysql

I am working on a platform where unique user ID's are Identity ID's from a Amazon Cognito identity pool. Which look like so: "us-east-1:128d0a74-c82f-4553-916d-90053e4a8b0f"
The platform has a MySQL database that has a table of items that users can view. I need to add a favorites table that holds every favorited item of every user. This table could possibly grow to millions of rows.
The layout of the 'favorites' table would look like so:
userID, itemID, dateAdded
where userID and itemID together are a composite primary key.
My understanding is that this type of userID (practically an expanded UUID, that needs to be stored as a char or varchar) gives poor indexing performance. So using it as a key or index for millions of rows is discouraged.
My question is: Is my understanding correct, and should I be worried about performance later on due to this key? Are there any mitigations I can take to reduce performance risks?
My overall database knowledge isn't that great, so if this is a large problem...Would moving the favorited list to a NoSQL table (where the userID as a key would allow constant access time), and retrieving an array of favorited item ID's, to be used in a SELECT...WHERE IN query, be an acceptable alternative?
Thanks so much!

Ok so here I want to say why this is not good, the alternative, and the read/write workflow of your application.
Why not: this is not a good architecture because if something happens to your Cognito user pool, you cant repopulate it with the same ids for each individual user. Moreover, Cognito is getting offered in more regions now; compare to last year. Lets say your users' base are in Indonesia, and now that Cognito is being available in Singapore; you want to move your user pools from Tokyo to Singapore; because of the latency issue; not only you have the problem of moving the users; you have the issue of populating your database; so your approach lacks the scalability, maintainability and breaks the single responsibility principle (updating Cognito required you to update the db and vica versa).
Alternative solution: leave the db index to the db domain; and use the username as the link between your db and your Cognito user pool. So:
Read work flow will be:
User authentication: User authenticates and gets the token.
Your app verifies the token, and from its payload get the username.
You app contacts the db and get the information of the user, based on the username.
Your app will bring the user to its page and provides the information which was stored in the database.
Write work flow will be:
Your app gets the write request with the user with the token.
verifies the token.
Writes to the database based on the unique username.

Regarding MySQL, if you use the UserID and CognitoID composite for the primary key, it has a negative impact on query performance therefore not recommended for a large dataset.
However using this or even UserID for NoSQL DynamoDB is more suitable unless you have complex queries. You can also enforce security with AWS DynamoDB fine-grained access control connecting with Cognito Identity Pools.

While cognito itself has some issues, which are discussed in this article, and there are too many to list...
It's a terrible idea to use cognito and then create a completely separate user Id to use as a PK. First of all it is also going to be a CHAR or VARCHAR, so it doesn't actually help. Additionally now you have extra complexity to deal with an imaginary problem. If you don't like what cognito is giving you then either pair it with another solution or replace it altogether.
Don't overengineer your solution to solve a trivial case that may never come up. Use the Cognito userId because you use Cognito. 99.9999% of the time this is all you need and will support your use case.
Specifically this SO post explains that there is are zero problems with your approach:
There's nothing wrong with using a CHAR or VARCHAR as a primary key.
Sure it'll take up a little more space than an INT in many cases, but there are many cases where it is the most logical choice and may even reduce the number of columns you need, improving efficiency, by avoiding the need to have a separate ID field.

Related

I have account_id on all of the tables in the database that belong to an account. Could this be why our database is using a lot of memory?

The application makes heavy use of data. It is a Real Estate Lead Management System each lead could make our users thousands of dollars, and accidentally returning other people's leads would be a sure-fire way to lose our customers trust. We make heavy use of the entity pattern on the client. This requires us to get collections of data and store them as collections on the client. My thinking when I was designing the database was that if I had the account id on each table, that would make getting all of the data easier without returning data from other account, with fewer or more performant queries. I realize that there are other ways to handle this, but we were on a very short deadline and had to build a full app with ~50 tables for a beta launch in 3 Months. That being said, we also have many queries that heavily use joins and group by methods to prevent (n+1) trips to the database. Is it simply that analyzing lots of data is something that requires a larger database? The big problem is that we only have 45 Active customers currently. The app is fast and feels great, we are just pushing the limits of the memory of the database. The current database has 8Gb of Ram.
This is the pattern that I did. I know it's not normalized for the keys but everything I have read did not seem like it would cause this issue. But I'm not a database specialist and would appreciate any advice.
Account Table:
id
first_name
...etc
Record Table:
id
account_id
address
...etc
Analysis Table:
id
account_id
record_id
expected_return_on_investment
...etc
Comps Table:
id
account_id
record_id
analysis_id
cost
...etc
First of all, for better understanding, I suggest you post your database with an Entity-Relation Diagram. Because it's hard to understand and there is almost no information it is impossible to know what you are exactly asking for. Nonetheless, I will try answering it (maybe I don't answer what you expected).
The Account table id should be the account_id on all other tables. When sorting the data to get (for example) all records from a client, you would search for all with account_id. Then, to differentiate between them, you can use the id. Also, you don't need to add account_id to Analysis and Comps; as well, you should not add record_id to Analysis. Last but not least, record_id is not needed in Comps. Why? Because since you have the id of the parent element, you can fetch other id's stored in there.
By the way, if you want to make it safer, I would hash the Account IDs and store them in another variable, which I would call hashID. Then, I would pass that hash to the other tables. In case someone unauthorized got that hash, they would not be able to fetch the id. If you have access to the database, by comparing hashes you know which user you are referring to. However, if you don't, you can't get more info from that hash (if he had the id, yes).

Preventing insertion of duplicates without using indices

I have a MariaDB table users that looks roughly like this:
id INT PRIMARY KEY AUTOINCREMENT,
email_hash INT, -- indexed
encrypted_email TEXT,
other_stuff JSON
For privacy reasons, I cannot store actual emails in the database.
The encryption used for emails is not 1-to-1, i.e. one email can be encrypted to many different encrypted representations. This makes it pointless to just slap an index on the encrypted_email column, as it will never catch a duplicate.
There are already data in the database and changing the encryption method or the hashing method is out of question.
The email_hash column cannot have a unique index either, as it is supposed to be a short hash to just speed up duplicate checks. It cannot be too unique, as it would void all privacy guarantees.
How can I prevent two entries with the same email from appearing in the database?
Another limitation: I probably cannot use LOCK TABLE, as according to the documentation https://mariadb.com/kb/en/library/lock-tables/
LOCK TABLES doesn't work when using Galera cluster. You may experience crashes or locks when used with Galera.
LOCK TABLES implicitly commits the active transaction, if any. Also, starting a transaction always releases all table locks acquired with LOCK TABLES.
(I do use Galera and I do need transactions as inserting a new user is accompanied with several other inserts and updates)
Since the backend application server (a monolith) is allowed to handle personal information (for example for sending email messages, verifying logins etc.) as long as it doesn't store it, I do the duplicate check in the application.
Currently, I'm doing something like this (pseudocode):
perform "START TRANSACTION"
h := hash(new_user.email)
conflicts := perform "SELECT encrypted_email FROM users WHERE email_hash = ?", h
for conflict in conflicts :
if decrypt(conflict) == new_user.email :
perform "ROLLBACK"
return DUPLICATE
e := encrypt(new_user.email)
s := new_user.other_stuff
perform "INSERT INTO users (email_hash, encrypted_email, other_stuff) VALUES (?,?,?)", h, e, s
perform some other inserts as part of the transaction
perform "COMMIT"
return OK
which works fine if two attempts are separated in time. However, when two threads try to add the same user simultaneously, then both transactions run in parallel, do the select, see no conflicting duplicate, and then both proceed to add the user. How to prevent that, or at least gracefully immediately recover?
This is how the race looks like, simplified:
Two threads start their transactions
Both threads do the select and the select returns zero rows in both cases.
Both threads assume there won't be a duplicate.
Both threads add the user.
Both threads commit the transactions.
There are now two users with the same email.
Tack FOR UPDATE on the end of the SELECT.
Also, since you are using Galera, you must check for errors after COMMIT. (That is when conflicts with the other nodes are reported.)
Your pseudocode risks race conditions unless you can force the code to run serially. That is, only one request at a time can attempt to insert an email. The whole block of code you show in your pseudocode has to be in a critical section.
If you can't use LOCK TABLES you could try MariaDB's GET_LOCK() function. I'm not sure if that's compatible with Galera, that's something for you to research.
If that's not possible, you'll have to find some other method of forcing that block of code to run serially. You haven't described your programming language or your application deployment architecture. Maybe you could use some kind of distributed lock server in Redis or something like that.
But even if you can accomplish this, making the code run serially, that will probably create a bottleneck in your app. Only one thread at a time will be able to insert a new email, and you'll probably find that they queue up waiting for the global lock.
Sorry, but that is the consequence of the constraints of this system, since you cannot implement it with a unique key, which would be the proper way to do it.
Good luck.
This is too long for a comment.
You can't. You have one field where one email gets multiple values. That's of no use for identifying duplicate values.
You have another field where multiple emails have the same value. That just raises false errors on duplicates.
If you want to prevent duplicates, then I would suggest a more robust hashing mechanism that greatly reduces collisions so you can use that. Otherwise, you need to do the validation behind a PII wall.
Also too long for a comment:
To prevent duplicate entries in a table you should use an unique index, so MariaDB will be able to detect duplicates.
A 4 byte hash/checksum (INT) is not unique enough and might have too many collisions. Instead of checksum, you should store the encrypted password (e.g. encrypting it by using AES-256-CTR or any other block cipher) in the table, the key and iv (initialization vector) should be stored on the client. Each encrypted value will now be unique, and for security encrypted value and key/iv are stored in different locations.
/* Don't send plain password, e.g. by using MariaDB's aes_encryot function
we encrypt it already on client*/
encrypted_unique_email= aes_256_ctr_encrypt(plain_pw);
encrypted_email=encrypt(user.email);
execute("INSERT INTO users VALUES (NULL, encrypted_unique_email, encrypted_email, other_stuff) ...
This solution however will only work with an empty table, since you likely will not be able to decrypt existing records.
In this case likely your proposal might be the best solution, however you need to lock the users table by LOCK TABLE users WRITE and unlock it with UNLOCK TABLES to prevent inconsistency.
You need to add another column and use it to store some one-to-one, collision free unrecoverable projection from email to some comparable output. Take any asymmetric cryptographic algorithm, generate public-private key pair, then destroy the private key and store the public key to encrypt the e-mail. The way asymmetric crytography works, it'll be impossible to recover private key even if the attacker gets their hands on public key you are using to encrypt the emails.
Note, however, that this approach has the same vulnerability as storing un-salted hashes: if the attacker gets his hands on your entire database, public key and algorithm, they can run a brute-force attack using some known e-mail dictionary, and successfully find the matching e-mails in their encrypted form, thus matching accounts fom your system to the actual e-mail. Deciding if that situation is an actual security risk is up to you and your ITSec department; but I think it shouldn't be, since you seem to have decrypt function available, so if attacker already has access to the database AND the system innards they can just decrypt the stored e-mails.
You can take it one step further and store these encrypted e-mails in a separate table without any relation to users. When a new row is inserted to users, make sure that a row is inserted into that table as well. Combined with unique index and a transaction, this will ensure no duplicates; however, managing changes and deletions will get more cumbersome. The potential attacker will get literally nothing besides knowing that yes, some of his known e-mails are registered in the system.
Otherwise, you just have to make sure the writes to users table are always serialized on the software layer before DB. Write a microservice that queues user storage requests and forbid modification of users by any other means.

Can I use user id as account number

I have users less then million. I use MySQL and table creates auto increment from 100000 (6 digits). If any problem if i use user id as account number for small web application. What is the best way in practice?
Fairly unspecified what an account number exactly means. In general, you can use that but in my opinion, the user-id is a technical information which should NOT get outside of the system to the customer (for security reasons). I suggest to create a GUID or any other generated (unpredictable)id for each user and then use that as account number to give "outside". With this approach, no user can "predict" the ID of another user.
Generally, I would decouple business logic from application logic even if the overlap is evident. The way key generation happens in databases could leave gaps and/or your app may not produce keys at the time needed. A simple key generator (synchronized maybe) maybe better. That said, here is more in this dialog for you to ponder upon.

Best way of relation database design in MySql

i'm quite familiar with MySQL DBMS. I'm interested to know the best way to design more complicated
relational database. Like, Lets say i have a "users" table with Auto Increment value which is "Primary Key".
Using this "PK" as "Foreign Key" i can create a table called "user_details" where i can store all confidential data of users.
Yes this is a good way to do it. But i wanted to know if there is any complicated way to do it.
Because if any body in localhost get access to database, they can easily get the "user_details" data based on users "PK".
Also, is it good idea to use application generated unique codes as a "PK" and "FK" in database Or Auto Increment value within database is more than enough?
This a very vague questions, so I'll just list a few points:
Your data model should not be concerned with server security. Build your data model accurately for your application and lock down access to the db and tables as much as possible. These are separate concerns.
Use encryption for data that only the end user is allowed to know. Passwords, for example, get one-way encryption.
MySQL's auto increment is sufficient for most use cases. The only time I sometimes have the application generate IDs is on multi-master replicated databases where I need more centralized control or have unique requirements. This isn't always necessary since you can set the autoincrement starting number separately for each server and not worry about the servers generating conflicting IDs. There is sometimes a performance drawback to generating your own IDs, e.g. generating a GUID takes longer than incrementing an integer.

What is the most efficient method of keeping track of each user's "blocked users" in a MySQL Database?

What is the most efficient method of managing blocked users for each user so they don't appear in search results on a PHP/MySQL-run site?
This is the way I am currently doing it and I have a feeling this is not the most efficient way:
Create a BLOB for each user on their main user table that gets updated with the unique User ID's of each user they block. So if User ID's 313, 563, and 732 are blocked by a user, their BLOB simply contains "313,563,732". Then, whenever a search result is queried for that user, I include the BLOB contents like so "AND UserID NOT IN (313,563,732)" so that the blocked User ID's don't show up for that user. When a user "unblocks" someone, I remove that User ID from their BLOB.
Is there a better way of doing this (I'm sure there is!)? If so, why is it better and what are the pros and cons of your suggestion?
Thanks, I appreciate it!
You are saving relationships in a relational database in a way that it does not understand. You will not have the benefit of foreign keys etc.
My recommended way to do this would be to have a seperate table for the blocked users:
create table user_blocked_users (user_id int, blocked_user_id);
Then when you want to filter the search result, you can simply do it with a subquery:
select * from user u where ?searcherId not in (select b.blocked_user_id from user_blocked_users where b.user_id = u.id)
You may want to start out that way, and then optimize it with queries, caches or other things if neccessary - but do it last. First, do a consistent and correct data model that you can work with.
Some of the pros of this approach:
You will have a correct data model
of your block relations
With foreign keys, you will keep your data model consistent
The cons of this approach:
In your case, none that I can see
The cons of your approach:
It will be slow and not scalable, as blobs are searched binarily and not indexed
Your data model will be hard to maintain and you will not have the benefit of foreign keys
You are looking for a cross reference table.
You have a table containing user IDs and "Blocked" user IDs, then you SELECT blockid FROM blocked WHERE uid=$user and you have a list of user ids that are blocked, which you can filter through a where clause such as WHERE uid NOT IN(SELECT blockid FROM blocked WHERE uid=$user)
Now you can block multiple users per user, and the other way round, with all the speed of an actual database.
You are looking for a second table joined in a many-to-many relationship. Check this post:
Many-to-Many Relationships in MySQL
The "Pros" are numerous. You are handling your data with referential integrity, which has incalculable benefits down the road. The issue you described will be followed by others in your application, and some of those others will be more unmanageable than this one.
The "Cons" are that
You will have have to learn how referential data works (but that's ahead anyway, as I say)
You will have more tables to deal with (ditto)
You will have to learn more about CRUD, which is difficult ... but, just part of the package.
What you are currently using is not regarded as a good practice for relational database design, however, like with anything else, there are cases when that approach can be justified, albeit restrictive in terms of what you can accomplish.
What you could do is, like J V suggested, create a cross reference table that contains mappings of user relationships. This allows you to, among other things, skip unnecessary queries, make use of table indexes and possibly most importantly, it gives you far greater flexibility in the future.
For instance, you can add a field to the table that indicates the type/status of the relationship (ie. blocked, friend, pending approval etc.) which would allow a much more complex system to be developed easily.