Related
I am having in greatest nightmare on deciding a database schema ! Recently signed up to my first freelancer project.
It has a user registration, and there is pretty decent requirements on user table as follows:
- name
- password
- email
- phone
- is_active
- email_verified
- phone_verified
- is_admin
- is_worker
- is_verified
- has_payment
- last_login
- created_at
Now am at huge confusion to decide whether to put everything under a single table or split things, as still I need to add few more fields like
- token
- otp ( may be in future )
- otp_limit ( may be in future ) // rate limiting
And may be something more in future when there is an update: I am afraid that, if there is an future update with new field to table then how to add that again if it's a single table.
And if I split things will that cause performance issue ? As most of the fields are moderately used on the webapp.
How can I decide?
Your initial aim should be to create a model that is in 3rd Normal Form (3NF). Once you have that, if you then need to move away from a strict 3NF model in order to effectively handle some specific operational requirements/challenges then that's fine - as long as you know what your doing.
A working/simplified definition of whether a model is in 3NF is that all attributes that can be uniquely identified by the same key should be in the same table.
So all attributes of a user should be in the same table (as long as they have a 1:1 relationship with the User ID).
I'm not sure why adding new columns to a table in the future is worrying you - this should not affect a well-designed application. Obviously altering/dropping columns is a different matter.
As commented, design database according to your business or project use case and narrative in mind. Essentially, you need a relational model of Users, Portfolios, and Stocks where Users can have many Portfolios and each Portfolio can contain many Stocks. If you need to track Registrations or Logins, add that to schema where Users can have multiple Registrations or Logins. Doing so, you simply add rows with corresponding UserID and not columns.
Also, consider best practices:
Use Lookup Tables: For static (or rarely changed) data shared across related entities, incorporate lookup tables in relation model like Tickers (with its ID referenced as foreign key in Stocks). Anything that regularly changes at specific level (i.e., user-level) should be stored in that table. Remember database tables should not resemble spreadsheets with repeated static data stored within it.
Avoid Data Elements in Columns: Avoid wide-formatted tables where you store data elements in columns. Tables with hundreds of suffixed or dated columns is indicative of this design. Doing this you avoid clearly capturing Logins data and force a re-design such as ALTER commands for new column with every new instance. Always normalize data for storage, efficiency, and scaling needs.
UserID
Login1
Login2
Login3
...
10001
...
...
...
...
10002
...
...
...
...
10003
...
...
...
...
Application vs Data Centric Design: Depending on your use case, try to not build database with specific application in mind but as a generalized solution for all users including business personnel, CEOs to regular staff, and maybe even data scientists. Therefore, avoid short names, abbreviations (like otp), industry jargon, etc. Everything should be clear and straightforward as much as possible.
Additionally, avoid any application or script that makes structural changes to database like creating temp tables or schemas on the fly. There is a debate if business logic should be contained in database or run in specific application. Usually data should be handled between database and application. Keep in mind , MySQL is a powerful (though free), enterprise, server-level RDBMS and not a throwaway file-level, small scale system.
Maintain Consistent Signature: Pick a naming convention and stick to it throughout the design (i.e., camelcase, snake case, plurals). There is a big debate if you should prefix objects tbl, vw, and sp. One strategy is to name data objects by its content and procedures/functions by its action. Always avoid reserved words and special characters and spaces in names.
Always Document: While very tedious for developers, document every object, functionality, and extension and annotate table and fields for definitions. MySQL supports COMMENTS in CREATE statements for tables and fields. And use # or -- for comments in stored procedures or triggers.
Once designed and in production, databases should rarely (if not ever) be restructured. So carefully think of all possibilities and scenarios beforehand with your use case. Do not dismiss the very important database design step. Good luck!
I am working on a platform where unique user ID's are Identity ID's from a Amazon Cognito identity pool. Which look like so: "us-east-1:128d0a74-c82f-4553-916d-90053e4a8b0f"
The platform has a MySQL database that has a table of items that users can view. I need to add a favorites table that holds every favorited item of every user. This table could possibly grow to millions of rows.
The layout of the 'favorites' table would look like so:
userID, itemID, dateAdded
where userID and itemID together are a composite primary key.
My understanding is that this type of userID (practically an expanded UUID, that needs to be stored as a char or varchar) gives poor indexing performance. So using it as a key or index for millions of rows is discouraged.
My question is: Is my understanding correct, and should I be worried about performance later on due to this key? Are there any mitigations I can take to reduce performance risks?
My overall database knowledge isn't that great, so if this is a large problem...Would moving the favorited list to a NoSQL table (where the userID as a key would allow constant access time), and retrieving an array of favorited item ID's, to be used in a SELECT...WHERE IN query, be an acceptable alternative?
Thanks so much!
Ok so here I want to say why this is not good, the alternative, and the read/write workflow of your application.
Why not: this is not a good architecture because if something happens to your Cognito user pool, you cant repopulate it with the same ids for each individual user. Moreover, Cognito is getting offered in more regions now; compare to last year. Lets say your users' base are in Indonesia, and now that Cognito is being available in Singapore; you want to move your user pools from Tokyo to Singapore; because of the latency issue; not only you have the problem of moving the users; you have the issue of populating your database; so your approach lacks the scalability, maintainability and breaks the single responsibility principle (updating Cognito required you to update the db and vica versa).
Alternative solution: leave the db index to the db domain; and use the username as the link between your db and your Cognito user pool. So:
Read work flow will be:
User authentication: User authenticates and gets the token.
Your app verifies the token, and from its payload get the username.
You app contacts the db and get the information of the user, based on the username.
Your app will bring the user to its page and provides the information which was stored in the database.
Write work flow will be:
Your app gets the write request with the user with the token.
verifies the token.
Writes to the database based on the unique username.
Regarding MySQL, if you use the UserID and CognitoID composite for the primary key, it has a negative impact on query performance therefore not recommended for a large dataset.
However using this or even UserID for NoSQL DynamoDB is more suitable unless you have complex queries. You can also enforce security with AWS DynamoDB fine-grained access control connecting with Cognito Identity Pools.
While cognito itself has some issues, which are discussed in this article, and there are too many to list...
It's a terrible idea to use cognito and then create a completely separate user Id to use as a PK. First of all it is also going to be a CHAR or VARCHAR, so it doesn't actually help. Additionally now you have extra complexity to deal with an imaginary problem. If you don't like what cognito is giving you then either pair it with another solution or replace it altogether.
Don't overengineer your solution to solve a trivial case that may never come up. Use the Cognito userId because you use Cognito. 99.9999% of the time this is all you need and will support your use case.
Specifically this SO post explains that there is are zero problems with your approach:
There's nothing wrong with using a CHAR or VARCHAR as a primary key.
Sure it'll take up a little more space than an INT in many cases, but there are many cases where it is the most logical choice and may even reduce the number of columns you need, improving efficiency, by avoiding the need to have a separate ID field.
I have 40+ columns in my table and i have to add few more fields like, current city, hometown, school, work, uni, collage..
These user data wil be pulled for many matching users who are mutual friends (joining friend table with other user friend to see mutual friends) and who are not blocked and also who is not already friend with the user.
The above request is little complex, so i thought it would be good idea to put extra data in same user table to fast access, rather then adding more joins to the table, it will slow the query more down. but i wanted to get your suggestion on this
my friend told me to add the extra fields, which wont be searched on one field as serialized data.
ERD Diagram:
My current table: http://i.stack.imgur.com/KMwxb.png
If i join into more tables: http://i.stack.imgur.com/xhAxE.png
Some Suggestions
nothing wrong with this table and columns
follow this approach MySQL: Optimize table with lots of columns - which serialize extra fields into one field, which are not searchable's
create another table and put most of the data there. (this gets harder on joins, if i already have 3 or more tables to join to pull the records for users (ex. friends, user, check mutual friends)
As usual - it depends.
Firstly, there is a maximum number of columns MySQL can support, and you don't really want to get there.
Secondly, there is a performance impact when inserting or updating if you have lots of columns with an index (though I'm not sure if this matters on modern hardware).
Thirdly, large tables are often a dumping ground for all data that seems related to the core entity; this rapidly makes the design unclear. For instance, the design you present shows 3 different "status" type fields (status, is_admin, and fb_account_verified) - I suspect there's some business logic that should link those together (an admin must be a verified user, for instance), but your design doesn't support that.
This may or may not be a problem - it's more a conceptual, architecture/design question than a performance/will it work thing. However, in such cases, you may consider creating tables to reflect the related information about the account, even if it doesn't have a x-to-many relationship. So, you might create "user_profile", "user_credentials", "user_fb", "user_activity", all linked by user_id.
This makes it neater, and if you have to add more facebook-related fields, they won't dangle at the end of the table. It won't make your database faster or more scalable, though. The cost of the joins is likely to be negligible.
Whatever you do, option 2 - serializing "rarely used fields" into a single text field - is a terrible idea. You can't validate the data (so dates could be invalid, numbers might be text, not-nulls might be missing), and any use in a "where" clause becomes very slow.
A popular alternative is "Entity/Attribute/Value" or "Key/Value" stores. This solution has some benefits - you can store your data in a relational database even if your schema changes or is unknown at design time. However, they also have drawbacks: it's hard to validate the data at the database level (data type and nullability), it's hard to make meaningful links to other tables using foreign key relationships, and querying the data can become very complicated - imagine finding all records where the status is 1 and the facebook_id is null and the registration date is greater than yesterday.
Given that you appear to know the schema of your data, I'd say "key/value" is not a good choice.
I would advice to run some tests. Try it both ways and benchmark it. Nobody will be able to give you a definitive answer because you have not shared your hardware configuration, sample data, sample queries, how you plan on using the data etc. Here is some information that you may want to consider.
Use The Database as it was intended
A relational database is designed specifically to handle data. Use it as such. When written correctly, joining data in a well written schema will perform well. You can use EXPLAIN to optimize queries. You can log SLOW queries and improve their performance. Databases have been around for years, if putting everything into a single table improved performance, don't you think that would be all the buzz on the internet and everyone would be doing it?
Engine Types
How will inserts be affected as the row count grows? Are you using MyISAM or InnoDB? You will most likely want to use InnoDB so you get row level locking and not table. Make sure you are using the correct Engine type for your tables. Get the information you need to understand the pros and cons of both. The wrong engine type can kill performance.
Enhancing Performance using Partitions
Find ways to enhance performance. For example, as your datasets grow you could partition the data. Data partitioning will improve the performance of a large dataset by keeping slices of the data in separate partions allowing you to run queries on parts of large datasets instead of all of the information.
Use correct column types
Consider using UUID Primary Keys for portability and future growth. If you use proper column types, it will improve performance of your data.
Do not serialize data
Using serialized data is the worse way to go. When you use serialized fields, you are basically using the database as a file management system. It will save and retrieve the "file", but then your code will be responsible for unserializing, searching, sorting, etc. I just spent a year trying to unravel a mess like that. It's not what a database was intended to be used for. Anyone advising you to do that is not only giving you bad advice, they do not know what they are doing. There are very few circumstances where you would use serialized data in a database.
Conclusion
In the end, you have to make the final decision. Just make sure you are well informed and educated on the pros and cons of how you store data. The last piece of advice I would give is to find out what heavy users of mysql are doing. Do you think they store data in a single table? Or do they build a relational model and use it the way it was designed to be used?
When you say "I am going to put everything into a single table", you are saying that you know more about performance and can make better choices for optimization in your code than the team of developers that constantly work on MySQL to make it what it is today. Consider weighing your knowledge against the cumulative knowledge of the MySQL team and the DBAs, companies, and members of the database community who use it every day.
At a certain point you should look at the "short row model", also know as entity-key-value stores,as well as the traditional "long row model".
If you look at the schema used by WordPress you will see that there is a table wp_posts with 23 columns and a related table wp_post_meta with 4 columns (meta_id, post_id, meta_key, meta_value). The meta table is a "short row model" table that allows WordPress to have an infinite collection of attributes for a post.
Neither the "long row model" or the "short row model" is the best model, often the best choice is a combination of the two. As #nevillek pointed out searching and validating "short row" is not easy, fetching data can involve pivoting which is annoyingly difficult in MySql and Oracle.
The "long row model" is easier to validate, relate and fetch, but it can be very inflexible and inefficient when the data is sparse. Some rows may have only a few of the values non-null. Also you can't add new columns without modifying the schema, which could force a system outage, depending on your architecture.
I recently worked on a financial services system that had over 700 possible facts for each instrument, most had less than 20 facts. This could have been built by setting up dozens of tables, each for a particular asset class, or as a table with 700 columns, but we chose to use a combination of a table with about 20 columns containing the most popular facts and a 4 column table which contained the other facts. This design was efficient but was difficult ot access, so we built a few table functions in PL/SQL to assist with this.
I have a general comment for you,
Think about it: If you put anything more than 10-12 columns in a table even if it makes sense to put them in a table, I guess you are going to pay the price in the short term, long term and medium term.
Your 3 tables approach seems to be better than the 1 table approach, but consider making those into 5-6 tables rather than 3 tables because you still can.
Move currently, currently_position, currently_link from user-table and work from user-profile into a new table with your primary key called USERWORKPROFILE.
Move locale Information from user-profile to a newer USERPROFILELOCALE information because it is generic in nature.
And yes, all your generic attributes in all the tables should be int and not varchar.
For instance, City needs to move out to a new table called LIST_OF_CITIES with cityid.
And your attribute city should change from varchar to int and point to cityid in LIST_OF_CITIES.
Do not worry about performance issues; the more tables you have, better the performance, because you are actually handing out the performance to the database provider instead of taking it all in your own hands.
I'm trying to implement a key/value store with mysql
I have a user table that has 2 columns, one for the global ID and one for the serialized data.
Now the problem is that everytime any bit of the user's data changes, I will have to retrieve the serialized data from the db, alter the data, then reserialize it and throw it back into the db. I have to repeat these steps even if there is a very very small change to any of the user's data (since there's no way to update that cell within the db itself)
Basically i'm looking at what solutions people normally use when faced with this problem?
Maybe you should preprocess your JSON data and insert data as a proper MySQL row separated into fields.
Since your input is JSON, you have various alternatives for converting data:
You mentioned many small changes happen in your case. Where do they occur? Do they happen in a member of a list? A top-level attribute?
If updates occur mainly in list members in a part of your JSON data, then perhaps every member should in fact be represented in a different table as separate rows.
If updates occur in an attribute, then represent it as a field.
I think cost of preprocessing won't hurt in your case.
When this is a problem, people do not use key/value stores, they design a normalized relational database schema to store the data in separate, single-valued columns which can be updated.
To be honest, your solution is using a database as a glorified file system - I would not recommend this approach for application data that is core to your application.
The best way to use a relational database, in my opinion, is to store relational data - tables, columns, primary and foreign keys, data types. There are situations where this doesn't work - for instance, if your data is really a document, or when the data structures aren't known in advance. For those situations, you can either extend the relational model, or migrate to a document or object database.
In your case, I'd see firstly if the serialized data could be modeled as relational data, and whether you even need a database. If so, move to a relational model. If you need a database but can't model the data as a relational set, you could go for a key/value model where you extract your serialized data into individual key/value pairs; this at least means that you can update/add the individual data field, rather than modify the entire document. Key/value is not a natural fit for RDBMSes, but it may be a smaller jump from your current architecture.
when you have a key/value store, assuming your serialized data is JSON,it is effective only when you have memcached along with it, because you don't update the database on the fly every time but instead you update the memcache & then push that to your database in background. so definitely you have to update the entire value but not an individual field in your JSON data like address alone in database. You can update & retrieve data fast from memcached. since there are no complex relations in database it will be fast to push & pull data from database to memcache.
I would continue with what you are doing and create separate tables for the indexable data. This allows you to treat your database as a single data store which is managed easily through most operation groups including updates, backups, restores, clustering, etc.
The only thing you may want to consider is to add ElasticSearch to the mix if you need to perform anything like a like query just for improved search performance.
If space is not an issue for you, I would even make it an insert only database so any changes adds a new record that way you can keep the history. Of course you may want to remove the older records but you can have a background job that would delete the superseded records in a batch in the background. (Mind you what I described is basically Kafka)
There's many alternatives out there now that beats RDBMS in terms of performance. However, they all add extra operational overhead in that it's yet another middleware to maintain.
The way around that if you have a microservices architecture is to keep the middleware as part of your microservice stack. However, you have to deal with transmitting the data across the microservices so you'd still end up with a switch to Kafka underneath it all.
For a project we having a bunch of data that always have the same structure and is not linked together.
There are two approaches to save the data:
Creating a new database for every pool (about 15-25 tables)
Creating all the tables in one database and differ the pools by table names.
Which one is easier and faster to handle for MySQL?
EDIT: I am not interessed in issues of database design, I am just interessed in which of the two possibilities is faster.
EDIT 2: I will try to make it more clear. As said we will have data, where some of the date rarely belongs together in different pools. Putting all the data of one type in one table and linking it with a pool id is not a good idea:
It is hard to backup/delete a specific pool (and we expect that we are running out primary keys after a while (even when use big int))
So the idea is to make a database for every pool or create a lot of tables in one database. 50% of the queries against the database will be simple inserts. 49% will be some simple selects on a primary key.
The question is, what is faster to handle for MySQL? Many tables or many databases?
There should be no significant performance difference between multiple tables in a single database versus multiple tables in separate databases.
In MySQL, databases (standard SQL uses the term "schema" for this) serve chiefly as a namespace for tables. A database has only a few attributes, e.g. the default character set and collation. And that usage of GRANT makes it convenient to control access privileges per database, but that has nothing to do with performance.
You can access tables in any database from a single connection (provided they are managed by the same instance of MySQL Server). You just have to qualify the table name:
SELECT * FROM database17.accounts_table;
This is purely a syntactical difference. It should have no effect on performance.
Regarding storage, you can't organize tables into a file-per-database as #Chris speculates. With the MyISAM storage engine, you always have a file per table. With the InnoDB storage engine, you either have a single set of storage files that amalgamate all tables, or else you have a file per table (this is configured for the whole MySQL server, not per database). In either case, there's no performance advantage or disadvantage to creating the tables in a single database versus many databases.
There aren't many MySQL configuration parameters that work per database. Most parameters that affect server performance are server-wide in scope.
Regarding backups, you can specify a subset of tables as arguments to the mysqldump command. It may be more convenient to back up logical sets of tables per database, without having to name all the tables on the command-line. But it should make no difference to performance, only convenience for you as you enter the backup command.
Why not create a single table to keep track of your pools (with a PoolID and PoolName as you columns, and whatever else you want to track) and then on your 15-25 tables you would add a column on all of them which would be a foreign key back to you pool table so you know which pool that particular record belongs to.
If you don't want to mix the data like that, I would suggest making multiple databases. Creating multiple tables all for the same functionality makes my spider sense tingle.
If you don't want one set of tables with poolID poolname as TheTXI suggested, use separate databases rather than multiple tables that all do the same thing.
That way, you restrict the variation between the accessing of different pools to the initial "use database" statement, you won't have to recode your SELECTs each time, or have dynamic sql.
The other advantages of this approach are:
Easy backup/restore
Easy start/stop of a database instance.
Disadvantages are:
a little bit more admin work, but not much.
I don't know what your application is, but really really think carefully before creating all of the tables in one database. That way madness lies.
Edit: If performance is the only thing that concerns you, you need to measure it. Take a representative set of queries and measure their performance.
Edit 2: The difference in performance for a single query between the many tables/many databases model will be neglible. If you have one database, you can tune the hell out of it. If you have many databases, you can tune the hell out of all of them.
My (our? - can't speak for anyone else) point is that, for well tuned database(s), there will be practically no difference in performance between the three options (poolid in table, multiple tables, multiple databases), so you can pick the option which is easiest for you, in the short AND long term.
For me, the best option is still one database with poolId, as TheTXI suggested, then multiple databases, depending upon your (mostly administration) needs. If you need to know exactly what the difference in performance is between two options, we can't give you that answer. You need to set it up and test it.
With multiple databases, it becomes easy to throw hardware at it to improve performance.
In the situation you describe, experience has led me to believe that you'll find the separate databases to be faster when you have a large number of pools.
There's a really important general principle to observe here, though: Don't think about how fast it'll be, profile it.
I'm not too sure I completely understand your scenario. Do you want to have all the pools using the same tables, but just differing by a distinguishing key? Or do you want separate pools of tables within the one database, with a suffix on each table to distinguish the pools?
Either way though, you should have multiple databases for two major reasons. The first being if you have to change the schema on one pool, it won't affect the others.
The second, if your load goes up (or for any other reason), you may want to move the pools onto separate physical machines with new database servers.
Also, security access to a database server can be more tightly locked down.
All of these things can still be accomplished without requiring separate databases - but the separation will make all of this easier and reduce the complexity of having to mentally track which tables you want to operate on.
Differing the pools by table name or putting them in separate databases is about the same thing. However, if you have lots of tables in one database, MySQL has to load the table information and do a security check on all those tables when logging in/connecting.
As others mentioned, separate databases will allow you to shift things around and create optimizations specific to a certain pool (i.e. compressed tables). It is extra admin overhead, but there is considerably more flexibility.
Additionally, you can always "pool" the tables that are in separate databases by using federated or merge tables to simplify querying if needed.
As for running out of primary keys, you could always use a compound primary key if you are using MyISAM tables. For example, if you have a field called groupCode (any type) and another called sequenceId (auto increment) and create your primary key as groupCode+sequenceId. The sequenceId will increment based on the next unique ID within the group code set.
For example:
AAA 1
AAA 2
BBB 1
AAA 3
CCC 1
AAA 4
BBB 2
...
Although with large tables you have to be careful about caching and make sure the file system you are using handles large files.
I don't know mysql very well, but I think I'll have to give the standard performance answer -- "It depends".
Some thoughts (dealing only with performance/maintenance, not database design):
Creating a new database means a separate file (or files) in the file system. These files could then be put on different filesystems if performance of one needs to be separate from the others, etc.
A new database will probably handle caching differently; eg. All tables in one DB is going to mean a shared cache for the DB, whereas splitting the tables into separate databases means each database can have a separate cache [obviously all databases will share the same physical memory for cache, but there may be a limit per database, etc].
Related to the separate files, this means that if one of your datasets becomes more important than the others, it can easily be pulled off to a new server.
Separating the databases has an added benefit of allowing you to deploy updates one-at-a-time more easily than with the single database.
However, to contrast, having multiple databases means the server will probably be using more memory (since it has multiple caches). I'm sure there are more "cons" for the multi-database approach, but I am drawing a blank now.
So I suppose I would recommend the multi-database approach. Obviously this is only with the understanding that there may very well be a better "database-designy" way of handling whatever you are actually doing.
Given the restrictions you've placed on it, I'd rather spin up more tables in the existing database, rather than having to connect to multiple databases. Managing connection strings TEND to be harder, in addition to managing the different database optimizations you may have.
FTR, in normal circumstances I'd take the approach described by TheTXI.
In answer to your specific question though, I have found it to be dependant on usage. (Cop out I know, but hear me out.)
A single database is probably easier. You'll have to worry about just one connection and would still have to specify tables. Multiple databases could, under certain conditions, be faster though.
If I were you I'd try both. There's no way we'll be able to give you a useful answer.