long many to many-database table: best performance practice - mysql

I have a question about performance of my MYSQL database design.
Table A has a lot of records, say a million, and table B also has a million. There is another table C in which every record id of A is connected to every row in B and this connection has an additional value 1 or 0. So functionally speaking, every record in A has a boolean vector where B contains the 'variables' of the vector and 1 or 0 is the value. It's explained more graphically in the image on bottom.
Table C will have a lot of write and read actions (select all values from a record of A), so the the table is very actively used. And table C is really long, a million times a million rows.
My first question is, will the length of the table give a performance
issue? the database needs to be really fast.
My second question is, if this is badly designed, whether there is a better design to achieve what i want. For instance I can think of storing the entire B vector of each A record inside of each row in A. Then table C will not be necessary. But it will make selecting, reading, writing much more difficult.

The table design is fine and shouldn't be a problem, because you access records via IDs which should be indexed. Depending on your typical queries you should also consider adding composite indexes (c(a_id,b_id), c(a_id,value), c(b_id,value), c(a_id,b_id,value)).
However, as there exist only two states, 0 and 1, you may decide only to store one of them. I.e. if you store all state 1 records only, all pairs not in the table have state 0 then implicitly. This pays especially when the states are unevenly distributed (say 90% of the records have state 0 and only 10% have state 1) or you usually access only one of the states (e.g. you always look for 1s).

Answer to your first question
Millions of records in a table with multiple read and write won't be a
bottleneck if you are following best practices of mysql.
Your engine should be innodb.
Your select queries should not involve a full table scan.
Your table should have desired indexes.
Answer to your second question
You should look for all your possible use cases, because either way is
a good idea if a use case supports it.
If you split your data across multiple tables than join operation is
to be performed if needed.

Related

When to add a column vs adding a related table?

I have a big table with over 100 million rows. I have been trimming it down for months getting rid of bad data (rows wise), trying to keep it small. I already had 9 columns on this table. I want to add a new boolean column to it. Below is the current state.
This table started off small, and now its getting pretty wide. Yet again, I am tasked with adding more information per row. This time it's a new boolean field. I expect this field to be low volume, meaning less than 10% will have this set to true. I know I can make it default null, and it is a boolean column which should be small.
However, I wanted to get some advice. This table cannot get infinitely wide, and I will need to work around this. Under these circumstances, does it make more sense to create another table and foreign key reference the record when I have additional data to add? How do the pro's handle this in database design?
The best situation for usability is to have all data on the record so any form of a query can get or calculate on the table itself without joins. I just do not have confidence that it will scale to 1 BILLION rows (insert meme).
At my job I support MySQL instances that have multi-billion row tables. At that scale, care must be taken to optimize queries properly. You don't want to do a table-scan at that scale.
But that's about rows, not columns. You asked first about columns.
The way to choose between adding a column versus adding another table is to follow rules of database normalization. If the new column is for an attribute of the same entity as your current table, add the column to that table. If it's a multi-valued attribute or if it's really an attribute of some other entity, then add it to a different table.
Very, very rarely is it the right choice to make another table solely for the sake of having too many columns. A given MySQL table can have dozens of columns pretty easily, and hundreds if you're careful.
In theory, there is no limit to the number of columns that might be appropriate to put in the same table with respect to normalization. But there are limitations due to the code to store those columns in a given implementation (e.g. InnoDB storage engine in MySQL).
See https://www.percona.com/blog/2013/04/08/understanding-the-maximum-number-of-columns-in-a-mysql-table/
So the maximum number of columns for a table in MySQL is somewhere between 191 and 2829, depending on a number of factors.
In the comments on that blog, I was able to design a table that failed to be created at 59 columns. Read the blog for details.

Best database design to have efficient analysis on it with some millions records

I have a basic question about database designing.
I have a lot of files which I have to read and insert them in database. Each file has some thousand lines and each line has about 30 fields (by these types: small int, int, big int, varchar, json). Of course I use multi threads along with bulk inserting in order to increase insert speed (finally I have 30-40 millions records).
After inserting I want to have some sophisticated analysis and the performance is important to me.
Now I get each line fields and I'm ready to insert so I have 3 approaches:
1- One big table:
In this case I can create a big table with 30 columns and stores all of the files fields in that. So there is a table with huge size which I want to have a lot of analysis on it.
2- A fairly large table (A) and some little tables (B)s
In this case I can create some little tables which consist of the columns that have fairly identical records if we separate them from the other columns. So these little tables just has some hundred or thousand records instead of 30 millions records. So in fairly large table (A), I emit the columns which I put them in another table and I use a foreign key instead of them. Finally I has a table (A) with 20 columns and 30 millions records and some tables (B) with 2-3 columns and 100-50000 records for each of them. So in order to analysis the table A, I have to use some joins ,for example in select and ...
3- just a fairly large table
In this case I can create a fairly large table like table A in above case (with 20 columns) and instead of using foreign keys, I use a mapping between source columns and destination columns (this is something like foreign keys but has a little difference). For example I have 3 columns c1,c2,c3 that in case 2, I put them in another table B and use foreign key to access them, but now I assign a specific number to each distinctive records consist of c1,c2,c3 at inserting time and store the relation between the record and its assigned value in the program codes. So this table is completely like the table A in case number 2 but there is no need to use join in select or ...
While the inserting time is important, the analysis time that I will have is more important to me, so I want to know your opinion about which of these case is better and also I will glad to see the other solutions.
From a design perspective 30 to 40 million is not that bad a number. Performance is fully dependent on how you would design your DB to be.
If you are using SQL Server then you could consider putting the large table on a separate database file group. I have worked on one case in a similar fashion where we had around 1.8 Billion record in a single table.
For the analysis if you are not going to look into the entire data in one shot. You could consider a vertical partitioning of the data. You could use a partition schema based on your need. Some sample could be to split the data as yearly partitions and this will help if your analysis will be limited to a years worth of data(just an example).
The major thing would be de-normalization /normalization based on your need and of course non clustered/clustered indexing of the data. Again this will depend on what sort of analysis queries you would be using.
A single thread can INSERT one row at a time and finish 40M rows in a day or two. With LOAD DATA, you can do it in perhaps an hour or less.
But is loading the real question? For doing grouping, summing, etc, the question is about SELECT. For "analytics", the question is not one of table structure. Have a single table for the raw data, plus one or more "Summary tables" to make the selects really fast for your typical queries.
Until you give more details about the data, I cannot give more details about a custom solution.
Partitioning (vertical or horizontal) is unlikely to help much in MySQL. (Again, details needed.)
Normalization shrinks the data, which leads to faster processing. But, it sounds like the dataset is so small that it will all fit in RAM?? (I assume your #2 is 'normalization'?)
Beware of over-normalization.

Does MySQL table size matters when doing JOINs?

I'm currently trying to design a high-performance database for tracking clicks and then displaying analytics of these clicks.
I expect at least 10M clicks to be coming in per 2 weeks time.
There are a few variables (each of them would need a unique column) that I'll allow people to use when using the click tracking - but I don't want to limit them to a number of these variables to 5 or so. That's why I thought about creating Table B where I can store these variables for each click.
However each click might have like 5-15+ of these variables depending on how many are they using. If I store them in a separate table that will multiple the 10M/2 weeks by the variables that the user might use.
In order to display analytics for the variables, I'll need to JOIN the tables.
Looking at both writing and most importantly reading performance, is there any difference if I JOIN a 100M rows table to a:
500 rows table OR to a 100M rows table?
Anyone recommends denormalizing it, like having 20 columns and store NULL vaules if they're not in use?
is there any difference if I JOIN a 100M rows table to a...
Yes there is. A JOIN's performance matters solely on how long it takes to find matching rows based on your ON condition. This means increasing row size of a joined table will increase the JOIN time, since there's more rows to sift through for matches. In general, a JOIN can be thought of as taking A*B time, where A is the number of rows in the first table and B is the number of rows in the second. This is a very broad statement as there are many optimization strategies the optimizer may take to change this value, but this can be thought of as a general rule.
To increase a JOIN's efficiency, for reads specifically, you should look into indexing. Indexing allows you to mark a column that the optimizer should index, or keep a running track of to allow quicker evaluation of the values. This increases any write operation since the data needs to modify an encompassing data structure, usually a B-Tree, but decreases the time read operations since the data is presorted in this data structure allowing for quick look ups.
Anyone recommends denormalizing it, like having 20 columns and store NULL vaules if they're not in use?
There's a lot of factors that would go into saying yes or no here. Mainly, would storage space be an issue and how likely is duplicate data to appear. If the answers are that storage space is not an issue and duplicates are not likely to appear, then one large table may be the right decision. If you have limited storage space, then storing the excess nulls may not be smart. If you have many duplicate values, then one large table may be more inefficient than a JOIN.
Another factor to consider when denormalizing is if another table would ever want to access values from just one of the previous two tables. If yes, then the JOIN to obtain these values after denormalizing would be more inefficient than having the two tables separate. This question is really something you need to handle yourself when designing the database and seeing how it is used.
First: There is a huge difference between joining 10m to 500 or 10m to 10m entries!
But using a propper index and structured table design will make this manageable for your goals I think. (at least depending on the hardware used to run the application)
I would totally NOT recommend to use denormalized tables, cause adding more than your 20 values will be a mess once you have 20m entries in your table. So even if there are some good reasons which might stand for using denormalized tables (performance, tablespace,..) this is a bad idea for further changes - but in the end your decison ;)

Database Optimisation through denormalization and smaller rows

Does tables with many columns take more time than the tables with less columns during SELECT or UPDATE query? (row count is same and I will update/select same number of columns in both cases)
example: I have a database to store user details and to store their last active time-stamp. In my website, I only need to show active users and their names.
Say, one table named userinfo has the following columns: (id,f_name,l_name,email,mobile,verified_status). Is it a good idea to store last active time also in the same table? Or its better to make a separate table(say, user_active) to store the last activity timestamp?
The reason I am asking, If I make two tables, userinfo table will only be accessed during new signups(to INSERT new user row) and I will use user_active table (table with less columns) to UPADATE timestamp and SELECT active users frequently.
But the cost I have to pay for creating two tables is data duplication as user_active table columns will be (id, f_name, timestamp).
The answer to your question is that, to a close approximation, having more columns in a table does not really take more time than having fewer columns for accessing a single row. This may seem counter-intuitive, but you need to understand how data is stored in databases.
Rows of a table are stored on data pages. The cost of a query is highly dependent on the number of pages that need to be read and written during the course of the query. Parsing the row from the data page is usually not a significant performance issue.
Now, wider rows do have a very slight performance disadvantage, because more data would (presumably) be returned to the user. This is a very minor consideration for rows that fit on a single page.
On a more complicated query, wider rows have a larger performance disadvantage, because more data pages need to be read and written for a given number of rows. For a single row, though, one page is being read and written -- assuming you have an index to find that row (which seems very likely in this case).
As for the rest of your question. The structure of your second table is not correct. You would not (normally) include fname in two tables -- that is data redundancy and causes all sort of other problems. There is a legitimate question whether you should store a table of all activity and use that table for the display purposes, but that is not the question you are asking.
Finally, for the data volumes you are talking about, having a few extra columns would make no noticeable difference on any reasonable transaction volume. Use one table if you have one attribute per entity and no compelling reason to do otherwise.
When returning and parsing a single row, the number of columns is unlikely to make a noticeable difference. However, searching and scanning tables with smaller rows is faster than tables with larger rows.
When searching using an index, MySQL utilizes a binary search so it would require significantly larger rows (and many rows) before any speed penalty is noticeable.
Scanning is a different matter. When scanning, it's reading through all of the data for all of the rows, so there's a 1-to-1 performance penalty for larger rows. Yet, with proper indexes, you shouldn't be doing much scanning.
However, in this case, keep the date together with the user info because they'll be queried together and there's a 1-to-1 relationship, and a table with larger rows is still going to be faster than a join.
Only denormalize for optimization when performance becomes an actual problem and you can't resolve it any other way (adding an index, improving hardware, etc.).

What is the best usage of columns?

I got one table called Table1, it has around 20 columns. Half of these columns are string values, and the rest are integer. My question is so simple: what's better, have all the columns into only one table, or have it distributed into 2, 3 or even 4 tables? If so, I'd have to join them using LEFT JOIN.
What's the best choice?
Thanks
The question of "best" depends on how the table is being used. So, there is no real answer to the question. I can say that 20 columns is not a lot and many very reasonable tables have more than 20 columns of mixed types.
First observation: If you are asking such a question, you have some knowledge of SQL but not in-depth knowledge. One table is almost certainly the way to go.
What might change this advice? If many of the integer columns are NULL -- say 90% of the records have all of them as NULL -- then those NULL values are probably just wasting space on the data page. By eliminating those rows and storing the values in another table, you would reduce the size of the data.
The same is true of the string values, but with a caveat. Whereas the integers occupy at least 4 bytes, variable length strings might be even smaller (depends on the exact way that the database stores them).
Another reason would be on how the data is typically used. If the queries are usually using just a handful of columns, then storing each column in a separate table could be beneficial. To be honest, the overhead of the key column generally overwhelms any savings. And, such a data structure is really lousy for updates, inserts, and deletes.
However, this becomes quite practical in a columnar database such as Paraccel, Amazon Redshift, or Vertica. Such databases have built-in support for this type of splitting and it can have some very remarkable effects on performance.
Answering this with an example for users table -
1) `users` - id, name, dob, city, zipcode etc.
2) `users_products` - id, user_id(FK), product_name, product_validity,...
3) `users_billing_details` - id, user_id(FK to `users`), billing_name, billing_address..
4) `users_friends` - id, user_id(FK to `users`), friend_id(FK to same table `users`)
Hence if have many relations, use MANY-to-MANY relationship. If few relationship go with using the same table. All depends upon your structure and requirements.
SUGGESTION - Many-to-Many makes your data structure more flexible.
You can have 20 columns in 1 table. Nothing wrong with that. But then are you sure you are designing the structure properly?
Could some of these data change significantly in the future?
Is the table trying to encapsulate a single activity or entity?
Does the table have a singular meaning with respect to the domain or does it encapsulate multiple entities?
Could the structure be simplified into smaller tables having singular meaning for each table and then "Relationships" added via primary key/foreign keys?
These are some of the questions you take into consideration while designing a database.
If you find answer to these questions, you will know yourself whether you should have a single table or multiple tables?