Database Optimisation through denormalization and smaller rows

Database Optimisation through denormalization and smaller rows - mysql

Does tables with many columns take more time than the tables with less columns during SELECT or UPDATE query? (row count is same and I will update/select same number of columns in both cases)
example: I have a database to store user details and to store their last active time-stamp. In my website, I only need to show active users and their names.
Say, one table named userinfo has the following columns: (id,f_name,l_name,email,mobile,verified_status). Is it a good idea to store last active time also in the same table? Or its better to make a separate table(say, user_active) to store the last activity timestamp?
The reason I am asking, If I make two tables, userinfo table will only be accessed during new signups(to INSERT new user row) and I will use user_active table (table with less columns) to UPADATE timestamp and SELECT active users frequently.
But the cost I have to pay for creating two tables is data duplication as user_active table columns will be (id, f_name, timestamp).

The answer to your question is that, to a close approximation, having more columns in a table does not really take more time than having fewer columns for accessing a single row. This may seem counter-intuitive, but you need to understand how data is stored in databases.
Rows of a table are stored on data pages. The cost of a query is highly dependent on the number of pages that need to be read and written during the course of the query. Parsing the row from the data page is usually not a significant performance issue.
Now, wider rows do have a very slight performance disadvantage, because more data would (presumably) be returned to the user. This is a very minor consideration for rows that fit on a single page.
On a more complicated query, wider rows have a larger performance disadvantage, because more data pages need to be read and written for a given number of rows. For a single row, though, one page is being read and written -- assuming you have an index to find that row (which seems very likely in this case).
As for the rest of your question. The structure of your second table is not correct. You would not (normally) include fname in two tables -- that is data redundancy and causes all sort of other problems. There is a legitimate question whether you should store a table of all activity and use that table for the display purposes, but that is not the question you are asking.
Finally, for the data volumes you are talking about, having a few extra columns would make no noticeable difference on any reasonable transaction volume. Use one table if you have one attribute per entity and no compelling reason to do otherwise.

When returning and parsing a single row, the number of columns is unlikely to make a noticeable difference. However, searching and scanning tables with smaller rows is faster than tables with larger rows.
When searching using an index, MySQL utilizes a binary search so it would require significantly larger rows (and many rows) before any speed penalty is noticeable.
Scanning is a different matter. When scanning, it's reading through all of the data for all of the rows, so there's a 1-to-1 performance penalty for larger rows. Yet, with proper indexes, you shouldn't be doing much scanning.
However, in this case, keep the date together with the user info because they'll be queried together and there's a 1-to-1 relationship, and a table with larger rows is still going to be faster than a join.
Only denormalize for optimization when performance becomes an actual problem and you can't resolve it any other way (adding an index, improving hardware, etc.).

Related

When to add a column vs adding a related table?

I have a big table with over 100 million rows. I have been trimming it down for months getting rid of bad data (rows wise), trying to keep it small. I already had 9 columns on this table. I want to add a new boolean column to it. Below is the current state.
This table started off small, and now its getting pretty wide. Yet again, I am tasked with adding more information per row. This time it's a new boolean field. I expect this field to be low volume, meaning less than 10% will have this set to true. I know I can make it default null, and it is a boolean column which should be small.
However, I wanted to get some advice. This table cannot get infinitely wide, and I will need to work around this. Under these circumstances, does it make more sense to create another table and foreign key reference the record when I have additional data to add? How do the pro's handle this in database design?
The best situation for usability is to have all data on the record so any form of a query can get or calculate on the table itself without joins. I just do not have confidence that it will scale to 1 BILLION rows (insert meme).

At my job I support MySQL instances that have multi-billion row tables. At that scale, care must be taken to optimize queries properly. You don't want to do a table-scan at that scale.
But that's about rows, not columns. You asked first about columns.
The way to choose between adding a column versus adding another table is to follow rules of database normalization. If the new column is for an attribute of the same entity as your current table, add the column to that table. If it's a multi-valued attribute or if it's really an attribute of some other entity, then add it to a different table.
Very, very rarely is it the right choice to make another table solely for the sake of having too many columns. A given MySQL table can have dozens of columns pretty easily, and hundreds if you're careful.
In theory, there is no limit to the number of columns that might be appropriate to put in the same table with respect to normalization. But there are limitations due to the code to store those columns in a given implementation (e.g. InnoDB storage engine in MySQL).
See https://www.percona.com/blog/2013/04/08/understanding-the-maximum-number-of-columns-in-a-mysql-table/
So the maximum number of columns for a table in MySQL is somewhere between 191 and 2829, depending on a number of factors.
In the comments on that blog, I was able to design a table that failed to be created at 59 columns. Read the blog for details.

Does MySQL table size matters when doing JOINs?

I'm currently trying to design a high-performance database for tracking clicks and then displaying analytics of these clicks.
I expect at least 10M clicks to be coming in per 2 weeks time.
There are a few variables (each of them would need a unique column) that I'll allow people to use when using the click tracking - but I don't want to limit them to a number of these variables to 5 or so. That's why I thought about creating Table B where I can store these variables for each click.
However each click might have like 5-15+ of these variables depending on how many are they using. If I store them in a separate table that will multiple the 10M/2 weeks by the variables that the user might use.
In order to display analytics for the variables, I'll need to JOIN the tables.
Looking at both writing and most importantly reading performance, is there any difference if I JOIN a 100M rows table to a:
500 rows table OR to a 100M rows table?
Anyone recommends denormalizing it, like having 20 columns and store NULL vaules if they're not in use?

is there any difference if I JOIN a 100M rows table to a...
Yes there is. A JOIN's performance matters solely on how long it takes to find matching rows based on your ON condition. This means increasing row size of a joined table will increase the JOIN time, since there's more rows to sift through for matches. In general, a JOIN can be thought of as taking A*B time, where A is the number of rows in the first table and B is the number of rows in the second. This is a very broad statement as there are many optimization strategies the optimizer may take to change this value, but this can be thought of as a general rule.
To increase a JOIN's efficiency, for reads specifically, you should look into indexing. Indexing allows you to mark a column that the optimizer should index, or keep a running track of to allow quicker evaluation of the values. This increases any write operation since the data needs to modify an encompassing data structure, usually a B-Tree, but decreases the time read operations since the data is presorted in this data structure allowing for quick look ups.
Anyone recommends denormalizing it, like having 20 columns and store NULL vaules if they're not in use?
There's a lot of factors that would go into saying yes or no here. Mainly, would storage space be an issue and how likely is duplicate data to appear. If the answers are that storage space is not an issue and duplicates are not likely to appear, then one large table may be the right decision. If you have limited storage space, then storing the excess nulls may not be smart. If you have many duplicate values, then one large table may be more inefficient than a JOIN.
Another factor to consider when denormalizing is if another table would ever want to access values from just one of the previous two tables. If yes, then the JOIN to obtain these values after denormalizing would be more inefficient than having the two tables separate. This question is really something you need to handle yourself when designing the database and seeing how it is used.

First: There is a huge difference between joining 10m to 500 or 10m to 10m entries!
But using a propper index and structured table design will make this manageable for your goals I think. (at least depending on the hardware used to run the application)
I would totally NOT recommend to use denormalized tables, cause adding more than your 20 values will be a mess once you have 20m entries in your table. So even if there are some good reasons which might stand for using denormalized tables (performance, tablespace,..) this is a bad idea for further changes - but in the end your decison ;)

Which is faster: a lookup on a large denormalized table or a join between three smaller tables?

I have a denormalized table with 100,000 records in it. I can normalize this down to a table of less than 50 records and a many-to-many of 20000 records between the aforementioned table and another table of 10000 records. Is it faster to do a lookup in the 100,000 records or join one of the 10000 records to its relations in the many-to-many? Citations are more than welcome because I don't believe I can test both conditions.

Generally, if the proper indices are in place, the denormalized table will be faster for select statements, but there are circumstances where the denormalized table will perform worse.
It depends on the relative row widths. If you factor out columns that take up a large percentage of the denormalized table's row width, and the resulting table has a much smaller row count, then the normalized structure could be faster due to better caching (The tables will have a much smaller memory footprint).
In your case, you should know that 100K records is a pretty small database and you probably
shouldn't let performance be the driving factor behind the change. There are many benefits to normalization beside performance.

I all depends on the particulars of the situation. How big is the result set? Do you have a covering index or indices on the columns required by the query?
The "advantage" of the denormalized model is that all your columns are in one place; the disadvantages are many, but from a performance perspective, it means you have wide rows and therefore fewer rows per page. This means that the query has to fetch more pages from disk to find what it needs.
In general, a properly normalized data model (e.g. 3rd Normal Form) will perform quite well. Yes, your queries will be more complex, but what it brings to the table are narrow rows (more rows per page, meaning fewer reads for a a given query). Further, the join criteria the queries will be using are more likely to have covering indices, meaning the joins are likely to perform well.
But without knowing the details, it's impossible to say. The only way to find out is to examine the query plan for your particular query.
It's very easy to denormalize data. It's much more difficult to normalize data, since all the repeated, duplicated data is likely to have...discrepancies that will need to be resolved. Get your data model right: applications are transient, but [good] data lasts forever/
Denormalizing before you have a problem is a case of premature optimization.

Does this de-normalization make sense?

I have 2 tables which I join very often. To simplify this, the join gives back a range of IDs that I use in another (complex) query as part of an IN.
So I do this join all the time to get back specific IDs.
To be clear, the query is not horribly slow. It takes around 2 mins. But since I call this query over a web page, the delay is noticeable.
As a concrete example let's say that the tables I am joining is a Supplier table and a table that contains the warehouses the supplier equipped specific dates. Essentially I get the IDs of suppliers that serviced specific warehouses at specific dates.
The query it self can not be improved since it is a simple join between 2 tables that are indexed but since there is a date range this complicates things.
I had the following idea which, I am not sure if it makes sense.
Since the data I am querying (especially for previous dates) do not change, what if I created another table that has as primary key, the columns in my where and as a value the list of IDs (comma separated).
This way it is a simple SELECT of 1 row.
I.e. this way I "pre-store" the supplier ids I need.
I understand that this is not even 1st normal formal but does it make sense? Is there another approach?

It makes sense as a denormalized design to speed up that specific type of query you have.
Though if your date range changes, couldn't it result in a different set of id's?
The other approach would be to really treat the denormalized entries like entries in a key/value cache like memcached or redis. Store the real data in normalized tables, and periodically update the cached, denormalized form.
Re your comments:
Yes, generally storing a list of id's in a string is against relational database design. See my answer to Is storing a delimited list in a database column really that bad?
But on the other hand, denormalization is justified in certain cases, for example as an optimization for a query you run frequently.
Just be aware of the downsides of denormalization: risk of data integrity failure, poor performance for other queries, limiting the ability to update data easily, etc.

In the absence of knowing a lot more about your application it's impossible to say whether this is the right approach - but to collect and consider that volume of information goes way beyond the scope of a question here.
Essentially I get the IDs of suppliers that serviced specific warehouses at specific dates.
While it's far from clear why you actually need 2 tables here, nor if denormalizing the data woul make the resulting query faster, one thing of note here is that your data is unlikely to change after capture, hence maintaining the current structure along with a materialized view would have minimal overhead. You first need to test the query performance by putting the sub-query results into a properly indexed table. If you get a significant performance benefit, then you need to think about how you maintain the new table - can you substitute one of the existing tables with a view on the new table, or do you keep both your original tables and populate data into the new table by batch, or by triggers.
It's not hard to try it out and see what works - and you'll get a far beter answer than anyone here can give you.

MySQL indexing - optional search criteria

"How many indexes should I use?" This question has been asked generally multiple times, I know. But I'm asking for an answer specific to my table structure and querying purposes.
I have a table with about 60 columns. I'm writing an SDK which has a function to fetch data based on optional search criteria. There are 10 columns for which the user can optionally pass in values (so the user might want all entries for a certain username and clientTimestamp, or all entries for a certain userID, etc). So potentially, we could be looking up data based on up to 10 columns.
This table will run INSERTS almost as often as SELECTS, and the table will usually have somewhere around 200-300K rows. Each row contains a significant amount of data (probably close to 0.5 MB).
Would it be a good or bad idea to have 10 indexes on this table?

Simple guide that may help you make a decision.
1. Index columns that have high selectivity.
2. Try normalizing your table (you mentioned username and userid columns; if it's not user table, no need for storing name here)
3. If your system is not abstract, it should be a number of parameters that are used more often than others. First of all, make sure you have indexes that support fast result retrieval with such parameters.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008