Does MySQL table size matters when doing JOINs? - mysql

I'm currently trying to design a high-performance database for tracking clicks and then displaying analytics of these clicks.
I expect at least 10M clicks to be coming in per 2 weeks time.
There are a few variables (each of them would need a unique column) that I'll allow people to use when using the click tracking - but I don't want to limit them to a number of these variables to 5 or so. That's why I thought about creating Table B where I can store these variables for each click.
However each click might have like 5-15+ of these variables depending on how many are they using. If I store them in a separate table that will multiple the 10M/2 weeks by the variables that the user might use.
In order to display analytics for the variables, I'll need to JOIN the tables.
Looking at both writing and most importantly reading performance, is there any difference if I JOIN a 100M rows table to a:
500 rows table OR to a 100M rows table?
Anyone recommends denormalizing it, like having 20 columns and store NULL vaules if they're not in use?

is there any difference if I JOIN a 100M rows table to a...
Yes there is. A JOIN's performance matters solely on how long it takes to find matching rows based on your ON condition. This means increasing row size of a joined table will increase the JOIN time, since there's more rows to sift through for matches. In general, a JOIN can be thought of as taking A*B time, where A is the number of rows in the first table and B is the number of rows in the second. This is a very broad statement as there are many optimization strategies the optimizer may take to change this value, but this can be thought of as a general rule.
To increase a JOIN's efficiency, for reads specifically, you should look into indexing. Indexing allows you to mark a column that the optimizer should index, or keep a running track of to allow quicker evaluation of the values. This increases any write operation since the data needs to modify an encompassing data structure, usually a B-Tree, but decreases the time read operations since the data is presorted in this data structure allowing for quick look ups.
Anyone recommends denormalizing it, like having 20 columns and store NULL vaules if they're not in use?
There's a lot of factors that would go into saying yes or no here. Mainly, would storage space be an issue and how likely is duplicate data to appear. If the answers are that storage space is not an issue and duplicates are not likely to appear, then one large table may be the right decision. If you have limited storage space, then storing the excess nulls may not be smart. If you have many duplicate values, then one large table may be more inefficient than a JOIN.
Another factor to consider when denormalizing is if another table would ever want to access values from just one of the previous two tables. If yes, then the JOIN to obtain these values after denormalizing would be more inefficient than having the two tables separate. This question is really something you need to handle yourself when designing the database and seeing how it is used.

First: There is a huge difference between joining 10m to 500 or 10m to 10m entries!
But using a propper index and structured table design will make this manageable for your goals I think. (at least depending on the hardware used to run the application)
I would totally NOT recommend to use denormalized tables, cause adding more than your 20 values will be a mess once you have 20m entries in your table. So even if there are some good reasons which might stand for using denormalized tables (performance, tablespace,..) this is a bad idea for further changes - but in the end your decison ;)

Related

long many to many-database table: best performance practice

I have a question about performance of my MYSQL database design.
Table A has a lot of records, say a million, and table B also has a million. There is another table C in which every record id of A is connected to every row in B and this connection has an additional value 1 or 0. So functionally speaking, every record in A has a boolean vector where B contains the 'variables' of the vector and 1 or 0 is the value. It's explained more graphically in the image on bottom.
Table C will have a lot of write and read actions (select all values from a record of A), so the the table is very actively used. And table C is really long, a million times a million rows.
My first question is, will the length of the table give a performance
issue? the database needs to be really fast.
My second question is, if this is badly designed, whether there is a better design to achieve what i want. For instance I can think of storing the entire B vector of each A record inside of each row in A. Then table C will not be necessary. But it will make selecting, reading, writing much more difficult.
The table design is fine and shouldn't be a problem, because you access records via IDs which should be indexed. Depending on your typical queries you should also consider adding composite indexes (c(a_id,b_id), c(a_id,value), c(b_id,value), c(a_id,b_id,value)).
However, as there exist only two states, 0 and 1, you may decide only to store one of them. I.e. if you store all state 1 records only, all pairs not in the table have state 0 then implicitly. This pays especially when the states are unevenly distributed (say 90% of the records have state 0 and only 10% have state 1) or you usually access only one of the states (e.g. you always look for 1s).
Answer to your first question
Millions of records in a table with multiple read and write won't be a
bottleneck if you are following best practices of mysql.
Your engine should be innodb.
Your select queries should not involve a full table scan.
Your table should have desired indexes.
Answer to your second question
You should look for all your possible use cases, because either way is
a good idea if a use case supports it.
If you split your data across multiple tables than join operation is
to be performed if needed.

Database Optimisation through denormalization and smaller rows

Does tables with many columns take more time than the tables with less columns during SELECT or UPDATE query? (row count is same and I will update/select same number of columns in both cases)
example: I have a database to store user details and to store their last active time-stamp. In my website, I only need to show active users and their names.
Say, one table named userinfo has the following columns: (id,f_name,l_name,email,mobile,verified_status). Is it a good idea to store last active time also in the same table? Or its better to make a separate table(say, user_active) to store the last activity timestamp?
The reason I am asking, If I make two tables, userinfo table will only be accessed during new signups(to INSERT new user row) and I will use user_active table (table with less columns) to UPADATE timestamp and SELECT active users frequently.
But the cost I have to pay for creating two tables is data duplication as user_active table columns will be (id, f_name, timestamp).
The answer to your question is that, to a close approximation, having more columns in a table does not really take more time than having fewer columns for accessing a single row. This may seem counter-intuitive, but you need to understand how data is stored in databases.
Rows of a table are stored on data pages. The cost of a query is highly dependent on the number of pages that need to be read and written during the course of the query. Parsing the row from the data page is usually not a significant performance issue.
Now, wider rows do have a very slight performance disadvantage, because more data would (presumably) be returned to the user. This is a very minor consideration for rows that fit on a single page.
On a more complicated query, wider rows have a larger performance disadvantage, because more data pages need to be read and written for a given number of rows. For a single row, though, one page is being read and written -- assuming you have an index to find that row (which seems very likely in this case).
As for the rest of your question. The structure of your second table is not correct. You would not (normally) include fname in two tables -- that is data redundancy and causes all sort of other problems. There is a legitimate question whether you should store a table of all activity and use that table for the display purposes, but that is not the question you are asking.
Finally, for the data volumes you are talking about, having a few extra columns would make no noticeable difference on any reasonable transaction volume. Use one table if you have one attribute per entity and no compelling reason to do otherwise.
When returning and parsing a single row, the number of columns is unlikely to make a noticeable difference. However, searching and scanning tables with smaller rows is faster than tables with larger rows.
When searching using an index, MySQL utilizes a binary search so it would require significantly larger rows (and many rows) before any speed penalty is noticeable.
Scanning is a different matter. When scanning, it's reading through all of the data for all of the rows, so there's a 1-to-1 performance penalty for larger rows. Yet, with proper indexes, you shouldn't be doing much scanning.
However, in this case, keep the date together with the user info because they'll be queried together and there's a 1-to-1 relationship, and a table with larger rows is still going to be faster than a join.
Only denormalize for optimization when performance becomes an actual problem and you can't resolve it any other way (adding an index, improving hardware, etc.).

Multiple SQL WHERE clauses in one table with lots of records

Recently I was asked to develop an app, which basically is going to use 1 main single table in the whole database for the operations.
It has to have around 20 columns with various types - decimals, int, varchar, date, float. At some point the table will have thousands of rows (3-5k).
The app must have the ability to SELECT records by combining each of the columns criteria - e.g. BETWEEN dates, greater than something, smaller than something, equal to something etc. Basically combining a lot of where clauses in order to see the desired result.
So my question is, since I know how to combine the wheres and make the app, what is the best approach? I mean is MySQL good enough not to slow down when I have 3k records and make a SELECT query with 15 WHERE clauses? I've never worked with a database larger than 1k records, so I'm not sure if I should use MySQL for this. Also I'm going to use PHP as a server language if that matters at all.
you are talking about conditions in ONE where clause.
3000 rows is very minimal for a relational database. these typically go far larger (like 3 million+ or even much more)
i am concerned that you have 20 columns in one table. this sounds like a normalization problem.
With a well-defined structure for your database, including appropriate indexes, 3k records is nothing, even with 15 conditions. Even without indexes, it is doubtful that with so few records, you will see any performance hit.
I would however plan for the future and perhaps look at your queries and see if there is any table optimisation you can do at this stage, to save pain in the future. Who knows, 3k records today, 30m next year.
3000 Records in a database is nothing. You won't have any performance issues even with your 15 WHERE.
MySQL and PHP will do the job just fine.
I'd be more concerned about your huge amount of columns. Maybe you should take a look at this article to make sure you respect the databases normal forms,
Good luck for your project.
I don't think querying a single table of 3-5K rows is going to be particularly intensive. MySQL should be able to cope with something like this easily enough. You could add lot's of indexes to speed up your selects if this is the "choke point" but this will slow down insert, edit's, etc. also if you querying lots of different rows this isn't prob a good idea.
As seeing the no of rows is very minimal,I guess it should not cause any performance issue.Still you can look at using OR operator carefully and also indexes on the columns in where clause.
Indices, indices, indices!
If you need to check a lot of different columns try flatten your used logic. In any case make sure you have set an appropriate index on the checked columns. A not an index per columns, but one index over all those columns, that a used regularly.

MySQL - why not index every field?

Recently I've learned the wonder of indexes, and performance has improved dramatically. However, with all I've learned, I can't seem to find the answer to this question.
Indexes are great, but why couldn't someone just index all fields to make the table incredibly fast? I'm sure there's a good reason to not do this, but how about three fields in a thirty-field table? 10 in a 30 field? Where should one draw the line, and why?
Indexes take up space in memory (RAM); Too many or too large of indexes and the DB is going to have to be swapping them to and from the disk. They also increase insert and delete time (each index must be updated for every piece of data inserted/deleted/updated).
You don't have infinite memory. Making it so all indexes fit in RAM = good.
You don't have infinite time. Indexing only the columns you need indexed minimizes the insert/delete/update performance hit.
Keep in mind that every index must be updated any time a row is updated, inserted, or deleted. So the more indexes you have, the slower performance you'll have for write operations.
Also, every index takes up further disk space and memory space (when called), so it could potentially slow read operations as well (for large tables).
Check this out
You have to balance CRUD needs. Writing to tables becomes slow. As for where to draw the line, that depends on how the data is being acessed (sorting filtering, etc.).
Indexing will take up more allocated space both from drive and ram, but also improving the performance a lot. Unfortunately when it reaches memory limit, the system will surrender the drive space and risk the performance. Practically, you shouldn't index any field that you might think doesn't involve in any kind of data traversing algorithm, neither inserting nor searching (WHERE clause). But you should if otherwise. By default you have to index all fields. The fields which you should consider unindexing is if the queries are used only by moderator, unless if they need for speed too
It is not a good idea to indexes all the columns in a table. While this will make the table very fast to read from, it also becomes much slower to write to. Writing to a table that has every column indexed would involve putting the new record in that table and then putting each column's information in the its own index table.
this answer is my personal opinion based I m using my mathematical logic to answer
the second question was about the border where to stop, First let do some mathematical calculation, suppose we have N rows with L fields in a table if we index all the fields we will get a L new index tables where every table will sort in a meaningfull way the data of the index field, in first glance if your table is a W weight it will become W*2 (1 tera will become 2 tera) if you have 100 big table (I already worked in project where the table number was arround 1800 table ) you will waste 100 times this space (100 tera), this is way far from wise.
If we will apply indexes in all tables we will have to think about index updates were one update trigger all indexes update this is a select all unordered equivalent in time
from this I conclude that you have in this scenario that if you will loose this time is preferable to lose it in a select nor an update because if you will select a field that is not indexed you will not trigger another select on all fields that are not indexed
what to index ?
foreign-keys : is a must based on
primary-key : I m not yet sure about it may be if someone read this could help on this case
other fields : the first natural answer is the half of the remaining filds why : if you should index more you r not far from the best answer if you should index less you are not also far because we know that no index is bad and all indexed is also bad.
from this 3 points I can conclude that if we have L fields composed of K keys the limit should be somewhere near ((L-K)/2)+K more or less by L/10
this answer is based on my logic and personal prictices
First of all, at least in SAP - ABAP and in background database table, we can create one index table for all required index fields, we will have their addresses only. So other SQL related software-database system can also use one table for all fields to be indexed.
Secondly, what is the writing performance? A company in one day records 50 sales orders for example. And let assume there is a table VBAK sales order header table with 30 fields for example each has 20 CHAR length..
I can write to real table in seconds, but other index table can work in the background, and at the same time a report is tried to be run, for this report while index table is searched, ther can be a logic- for database programming- a index writing process is contiuning and wait it for ending ( 5 sales orders at the same time were being recorded for example and take maybe 5 seconds) ..so , a running report can wait 5 seconds then runs 5 seconds total 10 seconds..
without index, a running report does not wait 5 seconds for writing performance..but runs maybe 40 seconds...
So, what is the meaning of writing performance no one writes thousands of records at the same time. But reading them.
And reading a second table means that : there were all ready sorted fields.I have 3 fields selected and I can find in which sorted sets I need to search these data, then I bring them...what RAM, what memory it is just a copied index table with only one data for each field -address data..What memory?
I think, this is one of the software company secrets hide from customers, not to wake them up , otherwise they will not need another system in the future with an expensive price.

What is the optimal amount of data for a table?

How much data should be in a table so that reading is optimal? Assuming that I have 3 fields varchar(25). This is in MySQL.
I would suggest that you consider the following in optimizing your database design:
Consider what you want to accomplish with the database. Will you be performing a lot of inserts to a single table at very high rates? Or will you be performing reporting and analytical functions with the data?
Once you've determined the purpose of the database, define what data you need to store to perform whatever functions are necessary.
Normalize till it hurts. If you're performing transaction processing (the most common function for a database) then you'll want a highly normalized database structure. If you're performing analytical functions, then you'll want a more denormalized structure that doesn't have to rely on joins to generate report results.
Typically, if you've really normalized the structure till it hurts then you need to take your normalization back a step or two to have a data structure that will be both normalized and functional.
A normalized database is mostly pointless if you fail to use keys. Make certain that each table has a primary key defined. Don't use surrogate keys just cause its what you always see. Consider what natural keys might exist in any given table. Once you are certain that you have the right primary key for each table, then you need to define your foreign key references. Establishing explicit foreign key relationships rather than relying on implicit definition will give you a performance boost, provide integrity for your data, and self-document the database structure.
Look for other indexes that exist within your tables. Do you have a column or set of columns that you will search against frequently like a username and password field? Indexes can be on a single column or multiple columns so think about how you'll be querying for data and create indexes as necessary for values you'll query against.
Number of rows should not matter. Make sure the fields your searching on are indexed properly. If you only have 3 varchar(25) fields, then you probably need to add a primary key that is not a varchar.
Agree that you should ensure that your data is properly indexed.
Apart from that, if you are worried about table size, you can always implement some type of data archival strategy to later down the line.
Don't worry too much about this until you see problems cropping up, and don't optimise prematurely.
For optimal reading you should have an index. A table exists to hold the rows it was designed to contain. As the number of rows increases, the value of the index comes into play and reading remains brisk.
Phrased as such I don't know how to answer this question. An idexed table of 100,000 records is faster than an unindexed table of 1,000.
What are your requirements? How much data do you have? Once you know the answer to these questions you can make decisions about indexing and/or partitioning.
This is a very loose question, so a very loose answer :-)
In general if you do the basics - reasonable normalization, a sensible primary key and run-of-the-mill queries - then on today's hardware you'll get away with most things on a small to medium sized database - i.e. one with the largest table having less than 50,000 records.
However once you get past the 50k - 100k rows, which roughly corresponds to the point when the rdbms is likely to be memory constrained - then unless you have your access paths set up correctly (i.e. indexes) then performance will start to fall off catastrophically. That is in the mathematical sense - in such scenario's it's not unusual to see performance deteriorate by an order of magnitude or two for a doubling in table size.
Obviously therefore the critical table size at which you need to pay attention will vary depending upon row size, machine memory, activity and other environmental issues, so there is no single answer, but it is well to be aware that performance generally does not degrade gracefully with table size and plan accordingly.
I have to disagree with Cruachan about "50k - 100k rows .... roughly correspond(ing) to the point when the rdbms is likely to be memory constrained". This blanket statement is just misleading without two additional data: approx. size of the row, and available memory. I'm currently developing a database to find the longest common subsequence (a la bio-informatics) of lines within source code files, and reached millions of rows in one table, even with a VARCHAR field of close to 1000, before it became memory constrained. So, with proper indexing, and sufficient RAM (a Gig or two), as regards the original question, with rows of 75 bytes at most, there is no reason why the proposed table couldn't hold tens of millions of records.
The proper amount of data is a function of your application, not of the database. There are very few cases where a MySQL problem is solved by breaking a table into multiple subtables, if that's the intent of your question.
If you have a particular situation where queries are slow, it would probably be more useful to discuss how to improve that situation by modifying query or the table design.