MySQL: Long table vs wide table - mysql

What is the more efficient (in terms of query performance) database table design - long or wide?
I.e., this
id size price
1 S 12.4
1 M 23.1
1 L 33.3
2 S 3.3
2 M 5.3
2 L 11.0
versus this
id S M L
1 12.4 23.1 33.3
2 3.3 5.3 11.0
Generally (I reckon) it comes down to the comparison of performance between GROUP BY and selecting the columns directly:
SELECT AVG(price) FROM table GROUP BY size
or
SELECT AVG(S), AVG(M), AVG(L) FROM table
Second one is a bit longer to write (in terms of many columns), but what about the performance of the two? If possible, what are the general advantages/disadvantages of each of these tables formats?

First of all, these are two different data models suitable for different purposes.
That being said, I'd expect1 the second model will be faster for aggregation, simply because the data is packed more compactly, therefore needing less I/O:
The GROUP BY in the first model can be satisfied by a full scan on the index {size, price}. The alternative to index is too slow when the data is too large to fit in RAM.
The query in the second model can be satisfied by a full table scan. No index needed2.
Since the first approach requires table + index and the second one just the table, the cache utilization is better in the second case. Even if we disregard caching and compare the index (without table) in the first model with the table in the second model, I suspect the index will be larger than the table, simply because it physically records the size and has unused "holes" typical for B-Trees (though the same is true for the table if it is clustered).
And finally, the second model does not have the index maintenance overhead, which could impact the INSERT/UPDATE/DELETE performance.
Other than that, you can consider caching the SUM and COUNT in a separate table containing just one row. Update both the SUM and COUNT via triggers whenever a row is inserted, updated or deleted in the main table. You can then easily get the current AVG, simply by dividing SUM and COUNT.
1 But you should really measure on representative amounts of data to be sure.
2 Since there is no WHERE clause in your query, all rows will be scanned. Indexes are only useful for getting a relatively small subset of table's rows (and sometimes for index-only scans). As a rough rule of thumb, if more than 10% of rows in the table are needed, indexes won't help and the DBMS will often opt for a full table scan even when indexes are available.

The first option results in more rows and will generally be slower than the second option.
However, as Deltalima also indicated, the first option is more flexible. Not only when it comes to different query options, but also if/when you one day need to extend the table with other sizes, colors etc.
Unless you have a very large dataset or need ultra-fast lookup time, you'll probably be better off with the first option.
If you do have or need a very large dataset, you may be better off creating a table with pre-calculated summary values.

The long is more flexible in use. It allows you to filter on size for example
SELECT MAX(price) where size='L'
Also it allows for indexing on the size and on the id. This speeds up the GROUP BY and any queries where other tables are joined on id and/or size such a product stock table.

Related

Does MySQL table size matters when doing JOINs?

I'm currently trying to design a high-performance database for tracking clicks and then displaying analytics of these clicks.
I expect at least 10M clicks to be coming in per 2 weeks time.
There are a few variables (each of them would need a unique column) that I'll allow people to use when using the click tracking - but I don't want to limit them to a number of these variables to 5 or so. That's why I thought about creating Table B where I can store these variables for each click.
However each click might have like 5-15+ of these variables depending on how many are they using. If I store them in a separate table that will multiple the 10M/2 weeks by the variables that the user might use.
In order to display analytics for the variables, I'll need to JOIN the tables.
Looking at both writing and most importantly reading performance, is there any difference if I JOIN a 100M rows table to a:
500 rows table OR to a 100M rows table?
Anyone recommends denormalizing it, like having 20 columns and store NULL vaules if they're not in use?
is there any difference if I JOIN a 100M rows table to a...
Yes there is. A JOIN's performance matters solely on how long it takes to find matching rows based on your ON condition. This means increasing row size of a joined table will increase the JOIN time, since there's more rows to sift through for matches. In general, a JOIN can be thought of as taking A*B time, where A is the number of rows in the first table and B is the number of rows in the second. This is a very broad statement as there are many optimization strategies the optimizer may take to change this value, but this can be thought of as a general rule.
To increase a JOIN's efficiency, for reads specifically, you should look into indexing. Indexing allows you to mark a column that the optimizer should index, or keep a running track of to allow quicker evaluation of the values. This increases any write operation since the data needs to modify an encompassing data structure, usually a B-Tree, but decreases the time read operations since the data is presorted in this data structure allowing for quick look ups.
Anyone recommends denormalizing it, like having 20 columns and store NULL vaules if they're not in use?
There's a lot of factors that would go into saying yes or no here. Mainly, would storage space be an issue and how likely is duplicate data to appear. If the answers are that storage space is not an issue and duplicates are not likely to appear, then one large table may be the right decision. If you have limited storage space, then storing the excess nulls may not be smart. If you have many duplicate values, then one large table may be more inefficient than a JOIN.
Another factor to consider when denormalizing is if another table would ever want to access values from just one of the previous two tables. If yes, then the JOIN to obtain these values after denormalizing would be more inefficient than having the two tables separate. This question is really something you need to handle yourself when designing the database and seeing how it is used.
First: There is a huge difference between joining 10m to 500 or 10m to 10m entries!
But using a propper index and structured table design will make this manageable for your goals I think. (at least depending on the hardware used to run the application)
I would totally NOT recommend to use denormalized tables, cause adding more than your 20 values will be a mess once you have 20m entries in your table. So even if there are some good reasons which might stand for using denormalized tables (performance, tablespace,..) this is a bad idea for further changes - but in the end your decison ;)

Which sql will be faster and why?

Suppose I have a student table contains id, class, school_id having 1000 records.
There are 3 schools and 12 classes.
Which of these 2 queries would be faster(if there is a difference)
Query 1:
SELECT * FROM student WHERE school = 2 and class = 5;
Query 2:
SELECT * FROM student WHERE class = 5 and school = 2;
Note: I just changed the places of the 2 conditions in WHERE.
Then which will be faster and Is the following true?
->probable number of records in query1 is 333
->probable number of records in query2 is 80.
It seriously doesn't matter one little bit. 1000 records is a truly tiny database table and, if there's a difference at all, you need to upgrade from such a brain-dead DBMS.
A decent DBMS would have already collected the stats from tables (or the DBA would have done it as part of periodic tuning) and the order of the where clauses would be irrelevant.
The execution engine would choose the one which reduced the cardinality (ie, reduced the candidate group of rows) the fastest. That means that (assuming classes and schools are roughly equally distributed) the class = 5 filter would happen first, no matter the order in the select statement.
Explaining the cardinality issue in a little more depth, for a roughly evenly distributed spread of those 1000 records, there would be 333 for each school and 83 for each class.
What a DBMS would do would be to filter first on what gives you the smallest result set. So it would tend to prefer using the class filter. That would immediately drop the candidate list of rows to about 83. Then, it's a simple matter of tossing out those which have a school other than 2.
In both cases, you end up with the same eventual row set but the initial filter is often faster since it can use an index to only select desired rows. The second filter, on the other hand, most likely goes through those rows in a less efficient manner so the quicker you can reduce the number of rows, the better.
If you really want to know, you need to measure rather than guess. That's one of the primary responsibilities of a DBA, tuning the database for optimal execution of queries.
These 2 queries are strictly the same :)
hypothetical; to teach a DB concept
"How your DB uses cardinality to optiize your queries"
So, it's basically true that they are identical, but I will mention one thought hinting at the "why" which will actually introduce a good RDBMS concept.
Let's just say hypothetically that your RDBMS used the WHERE clauses strictly in the order you specified them.
In that case, the optimal query would be the one in which the column with maximum cardinality was specified first. What that means is that specifying class=5 first would be faster, as it more quickly eliminates rows from consideration, meaning if the row's "class" column does not contain 5 (which is statistically more likely than it's "school" column not containing 2), then it doesn't even need to evaluate the "school" column.
Coming back to reality, however, you should know that almost all modern relational database management systems do what is called "building a query plan" and "compiling the query". This involves, among other things, evaluating the cardinality of columns specified in the WHERE clause (and what indexes are available, etc). So essentially, it is probably true to say they are identical, and the number of results will be, too.
The number of rows affected will not and may not change simply because you reorder the conditions in the "where clause" of the sql-statement.
The execution time will also not be affected since the sql-server will look for a matching index first.
First query executes faster than 2nd query because in where clause it filters school first so it is easier to get the class details later

Indexing on column with few fixed values but values constitue to less than 25% of total rows

I have a field table_name in a table which can have only 20 different values. The total records in the table is about few tens of thousands of rows. If I do a query like this:
SELECT * FROM table WHERE table_name = 'adasd';
at most the returned records are 25% of the total rows. Mostly I get only 10% of the total records. Is there a scope to index the field table_name here? I hear that for indexes to work well it requires the values in that field to be unique or close to it. In my case, its not at all close to unique. But I also heard that if the returned rows are less in number compared to total number of rows, it makes a good case for indexing.
How should I go about this?
No they don't have to be unique to get a benefit from using indexes, however take some time to think about what the DBMS does when processing a query:
Full table scan - a sequential read through the data (i.e. very few seek operations)
Index lookup - a few seeks on the index to find the start of the selected data, then a sequential read (few seeks) to identify rows in the underlying table, then LOTS AND LOTS of seeks to fetch the rows from the table
Seeks are expensive.
(there is a secondary effect of full table scans in that they are more prone to flushing hot data out of the cache - but you should address the primary concern first).
In this case, it's unlikely that the DBMS would use the index if it were present, and even if it did, it would probably be slower than a full table scan. As a (very) rough rule of thumb, you're only going to get a benefit from an index if a predicate identifies less than around 5% of the rows (but it will vary depending on the relative size of the index and the data).
i.e. don't bother adding an index on this field alone.
I think you may benefit from spending some time thinking about why you need to run queries which return so many rows?
Revised Answer
I just learned that creating an index does not mean that MySQL will use it. Keeping that in mind, I will re-phrase my answer:
You should create an index on that column if (general or your own) practices suggest you to do so. MySQL will use heuristics; which include looking at the available indexes and their respective cardinality, to determine the best index to use or not to use an index at all.
Interesting reading about this topic here.

MySQL - why not index every field?

Recently I've learned the wonder of indexes, and performance has improved dramatically. However, with all I've learned, I can't seem to find the answer to this question.
Indexes are great, but why couldn't someone just index all fields to make the table incredibly fast? I'm sure there's a good reason to not do this, but how about three fields in a thirty-field table? 10 in a 30 field? Where should one draw the line, and why?
Indexes take up space in memory (RAM); Too many or too large of indexes and the DB is going to have to be swapping them to and from the disk. They also increase insert and delete time (each index must be updated for every piece of data inserted/deleted/updated).
You don't have infinite memory. Making it so all indexes fit in RAM = good.
You don't have infinite time. Indexing only the columns you need indexed minimizes the insert/delete/update performance hit.
Keep in mind that every index must be updated any time a row is updated, inserted, or deleted. So the more indexes you have, the slower performance you'll have for write operations.
Also, every index takes up further disk space and memory space (when called), so it could potentially slow read operations as well (for large tables).
Check this out
You have to balance CRUD needs. Writing to tables becomes slow. As for where to draw the line, that depends on how the data is being acessed (sorting filtering, etc.).
Indexing will take up more allocated space both from drive and ram, but also improving the performance a lot. Unfortunately when it reaches memory limit, the system will surrender the drive space and risk the performance. Practically, you shouldn't index any field that you might think doesn't involve in any kind of data traversing algorithm, neither inserting nor searching (WHERE clause). But you should if otherwise. By default you have to index all fields. The fields which you should consider unindexing is if the queries are used only by moderator, unless if they need for speed too
It is not a good idea to indexes all the columns in a table. While this will make the table very fast to read from, it also becomes much slower to write to. Writing to a table that has every column indexed would involve putting the new record in that table and then putting each column's information in the its own index table.
this answer is my personal opinion based I m using my mathematical logic to answer
the second question was about the border where to stop, First let do some mathematical calculation, suppose we have N rows with L fields in a table if we index all the fields we will get a L new index tables where every table will sort in a meaningfull way the data of the index field, in first glance if your table is a W weight it will become W*2 (1 tera will become 2 tera) if you have 100 big table (I already worked in project where the table number was arround 1800 table ) you will waste 100 times this space (100 tera), this is way far from wise.
If we will apply indexes in all tables we will have to think about index updates were one update trigger all indexes update this is a select all unordered equivalent in time
from this I conclude that you have in this scenario that if you will loose this time is preferable to lose it in a select nor an update because if you will select a field that is not indexed you will not trigger another select on all fields that are not indexed
what to index ?
foreign-keys : is a must based on
primary-key : I m not yet sure about it may be if someone read this could help on this case
other fields : the first natural answer is the half of the remaining filds why : if you should index more you r not far from the best answer if you should index less you are not also far because we know that no index is bad and all indexed is also bad.
from this 3 points I can conclude that if we have L fields composed of K keys the limit should be somewhere near ((L-K)/2)+K more or less by L/10
this answer is based on my logic and personal prictices
First of all, at least in SAP - ABAP and in background database table, we can create one index table for all required index fields, we will have their addresses only. So other SQL related software-database system can also use one table for all fields to be indexed.
Secondly, what is the writing performance? A company in one day records 50 sales orders for example. And let assume there is a table VBAK sales order header table with 30 fields for example each has 20 CHAR length..
I can write to real table in seconds, but other index table can work in the background, and at the same time a report is tried to be run, for this report while index table is searched, ther can be a logic- for database programming- a index writing process is contiuning and wait it for ending ( 5 sales orders at the same time were being recorded for example and take maybe 5 seconds) ..so , a running report can wait 5 seconds then runs 5 seconds total 10 seconds..
without index, a running report does not wait 5 seconds for writing performance..but runs maybe 40 seconds...
So, what is the meaning of writing performance no one writes thousands of records at the same time. But reading them.
And reading a second table means that : there were all ready sorted fields.I have 3 fields selected and I can find in which sorted sets I need to search these data, then I bring them...what RAM, what memory it is just a copied index table with only one data for each field -address data..What memory?
I think, this is one of the software company secrets hide from customers, not to wake them up , otherwise they will not need another system in the future with an expensive price.

MySQL: low cardinality/selectivity columns = how to index?

I need to add indexes to my table (columns) and stumbled across this post:
How many database indexes is too many?
Quote:
“Having said that, you can clearly add a lot of pointless indexes to a table that won't do anything. Adding B-Tree indexes to a column with 2 distinct values will be pointless since it doesn't add anything in terms of looking the data up. The more unique the values in a column, the more it will benefit from an index.”
Is an Index really pointless if there are only two distinct values? Given a table as follows (MySQL Database, InnoDB)
Id (BIGINT)
fullname (VARCHAR)
address (VARCHAR)
status (VARCHAR)
Further conditions:
The Database contains 300 Million records
Status can only be “enabled” and “disabled”
150 Million records have status= enabled and 150 Million records have
stauts= disabled
My understanding is, without having an index on status, a select with where status=’enabled’ would result in a full tablescan with 300 Million Records to process?
How efficient is the lookup when I use a BTREE index on status?
Should I index this column or not?
What alternatives (maybe any other indexes) does MySQL InnoDB provide to efficiently look records up by the "where status="enabled" clause in the given example with a very low cardinality/selectivity of the values?
The index that you describe is pretty much pointless. An index is best used when you need to select a small number of rows in comparison to the total rows.
The reason for this is related to how a database accesses a table. Tables can be assessed either by a full table scan, where each block is read and processed in turn. Or by a rowid or key lookup, where the database has a key/rowid and reads the exact row it requires.
In the case where you use a where clause based on the primary key or another unique index, eg. where id = 1, the database can use the index to get an exact reference to where the row's data is stored. This is clearly more efficient than doing a full table scan and processing every block.
Now back to your example, you have a where clause of where status = 'enabled', the index will return 150m rows and the database will have to read each row in turn using separate small reads. Whereas accessing the table with a full table scan allows the database to make use of more efficient larger reads.
There is a point at which it is better to just do a full table scan rather than use the index. With mysql you can use FORCE INDEX (idx_name) as part of your query to allow comparisons between each table access method.
Reference:
http://dev.mysql.com/doc/refman/5.5/en/how-to-avoid-table-scan.html
I'm sorry to say that I do not agree with Mike. Adding an index is meant to limit the amount of full records searches for MySQL, thereby limiting IO which usually is the bottleneck.
This indexing is not free; you pay for it on inserts/updates when the index has to be updated and in the search itself, as it now needs to load the index file (full text index for 300M records is probably not in memory). So it might well be that you get extra IO in stead of limitting it.
I do agree with the statement that a binary variable is best stored as one, a bool or tinyint, as that decreases the length of a row and can thereby limit disk IO, also comparisons on numbers are faster.
If you need speed and you seldom use the disabled records, you may wish to have 2 tables, one for enabled and one for disabled records and move the records when the status changes. As it increases complexity and risk this would be my very last choice of course. Definitely do the move in 1 transaction if you happen to go for it.
It just popped into my head that you can check wether an index is actually used by using the explain statement. That should show you how MySQL is optimizing the query. I don't really know hoe MySQL optimizes queries, but from postgresql I do know that you should explain a query on a database approximately the same (in size and data) as the real database. So if you have a copy on the database, create an index on the table and see wether it's actually used. As I said, I doubt it, but I most definitely don't know everything:)
If the data is distributed like 50:50 then query like where status="enabled" will avoid half scanning of the table.
Having index on such tables is completely depends on distribution of data, i,e : if entries having status enabled is 90% and other is 10%. and for query where status="disabled" it scans only 10% of the table.
so having index on such columns depends on distribution of data.
#a'r answer is correct, however it needs to be pointed out that the usefulness of an index is given not only by its cardinality but also by the distribution of data and the queries run on the database.
In OP's case, with 150M records having status='enabled' and 150M having status='disabled', the index is unnecessary and a waste of resource.
In case of 299M records having status='enabled' and 1M having status='disabled', the index is useful (and will be used) in queries of type SELECT ... where status='disabled'.
Queries of type SELECT ... where status='enabled' will still run with a full table scan.
You will hardly need all 150 mln records at once, so I guess "status" will always be used in conjunction with other columns. Perhaps it'd make more sense to use a compound index like (status, fullname)
Jan, you should definitely index that column. I'm not sure of the context of the quote, but everything you said above is correct. Without an index on that column, you are most certainly doing a table scan on 300M rows, which is about the worst you can do for that data.
Jan, as asked, where your query involves simply "where status=enabled" without some other limiting factor, an index on that column apparently won't help (glad to SO community showed me what's up). If however, there is a limiting factor, such as "limit 10" an index may help. Also, remember that indexes are also used in group by and order by optimizations. If you are doing "select count(*),status from table group by status", an index would be helpful.
You should also consider converting status to a tinyint where 0 would represent disabled and 1 would be enabled. You're wasting tons of space storing that string vs. a tinyint which only requires 1 byte per row!
I have a similar column in my MySQL database. Approximately 4 million rows, with the distribution of 90% 1 and 10% 0.
I've just discovered today that my queries (where column = 1) actually run significantly faster WITHOUT the index.
Foolishly I deleted the index. I say foolishly, because I now suspect the queries (where column = 0) may have still benefited from it. So, instead I should explicitly tell MySQL to ignore the index when I'm searching for 1, and to use it when I'm searching for 0. Maybe.