my question is simple: let's say that I have hypothetically 18446744073709551615 records in one table (the max number) but I want to select from those records only one something like this:
SELECT * FROM TABLE1 WHERE ID = 5
1.- will the result be so slow to appear?
or if I have another table with only five records and I do the same query
SELECT * FROM TABLE2 WHERE ID = 5
2.- will the result appear at the same speed as in the first select or will be much faster in this other one?
thanks.
Let's assume for simplicity that the ID column is a fixed-width primary key. It will be found in roughly 64 index lookups (Wolfram Alpha on that). Since MySQL / InnoDB uses BTrees, it will be somewhat less than that for disk seeks.
Searching among 1 in a million would take you roughly index lookups. Seeking among 5 values will take 3 index lookups and the whole page will probably fit into one block.
Most of the speed difference will come from data that is being read from disk. The index branching should be a relatively fast operation and functionally you would not notice the difference once the values were cached in RAM. That is to say the first time you select from you 264 rows, it will be a little bit to read from a spinning disk, but essentially the same speed for the 5 and 264 rows if you were to repeat the query (even ignoring query cache).
No the first one will almost certainly be slower than the second but probably not that much slower, provided you have an index on the ID column.
With an index, you can efficiently find the first record meeting the condition and then all the other records will be close by (in the index structures anyway, not necessarily the data area).
I'd say you're more likely to run out of disk storage with the first one before you run out of database processing power :-)
Related
I have a table with 32 columns of which 6 rows are primary keys and 2 more column are indexed.
Explain statement provides the below output
I have observed that, everytime the number of rows in the explain statement increases, the select query takes seconds to retrieve data from DB. The above select query returned only 310 rows but it had to scan 382546 rows.
Time taken was calculated by enabling mariadb's slow query log.
Create table query
I would like to understand the incorrectness in the table or query which is considerably slowing down the select query execution.
Your row is relatively large (around 300bytes, depending on the content of your varchar columns). Using the primary key means (for InnoDB) that MySQL will read the whole row. Assuming the estimate of 400k rows is right (which it probably isn't, but you can check by removing the and country_code = 1506 from your query to get a better count), MySQL may end up reading more than 100mb from disk, which reasonably can take several seconds.
Adding a proper index should fix this, in your case I would suggest (country_code, lcr_run_id, tier_type) (which would, with your primary key, actually be the same as just (country_code)).
If most of your queries have that form (e.g. use at least these three columns for lookup), you could think about changing the order of your primary key to start with those three columns, it should give you another speedboost. That operation will take some time though.
Hash partitioning is useless for performance, get rid of it. Ditto for subpartitioning.
Specifying which partition to use defeats the purpose of letting the Optimizer do it for you.
You simply need INDEX(tier_type, lcr_run_id, country_code) with the columns in any desired order.
Plan A: Have the PRIMARY KEY start with those 3 columns (again, the order is not important)
Plan B: Have a "secondary" index with those 3 columns, but not being the same as the start of the PK. (This index could have more columns on the end; let's see some more queries to advise further.)
Either way, it will scan only 310 rows if you also get rid of all partitioning. (Hence, resolving your "returned only 310 rows but it had to scan 382546 rows". Anyway, the '382546' may have been a poor estimate by Explain.)
The important issue here is that indexing works with the leftmost columns in the INDEX. (The PK is an index.) Your SELECT had a match on the first 2 columns, but country_code came later in the list, and the intervening columns were not tested with =.
The three 35M values makes me wonder if the PK is over-specified. For example, if a "zone" is comprised of several "countries", then "zone" is irrelevant in specifying the PK.
The table has only 382K rows, but it is much fatter than it needs to be. Partitioning has a lot of overhead. Also, most columns have (I think) much bigger datatypes than needed. BIGINT takes 8 bytes; INT takes 4 bytes. For example, if there are only a small number of "zones", use TINYINT UNSIGNED, which takes only 1 byte (and allows values 0..255). (See also other 'int' variants.)
Oops, I missed something else. Since zone is not in the WHERE, it can't even get past the primary partitioning.
We have a Mysql Master Slave architecture. We have around 1000 tables. 5 or 6 tables in our db is around 30 to 40 GB each. We can not join one 30 GB table to another 30 GB table as it never returns result .
What we do : Select required data from one table and than find matching data in another table in chunks. This gives result to us, but this is slow.
After joining two tables in chunks we further process these tables. We use few more joins as well as per the use case.
Current DB: architecture: 5 Master Server, 100 Slave Servers.
1. How can we make it faster ? Indexing is not an issue here, we are already using it.
2. Do we need some big data approach to get faster result.
EDIT: Query Details Below
Query select count(*) from A, B where A.id = B.uid;
Table A 30 GB, have 51 Columns. Id is primary key which is auto incremental integer.
Table B 27 GB, have 48 Columns. uid (int 11) is non unique index.
MySql ISAM is used.
That's an awful query. It will either
Scan all of A
For each id, lookup (randomly) the uid in B's index.
or
Scan all of B's index on uid
For each uid, lookup (randomly) the id in A (in the PK, hence i the data).
In either case,
the 30GB of A will all be touched
much of the uid index of B will be touched
Step 1 will be a linear scan
Step 2 will be random probe, presumably involving lots of I/O.
Please explain the intent if the query; maybe we can help you reformulate it to achieve the same or similar purpose.
Meanwhile, how much RAM do you have? What is the setting of innodb_buffer_pool_size? And are the tables InnoDB?
The query will eventually return a result, unless some "timeout" kills it.
Is id an AUTO_INCREMENT? Or is uid a "UUID"? (UUIDs make performance worse, but there are some minor tips to help.)
Let's say there a table of people, with an age column which is indexed. How fast would be a query to count people older than 20: SELECT COUNT(*) FROM people WHERE age > 20? Is full table scan required? The database is MySQL.
if the column age is not indexed, then yes, a full table scan is required.
Even if it is indexed, if the data distribution of age values is such that there are more than a certain threshold percentage of the records that have age > 20, then a table scan is required anyway. it works this way, for each row that would be returned by the query, the processor must execute n disk IO operations, where n is the number of levels in the index... If there are, say a million rows in the table, and the index on age is 5 levels deep, then if there are more than 200k rows with age value > 20 then for each of those rows the processor has to execute 5 I/Os, for a total of 200k * 5 = 1 million I/Os, so, the optimizer says, if my statistics indicate that more than 200k rows would be returned, I might as well do a complete table scan, that will require less than 1 Million I/Os.
The only exception to this is if the entire table is clustered on the age column, then you only need to traverse the index for the boundaries of the age range you want to filter on.
There are some errors in the Accepted Answer. Rather than dissect that Answer, I will start fresh:
Given SELECT COUNT(*) FROM people WHERE age > 20, here is the performance for InnoDB, fastest first:
1. `INDEX(age)` -- Range scan within the index
2. `INDEX(age, ...)` -- Range scan within the index
3. `INDEX(foo, age)` -- Full Index scan
4. `PRIMARY KEY(age, ...)` -- Range scan within the table
5. No indexes -- Table scan needed
6. `PRIMARY KEY(foo, ...) -- Table scan needed (same as "No index")
Notes and caveats:
INDEX(age, ...) is a littler slower than INDEX(age) only because the index is bulkier.
Any secondary index containing all the columns mentioned anywhere in the SELECT (just age, in this example) is called a "covering" index. EXPLAIN will say Using index (not to be confused with Using index condition). A covering index is faster than other secondary indexes. (If we had another column in the select, I could say more.)
Note "Range scan" vs "scan" -- This is where the processing can drill down the BTree index (primary or secondary) to where age = 20 and scan forward. That is, it does not need to scan the entire table or index, hence Range scan" is faster than "scan".
Items 3 and 4 may not be in the correct order. Item 3 may be faster when the index is significantly less bulky than the table. Item 4 may be faster when the range is a small fraction of the table. Because of this "maybe", I can't say "a covering index is always faster than using the PRIMARY KEY". Instead I can only say "usually faster".
A million rows is likely to have only 3 levels of BTree. However this part of the computation is almost never worth pursuing. (Rule of Thumb: each level of an index or table BTree fans out by a factor of 100.)
If the necessary part of the data or index is not already cached in RAM, then there will be I/O -- this can drastically slow down any of the cases. It can even turn the fastest case into slower than all the rest.
If the the data/index is too big to be cached, then there will always be I/O. In this case the ordering will stay roughly the same, but the differences will be more pronounced. (For example, "bulkier" becomes a significant factor.)
SELECT name FROM t WHERE age>20 is a different can of worms. Some of what I have said does not carry over to it. (Ask another Question if you want me to spell that out. It will have more cases.)
MyISAM and MEMORY have differences relative to InnoDB.
I have 30 millions records and one field (updated) is a tinyint(1) with an index.
If I run:
SELECT * FROM `table` WHERE `updated` = 1
It will take an increasingly long time depending on how many are set to 1.
If it is say 10,000 it will be quite fast about 1 second. However if there is say 1 million it takes a couple of minutes.
Isn't the index suppose to make this fast?
When I run the same query on a non-indexed field that is similar only it is int(1) it performs the same as the indexed field.
Any ideas as to why this would be? is tinyint bad?
In general, using a binary column for an index is not considered a good idea. There are some cases where it is okay, but this is generally suspect.
The main purpose of an index is to reduce the I/O of a query. The way your query is expressed, it requires both the original data (to satisfy "select *") and the index (for the where clause).
So, the database engine will go through the index. Each time it finds a matching record, it brings the page into member. You have an I/O hit. Because your table is so large, the page probably was not seen already, so there is a real I/O hit.
Interestingly, your experience supports this. 10,000 rows is about one second. 100 times as many rows (one million) is about 100 seconds. You are witnessing linearity in performance.
By the way, the query would be faster if you did "select update" instead of "select *". This query could be satisfied only from the index. If you have an id column, you could create the index on (update, id), and then do "select id" for performance.
Say I have a large table, about 2 million rows and 50 columns. Using MySQL, how efficient would it be to search an entire column for one particular value, and then return the row number of said value? (Assume random distribution of values throughout the entire column)
If an operation like this takes an extended amount of time, what can I do to speed it up?
If the column in question is indexed, then it's pretty fast.
Don't be cavalier with indexes, though. The more indexes you have, the more expensive your writes will be (inserts/updates/deletes). Also, they take up disk space and RAM (and can easily be larger than the table itself). Indexes are good for querying, bad for writing. Choose wisely.
Exactly how fast we're talking here? This depends on configuration of your DB machine. If it doesn't have enough RAM to host indexes and data, operation may become disk-bound and performance will be reduced. Equally will be reduced operation without index. Assuming machine is fine, this further depends on how selective your index is. If you have a table with 10M rows and you index column with boolean values, you will get only a slight increase in performance. If, otherwise, you index a column with many-many different values (user emails), query will be orders of magnitude faster.
Also, by modern standards, table with 2M rows is rather small :-)
The structure of the data makes a big difference here, because it will affect your ability to index. Have a look at mysql indexing options (fulltext, etc).
There is no easy answer to that question, it depends on more parameters about your data. As many others have advised you already, creating an index on the column you have to search (for an exact match, or starting with a string) will be quite efficient.
As an example, I have a MyISAM table with 27,000,000 records (6.7 GB in size) which holds an index on a VARCHAR(128) field.
Here are two sample queries (real data) to give you an idea:
mysql> SELECT COUNT(*) FROM Books WHERE Publisher = "Hachette";
+----------+
| COUNT(*) |
+----------+
| 15072 |
+----------+
1 row in set (0.12 sec)
mysql> SELECT Name FROM Books WHERE Publisher = "Scholastic" LIMIT 100;
...
100 rows in set (0.17 sec)
So yes, I think MySQL is definitely fast enough to do what you're planning to do :)
Create an index on that column.
Create an index on the column in question and performance should not be a problem.
In general - add an index on the column