Is MySQL an Efficient System for Searching Large Tables?

Is MySQL an Efficient System for Searching Large Tables? - mysql

Say I have a large table, about 2 million rows and 50 columns. Using MySQL, how efficient would it be to search an entire column for one particular value, and then return the row number of said value? (Assume random distribution of values throughout the entire column)
If an operation like this takes an extended amount of time, what can I do to speed it up?

If the column in question is indexed, then it's pretty fast.
Don't be cavalier with indexes, though. The more indexes you have, the more expensive your writes will be (inserts/updates/deletes). Also, they take up disk space and RAM (and can easily be larger than the table itself). Indexes are good for querying, bad for writing. Choose wisely.
Exactly how fast we're talking here? This depends on configuration of your DB machine. If it doesn't have enough RAM to host indexes and data, operation may become disk-bound and performance will be reduced. Equally will be reduced operation without index. Assuming machine is fine, this further depends on how selective your index is. If you have a table with 10M rows and you index column with boolean values, you will get only a slight increase in performance. If, otherwise, you index a column with many-many different values (user emails), query will be orders of magnitude faster.
Also, by modern standards, table with 2M rows is rather small :-)

The structure of the data makes a big difference here, because it will affect your ability to index. Have a look at mysql indexing options (fulltext, etc).

There is no easy answer to that question, it depends on more parameters about your data. As many others have advised you already, creating an index on the column you have to search (for an exact match, or starting with a string) will be quite efficient.
As an example, I have a MyISAM table with 27,000,000 records (6.7 GB in size) which holds an index on a VARCHAR(128) field.
Here are two sample queries (real data) to give you an idea:
mysql> SELECT COUNT(*) FROM Books WHERE Publisher = "Hachette";
+----------+
| COUNT(*) |
+----------+
| 15072 |
+----------+
1 row in set (0.12 sec)
mysql> SELECT Name FROM Books WHERE Publisher = "Scholastic" LIMIT 100;
...
100 rows in set (0.17 sec)
So yes, I think MySQL is definitely fast enough to do what you're planning to do :)

Create an index on that column.

Create an index on the column in question and performance should not be a problem.

In general - add an index on the column

Related

Mariadb Explain statement estimating high number of rows that would be found during lookup

I have a table with 32 columns of which 6 rows are primary keys and 2 more column are indexed.
Explain statement provides the below output
I have observed that, everytime the number of rows in the explain statement increases, the select query takes seconds to retrieve data from DB. The above select query returned only 310 rows but it had to scan 382546 rows.
Time taken was calculated by enabling mariadb's slow query log.
Create table query
I would like to understand the incorrectness in the table or query which is considerably slowing down the select query execution.

Your row is relatively large (around 300bytes, depending on the content of your varchar columns). Using the primary key means (for InnoDB) that MySQL will read the whole row. Assuming the estimate of 400k rows is right (which it probably isn't, but you can check by removing the and country_code = 1506 from your query to get a better count), MySQL may end up reading more than 100mb from disk, which reasonably can take several seconds.
Adding a proper index should fix this, in your case I would suggest (country_code, lcr_run_id, tier_type) (which would, with your primary key, actually be the same as just (country_code)).
If most of your queries have that form (e.g. use at least these three columns for lookup), you could think about changing the order of your primary key to start with those three columns, it should give you another speedboost. That operation will take some time though.

Hash partitioning is useless for performance, get rid of it. Ditto for subpartitioning.
Specifying which partition to use defeats the purpose of letting the Optimizer do it for you.
You simply need INDEX(tier_type, lcr_run_id, country_code) with the columns in any desired order.
Plan A: Have the PRIMARY KEY start with those 3 columns (again, the order is not important)
Plan B: Have a "secondary" index with those 3 columns, but not being the same as the start of the PK. (This index could have more columns on the end; let's see some more queries to advise further.)
Either way, it will scan only 310 rows if you also get rid of all partitioning. (Hence, resolving your "returned only 310 rows but it had to scan 382546 rows". Anyway, the '382546' may have been a poor estimate by Explain.)
The important issue here is that indexing works with the leftmost columns in the INDEX. (The PK is an index.) Your SELECT had a match on the first 2 columns, but country_code came later in the list, and the intervening columns were not tested with =.
The three 35M values makes me wonder if the PK is over-specified. For example, if a "zone" is comprised of several "countries", then "zone" is irrelevant in specifying the PK.
The table has only 382K rows, but it is much fatter than it needs to be. Partitioning has a lot of overhead. Also, most columns have (I think) much bigger datatypes than needed. BIGINT takes 8 bytes; INT takes 4 bytes. For example, if there are only a small number of "zones", use TINYINT UNSIGNED, which takes only 1 byte (and allows values 0..255). (See also other 'int' variants.)
Oops, I missed something else. Since zone is not in the WHERE, it can't even get past the primary partitioning.

MySql - select a record from 18446744073709551615 records

my question is simple: let's say that I have hypothetically 18446744073709551615 records in one table (the max number) but I want to select from those records only one something like this:
SELECT * FROM TABLE1 WHERE ID = 5
1.- will the result be so slow to appear?
or if I have another table with only five records and I do the same query
SELECT * FROM TABLE2 WHERE ID = 5
2.- will the result appear at the same speed as in the first select or will be much faster in this other one?
thanks.

Let's assume for simplicity that the ID column is a fixed-width primary key. It will be found in roughly 64 index lookups (Wolfram Alpha on that). Since MySQL / InnoDB uses BTrees, it will be somewhat less than that for disk seeks.
Searching among 1 in a million would take you roughly index lookups. Seeking among 5 values will take 3 index lookups and the whole page will probably fit into one block.
Most of the speed difference will come from data that is being read from disk. The index branching should be a relatively fast operation and functionally you would not notice the difference once the values were cached in RAM. That is to say the first time you select from you 264 rows, it will be a little bit to read from a spinning disk, but essentially the same speed for the 5 and 264 rows if you were to repeat the query (even ignoring query cache).

No the first one will almost certainly be slower than the second but probably not that much slower, provided you have an index on the ID column.
With an index, you can efficiently find the first record meeting the condition and then all the other records will be close by (in the index structures anyway, not necessarily the data area).
I'd say you're more likely to run out of disk storage with the first one before you run out of database processing power :-)

Two non-primary/unique indexes in a three column table

I've got a three col table. It has a unique index, and another two (for two different columnts) for faster queries.
+-------------+-------------+----------+
| category_id | related_id | position |
+-------------+-------------+----------+
Sometimes the query is
SELECT * FROM table WHERE category_id = foo
and sometimes it's
SELECT * FROM table WHERE related_id = foo
So I decided to make both category_id and related_id an index for better performance. Is this bad practice? What are the downsides of this approach?
In the case I already have 100.000 rows in that table, and am inserting another 100.000, will it be an overkill. having to refresh the index with every new insert? Would that operation then take too long? Thanks

There are no downsides if it's doing exactly what you want, you query on a specific column a lot, so you make that column indexed, that's the whole point. Now you have a 60 column table and your adding indexes to columns you never query on then you are wasting resources because those indexes need to be maintained on INSERT/UPDATE/DELETE operations.

If you have created index for each column then you will definitely get benefit out of it.
Don't go for composite indexes (Multiple coulmn indexes).
You yourself can see the advantage of index in your query by using EXPLAIN (statement provides information about how MySQL executes statements).
EXAMPLE:
EXPLAIN SELECT * FROM table WHERE category_id = foo;
Hope this will help.
~K

Its good to have indexes. Just understand that indexes would take more disk space, but faster search.
It is in your best interest to index those fields which have less repeated values. For eg. Indexing a field that contains a Boolean flag might not be a good idea.
Since in your case you are having an id, hence I think you won't be having any problem in keeping the indexes that you have created.
Also, the inserts would be slower, but since you are saving id's there won't be much of a difference in the time required to insert. Go ahead and do the insert.
My personal advice :
When you are inserting large number of rows in a single table in one go, don't insert them using a single query, unless mandatory. This would prevent your table from getting locked and inaccessible for a long time.

MySQL performance: multiple tables vs. index on single table and partitions

I am wondering what is more efficient and faster in performance:
Having an index on one big table or multiple smaller tables without indexes?
Since this is a pretty abstract problem let me make it more practical:
I have one table with statistics about users (20,000 users and about 30 million rows overall). The table has about 10 columns including the user_id, actions, timestamps, etc.
Most common applications are: Inserting data by user_id and retrieving data by user_id (SELECT statements never include multiple user_id's).
Now so far I have an INDEX on the user_id and the query looks something like this
SELECT * FROM statistics WHERE user_id = 1
Now, with more and more rows the table gets slower and slower. INSERT statements slow down because the INDEX gets bigger and bigger; SELECT statements slow down, well, because there are more rows to search through.
Now I was wondering why not have one statistics table for each user and change the query syntax to something like this instead:
SELECT * FROM statistics_1
where 1 represents the user_id obviously.
This way, no INDEX is needed and there is far less data in each table, so INSERT and SELECT statements should be much faster.
Now my questions again:
Are there any real world disadvantages to handle so many tables (in my case 20,000) instead of using of using one table with an INDEX?
Would my approach actually speed things up or might the lookup for the table eventually slow down things more than everything?

Creating 20,000 tables is a bad idea. You'll need 40,000 tables before long, and then more.
I called this syndrome Metadata Tribbles in my book SQL Antipatterns Volume 1. You see this happen every time you plan to create a "table per X" or a "column per X".
This does cause real performance problems when you have tens of thousands of tables. Each table requires MySQL to maintain internal data structures, file descriptors, a data dictionary, etc.
There are also practical operational consequences. Do you really want to create a system that requires you to create a new table every time a new user signs up?
Instead, I'd recommend you use MySQL Partitioning.
Here's an example of partitioning the table:
CREATE TABLE statistics (
id INT AUTO_INCREMENT NOT NULL,
user_id INT NOT NULL,
PRIMARY KEY (id, user_id)
) PARTITION BY HASH(user_id) PARTITIONS 101;
This gives you the benefit of defining one logical table, while also dividing the table into many physical tables for faster access when you query for a specific value of the partition key.
For example, When you run a query like your example, MySQL accesses only the correct partition containing the specific user_id:
mysql> EXPLAIN PARTITIONS SELECT * FROM statistics WHERE user_id = 1\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: statistics
partitions: p1 <--- this shows it touches only one partition
type: index
possible_keys: NULL
key: PRIMARY
key_len: 8
ref: NULL
rows: 2
Extra: Using where; Using index
The HASH method of partitioning means that the rows are placed in a partition by a modulus of the integer partition key. This does mean that many user_id's map to the same partition, but each partition would have only 1/Nth as many rows on average (where N is the number of partitions). And you define the table with a constant number of partitions, so you don't have to expand it every time you get a new user.
You can choose any number of partitions up to 1024 (or 8192 in MySQL 5.6), but some people have reported performance problems when they go that high.
It is recommended to use a prime number of partitions. In case your user_id values follow a pattern (like using only even numbers), using a prime number of partitions helps distribute the data more evenly.
Re your questions in comment:
How could I determine a resonable number of partitions?
For HASH partitioning, if you use 101 partitions like I show in the example above, then any given partition has about 1% of your rows on average. You said your statistics table has 30 million rows, so if you use this partitioning, you would have only 300k rows per partition. That is much easier for MySQL to read through. You can (and should) use indexes as well -- each partition will have its own index, and it will be only 1% as large as the index on the whole unpartitioned table would be.
So the answer to how can you determine a reasonable number of partitions is: how big is your whole table, and how big do you want the partitions to be on average?
Shouldn't the amount of partitions grow over time? If so: How can I automate that?
The number of partitions doesn't necessarily need to grow if you use HASH partitioning. Eventually you may have 30 billion rows total, but I have found that when your data volume grows by orders of magnitude, that demands a new architecture anyway. If your data grow that large, you probably need sharding over multiple servers as well as partitioning into multiple tables.
That said, you can re-partition a table with ALTER TABLE:
ALTER TABLE statistics PARTITION BY HASH(user_id) PARTITIONS 401;
This has to restructure the table (like most ALTER TABLE changes), so expect it to take a while.
You may want to monitor the size of data and indexes in partitions:
SELECT table_schema, table_name, table_rows, data_length, index_length
FROM INFORMATION_SCHEMA.PARTITIONS
WHERE partition_method IS NOT NULL;
Like with any table, you want the total size of active indexes to fit in your buffer pool, because if MySQL has to swap parts of indexes in and out of the buffer pool during SELECT queries, performance suffers.
If you use RANGE or LIST partitioning, then adding, dropping, merging, and splitting partitions is much more common. See http://dev.mysql.com/doc/refman/5.6/en/partitioning-management-range-list.html
I encourage you to read the manual section on partitioning, and also check out this nice presentation: Boost Performance With MySQL 5.1 Partitions.

It probably depends on the type of queries you plan on making often, and the best way to know for sure is to just implement a prototype of both and do some performance tests.
With that said, I would expect that a single (large) table with an index will do better overall because most DBMS systems are heavily optimized to deal with the exact situation of finding and inserting data into large tables. If you try to make many little tables in hopes of improving performance, you're kindof fighting the optimizer (which is usually better).
Also, keep in mind that one table is probably more practical for the future. What if you want to get some aggregate statistics over all users? Having 20 000 tables would make this very hard and inefficient to execute. It's worth considering the flexibility of these schemas as well. If you partition your tables like that, you might be designing yourself into a corner for the future.

Concrete example:
I have one table with statistics about users (20,000 users and about 30 million rows overall). The table has about 10 columns including the user_id, actions, timestamps, etc.
Most common applications are: Inserting data by user_id and retrieving data by user_id (SELECT statements never include multiple user_id's).
Do this:
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
...
PRIMARY KEY(user_id, id),
INDEX(id)
Having user_id at the start of the PK gives you "locality of reference". That is, all the rows for one user are clustered together thereby minimizing I/O.
The id on the end of the PK is because the PK must be unique.
The strange-looking INDEX(id) is to keep AUTO_INCREMENT happy.
Abstract question:
Never have multiple identical tables.
Use PARTITIONing only if it meets one of the use-cases listed in http://mysql.rjweb.org/doc.php/partitionmaint
A PARTITIONed table needs a different set of indexes than the non-partitioned equivalent table.
In most cases a single, non-partitioned, table is optimal.
Use the queries to design indexes.

There is little to add to Bill Karwins answer. But one hint is: check if all the data for the user is needed in complete detail over all the time.
If you want to give usage statistics or number of visits or those things, you usually will get not a granularity of single actions and seconds for, say, the year 2009 from todays view. So you could build aggregation tables and a archive-table (not engine archive, of course) to have the recent data on action- base and an overview over the older actions.
Old actions don't change, I think.
And you still can go into detail from the aggregation with a week_id in the archive-table for example.

Intead of going from 1 table to 1 table per user, you can use partitioning to hit a number of tables/table size ratio somewhere in the middle.
You can also keep stats on users to try to move 'active' users into 1 table to reduce the number of tables that you have to access over time.
The bottom line is that there is a lot you can do, but largely you have to build prototypes and tests and just evaluate the performance impacts of various changes you are making.

Should I avoid COUNT all together in InnoDB?

Right now, I'm debating whether or not to use COUNT(id) or "count" columns. I heard that InnoDB COUNT is very slow without a WHERE clause because it needs to lock the table and do a full index scan. Is that the same behavior when using a WHERE clause?
For example, if I have a table with 1 million records. Doing a COUNT without a WHERE clause will require looking up 1 million records using an index. Will the query become significantly faster if adding a WHERE clause decreases the number of rows that match the criteria from 1 million to 500,000?
Consider the "Badges" page on SO, would adding a column in the badges table called count and incrementing it whenever a user earned that particular badge be faster than doing a SELECT COUNT(id) FROM user_badges WHERE user_id = 111?
Using MyIASM is not an option because I need the features of InnoDB to maintain data integrity.

SELECT COUNT(*) FROM tablename seems to do a full table scan.
SELECT COUNT(*) FROM tablename USE INDEX (colname) seems to be quite fast if
the index available is NOT NULL, UNIQUE, and fixed-length. A non-UNIQUE index doesn't help much, if at all. Variable length indices (VARCHAR) seem to be slower, but that may just be because the index is physically larger. Integer UNIQUE NOT NULL indices can be counted quickly. Which makes sense.
MySQL really should perform this optimization automatically.

Performance of COUNT() is fine as long as you have an index that's used.
If you have a million records and the column in question is NON NULL then a COUNT() will be a million quite easily. If NULL values are allowed, those aren't indexed so the number of records is easily obtained by looking at the index size.
If you're not specifying a WHERE clause, then the worst case is the primary key index will be used.
If you specify a WHERE clause, just make sure the column(s) are indexed.

I wouldn't say avoid, but it depends on what you are trying to do:
If you only need to provide an estimate, you could do SELECT MAX(id) FROM table. This is much cheaper, since it just needs to read the max value in the index.
If we consider the badges example you gave, InnoDB only needs to count up the number of badges that user has (assuming an index on user_id). I'd say in most case that's not going to be more than 10-20, and it's not much harm at all.
It really depends on the situation. I probably would keep the count of the number of badges someone has on the main user table as a column (count_badges_awarded) simply because every time an avatar is shown, so is that number. It saves me having to do 2 queries.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008