How can make this join of two huge MySQL Tables finish? - mysql

I have two tables
table1:
column1: varchar(20)
column2: varchar(20)
column3: varchar(20)
table2:
column1: varchar(20)
column2: varchar(20)
column3: varchar(20) <- empty
column1 and column2 both have a separate Fulltext index in table1
both tables hold 20 million rows
I need to fill column3 of table2 by matching column1 & column2 from table2 to column1 & column2 from table1, then take the value in column3 from table1 and put it into column3 of table2. column1 & column2 might not match exactly, so the query I use for this is:
UPDATE table1, table2
SET table2.column3 = table1.column3
WHERE table2.column1 LIKE table1.'%column1%' AND
table2.column2 LIKE table1.'%column2%';
This query never finishes. I let it run for 2 weeks and still didn't produce any result. It utilized one CPU core 100%, had little SSD IO and apparently needs to be optimized somehow.
I am open to any suggestions regarding query optimization, index optimization or even DBMS optimization (or even migration, if it helps) since I need to do queries like this more often in the future.
EDIT1
There are plenty of optimization guides, please use google for that. You can increase the threads in config (InnoDB). For the Update itself i recommend to first create a temp_table and then copy to db2
I know that but couldn't quite solve my scenario with those guides. I also know that questions of all possible permutations of combinations for this problem (huge databases, performance, bottlenecks, query design) are all around, also on stackoverflow. However, to this day I couldn't figure out what the best way to proceed would be for this specific combination of problems and hoped for getting help here. That being said:
- more threads would require sharding or partitioning in order to utilize more than one CPU core, which I would like to avoid if I can solve the problem with other means
- how would you propose to create such temporary table here?
Why do you use like operator if you do not use wild card characters? Replace them with =. Also, do you have multi-column index on the 3 columns in the where criteria in each of the tables? Pls share the output of the explain as well, along with any existing indexes in the 2 tables.
I left those characters out in the example but want to use them once the basic query works, sorry for the confusion. I am not entirely sure how to put those wildcards into a column comparison though.
I have two seperate indizes, should I create a 2-column index instead? (there are only 2 columns in the where criteria)
would you rather have the explain of the structure I have now or prefer the explain of the structure with a 2-column index?
i guess you say databases but you are talking about tables, right?
Exactly, sorry for the confusion.
The query you wrote will perform 20m x 20m lookups (for each row in table 1 look up all rows in table 2). You can't write whatever in and expect it to work if you have an SSD or a good CPU. If you arrived at this point, it's time to think before you start writing SQL. What it is that you need to do, what are the tools you have at your disposal and what's the middle part that you don't know - those are the questions you need to answer every time before you issue 400 billion lookup query.
That is the scenario I am facing though. I don't expect it to work at all like it is at the moment, to be honest, so I am looking for pointers which might make this a solvable scenario. The basic "update this, where that matches" query apparently doesn't apply here. So I am trying to figure out a way to a more advanced solution. Any criticism is very welcome, so thank you for this input. How would you suggest to proceed here?
EDIT2
Give us some sample values and non-exact comparisons.
table1:
+---------+---------+-------------+---------+---------+---------+
| column1 | column2 | column3 | column4 | column5 | columnN |
+---------+---------+-------------+---------+---------+---------+
| John | Doe_ | employee001 | xyz | 12345 | ... |
| Jim | Doe | employee002 | abc | 67890 | ... |
+---------+---------+-------------+---------+---------+---------+
table2:
+---------+---------+---------+
| column1 | column2 | column3 |
+---------+---------+---------+
| John | Doe | |
| Jim | Doe | |
+---------+---------+---------+
Here, a LIKE query would fill both rows of table 2, if it would match "Doe_" for "Doe". But by writing this down, I just realized that a LIKE query is no option here because the variations wouldn't constrain to a suffix of column2 in table 1, rather various possible likes would be required (leading AND trailing variants for both columns in both tables). This in turn would multiply the number of required matches.
So let's forget about the LIKE and concentrate on exact matching only.
FULLTEXT and LIKE have nothing to do with each other.
"Might not match exactly" -- You will need more limitations on this non-restriction. Else, any attempt at a query will continue to take weeks.
t2.c1 LIKE CONCAT('%', t1.c1, '%') requires checking ever row of t1 against every row of t2; that's 400 trillion tests. No hardware can do that in a reasonable length of time.
FULLTEXT works with "words". If your c1 and c2 are strings of words, then there is some hope to use FULLTEXT. FULLTEXT is much faster than LIKE because it has an index structure based on words.
However, even FULLTEXT is no where near the speed of t2.c1 = t1.c1. Still, that would need a composite INDEX(c1, c2) Then it would be a full table scan (20M rows) of one table, plus 20M probes via a BTree index into the other table. This is like 40M operations -- a lot better than 400T for LIKE.
In order to proceed, please think through your definition of "Might not match exactly" and present the best you can live with.
Ok, since I decided to drop the LIKE requirement, what exactly do you propose to use as index?
I read your post like this:
ALTER TABLE `table1` ADD FULLTEXT INDEX `indexname1` (`column1`, `column2`);
ALTER TABLE `table2` ADD FULLTEXT INDEX `indexname2` (`column1`, `column2`);
UPDATE `table1`, `table2`
SET `table2`.`column3` = `table1`.`column3 `
WHERE CONCAT(`table1`.`column1`, `table1`.`column2`) = CONCAT(`table2`.`column1`, `table2`.`column2`);
Is this correct?
Two followup questions though:
1) Is the update in your oppinion as fast, faster or slower as creating a new table, i.e.:
CREATE TABLE `merged` AS
SELECT `table1`.`column1`, `table1`.`column2`, `table1`.`column3`
FROM `table1`, `table2`
WHERE CONCAT(`table1`.`column1`, `table1`.`column2`) = CONCAT(`table2`.`column1`, `table2`.`column2`);
2) Would the indizes and / or the matching be case sensitive? If yes, can adapt the query without having to change column1 & column2 to all upper case (or all lower case)?

FULLTEXT and LIKE have nothing to do with each other.
"Might not match exactly" -- You will need more limitations on this non-restriction. Else, any attempt at a query will continue to take weeks.
t2.c1 LIKE CONCAT('%', t1.c1, '%') requires checking ever row of t1 against every row of t2; that's 400 trillion tests. No hardware can do that in a reasonable length of time.
FULLTEXT works with "words". If your c1 and c2 are strings of words, then there is some hope to use FULLTEXT. FULLTEXT is much faster than LIKE because it has an index structure based on words.
However, even FULLTEXT is no where near the speed of t2.c1 = t1.c1. Still, that would need a composite INDEX(c1, c2) Then it would be a full table scan (20M rows) of one table, plus 20M probes via a BTree index into the other table. This is like 40M operations -- a lot better than 400T for LIKE.
In order to proceed, please think through your definition of "Might not match exactly" and present the best you can live with.
Edit
WHERE CONCAT(t1.c1, t1.c2) = CONCAT(t2.c1, t2.c2) is a lot worse than saying WHERE t1.c1=t2.c2 AND t1.c2 = t2.c2. The latter will run fast with INDEX(c1,c2).

Try this:
1. Add a new column to db1 and db2 with a character, that never appears in column1 and column2, for example #
ALTER TABLE `db1` ADD `column4` VARCHAR(40) NOT NULL ;
UPDATE db1 SET column4 = column1 + '#' + column2
2. Do the same for db2. Then create an index (BTREE) on column 4 (in db1 and db2).
ALTER TABLE `db1` ADD INDEX ( `column4` ) ;
ALTER TABLE `db2` ADD INDEX ( `column4` ) ;
3. Then run next query:
UPDATE db1, db2 SET db2.column3 = db1.column3 WHERE db1.column4 = db2.column4;
It should run fast enough.
When it's done - just drop column4 and it's index

Related

Mysql fastest technique for insert, replace, on duplicate of mass records

I know there are a lot related questions with many answers, but I have a bit of a more nuanced question. I have been doing reading on different insert techniques for mass records, but are there limits on how big a query insert can be? Can the same technique be used for REPLACE and INSERT ...ON DUPLICATE KEY UPDATE ... ? Is there a faster method?
Table:
+-----------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------+-------------+------+-----+---------+----------------+
| a | int(11) | NO | PRI | NULL | auto_increment |
| b | int(11) | YES | | NULL | |
| c | int(11) | YES | | NULL | |
#1
1) "INSERT INTO TABLE COLUMNS (a,b,c) values (1,2,3);"
2) "INSERT INTO TABLE COLUMNS (a,b,c) values (5,6,7);"
3) "INSERT INTO TABLE COLUMNS (a,b,c) values (8,9,10);"
...
10,000) "INSERT INTO TABLE COLUMNS (a,b,c) values (30001,30002,30003);"
or
#2 - should be faster, but is there a limit?
"INSERT INTO TABLE COLUMNS (a,b,c) values (1,2,3),(4,5,6),(8,9,10)....(30001,30002,30003)" ;
From a scripting perspective (PHP), using #2, is it better to loop through and queue up 100 entries (1000 times)...or a 1000 entries (100 times), or just all 10,000 at once? Could this be done with 100,000 entries?
Can the same be used with REPLACE:
"REPLACE INTO TABLE (a, b, c) VALUES(1,2,3),(4,5,6)(7,8,9),...(30001,30002,30003);"
Can it also be used with INSERT ON DUPLICATE?
INSERT INTO TABLE (a, b, c) VALUES(1,2,3),(4,5,6),(7,8,9),....(30001,30002,30003) ON DUPLICATE KEY UPDATE (b=2,c=3)(b=5,c=6),(b=8,c=9),....(b=30002,c=30003) ?
For any and all of the above (assuming the replace/on duplicate are valid), are there faster methods to achieve the inserts?
The length of any SQL statement is limited by a MySQL option called max_allowed_packet.
The syntax of INSERT allows you to add an unlimited number of tuples after the VALUES clause, but the total length of the statement from INSERT to the last tuple must still be no more than the number of bytes equal to max_allowed_packet.
Regardless of that, I have found that LOAD DATA INFILE is usually significantly faster than any INSERT syntax. It's so much faster, that you might even find it faster to write your tuples to a temporary CSV file and then use LOAD DATA INFILE on that CSV file.
You might like my presentation comparing different bulk-loading solutions in MySQL: Load Data Fast!
#1 (single-row inserts) -- Slow. A variant is INSERT IGNORE -- beware: it burns AUTO_INCREMENT ids.
#2 (batch insert) -- Faster than #1 by a factor of 10. But do the inserts in batches of no more than 1000. (After that, you are into "diminishing returns" and may conflict with other activities.
#3 REPLACE -- Bad. It is essentially a DELETE plus an INSERT. Once IODKU was added to MySQL, I don't think there is any use for REPLACE. All the old AUTO_INCREMENT ids will be tossed and new ones created.
#4 IODKU (Upsert) -- [If you need to test before Insert.] It can be batched, but not the way you presented it. (There is no need to repeat the b and c values.)
INSERT INTO (
INSERT INTO TABLE (a, b, c)
VALUES(1,2,3),(4,5,6),(7,8,9),....(30001,30002,30003)
ON DUPLICATE KEY UPDATE
b = VALUES(b),
c = VALUES(c);
Or, in MySQL 8.0, the last 2 lines are:
b = NEW.b,
c = NEW.c;
IODKU also burns ids.
MySQL LOAD DATA INFILE with ON DUPLICATE KEY UPDATE discusses a 2-step process of LOAD + IODKU. Depending on how complex the "updates" are, 2+ steps may be your best answer.
#5 LOAD DATA -- as Bill mentions, this is a good way if the data comes from a file. (I am dubious about its speed if you also have to write the data to a file first.) Be aware of the usefulness of #variables to make minor tweaks as you do the load. (Eg, STR_TO_DATE(..) to fix a DATE format.)
#6 INSERT ... SELECT ...; -- If the data is already in some other table(s), you may as well combine the Insert and Select. This works for IODKU, too.
As a side note, if you need to get AUTO_INCREMENT ids of each batched row, I recommend some variant on the following. It is aimed at batch-normalization of id-name pairs that might already exist in the mapping table. Normalization

How to get Count for large tables?

Sample Table:
+----+-------+-------+-------+-------+-------+---------------+
| id | col1 | col2 | col3 | col4 | col5 | modifiedTime |
+----+-------+-------+-------+-------+-------+---------------+
| 1 | temp1 | temp2 | temp3 | temp4 | temp5 | 1554459626708 |
+----+-------+-------+-------+-------+-------+---------------+
above table has 50 million records
(col1, col2, col3, col4, col5 these are VARCHAR columns)
(id is PK)
(modifiedTime)
Every column is indexed
For Ex: I have two tabs in my website.
FirstTab - I print the count of above table with following criteria [col1 like "value1%" and col2 like "value2%"]
SeocndTab - I print the count of above table with following criteria [col3 like "value3%"]
As I have 50 million records, the count with those criteria takes too much time to get the result.
Note: I would change records data(rows in table) sometime. Insert new rows. Delete not needed records.
I need a feasible solution instead of querying the whole table. Ex: like caching the older count. Is anything like this possible.
While I'm sure it's possible for MySQL, here's a solution for Postgres, using triggers.
Count is stored in another table, and there's a trigger on each insert/update/delete that checks if the new row meets the condition(s), and if it does, add 1 to the count. Another part of the trigger checks if the old row meets the condition(s), and if it does, subtracts 1.
Here's the basic code for the trigger that counts the rows with temp2 = '5':
CREATE OR REPLACE FUNCTION updateCount() RETURNS TRIGGER AS
$func$
BEGIN
IF TG_OP = 'INSERT' OR TG_OP = 'UPDATE' THEN
EXECUTE 'UPDATE someTableCount SET cnt = cnt + 1 WHERE 1 = (SELECT 1 FROM (VALUES($1.*)) x(id, temp1, temp2, temp3) WHERE x.temp2 = ''5'')'
USING NEW;
END IF;
IF TG_OP = 'DELETE' OR TG_OP = 'UPDATE' THEN
EXECUTE 'UPDATE someTableCount SET cnt = cnt - 1 WHERE 1 = (SELECT 1 FROM (VALUES($1.*)) x(id, temp1, temp2, temp3) WHERE x.temp2 = ''5'')'
USING OLD;
END IF;
RETURN new;
END
$func$ LANGUAGE plpgsql;
Here's a working example on dbfiddle.
You could of course modify the trigger code to have dynamic where expressions and store counts for each in the table like:
CREATE TABLE someTableCount
(
whereExpr text,
cnt INT
);
INSERT INTO someTableCount VALUES ('temp2 = ''5''', 0);
In the trigger you'd then loop through the conditions and update accordingly.
FirstTab - I print the count of above table with following criteria [col1 like "value1%" and col2 like "value2%"]
That would benefit from a 'composite' index:
INDEX(col1, col2)
because it would be "covering". (That is, all the columns needed in the query are found in a single index.)
SeocndTab - I print the count of above table with following criteria [col3 like "value3%"]
You apparently already have the optimal (covering) index:
INDEX(col3)
Now, let's look at it from a different point of view. Have you noticed that search engines no longer give you an exact count of rows that match? You are finding out why -- It takes too long to do the tally not matter what technique is used.
Since "col1" gives me no clue of your app, nor any idea of what is being counted, I can only throw out some generic recommendations:
Don't give the counts.
Precompute the counts, save them somewhere and deliver 'stale' values. This can be handy if there are only a few different "values" being counted. It is probably not practical for arbitrary strings.
Say "about nnnn" in the output.
Play some tricks to decide whether it is practical to compute the exact value or just say "about".
Say "more than 1000".
etc
If you would like to describe the app and the columns, perhaps I can provide some clever tricks.
You expressed concern about "insert speed". This is usually not an issue, and the benefit of having the 'right' index for SELECTs outweighs the slight performance hit for INSERTs.
It sounds like you're trying to use a hammer when a screwdriver is needed. If you don't want to run batch computations, I'd suggest using a streaming framework such as Flink or Samza to add and subtract from your counts when records are added or deleted. This is precisely what those frameworks are built for.
If you're committed to using SQL, you can set up a job that performs the desired count operations every given time window, and stores the values to a second table. That way you don't have to perform repeated counts across the same rows.
As a general rule of thumb when it comes to optimisation (and yes, 1 SQL server node#50mio entries per table needs one!), here is a list of few possible optimisation techniques, some fairly easy to implement, others maybe need more serious modifications:
optimize your MYSQL field type and sizes, eg. use INT instead of VARCHAR if data can be presented with numbers, use SMALL INT instead of BIG INT, etc. In case you really need to have VARCHAR, then use as small as possible length of each field,
look at your dataset; is there any repeating values? Let say if any of your field has only 5 unique values in 50mio rows, then save those values to separate table and just link PK to this Sample Table,
MYSQL partitioning, basic understanding is shown at this link, so the general idea is so implement some kind of partitioning scheme, e.g. new partition is created by CRONJOB every day at "night" when server utilization is at minimum, or when you reach another 50k INSERTs or so (btw also some extra effort will be needed for UPDATE/DELETE operations on different partitions),
caching is another very simple and effective approach, since requesting (almost) same data (I am assuming your value1%, value2%, value3% are always the same?) over and over again. So do SELECT COUNT() once a while, and then use differencial index count to get actual number of selected rows,
in-memory database can be used alongside tradtional SQL DBs to get often-needed data: simple key-value pair style could be enough: Redis, Memcached, VoltDB, MemSQL are just some of them. Also, MYSQL also knows in-memory engine,
use other types of DBs, e.g NoSQL DB like MongoDB, if your dataset/system can utilize different concept.
If you are looking for aggregation performance and don't really care about insert times, I would consider changing your Row DBMS for a Column DBMS.
A Column RDBMS stores data as columns, meaning each column is indexed independantly from the others. This allows way faster aggregations, I have switched from Postgres to MonetDB (an open source column DBMS) and summing one field from a 6 milions lines table dropped down from ~60s to 50ms. I chose MonetDB as it supports SQL querying and odbc connections which were a plus for my use case, but you will experience similar performance improvements with other Column DBMS.
There is a downside to Column storing, which is that you lose performance on insert, update and delete queries, but from what you said, I believe it won't affect you that much.
In Postgres, you can get an estimated row count from the internal statistics that are managed by the query planner:
SELECT reltuples AS approximate_row_count FROM pg_class WHERE relname = 'mytable';
Here you have more details: https://wiki.postgresql.org/wiki/Count_estimate
You could create a materialized view first. Something like this:
CREATE MATERIALIZED VIEW mytable AS SELECT * FROM the_table WHERE col1 like "value1%" and col2 like "value2%";`
You can also materialize directly the count queries. If you have 10 tabs, then you should have to materialize 10 views:
CREATE MATERIALIZED VIEW count_tab1 AS SELECT count(*) FROM the_table WHERE col1 like "value1%" and col2 like "value2%";`
CREATE MATERIALIZED VIEW count_tab2 AS SELECT count(*) FROM the_table WHERE col2 like "value2%" and col3 like "value3%";`
...
After each insert, you should refresh views (asynchronously):
REFRESH MATERIALIZED VIEW count_tab1
REFRESH MATERIALIZED VIEW count_tab2
...
As noted in the critique, you have not posted what you have tried. So I would assume that the limit of question is exactly what you posted. So kindly report results of exactly that much
What is the current time you are spending for the subset of the problem, i.e. count of [col1 like "value1%" and col2 like "value2%"] and 2nd [col3 like "value3%]
The trick would be to scan the data source once and make the data source smaller by creating an index. So first create an index on col1,col2,col3,id. Purpose of col3 and id is so that database scans just the index. And I would get both counts in same SQL
select sum
(
case
when col1 like 'value1%' and col2 like 'value2%' then 1
else 0
end
) cnt_condition_1,
sum
(
case
when col3 like 'value3%' then 1
else 0
end
) cnt_condition_2
from table
where (col1 like 'value1%' and col2 like 'value2%') or
(col3 like 'value3%')
```
So the 50M row table is probably very wide right now. This should trim it down - on a reasonable server I would expect above to return in a few seconds. If it does not and each condition returns < 10% of the table, second option will be to create multiple indexes for each scenario and do count for each so that index is used in each case.
If there is no bulk insert/ bulk updates happening in your system, Can you try vertical partitioning in your table? By vertical partitioning, you can separate the data block of col1, col2 from other data of the table and so your searching space will reduce.
Also, indexing on every columns doesn't seem to be the best approach to go with. Index wherever it is absolutely needed. In this case, I would say Index(col1,col2) and Index(col3).
Even after indexing, you need to look into the fragmentation of those indexes and modify it accordingly to get the best results. Because, sometimes 50 million index of one column can sit as one huge chunk, which will restrict multi processing capabilities of your SQL server.
Each Database has their own peculiarities in how to "enhance" their RDBMS. I can't speak for MySQL or SQL Server but for PostgreSQL you should consider making the indexes that you search as GIN (Generalized Inverted Index)-based indexes.
CREATE INDEX name ON table USING gin(col1);
CREATE INDEX name ON table USING gin(col2);
CREATE INDEX name ON table USING gin(col3);
More information can be found here.
-HTH
this will work:
select count(*) from (
select * from tablename where col1 like 'value1%' and col2 like 'value2%' and col3
like'value3%')
where REGEXP_LIKE(col1,'^value1(.*)$') and REGEXP_LIKE(col2,'^value2(.*)$') and
REGEXP_LIKE(col1,'^value2(.*)$');
try not to apply index on all the columns as it slows down the processing of a sql
query and have it in required columns only.

What is the "Default order by" for a mysql Innodb query that omits the Order by clause?

So i understand and found posts that indicates that it is not recommended to omit the order by clause in a SQL query when you are retrieving data from the DBMS.
Resources & Post consulted (will be updated):
SQL Server UNION - What is the default ORDER BY Behaviour
When no 'Order by' is specified, what order does a query choose for your record set?
https://dba.stackexchange.com/questions/6051/what-is-the-default-order-of-records-for-a-select-statement-in-mysql
Questions :
See logic of the question below if you want to know more.
My question is : under mysql with innoDB engine, does anyone know how the DBMS effectively gives us the results ?
I read that it is implementation dependent, ok, but is there a way to know it for my current implementation ?
Where is this defined exactly ?
Is it from MySQL, InnoDB , OS-Dependent ?
Isn't there some kind of list out there ?
Most importantly, if i omit the order by clause and get my result, i can't be sure that this code will still work with newer database versions and that the DBMS will never give me the same result, can i ?
Use case & Logic :
I'm currently writing a CRUD API, and i have table in my DB that doesn't contain an "id" field (there is a PK though), and so when i'm showing the results of that table without any research criteria, i don't really have a clue on what i should use to order the results. I mean, i could use the PK or any field that is never null, but it wouldn't make it relevant. So i was wondering, as my CRUD is supposed to work for any table and i don't want to solve this problem by adding an exception for this specific table, i could also simply omit the order by clause.
Final Note :
As i'm reading other posts, examples and code samples, i'm feeling like i want to go too far. I understand that it is common knowledge that it's just a bad practice to omit the Order By clause in a request and that there is no reliable default order clause, not to say that there is no order at all unless you specify it.
I'd just love to know where this is defined, and would love to learn how this works internally or at least where it's defined (DBMS / Storage Engine / OS-Dependant / Other / Multiple criteria). I think it would also benefit other people to know it, and to understand the inners mechanisms in place here.
Thanks for taking the time to read anyway ! Have a nice day.
Without a clear ORDER BY, current versions of InnoDB return rows in the order of the index it reads from. Which index varies, but it always reads from some index. Even reading from the "table" is really an index—it's the primary key index.
As in the comments above, there's no guarantee this will remain the same in the next version of InnoDB. You should treat it as a coincidental behavior, it is not documented and the makers of MySQL don't promise not to change it.
Even if their implementation doesn't change, reading in index order can cause some strange effects that you might not expect, and which won't give you query result sets that makes sense to you.
For example, the default index is the clustered index, PRIMARY. It means index order is the same as the order of values in the primary key (not the order in which you insert them).
mysql> create table mytable ( id int primary key, name varchar(20));
mysql> insert into mytable values (3, 'Hermione'), (2, 'Ron'), (1, 'Harry');
mysql> select * from mytable;
+----+----------+
| id | name |
+----+----------+
| 1 | Harry |
| 2 | Ron |
| 3 | Hermione |
+----+----------+
But if your query uses another index to read the table, like if you only access column(s) of a secondary index, you'll get rows in that order:
mysql> alter table mytable add key (name);
mysql> select name from mytable;
+----------+
| name |
+----------+
| Harry |
| Hermione |
| Ron |
+----------+
This shows it's reading the table by using an index-scan of that secondary index on name:
mysql> explain select name from mytable;
+----+-------------+---------+-------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+-------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | mytable | index | NULL | name | 83 | NULL | 3 | Using index |
+----+-------------+---------+-------+---------------+------+---------+------+------+-------------+
In a more complex query, it can become very tricky to predict which index InnoDB will use for a given query. The choice can even change from day to day, as your data changes.
All this goes to show: You should just use ORDER BY if you care about the order of your query result set!
Bill's answer is good. But not complete.
If the query is a UNION, it will (I think) deliver first the results of the first SELECT (according to the rules), then the results of the second. Also, if the table is PARTITIONed, it is likely to do a similar thing.
GROUP BY may sort by the grouping expressions, thereby leading to a predictable order, or it may use a hashing technique, which scrambles the rows. I don't know how to predict which.
A derived table used to be an ordered list that propagates into the parent query's ordering. But recently, the ORDER BY is being thrown away in that subquery! (Unless there is a LIMIT.)
Bottom Line: If you care about the order, add an ORDER BY, even if it seems unnecessary based on this Q & A.
MyISAM, in contrast, starts with this premise: The default order is the order in the .MYD file. But DELETEs leave gaps, UPDATEs mess with the gaps, and INSERTs prefer to fill in gaps over appending to the file. So, the row order is rather unpredictable. ALTER TABLE x ORDER BY y temporarily sets the .MYD order; this 'feature' does not work for InnoDB.

Why performance of MySQL queries are so bad when using a CHAR/VARCHAR index?

First, I will describe a simplified version of the problem domain.
There is table strings:
CREATE TABLE strings (
value CHAR(3) COLLATE utf8_unicode_ci NOT NULL,
INDEX(value)
) ENGINE=InnoDB;
As you can see, it have a non-unique index of CHAR(3) column.
The table is populated using the following script:
CREATE TABLE a_variants (
letter CHAR(1) COLLATE utf8_unicode_ci NOT NULL
) ENGINE=MEMORY;
INSERT INTO a_variants VALUES -- 60 variants of letter 'A'
('A'),('a'),('À'),('Á'),('Â'),('Ã'),('Ä'),('Å'),('à'),('á'),('â'),('ã'),
('ä'),('å'),('Ā'),('ā'),('Ă'),('ă'),('Ą'),('ą'),('Ǎ'),('ǎ'),('Ǟ'),('ǟ'),
('Ǡ'),('ǡ'),('Ǻ'),('ǻ'),('Ȁ'),('ȁ'),('Ȃ'),('ȃ'),('Ȧ'),('ȧ'),('Ḁ'),('ḁ'),
('Ạ'),('ạ'),('Ả'),('ả'),('Ấ'),('ấ'),('Ầ'),('ầ'),('Ẩ'),('ẩ'),('Ẫ'),('ẫ'),
('Ậ'),('ậ'),('Ắ'),('ắ'),('Ằ'),('ằ'),('Ẳ'),('ẳ'),('Ẵ'),('ẵ'),('Ặ'),('ặ');
INSERT INTO strings
SELECT CONCAT(a.letter, b.letter, c.letter) -- 60^3 variants of string 'AAA'
FROM a_variants a, a_variants b, a_variants c
UNION ALL SELECT 'BBB'; -- one variant of string 'BBB'
So, it contains 216000 indistinguishable (in terms of the utf8_unicode_ci collation) variants of string "AAA" and one variant of string "BBB":
SELECT value, COUNT(*) FROM strings GROUP BY value;
+-------+----------+
| value | COUNT(*) |
+-------+----------+
| AAA | 216000 |
| BBB | 1 |
+-------+----------+
As value is indexed, I expect the following two queries to have similar performance:
SELECT SQL_NO_CACHE COUNT(*) FROM strings WHERE value = 'AAA';
SELECT SQL_NO_CACHE COUNT(*) FROM strings WHERE value = 'BBB';
But in practice the first one is more than 300x times slower than the second! See:
+----------+------------+---------------------------------------------------------------+
| Query_ID | Duration | Query |
+----------+------------+---------------------------------------------------------------+
| 1 | 0.11749275 | SELECT SQL_NO_CACHE COUNT(*) FROM strings WHERE value = 'AAA' |
| 2 | 0.00033325 | SELECT SQL_NO_CACHE COUNT(*) FROM strings WHERE value = 'BBB' |
| 3 | 0.11718050 | SELECT SQL_NO_CACHE COUNT(*) FROM strings WHERE value = 'AAA' |
+----------+------------+---------------------------------------------------------------+
-- I ran the 'AAA' query twice here just to be sure.
If I change size of the indexed column or change its type to VARCHAR, the problem with performance still manifests itself. Meanwhile, in analogous situations, but when the non-unique index is not CHAR/VARCHAR (e.g. INT), queries are as fast as expected.
So, the question is why performance of MySQL queries are so bad when using a CHAR/VARCHAR index?
I have strong feeling that MySQL perform full linear scan of all the values matched by the index key. But why it do so when it can just return the count of the matched rows? Am I missing something and that is really needed? Or is that a sad shortcoming of MySQL optimizer?
Clearly, the issue is that the query is doing an index scan. The alternative approach would be to do two index lookups, for the first and last values that are the same, and then use meta information in the index for the calculation. Based on your observations, MySQL does both.
The rest of this answer is speculation.
The reason the performance is "only" 300 times slower, rather than 200,000 times slower, is because of overhead in reading the index. Actually scanning the entries is quite fast compared to other operations that are needed.
There is a fundamental difference between numbers and strings when it comes to comparisons. The engine can just look at the bit representations of two numbers and recognize whether they are the same or different. Unfortunately, for strings, you need to take encoding/collation into account. I think that is why it needs to look at the values.
It is possible that if you had 216,000 copies of exactly the same string, then MySQL would be able to do the count using metadata in the index. In other words, the indexer is smart enough to use metadata for exact equality comparisons. But, it is not smart enough to take encoding into account.
One of things you may want to check on is the logical I/O of each query. I'm sure you'll see quite a difference. To count the number of 'BBB's in the table, probably only 3 or 4 LIOs are needed (depending on things like bucket size). To count the number of 'AAA's, essentially the entire table must be scanned, index or not. With 216k rows, that can add up to significantly more LIOs -- not to mention physical I/Os. Logical I/Os are faster than physical I/Os, but any I/O is a performance killer.
As for text vs numbers, it is always easier and faster for software (any software, not just database engines) to compare numbers than text.

MySQL performance boost after create & drop index

I have a large MySQL, MyISAM table of around 4 million rows running in a core 2 duo, 8G RAM laptop.
This table has 30 columns including varchar, decimal and int types.
I have an index on a varchar(16). Let's call this column: "indexed_varchar_column".
My query is
SELECT 9 columns FROM the_table WHERE indexed_varchar_column = 'something';
It always returns around 5000 rows for every 'something' I query against.
An EXPLAIN to the query returns this:
+----+-------------+-------------+------+----------------------------------------------------+--------------------------------------------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+------+----------------------------------------------------+--------------------------------------------+---------+-------+------+-------------+
| 1 | SIMPLE | the_table | ref | many indexes including indexed_varchar_column | another_index NOT: indexed_varchar_column! | 19 | const | 5247 | Using where |
+----+-------------+-------------+------+----------------------------------------------------+--------------------------------------------+---------+-------+------+-------------+
First thing is I'm not sure is why another_index is chosen. In fact it chooses an index which is a composite index of indexed_varchar_column and another 2 columns (which form part of the selected ones). Perhaps this makes sense since it may make things a bit faster for not having to read 2 of the columns in the query. The real QUESTION is the following one:
The query takes 5 seconds for every 'something' I match. On the 2nd time I query against 'something' it takes 0.15 secs (I guess because the query is being cached). When I run another query against 'something_new' it takes again 5 seconds. So, it is consistent.
THE PROBLEM IS: I discovered that creating an index (another composite index including my indexed_varchar_column) and dropping it again produces that all further queries against new 'something_other' take only 0.15 secs. Please note that 1) I create an index 2) drop it again. So everything is in the same state.
I guess all the operations needed for building and dropping indices make the SQL engine to cache something that is then reused. When I run EXPLAIN on a query after all this I get exactly the same as before.
How can I proceed to understand what is cached in the create-drop index procedure so that I can cache it without manipulating indices?
UPDATE:
Following a comment from Marc B that suggested that when mySQL creates an index it internally does a SELECT... I tried the following:
SELECT * FROM my_table;
It took 30 secs and returned 4 million rows. The good thing is that all further queries are very fast again (until I reboot the system). Please note that after rebooting the queries are slow again. I guess this is because mySQL is using some sort of OS caching.
Any idea? How can I explicitly cache the table I guess?
UPDATE 2:
Perhaps I should have mentioned that this table may be severely fragmented. It's 4 million rows but I remove lots of old fields regularly. I also add new ones. Since I had large gaps in IDs (for the rows deleted) every day I drop the primary index (ID) and create it again with consecutive numbers. The table may be then very fragmented and therefore IO must be an issue... Not sure what to do.
Thanks everybody for your help.
Finally I discovered (thanks to the hint of Marc B) that my table was severely fragmented after many INSERTs and DELETEs. I updated the question with this info some hours ago. There are two things that help:
1)
ALTER TABLE my_table ORDER BY indexed_varchar_column;
2) Running:
myisamchk --sort-records=4 my_table.MYI (where 4 corresponds to my index)
I believe both commands are equivalent. Queries are fast even after a system reboot.
I've put this ALTER TABLE ORDER BY command on a cron that is run everyday. It takes 2 minutes but it's worth it.
How many indexes do you have that contain the indexed_varchar_column? Do you have a single index for just the indexed_varchar_column?
Have you tried:
SELECT 9 columns FROM USE INDEX (name_of_index) the_table WHERE indexed_varchar_column = 'something';?
What is the order of the columns in your composite index.
You must use (at least) a left-associative sub-set of the columns in your query
If you have an index on foo,bar, and baz, that will not be usable as an index against bar or baz by themeselves. Only (foo), (foo,bar), and (foo,bar,baz).
EXPLAIN is your friend here. It will tell you which index, if any, is being used by a query.
EDIT Here's a postgres explain of a simple left join query for comparison.
Nested Loop Left Join (cost=0.00..16.97 rows=13 width=103)
Join Filter: (pagesets.id = pages.pageset_id)
-> Index Scan using ix_pages_pageset_id on pages (cost=0.00..8.51 rows=13 width=80)
Index Cond: (pageset_id = 515)
-> Materialize (cost=0.00..8.27 rows=1 width=23)
-> Index Scan using pagesets_pkey on pagesets (cost=0.00..8.27 rows=1 width=23)
Index Cond: (id = 515)