SQL query: Speed up for huge tables

SQL query: Speed up for huge tables - mysql

We have a table with about 25,000,000 rows called 'events' having the following schema:
TABLE events
- campaign_id : int(10)
- city : varchar(60)
- country_code : varchar(2)
The following query takes VERY long (> 2000 seconds):
SELECT COUNT(*) AS counted_events, country_code
FROM events
WHERE campaign_id` in (597)
GROUPY BY city, country_code
ORDER BY counted_events
We found out that it's because of the GROUP BY part.
There is already an index idx_campaign_id_city_country_code on (campaign_id, city, country_code) which is used.
Maybe someone can suggest a good solution to speed it up?
Update:
'Explain' shows that out of many possible index MySql uses this one: 'idx_campaign_id_city_country_code', for rows it shows: '471304' and for 'Extra' it shows: 'Using where; Using temporary; Using filesort' –
Here is the whole result of EXPLAIN:
id: '1'
select_type: 'SIMPLE'
table: 'events'
type: 'ref'
possible_keys: 'index_campaign,idx_campaignid_paid,idx_city_country_code,idx_city_country_code_campaign_id,idx_cid,idx_campaign_id_city_country_code'
key: 'idx_campaign_id_city_country_code'
key_len: '4'
ref: 'const'
rows: '471304'
Extra: 'Using where; Using temporary; Using filesort'
UPDATE:
Ok, I think it has been solved:
Looking at the pasted query here again I realized that I forget to mention here that there was one more column in the SELECT called 'country_name'. So the query was very slow then (including country_name), but I'll just leave it out and now the performance of the query is absolutely ok.
Sorry for that mistake!
So thank you for all your helpful comments, I'll upvote all the good answers! There were some really helpful additions, that I probably also we apply (like changing types etc).

without seeing what EXPLAIN says it's a long distance shot, anyway:
make an index on (city,country_code)
see if there's a way to use partitioning, your table is getting rather huge
if country code is always 2 chars change it to char
change numeric indexes to unsigned int
post entire EXPLAIN output

don't use IN() - better use:
WHERE campaign_id = 597
OR campaign_id = 231
OR ....
afaik IN() is very slow.
update: like nik0lias commented - IN() is faster than concatenating OR conditions.

Some ideas:
Given the nature and size of the table it would be a great candidate for partitioned tables by country. This way the events of every country would be stored in a different physical table even if it behaves as a virtual big table
Is country code an string? May be you have a country_id that could be easier to sort. (It may force you to create or change indexes)
Are you really using the city in the group by?

partitioning - especially by country will not help
column IN (const-list) is not slow, it is in fact a case with special optimization
The problem is, that MySQL doesn't use the index for sorting. I cannot say why, because it should. Could be a bug.
The best strategy to execute this query is to scan that sub-tree of the index where event_id=597. Since the index is then sorted by city_id, country_code no extra sorting is needed and rows can be counted while scanning.
So the indexes are already optimal for this query. MySQL is just not using them correctly.
I'm getting more information off line. It seems this is not a database problem at all, but
the schema is not normalized. The table contains not only country_code, but also country_name (this should be in an extra table).
the real query contains country_name in the select list. But since that column is not indexed, MySQL cannot use an index scan.
As soon as country_name is dropped from the select list, the query reverts to an index-only scan ("using index" in EXPLAIN output) and is blazingly fast.

Related

Simple MySQL query with ORDER BY too slow

Edit:The time spent for querying a normal word is actually 1.78 seconds. The 4.5 seconds mentioned in the original post below was when querying special words like '.vnet'. (I know REGEXP '\\b.vnet\\b' won't find the whole word match for '.vnet'. I might use a more complex regex to fix this later, or drop the support for '.vnet' if it's too time-consuming.) Also I added solution 5 below.
I have the following MySQL query to achieve whole word matching.
SELECT source, target
FROM tm
WHERE source REGEXP '\\bword\\b'
AND customer = 'COMPANY X'
AND language = 'YYY'
ORDER BY CHAR_LENGTH(source)
LIMIT 5;
There are 2 customers and 2 languages currently.
My goal is to find the top 5 closest matches of a phrase among hundreds of thousands of English sentences. The reason the fetched records are ordered by CHAR_LENGTH is because the shorter the length, the higher the match ratio, since REGEXP '\\bword\\b' makes sure source has word already.
The tm table:
CREATE TABLE tm(
id INT AUTO_INCREMENT PRIMARY KEY,
source TEXT(7000) NOT NULL,
target TEXT(6000) NOT NULL,
language CHAR(3),
customer VARCHAR(10),
INDEX src_cus_lang (source(755), customer, language)
The query above took about 4.5 seconds to finish, which is very slow for me and my PC that has an Intel Core i5-10400F, 16GB RAM and an SSD.
The EXPLAIN command showed the below result:
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: tm
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 1117154
filtered: 1.00
Extra: Using where; Using filesort
1 row in set, 1 warning (0.00 sec)
I tried to delelte the src_cus_lang index and created a new one (customer, language, source(755)), but no improvement at all.
I can think of a few solutions:
Recreate the tm table, ordering by CHAR_LENGTH(source) in the process. This is not ideal for me as I'd like to keep the original order of the table.
Create a new column named src_len, i.e. the length of the source. However, ORDER BY src_len is still very slow.
Split the tm table into 4 separate ones by customer and language. Not ideal for me.
Index the source column. Still very slow.
Use INDEX(customer, language). Took 1.4 seconds longer for both normal words and special words like '.vnet'.
Is there a way to cut the execution time down to less than 0.5 seconds?

This is essentially useless:
INDEX src_cus_lang (source(755), customer, language)
The prefixing keeps the rest of the columns from being very useful. REGEXP requires checking all 1.1M rows.
This would be better:
INDEX(customer, language)
It will at least filter on those two columns, then apply the REGEXP fewer times.
Since it usually wants to finish with the WHERE before considering the ORDER BY, your attempts at src_len, etc, did not help.
If there are only 4 different combinations of customer and language, not much can be done.
However, you should consider a FULLTEXT(source) index. With such,
MATCH(source) AGAINST('+word' IN BOOLEAN MODE)
AND ...
will work much faster.
Also try IN NATURAL LANGUAGE MODE.

MySQL query index

I am using MySQL 5.6 and try to optimize next query:
SELECT t1.field1,
...
t1.field30,
t2.field1
FROM Table1 t1
JOIN Table2 t2 ON t1.fk_int = t2.pk_int
WHERE t1.int_field = ?
AND t1.enum_filed != 'value'
ORDER BY t1.created_datetime desc;
A response can contain millions of records and every row consists of 31 columns.
Now EXPLAIN says in Extra that planner uses 'Using where'.
I tried to add next index:
create index test_idx ON Table1 (int_field, enum_filed, created_datetime, fk_int);
After that EXPLAIN says in Extra that planner uses "Using index condition; Using filesort"
"rows" value from EXPLAIN with index is less than without it. But in practice time of execution is longer.
So, the questions are next:
What is the best index for this query?
Why EXPLAIN says that 'key_len' of query with index is 5. Shouldn't it be 4+1+8+4=17?
Should the fields from ORDER BY be in index?
Should the fields from JOIN be in index?

try refactor your index this way
avoid (o move to the right after fk_int) the created_datetime column.. and move fk_int before the enum_filed column .. the in this wahy the 3 more colums used for filter shold be use better )
create index test_idx ON Table1 (int_field, fk_int, enum_filed);
be sure you have also an specific index on table2 column pk_int. if you have not add
create index test_idx ON Table2 (int_field, fk_int, enum_filed);

What is the best index for this query?
Maybe (int_field, created_datetime) (See next Q&A for reason.)
Why EXPLAIN says that 'key_len' of query with index is 5. Shouldn't it be 4+1+8+4=17?
enum_filed != defeats the optimizer. If there is only one other value for that enum (and it is NOT NULL), then use = and the other value. And try INDEX(int_field, enum_field, created_datetime) The Optimizer is much happier with = than with any inequality.
"5" could be indicating 2 columns, or it could be indicating one INT that is Nullable. If int_field can be NULL, consider changing it to NOT NULL; then the "5" would drop to "4".
Should the fields from ORDER BY be in index?
Only if the index can completely handle the WHERE. This usually occurs only if all the WHERE tests are =. (Hence, my previous answer.)
Another case for including those columns is "covering"; see next Q&A.
Should the fields from JOIN be in index?
It depends. One thing that gives some performance benefit is to include all columns mentioned anywhere in the SELECT. This is called a "covering" index and is indicated in EXPLAIN by Using index (not Using index condition). There are too many columns in t1 to add a "covering" index. I think the practical limit is about 5 columns.

My guess for your question № 1:
create index my_idx on Table1(int_field, created_datetime desc, fk_int)
or one of these (but neither will probably be worthwhile):
create index my_idx on Table1(int_field, created_datetime desc, enum_filed, fk_int)
create index my_idx on Table1(int_field, created_datetime desc, fk_int, enum_filed)
I'm supposing 3 things:
Table2.pk_int is already a primary key, judging by the name
The where condition on Table1.int_field is only satisfied by a small subset of Table1
The inequality on Table1.enum_filed (I would fix the typo, if I were you) only excludes a small subset of Table1
Question № 2: the key_len refers to the keys used. Don't forget that there is one extra byte for nullable keys. In your case, if int_field is nullable, it means that this is the only key used, otherwise both int_field and enum_filed are used.
As for questions 3 and 4: If, as I suppose, it's more efficient to start the query plan from the where condition on Table1.int_field, the composite index, in this case also with the correct sort order (desc), enables a scan of the index to get the output rows in the correct order, without an extra sort step. Furthermore, adding also fk_int to the index makes the retrieval of any record of Table1 unnecessary unless a corresponding record is present in Table2. For a similar reason you could also add enum_filed to the index, but, if this doesn't considerably reduce the output record count, the increase in index size will make things worse instead of better. In the end, you will have to try it out (with realistic data!).
Note that if you put another column between int_field and created_datetime in the index, the index won't provide the created_datetime (for a given int_field) in the desired output order.

The issue was fixed by adding more filters (to where clause) to the query.
Regarding to indexes 2 proposed indexes were helpful:
From #WalterTross with next index for initial query:
(int_field, created_datetime desc, enum_filed, fk_int)
With my short comment: desc indexes is not supported at MySQL 5.6 - this key word just reserved.
From #RickJames with next index for modified query:
(int_field, created_datetime)
Thanks everyone who tried to help. I really appreciate it.

MySQL date range idex being ignored when a second table added to query

Long time lurker, first time questioner. ;-)
Using PHP 5.6 and MySQL Ver 14.14 Distrib 5.6.41, for Win64 (x86_64) Yeah, I know a little behind the times and we're working on updating. But that's where we are now. ;-)
Updates for questions asked:
The index is on the CreateDate. I thought there might be an issue with that column being a DateTime so I created another column that was just a date, set an index on that and retried, but it didn't have any effect.
ulc has 8965 rows total. With index searches 3787
et has 9530 rows. In the query that doesn't use the index, it searches just one row as it's searching on the primary key from the first query.
The formatting of the comparison date doesn't seem to matter. I've tried all sorts of formats, including just straight "2018-01-01 {00:00:00}'. No change.
I've got what I consider a weird one, but I suspect for someone here it's going to be a "duh!" one. I've got a query that includes a date range for the primary table and then goes to get other bits of data from other tables based on a set of unique ids from the first table. Don't worry, I'll have examples below. When I do the search on just the primary table, the range index works as expected and only searches the relevant rows. However, when I add in the next table with the ON clause, it ignores the index and searches all of the rows of the primary table. If I leave off the on clause, it goes back to using the index correctly. I tried using the FORCE INDEX (USE is ignored) and while that makes it use the index, it slows the query way down. Anyway, here are the queries:
Works:
select CreateDate
from ulc
Inner Join et
WHERE ulc.CreateDate >= STR_TO_DATE("01/01/2018", "%m/%d/%Y")
AND ulc.CreateDate <= STR_TO_DATE("08/02/2018", "%m/%d/%Y")
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE ulc range index_CreateDate index_CreateDate 5 NULL 3787 Using where; Using index
1 SIMPLE et index NULL index_BankProcessorProfile 5 NULL 9530 Using index; Using join buffer (Block Nested Loop)
Doesn't work:
select CreateDate
from ulc
Inner Join et on et.TranID = ulc.TranID
WHERE ulc.CreateDate >= STR_TO_DATE("01/01/2018", "%m/%d/%Y")
AND ulc.CreateDate <= STR_TO_DATE("08/02/2018", "%m/%d/%Y")
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE ulc ALL TranID,index_CreateDate NULL NULL NULL 8965 Using where
1 SIMPLE et eq_ref PRIMARY PRIMARY 8 showpro.ulc.TranID 1 Using index
For the second one I just added the on et.TranID = ulc.TranID
Additionally, if I change it from a range to a specific date, the index works as well.

Just guessing here without more data,but adding a new table to the JOIN changes the data distribution.
So if in the first case the WHERE condition return probably a small(relatively) percentage of the data,in you second case the optimizer decides you`ll get faster results without using the index since the same conditions might not be quite so selective for the new batch of data.
Add the table definitions and a COUNT for both queries,both total and based on your queries,for a better answer.

if you are using DateTime in your query its suggested to use "YYYY-MM-DD HH:MM:SS" in where class
if you are using Date in your query its suggested to use the format "YYYY-MM-DD" in your where class.you have used STR_TO_DATE("01/01/2018", "%m/%d/%Y") which will typecast to '2018-01-01' seems to be fine
you try to find the complexity of the query using EXPLAIN
explain select CreateDate
from ulc
Inner Join et on et.TranID = ulc.TranID
WHERE ulc.CreateDate >= STR_TO_DATE("01/01/2018", "%m/%d/%Y")
AND ulc.CreateDate <= STR_TO_DATE("08/02/2018", "%m/%d/%Y")
you can check if et.TranID and ulc.TranID have proper index or not

(I'm going to have to guess at some things, since you have not provided SHOW CREATE TABLE. As a 'long time lurker', you should have realized this.)
First guess is that TranID is not the PRIMARY KEY of ulc?
The solution is to add a "composite" INDEX(CreateDate, TranID) to ulc. (Actually, you should replace the existing INDEX(CreateDate) (Second guess is that you have that index now.)
Now I will try to explain why the first query was happy with INDEX(CreateDate) but the second was not.
In the first query, INDEX(CreateDate) is a "covering" index. That is, this index contains all the columns of ulc that are needed by the SELECT. So, it is almost guaranteed that using the index would be better than scanning the table. It will be a "range index scan" of that index.
The second query needs both CreateDate and TranID, so your index won't be "covering". There are two ways to perform the first part of the query. But first, note that (in InnoDB) a secondary index has all the columns of the PRIMARY KEY (third guess: it is (id)).
Range scan of the index. But, in order to get TranID, it first gets id, then does a lookup in the PRIMARY KEY/data to get TranID`. This process is more costly than simply staying in the index, so the Optimizer does not want to do it unless the estimated number of rows is 'small'.
Since 3787/8965 is not "small", the Optimizer decides that it is probably faster to scan ALL 8965 rows, filtering out the ones not needed.
My proposed index is 'covering', thereby avoiding the bounding back and forth between index and data. So, a range index scan is efficient.
Your observation that switching to a single date made use of the index -- Well, 1 row out of 8965 is 'small', so the index (and the bouncing) is deemed to be the faster way.
As for formatting of the date -- True, it does not matter. This is because the parser notices that STR_TO_DATE("01/01/2018", "%m/%d/%Y") is a constant that can be evaluated once, and does so.
My cookbook should take you directly to the composite index without having to scratch your head over this Question.
Your first query is a "cross join" since it has no ON clause to relate the tables together, and it will return about 35 million rows (9530*3787). The second query will have about 3787 rows, maybe fewer (if some of the joins fail to find a match).
"how little changes between the two queries" -- Never think that! The Optimizer will latch on to seemingly insignificant differences. SELECT CreateDate versus SELECT * -- a huge difference. Most of what I said about the 'first query' would be thrown out. Even changing to SELECT ChangeDate, x would be enough to make a big wrinkle. If the datatypes of TranID in the two tables differed enough, the indexes become useless. Etc, etc.

MySQL not using Index (despite FORCE INDEX)

I need to do a live search using PHP and jQuery to select cities and countries from the two tables cities (almost 3M rows) and countries (few hundred rows).
For a short moment I was thinking of using a MyISAM table for cities as InnoDB does not support FULLTEXT search, however decided it is not a way to go (frequent table crashes, all other tables are InnoDB etc and with MySQL 5.6+ InnoDB also starts to support FULLTEXT index).
So, right now I still use MySQL 5.1, and as most cities consist of one-word only or max 2-3 Words, but e.g. "New York" - most people will not search for "York" if they mean "New York". So, I just put an index on the city_real column (which is a varchar).
The following query (I tried it in different versions, without any JOIN and without ORDER BY, with USE INDEX and even with FORCE INDEX, I have tried LIKE instead equal (=) but another post said = was faster and if the wildcard is only at the end, it is OK to use it), in EXPLAIN it always says "using where, using filesort". The average time for the query is about 4sec, which you have to admit is a little bit to slow for a live search (user typing in text-box and seeing suggestions of cities and countries)...
Live search (jQuery ajax) searches if the user typed at least 3 characters...
SELECT ci.id, ci.city_real, co.country_name FROM cities ci LEFT JOIN countries co ON(ci.country_id=co.country_id) WHERE city_real='cit%' ORDER BY population DESC LIMIT 5
There is a PRIMARY on ci.id and an INDEX on ci.city_real. Any ideas why MySQL does not use the index? Or how I could speed up the query? Or where else I should/should not set an INDEX?
Thank you very much in advance for your help!
Here's the explain output
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE ci range city_real city_real 768 NULL 1250 Using where; Using filesort
1 SIMPLE co eq_ref PRIMARY PRIMARY 6 fibsi_1.ci.country_id 1

You should use WHERE city_real LIKE 'cit%', not WHERE city_real='cit%'.
I have tried LIKE instead equal (=) but another post said = was faster and if the wildcard is only at the end, it is OK to use it
This is wrong. = doesn't support wildcards so it will give you the wrong results.
Or how I could speed up the query?
Make sure you have an index on country_id in both tables. Post the output of EXPLAIN SELECT ... if you need further help.

The query do use index as seen in the key field of explain output. The reason it uses filesort is the order by, and the reason it uses where is probably one of the fields (city_real, population) allows null values.

Removing 'using filesort' from query

I have the following query:
SELECT *
FROM shop_user_member_spots
WHERE delete_flag = 0
ORDER BY date_spotted desc
LIMIT 10
When run, it takes a few minutes. The table is around 2.5 million rows.
Here is the table (not designed by me but I am able to make some changes):
And finally, here are the indexes (again, not made by me!):
I've been attempting to get this query running fast for hours now, to no avail.
Here is the output of EXPLAIN:
Any help / articles are much appreciated.

Based on your query, it seems the index you would want would be on (delete_flag, date_spotted). You have an index that has the two columns, but the id column is in between them, which would make the index unhelpful in sorting based on date_spotted. Now whether mysql will use the index based on Zohaib's answer I can't say (sorry, I work most often with SQL Server).

The problem that I see in the explain plan is that the index on spotted date is not being used, insted filesort mechanism is being used to sort (as far as index on delete flag is concerned, we actually gain performance benefit of index if the column on which index is being created contains unique values)
the mysql documentation says
Index will not used for order by clause if
The key used to fetch the rows is not the same as the one used in the ORDER BY:
SELECT * FROM t1 WHERE key2=constant ORDER BY key1;
http://dev.mysql.com/doc/refman/5.0/en/order-by-optimization.html
I guess same is the case here. Although you can try using Force Index

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008