I have a table that I do fulltext searching on. It's starting to get big already with a relatively small amount of users - 20 million rows
Searches will only ever need to be on rows that belong to the PKs relevant to the search ie rows that belong to that user, and at most, that's about 200 000 per user. I figured if the fulltext search was only done on a subquery that first selects that user's rows, it should be super fast eg
SELECT * FROM
(SELECT * FROM table1 WHERE userID = 2 ) AS r
WHERE MATCH (r.fullTextCol1) AGAINST ('+monkey* ' IN BOOLEAN MODE)
ORDER BY r.fullTextCol1, r.fullTextCol2 ASC LIMIT 0,50
However, this query takes 4 seconds.
EXPLAIN says...
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 185927 Using where; Using filesort
2 DERIVED table1 ref PRIMARY,unique unique 4 193082
My indexes are:
PRIMARY (userID, userSubList, userItemID)
FULLTEXT fullTextCol1
FULLTEXT fullTextCol2
The subquery seems to not use the userID index at all.
Is my thinking right in approaching it like this - sub selecting the relevent user row to search on?
Thanks for your time and help.
Have you tried like this? :
SELECT *
FROM table1
WHERE userID = 2
AND MATCH (fullTextCol1) AGAINST ('+monkey* ' IN BOOLEAN MODE)
ORDER BY fullTextCol1, fullTextCol2 ASC LIMIT 0,50;
Or run without ORDER BY to check JOIN is slow or ORDERing is slow (or mixed)
EDIT
In your case, composite index on (userID, fullTextCol1) is needed but MySQL doesn't have it. Another already answered about this. see Compound FULLTEXT index in MySQL
please, let me know above answer makes sense and it's result.
Related
I'm having trouble understanding my options for how to optimize this specific query. Looking online, I find various resources, but all for queries that don't match my particular one. From what I could gather, it's very hard to optimize a query when you have an order by combined with a limit.
My usecase is that i would like to have a paginated datatable that displayed the latest records first.
The query in question is the following (to fetch 10 latest records):
select
`xyz`.*
from
xyz
where
`xyz`.`fk_campaign_id` = 95870
and `xyz`.`voided` = 0
order by
`registration_id` desc
limit 10 offset 0
& table DDL:
CREATE TABLE `xyz` (
`registration_id` int NOT NULL AUTO_INCREMENT,
`fk_campaign_id` int DEFAULT NULL,
`fk_customer_id` int DEFAULT NULL,
... other fields ...
`voided` tinyint unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`registration_id`),
.... ~12 other indexes ...
KEY `activityOverview` (`fk_campaign_id`,`voided`,`registration_id` DESC)
) ENGINE=InnoDB AUTO_INCREMENT=280614594 DEFAULT CHARSET=utf8 COLLATE=utf8_danish_ci;
The explain on the query mentioned gives me the following:
"id","select_type","table","partitions","type","possible_keys","key","key_len","ref","rows","filtered","Extra"
1,SIMPLE,db_campaign_registration,,index,"getTop5,winners,findByPage,foreignKeyExistingCheck,limitReachedIp,byCampaign,emailExistingCheck,getAll,getAllDated,activityOverview",PRIMARY,"4",,1626,0.65,Using where; Backward index scan
As you can see it says it only hits 1626 rows. But, when i execute it - then it takes 200+ seconds to run.
I'm doing this to fetch data for a datatable that is to display the latest 10 records. I also have pagination that allows one to navigate pages (only able to go to next page, not last or make any big jumps).
To further help with getting the full picture I've put together a dbfiddle. https://dbfiddle.uk/Jc_K68rj - this fiddle does not have the same results as my table. But i suspect this is because of the data size that I'm having with my table.
The table in question has 120GB data and 39.000.000 active records. I already have an index put in that should cover the query and allow it to fetch the data fast. Am i completely missing something here?
Another solution goes something like this:
SELECT b.*
FROM ( SELECT registration_id
FROM xyz
where `xyz`.`fk_campaign_id` = 95870
and `xyz`.`voided` = 0
order by `registration_id` desc
limit 10 offset 0 ) AS a
JOIN xyz AS b USING (registration_id)
order by `registration_id` desc;
Explanation:
The derived table (subquery) will use the 'best' query without any extra prompting -- since it is "covering".
That will deliver 10 ids
Then 10 JOINs to the table to get xyz.*
A derived table is unordered, so the ORDER BY does need repeating.
That's tricking the Optimizer into doing what it should have done anyway.
(Again, I encourage getting rid of any indexes that are prefixes of the the 3-column, optimal, index discussed.)
KEY `activityOverview` (`fk_campaign_id`,`voided`,`registration_id` DESC)
is optimal. (Nearly as good is the same index, but without the DESC).
Let's see the other indexes. I strongly suspect that there is at least one index that is a prefix of that index. Remove it/them. The Optimizer sometimes gets confused and picks the "smaller" index instead of the "better index.
Here's a technique for seeing whether it manages to read only 10 rows instead of most of the table: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#handler_counts
So this might be a bit silly, but the alternative I was using is worse. I am trying to write an excel sheet using data from my database and a PHP tool called Box/Spout. The thing is that Box/Spout reads rows one at a time, and they are not retrieved via index ( e.g. rows[10], rows[42], rows[156] )
I need to retrieve data from the database in the order the rows come out. I have a database with a list of customers, that came in via Import and I have to write them into the excel spreadsheet. They have phone numbers, emails, and an address. Sorry for the confusion... :/ So I compiled this fairly complex query:
SELECT
`Import`.`UniqueID`,
`Import`.`RowNum`,
`People`.`PeopleID`,
`People`.`First`,
`People`.`Last`,
GROUP_CONCAT(
DISTINCT CONCAT_WS(',', `PhonesTable`.`Phone`, `PhonesTable`.`Type`)
ORDER BY `PhonesTable`.`PhoneID` DESC
SEPARATOR ';'
) AS `Phones`,
GROUP_CONCAT(
DISTINCT CONCAT_WS(',', `EmailsTable`.`Email`)
ORDER BY `EmailsTable`.`EmailID` DESC
SEPARATOR ';'
) AS `Emails`,
`Properties`.`Address1`,
`Properties`.`city`,
`Properties`.`state`,
`Properties`.`PostalCode5`,
...(17 more `People` Columns)...,
FROM `T_Import` AS `Import`
LEFT JOIN `T_CustomerStorageJoin` AS `CustomerJoin`
ON `Import`.`UniqueID` = `CustomerJoin`.`ImportID`
LEFT JOIN `T_People` AS `People`
ON `CustomerJoin`.`PersID`=`People`.`PeopleID`
LEFT JOIN `T_JoinPeopleIDPhoneID` AS `PeIDPhID`
ON `People`.`PeopleID` = `PeIDPhID`.`PeopleID`
LEFT JOIN `T_Phone` AS `PhonesTable`
ON `PeIDPhID`.`PhoneID`=`PhonesTable`.`PhoneID`
LEFT JOIN `T_JoinPeopleIDEmailID` AS `PeIDEmID`
ON `People`.`PeopleID` = `PeIDEmID`.`PeopleID`
LEFT JOIN `T_Email` AS `EmailsTable`
ON `PeIDEmID`.`EmailID`=`EmailsTable`.`EmailID`
LEFT JOIN `T_JoinPeopleIDPropertyID` AS `PeIDPrID`
ON `People`.`PeopleID` = `PeIDPrID`.`PeopleID`
AND `PeIDPrID`.`PropertyCP`='CurrentImported'
LEFT JOIN `T_Property` AS `Properties`
ON `PeIDPrID`.`PropertyID`=`Properties`.`PropertyID`
WHERE `Import`.`CustomerCollectionID`=$ccID
AND `RowNum` >= $rnOffset
AND `RowNum` < $rnLimit
GROUP BY `RowNum`;
So I have indexes on every ON segment, and the WHERE segment. When RowNumber is like around 0->2500 in value, the query runs great and executes within a couple seconds. But it seems like the query execution time exponentially multiplies the larger RowNumber gets.
I have an EXPLAIN here: and at pastebin( https://pastebin.com/PksYB4n2 )
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE Import NULL ref CustomerCollectionID,RowNumIndex CustomerCollectionID 4 const 48108 8.74 Using index condition; Using where; Using filesort;
1 SIMPLE CustomerJoin NULL ref ImportID ImportID 4 MyDatabase.Import.UniqueID 1 100 NULL
1 SIMPLE People NULL eq_ref PRIMARY,PeopleID PRIMARY 4 MyDatabase.CustomerJoin.PersID 1 100 NULL
1 SIMPLE PeIDPhID NULL ref PeopleID PeopleID 5 MyDatabase.People.PeopleID 8 100 NULL
1 SIMPLE PhonesTable NULL eq_ref PRIMARY,PhoneID,PhoneID_2 PRIMARY 4 MyDatabase.PeIDPhID.PhoneID 1 100 NULL
1 SIMPLE PeIDEmID NULL ref PeopleID PeopleID 5 MyDatabase.People.PeopleID 5 100 NULL
1 SIMPLE EmailsTable NULL eq_ref PRIMARY,EmailID,DupeDeleteSelect PRIMARY 4 MyDatabase.PeIDEmID.EmailID 1 100 NULL
1 SIMPLE PeIDPrID NULL ref PeopleMSCP,PeopleID,PropertyCP PeopleMSCP 5 MyDatabase.People.PeopleID 4 100 Using where
1 SIMPLE Properties NULL eq_ref PRIMARY,PropertyID PRIMARY 4 MyDatabase.PeIDPrID.PropertyID 1 100 NULL
I apologize if the formatting is absolutely terrible. I'm not sure what good formatting looks like so I may have jumbled it a bit on accident, plus the tabs got screwed up.
What I want to know is how to speed up the query time. The databases are very large, like in the 10s of millions of rows. And they aren't always like this as our tables are constantly changing, however I would like to be able to handle it when they are.
I tried using LIMIT 2000, 1000 for example, but I know that it's less efficient than using an indexed column. So I switched over to RowNumber. I feel like this was a good decision, but it seems like MySQL is still looping every single row before the offset variable which kind of defeats the purpose of my index... I think? I'm not sure. I also basically split this particular query into about 10 singular queries, and ran them one by one, for each row of the excel file. It takes a LONG time... TOO LONG. This is fast, but, obviously I'm having a problem.
Any help would be greatly appreciated, and thank you ahead of time. I'm sorry again for my lack of post organization.
The order of the columns in an index matters. The order of the clauses in WHERE does not matter (usually).
INDEX(a), INDEX(b) is not the same as the "composite" INDEX(a,b). I deliberately made composite indexes where they seemed useful.
INDEX(a,b) and INDEX(b,a) are not interchangeable unless both a and b are tested with =. (Plus a few exceptions.)
A "covering" index is one where all the columns for the one table are found in the one index. This sometimes provides an extra performance boost. Some of my recommended indexes are "covering". It implies that only the index BTree need be accessed, not also the data BTree; this is where it picks up some speed.
In EXPLAIN SELECT ... a "covering" index is indicated by "Using index" (which is not the same as "Using index condition"). (Your Explain shows no covering indexes currently.)
An index 'should not' have more than 5 columns. (This is not a hard and fast rule.) T5's index had f5 columns to be covering; it was not practical to make a covering index for T2.
When JOINing, the order of the tables does not matter; the Optimizer is free to shuffle them around. However, these "rules" apply:
A LEFT JOIN may force ordering of the tables. (I think it does in this case.) (I ordered the columns based on what I think the Optimizer wants; there may be some flexibility.)
The WHERE clause usually determines which table to "start with". (You test on T1 only, so obviously it will start with T1.
The "next table" to be referenced (via NLJ - Nested Loop Join) is determined by a variety of things. (In your case it is pretty obvious -- namely the ON column(s).)
More on indexing: http://mysql.rjweb.org/doc.php/index_cookbook_mysql
Revised Query
1. Import: (CustomerCollectionID, -- '=' comes first
RowNum, -- 'range'
UniqueID) -- 'covering'
Import shows up in WHERE, so is first in Explain; Also due to LEFTs
Properties: (PropertyID) -- is that the PK?
PeIDPrID: (PropertyCP, PeopleID, PropertyID)
3. People: (PeopleID)
I assume that is the `PRIMARY KEY`? (Too many for "covering")
(Since `People` leads to 3 other table; I won't number the rest.)
EmailsTable: (EmailID, Email)
PeIDEmID: (PeopleID, -- JOIN from People
EmailID) -- covering
PhonesTable: (PhoneID, Type, Phone)
PeIDPhID: (PeopleID, PhoneID)
2. CustomerJoin: (ImportID, -- coming from `Import` (see ON...)
PersID) -- covering
After adding those, I expect most lines of EXPLAIN to say Using index.
The lack of at least a composite index on Import is the main problem leading to your performance complaint.
Bad GROUP BY
When there is a GROUP BY that does not include all the non-aggregated columns that are not directly dependent on the group by column(s), you get random values for the extras. I see from the EXPLAIN ("Rows") that several tables probably have multiple rows. You really ought to think about the garbage being generated by this query.
Curiously, Phones and Emails are feed into GROUP_CONCAT(), thereby avoiding the above issue, but the "Rows" is only 1.
(Read about ONLY_FULL_GROUP_BY; it might explain the issue better.)
(I'm listing this as a separate Answer since it is orthogonal to my other Answer.)
I call this the "explode-implode" syndrome. The query does a JOIN, getting a bunch of rows, thereby generating several rows, and puts multiple rows into an intermediate table. Then the GROUP BY implodes back to down to the original set of rows.
Let me focus on a portion of the query that could be reformulated to provide a performance improvement:
SELECT ...
GROUP_CONCAT(
DISTINCT CONCAT_WS(',', `EmailsTable`.`Email`)
ORDER BY `EmailsTable`.`EmailID` DESC
SEPARATOR ';'
) AS `Emails`,
...
FROM ...
LEFT JOIN `T_Email` AS `EmailsTable`
ON `PeIDEmID`.`EmailID`=`EmailsTable`.`EmailID`
...
GROUP BY `RowNum`;
Instead, move the table and aggregation function into a subquery
SELECT ...
( SELECT GROUP_CONCAT(
DISTINCT CONCAT_WS(',', `Email`)
ORDER BY `EmailID` DESC
SEPARATOR ';' )
FROM T_Email
WHERE `PeIDEmID`.`EmailID` = `EmailID`
) AS `Emails`,
...
FROM ...
-- and Remove: LEFT JOIN `T_Email` ON ...
...
-- and possibly Remove: GROUP BY ...;
Ditto for PhonesTable.
(It is unclear whether the GROUP BY can be removed; other things may need it.)
I'm looking for a reason and suggestions.
My table have about 1.4Million rows and when I run following query it took over 3 minutes. I added count just for showing result. My real query is without count.
MariaDB [ams]> SELECT count(asin) FROM asins where asins.is_active = 1
and asins.title is null and asins.updated < '2018-10-28' order by sortorder,id;
+-------------+
| count(asin) |
+-------------+
| 187930 |
+-------------+
1 row in set (3 min 34.34 sec)
Structure
id int(9) Primary
asin varchar(25) UNIQUE
is_active int(1) Index
sortorder int(9) Index
Please let me know if you need more information.
Thanks in advance.
EDIT
Query with EXPLAIN
MariaDB [ams]> EXPLAIN SELECT asin FROM asins where asins.is_active = 1 and asins.title is null and asins.updated < '2018-10-28' order by sortorder,id;
The database is scanning all the rows to answer the query. I imagine you have a really big table.
For this query, the ORDER BY is unnecessary (but it should have no impact on performance:
SELECT count(asin)
FROM asins
WHERE asins.is_active = 1 AND
asins.title is null AND
asins.updated < '2018-10-28' ;
Then you want an index on (is_active, title, updated).
Looks like you have an index on is_active and updated. So that index is going to be scanned (like a table scan, every record in the index read), but since title is not in the index, there is going to be a second operation which looks up title in the table. You can think of this as a join between the index and the table. If most of the records in the index match your conditions, then the join is going to involve most of the data in the table. Large joins are slow.
You might be better off with a full table scan if the conditions against the index are going to result in a large number of records returned.
See https://dba.stackexchange.com/questions/110707/how-can-i-force-mysql-to-ignore-all-indexes for a way to force the full table scan. Give it a try and see if your query is faster.
Try these:
INDEX(is_active, updated),
INDEX(is_active, sortorder, id)
And please provide SHOW CREATE TABLE.
With the first of these indexes, some of the filtering will be done, but then it will still have to sort the results.
With the second index, the Optimizer may chose to filter on the only = column, then avoid the sort by launching into the ORDER BY. The risk is that it will still have to hit so many rows that avoiding the sort is not worth it.
What percentage of the table has is_active = 1? What percentage has a null title? What percentage is in that date range?
When you create a compound index, and part of it is range based, you want the range based part first.
So try the index (updated, is_active, title)
This way updated becomes a prefix and can be used in range queries.
I wish to fetch the last 10 rows from the table of 1 M rows.
CREATE TABLE `test` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`updated_date` datetime NOT NULL,
PRIMARY KEY (`id`)
)
One way of doing this is -
select * from test order by -id limit 10;
**10 rows in set (0.14 sec)**
Another way of doing this is -
select * from test order by id desc limit 10;
**10 rows in set (0.00 sec)**
So I did an 'EXPLAIN' on these queries -
Here is the result for the query where I use 'order by desc'
EXPLAIN select * from test order by id desc limit 10;
And here is the result for the query where I use 'order by -id'
EXPLAIN select * from test order by -id limit 10;
I thought this would be same but is seems there are differences in the execution plan.
RDBMS use heuristics to calculate the execution plan, they cannot always determine the semantic equivalence of two statements as it is a too difficult problem (in terms of theoretical and practical complexity).
So MySQL is not able to use the index, as you do not have an index on "-id", that is a custom function applied to the field "id". Seems trivial, but the RDBMSs must minimize the amount of time needed to compute the plans, so they get stuck with simple problems.
When an optimization cannot be found for a query (i.e. using an index) the system fall back to the implementation that works in any case: a scan of the full table.
As you can see in Explain results,
1 : order by id
MySQL is using indexing on id. So it need to iterate only 10 rows as it is already indexed. And also in this case MySQL don't need to use filesort algorithm as it is already indexed.
2 : order by -id
MySQL is not using indexing on id. So it needs to iterate all the rows.( e.g. 455952) to get your expected results. In this case MySQL needs to use filesort algorithm as id is not indexed. So it will obviously take more time :)
You use ORDER BY with an expression that includes terms other than the key column name:
SELECT * FROM t1 ORDER BY ABS(key);
SELECT * FROM t1 ORDER BY -key;
You index only a prefix of a column named in the ORDER BY clause. In this case, the index cannot be used to fully resolve the sort order. For example, if you have a CHAR(20) column, but index only the first 10 bytes, the index cannot distinguish values past the 10th byte and a filesort will be needed.
The type of table index used does not store rows in order. For example, this is true for a HASH index in a MEMORY table.
Please follow this link: http://dev.mysql.com/doc/refman/5.7/en/order-by-optimization.html
I have a problem with this slow query that runs for 10+ seconds:
SELECT DISTINCT siteid,
storyid,
added,
title,
subscore1,
subscore2,
subscore3,
( 1 * subscore1 + 0.8 * subscore2 + 0.1 * subscore3 ) AS score
FROM articles
WHERE added > '2011-10-23 09:10:19'
AND ( articles.feedid IN (SELECT userfeeds.siteid
FROM userfeeds
WHERE userfeeds.userid = '1234')
OR ( articles.title REGEXP '[[:<:]]keyword1[[:>:]]' = 1
OR articles.title REGEXP '[[:<:]]keyword2[[:>:]]' = 1 ) )
ORDER BY score DESC
LIMIT 0, 25
This outputs a list of stories based on the sites that a user added to his account. The ranking is determined by score, which is made up out of the subscore columns.
The query uses filesort and uses indices on PRIMARY and feedid.
Results of an EXPLAIN:
1 PRIMARY articles
range
PRIMARY,added,storyid
PRIMARY 729263 rows
Using where; Using filesort
2 DEPENDENT SUBQUERY
userfeeds
index_subquery storyid,userid,siteid_storyid
siteid func
1 row
Using where
Any suggestions to improve this query? Thank you.
I would move the calculation logic to the client and only load fields from the database. This makes your query and the calculation itself faster. It's not a good style to do such things in SQL code.
And also is the regex very slow, maybe another searching mode like 'LIKE' is faster.
Looking at your EXPLAIN, it doesn't appear your query is utilizing any index (thus the filesort). This is being caused by the sort on the calculated column (score).
Another barrier is the size of the table (729263 rows). You don't want to create an index that is too wide as it will take much more space and impact performance of your CUD operations. What we want to do is target the columns that are being selected, however, in this situation we can't since it's a calculated column. You can try creating a VIEW or either remove the sort or do it at the application layer.