This is a theoretical question. Sorry, but I don't have a working tables data to show, I'll try to improvise with a theoretical example.
Using MySql/MariaDB. Have indexes for all relevant fields.
I have a system, which historical design had a ProductType table, something like:
ID=1, Description="Milk"
ID=2, Description="Bread"
ID=3, Description="Salt"
ID=4, Description="Sugar"
and so on.
There are some features in the system that rely on the ProductType ID and the Description is also used in different places, such as for defining different properties of the product type.
There is also a Product table, with fields such as:
ID, ProductTypeID, Name
The Product:Name don't have the product type description in it, so a "Milk bottle 1l" will have an entry such as:
ID=101, ProductTypeID=1, Name="bottle 1l"
and "Sugar pack 1kg" will be:
ID=102, ProductTypeID=4, Name="pack 1kg"
You get the idea...
The system combines the ProductType:Description and Product:Name to show full product names to the users. This creates a systematic naming for all the products, so there is no way to define a product with a name such as "1l bottle of milk". I know that in English that might be hard to swallow, but that way works great with my local language.
Years passed, the database grow to millions of products.
Since full-text index should have all searched data in one table, I had to store the ProductType:Description inside the Product table in a string field I added that have different keywords related to the product, so the full-text search will be able to find anything related to the product (type, name, barcode, SKU and etc.)
Now I'm trying to solve the full table scans and it makes me think that current design might not be optimal and I'll have to redesign and store the full product name (type + name) in the same table...
In order to show the proper order of the products there's an ORDER BY TypeDescription ASC, ProductName ASC after the ProductType table is joined to Product select queries.
From my research I see that the database can't use indexes when the order is done on fields from different tables, so it's doing full table scan to get to the right entries.
During pagination, there's ORDER and LIMIT 50000,100 in the query that take lots of time.
There are sections with lots for products, so that ordering and limiting cause very long full table scans.
How would you handle that situation?
Change the design and store all query related data to the Product table? Feels a bit of a duplication and not natural solution.
Or maybe there's another way to solve it?
Will index on VARCHAR type (product name) be efficient for the ORDER speed? Or the database will still do full table scan?
My first question here. Couldn't find answers on similar cases.
Thanks!
I've tried to play with the queries to see if ordering by a VARCHAR field that have an index will work, but the EXPLAIN SELECT still shows that the query didn't use the index and did WHERE run :(
UPDATE
Trying to add some more data...
The situation is a bit more complicated and after digging a bit more it looks like the initial question was not in the right direction.
I removed the product type from the queries and still have the slow query.
I feel like it's a chicken and egg situation...
I have a table that maps prodcut IDs to section IDs:
CREATE TABLE `Product2Section` (
`SectionId` int(10) unsigned NOT NULL,
`ProductId` int(10) unsigned NOT NULL,
KEY `idx_ProductId` (`ProductId`),
KEY `idx_SectionId` (`SectionId`),
KEY `idx_ProductId_SectionId` (`ProductId`,`SectionId`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 ROW_FORMAT=DYNAMIC
The query (after stripping all non-relevant to the question feilds):
SELECT DISTINCT
DRIVER.ProductId AS ID,
p.*
FROM
Product2Section AS DRIVER
LEFT JOIN Product p ON
(p.ID = DRIVER.ProductId)
WHERE
DRIVER.SectionId IN(
544,545,546,548,550,551,552,553,554,555,556,557,558,559,560,561,562,563,564,566,567,568,570,571,572,573,574,575,1337,1343,1353,1358,1369,1385,1956,1957,1964,1973,1979,1980,1987,1988,1994,1999,2016,2020,576,577,578,579,580,582,586,587,589,590,591,593,596,597,598,604,605,606,608,609,612,613,614,615,617,619,620,621,622,624,625,626,627,628,629,630,632,634,635,637,639,640,642,643,644,645,647,648,651,656,659,660,661,662,663,665,667,669,670,672,674,675,677,683,684,689,690,691,695,726,728,729,730,731,734,736,741,742,743,745,746,749,752,758,761,762,763,764,768,769,771,772,773,774,775,776,777
)
ORDER BY
p.ProductName ASC
LIMIT 500900,100;
explain shows:
id
select_type
table
type
possible_keys
key
key_len
ref
rows
Extra
1
SIMPLE
DRIVER
index
idx_SectionId
idx_ProductId_SectionId
8
NULL
589966
Using where; Using index; Using temporary; Using filesort
1
SIMPLE
p
eq_ref
PRIMARY,idx_ID
PRIMARY
4
4project.DRIVER.ProductId
1
Using where
I've tried to select from the products table and join the Product2Section in order to filter the results, but get the same results:
SELECT DISTINCT
p.ID,
p.ProductName
FROM
Product p
LEFT JOIN
Product2Section p2s ON (p.ID=p2s.ProductId)
WHERE
p2s.SectionId IN(
544,545,546,548,550,551,552,553,554,555,556,557,558,559,560,561,562,563,564,566,567,568,570,571,572,573,574,575,1337,1343,1353,1358,1369,1385,1956,1957,1964,1973,1979,1980,1987,1988,1994,1999,2016,2020,576,577,578,579,580,582,586,587,589,590,591,593,596,597,598,604,605,606,608,609,612,613,614,615,617,619,620,621,622,624,625,626,627,628,629,630,632,634,635,637,639,640,642,643,644,645,647,648,651,656,659,660,661,662,663,665,667,669,670,672,674,675,677,683,684,689,690,691,695,726,728,729,730,731,734,736,741,742,743,745,746,749,752,758,761,762,763,764,768,769,771,772,773,774,775,776,777
)
ORDER BY
p.ProductName ASC
LIMIT 500900,
100;
explain:
id
select_type
table
type
possible_keys
key
key_len
ref
rows
Extra
1
SIMPLE
p2s
index
idx_ProductId,idx_SectionId,idx_ProductId_SectionId
idx_ProductId_SectionId
8
NULL
589966
Using where; Using index; Using temporary; Using filesort
1
SIMPLE
p
eq_ref
PRIMARY,idx_ID
PRIMARY
4
4project.p2s.ProductId
1
Using where
Don't see a way out of that situation.
The two single column indices on Product2Section serve no purpose. You should change your junction table to:
CREATE TABLE `Product2Section` (
`SectionId` int unsigned NOT NULL,
`ProductId` int unsigned NOT NULL,
PRIMARY KEY (`SectionId`, `ProductId`),
KEY `idx_ProductId_SectionId` (`ProductId`, `SectionId`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
There are other queries in the system that probably use the single field indexes
The single column indices cannot be used for anything that the two composite indices cannot be used for. They are just wasting space and cause unnecessary overhead on insert and for the optimizer. Setting one of the composite indices as PRIMARY stops InnoDB from having to create its own internal rowid, which just wastes space. It also adds the uniqueness constraint which is currently missing from your table.
From the docs:
Accessing a row through the clustered index is fast because the index search leads directly to the page that contains the row data. If a table is large, the clustered index architecture often saves a disk I/O operation when compared to storage organizations that store row data using a different page from the index record.
This is not significant for a "simple" junction table as both columns should be stored in both indices, therefor no further read is required.
You said:
that didn't really bother me since there was no real performance hit
You may not see the difference when running an individual query with no contention but the difference in a highly contended production environment can be huge, due to the amount of effort required.
Do you really need to accommodate 4,294,967,295 (int unsigned) sections? Perhaps the 65,535 provided by smallint unsigned would be enough?
You said:
Might change it in the future. Don't think it will change the performance somehow
Changing SectionId to smallint will reduce each index entry from 8 to 6 bytes. That's a 25% reduction in size. Smaller is faster.
Why are you using LEFT JOIN? The fact that you are happy to reverse the order of the tables in the query suggests it should be an INNER JOIN.
Do you have your buffer pool configured appropriately, or is it set to defaults? Please run ANALYZE TABLE Product2Section; and then provide the output from:
SELECT TABLE_ROWS, AVG_ROW_LENGTH, DATA_LENGTH + INDEX_LENGTH
FROM information_schema.TABLES
WHERE TABLE_NAME = 'Product2Section';
And:
SELECT ROUND(SUM(DATA_LENGTH + INDEX_LENGTH)/POW(1024, 3), 2)
FROM information_schema.TABLES
WHERE TABLE_SCHEMA = 'your_database_name';
And:
SHOW VARIABLES LIKE 'innodb_buffer%';
I have the following query that runs forever and I am looking to see if there is anyway that I can optimise it. This is running on a table that has in total 1,406,480 rows of data but apart from the Filename and Refcolumn, the ID and End_Date have both been indexed.
My Query:
INSERT INTO UniqueIDs
(
SELECT
T1.ID
FROM
master_table T1
LEFT JOIN
master_table T2
ON
(
T1.Ref_No = T2.Ref_No
AND
T1.End_Date = T2.End_Date
AND
T1.Filename = T2.Filename
AND
T1.ID > T2.ID
)
WHERE T2.ID IS NULL
AND
LENGTH(T1.Ref_No) BETWEEN 5 AND 10
)
;
Explain Results:
The reason for not indexing the Ref_No is that this is a text column and therefore I get a BLOB/TEXT error when I try and index this column.
Would really appreciate if somebody could advise on how I can quicken this query.
Thanks
Thanks to Bill in regards to multi column indexes I have managed to make some headway. I first ran this code:
CREATE INDEX I_DELETE_DUPS ON master_table(id, End_Date);
I then added a new column to show the length of the Ref_No but had to change it from the query Bill mentioned as my version of MySQL is 5.5. So I ran it in 3 steps:
ALTER TABLE master_table
ADD COLUMN Ref_No_length SMALLINT UNSIGNED;
UPDATE master_table SET Ref_No_length = LENGTH(Ref_No);
ALTER TABLE master_table ADD INDEX (Ref_No_length);
Last step was to change my insert query with the where clause for the length. This was changed to:
AND t1.Ref_No_length between 5 and 10;
I then ran this query and within 15 mins I had 280k worth of id's inserted into my UniqueIDs table. I did go change my insert script to see if I could add more values to the length by doing the following:
AND t1.Ref_No_length IN (5,6,7,8,9,10,13);
This was to bring in the values where length was also equal to 13. This query took a lot longer, 2hr 50 mins to be precise but the additional ask of looking for all rows that have length of 13 gave me an extra 700k unique ids.
I am looking at ways to optimise the query with the IN clause, but a big improvement where this query kept running for 24 hours. So thank you so much Bill.
For the JOIN, you should have a multi-column index on (Ref_No, End_Date, Filename).
You can create a prefix index on a TEXT column like this:
ALTER TABLE master_table ADD INDEX (Ref_No(10));
But that won't help you search based on the LENGTH(). Indexing only helps search by value indexed, not by functions on the column.
In MySQL 5.7 or later, you can create a virtual column like this, with an index on the values calculated for the virtual column:
ALTER TABLE master_table
ADD COLUMN Ref_No_length SMALLINT UNSIGNED AS (LENGTH(Ref_No)),
ADD INDEX (Ref_No_length);
Then MySQL will recognize that your condition in your query is the same as the expression for the virtual column, and it will automatically use the index (exception: in my experience, this doesn't work for expressions using JSON functions).
But this is no guarantee that the index will help. If most of the rows match the condition of the length being between 5 and 10, the optimizer will not bother with the index. It may be more work to use the index than to do a table-scan.
the ID and End_Date have both been indexed.
You have PRIMARY KEY(id) and redundantly INDEX(id)? A PK is a unique key.
"have both been indexed" -- INDEX(a), INDEX(b) is not the same as INDEX(a,b) -- they have different uses. Read about "composite" indexes.
That query smells a lot like "group-wise" max done in a very slow way. (Alas, that may have come from the online docs.)
I have compiled the fastest ways to do that task here: http://mysql.rjweb.org/doc.php/groupwise_max (There are multiple versions, based on MySQL version and what issues your code can/cannot tolerate.)
Please provide SHOW CREATE TABLE. One important question: Is id the PRIMARY KEY?
This composite index may be useful:
(Filename, End_Date, Ref_No, -- first, in any order
ID) -- last
This, as others have noted, is unlikely to be helped by any index, hence T1 will need a full-table-scan:
AND LENGTH(T1.Ref_No) BETWEEN 5 AND 10
If Ref_No cannot be bigger than 191 characters, change it to a VARCHAR so that it can be used in an index. Oh, did I ask for SHOW CREATE TABLE? If you can't make it VARCHAR, then my recommended composite index is
INDEX(Filename, End_Date, ID)