SLOW QUERY / IN HAVING Clause - mysql

I have a many-to-many relationship database in MySQL
And this Query:
SELECT main_id FROM posts_tag
WHERE post_id IN ('134','140','187')
GROUP BY main_id
HAVING COUNT(DISTINCT post_id) = 3
There are ~5,300,000 rows into this table and that query seems to be slow like 5 seconds (and slower if I add more ids into search)
I want to ask if there is any way to make it faster?
EXPLAIN shows this:
By the way, I want to add more conditions like NOT IN and possible JOIN new tables which has same structure but different data. Not so much like this but first I want to know if there is any way to make that simple query faster?
Any advice would be helpful, even another method, or structure etc.
PS: Hardware is Intel Core i9 3.6Ghz, 64GB RAM, 480GB SSD. So I think the server specs is not a problem.

Use a "composite" and "covering" index:
INDEX(post_id, main_id)
And get rid of INDEX(post_id) since it will then be redundant.
"Covering" helps speed up a query.
Assuming this is a normal "many-to-many" table, then:
CREATE TABLE post_main (
post_id -- similar to `id` in table `posts`
main_id -- similar to `id` in table `main`
PRIMARY KEY(post_id, main_id),
INDEX(main_id, post_id)
) ENGINE=InnoDB;
There is no need for AUTO_INCREMENT anywhere in a many-to-many table.
(You could add FK constraints, but I say 'why bother'.)
More discussion: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#many_to_many_mapping_table
And NOT IN
This gets a bit tricky. I think this is one way; there may be others.
SELECT main_id
FROM post_main
WHERE post_id IN (244,229,193,93,61)
GROUP BY main_id AS x
HAVING COUNT(*) = 5
AND NOT EXISTS ( SELECT 1
FROM post_main
WHERE main_id = x.main_id
AND post_id IN (92,10,234) );

Alexfsk, your Query on the second line has the IN variables surrounded by single quotes. When your column name is defined as INT or mediumint (or any kind of int) datatype, adding the single quotes around the data causes datatype conversion delays on every row considered and delays completion of your query.

Related

Order by fields from different tables create full table scan. Should I combine data to one table?

This is a theoretical question. Sorry, but I don't have a working tables data to show, I'll try to improvise with a theoretical example.
Using MySql/MariaDB. Have indexes for all relevant fields.
I have a system, which historical design had a ProductType table, something like:
ID=1, Description="Milk"
ID=2, Description="Bread"
ID=3, Description="Salt"
ID=4, Description="Sugar"
and so on.
There are some features in the system that rely on the ProductType ID and the Description is also used in different places, such as for defining different properties of the product type.
There is also a Product table, with fields such as:
ID, ProductTypeID, Name
The Product:Name don't have the product type description in it, so a "Milk bottle 1l" will have an entry such as:
ID=101, ProductTypeID=1, Name="bottle 1l"
and "Sugar pack 1kg" will be:
ID=102, ProductTypeID=4, Name="pack 1kg"
You get the idea...
The system combines the ProductType:Description and Product:Name to show full product names to the users. This creates a systematic naming for all the products, so there is no way to define a product with a name such as "1l bottle of milk". I know that in English that might be hard to swallow, but that way works great with my local language.
Years passed, the database grow to millions of products.
Since full-text index should have all searched data in one table, I had to store the ProductType:Description inside the Product table in a string field I added that have different keywords related to the product, so the full-text search will be able to find anything related to the product (type, name, barcode, SKU and etc.)
Now I'm trying to solve the full table scans and it makes me think that current design might not be optimal and I'll have to redesign and store the full product name (type + name) in the same table...
In order to show the proper order of the products there's an ORDER BY TypeDescription ASC, ProductName ASC after the ProductType table is joined to Product select queries.
From my research I see that the database can't use indexes when the order is done on fields from different tables, so it's doing full table scan to get to the right entries.
During pagination, there's ORDER and LIMIT 50000,100 in the query that take lots of time.
There are sections with lots for products, so that ordering and limiting cause very long full table scans.
How would you handle that situation?
Change the design and store all query related data to the Product table? Feels a bit of a duplication and not natural solution.
Or maybe there's another way to solve it?
Will index on VARCHAR type (product name) be efficient for the ORDER speed? Or the database will still do full table scan?
My first question here. Couldn't find answers on similar cases.
Thanks!
I've tried to play with the queries to see if ordering by a VARCHAR field that have an index will work, but the EXPLAIN SELECT still shows that the query didn't use the index and did WHERE run :(
UPDATE
Trying to add some more data...
The situation is a bit more complicated and after digging a bit more it looks like the initial question was not in the right direction.
I removed the product type from the queries and still have the slow query.
I feel like it's a chicken and egg situation...
I have a table that maps prodcut IDs to section IDs:
CREATE TABLE `Product2Section` (
`SectionId` int(10) unsigned NOT NULL,
`ProductId` int(10) unsigned NOT NULL,
KEY `idx_ProductId` (`ProductId`),
KEY `idx_SectionId` (`SectionId`),
KEY `idx_ProductId_SectionId` (`ProductId`,`SectionId`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 ROW_FORMAT=DYNAMIC
The query (after stripping all non-relevant to the question feilds):
SELECT DISTINCT
DRIVER.ProductId AS ID,
p.*
FROM
Product2Section AS DRIVER
LEFT JOIN Product p ON
(p.ID = DRIVER.ProductId)
WHERE
DRIVER.SectionId IN(
544,545,546,548,550,551,552,553,554,555,556,557,558,559,560,561,562,563,564,566,567,568,570,571,572,573,574,575,1337,1343,1353,1358,1369,1385,1956,1957,1964,1973,1979,1980,1987,1988,1994,1999,2016,2020,576,577,578,579,580,582,586,587,589,590,591,593,596,597,598,604,605,606,608,609,612,613,614,615,617,619,620,621,622,624,625,626,627,628,629,630,632,634,635,637,639,640,642,643,644,645,647,648,651,656,659,660,661,662,663,665,667,669,670,672,674,675,677,683,684,689,690,691,695,726,728,729,730,731,734,736,741,742,743,745,746,749,752,758,761,762,763,764,768,769,771,772,773,774,775,776,777
)
ORDER BY
p.ProductName ASC
LIMIT 500900,100;
explain shows:
id
select_type
table
type
possible_keys
key
key_len
ref
rows
Extra
1
SIMPLE
DRIVER
index
idx_SectionId
idx_ProductId_SectionId
8
NULL
589966
Using where; Using index; Using temporary; Using filesort
1
SIMPLE
p
eq_ref
PRIMARY,idx_ID
PRIMARY
4
4project.DRIVER.ProductId
1
Using where
I've tried to select from the products table and join the Product2Section in order to filter the results, but get the same results:
SELECT DISTINCT
p.ID,
p.ProductName
FROM
Product p
LEFT JOIN
Product2Section p2s ON (p.ID=p2s.ProductId)
WHERE
p2s.SectionId IN(
544,545,546,548,550,551,552,553,554,555,556,557,558,559,560,561,562,563,564,566,567,568,570,571,572,573,574,575,1337,1343,1353,1358,1369,1385,1956,1957,1964,1973,1979,1980,1987,1988,1994,1999,2016,2020,576,577,578,579,580,582,586,587,589,590,591,593,596,597,598,604,605,606,608,609,612,613,614,615,617,619,620,621,622,624,625,626,627,628,629,630,632,634,635,637,639,640,642,643,644,645,647,648,651,656,659,660,661,662,663,665,667,669,670,672,674,675,677,683,684,689,690,691,695,726,728,729,730,731,734,736,741,742,743,745,746,749,752,758,761,762,763,764,768,769,771,772,773,774,775,776,777
)
ORDER BY
p.ProductName ASC
LIMIT 500900,
100;
explain:
id
select_type
table
type
possible_keys
key
key_len
ref
rows
Extra
1
SIMPLE
p2s
index
idx_ProductId,idx_SectionId,idx_ProductId_SectionId
idx_ProductId_SectionId
8
NULL
589966
Using where; Using index; Using temporary; Using filesort
1
SIMPLE
p
eq_ref
PRIMARY,idx_ID
PRIMARY
4
4project.p2s.ProductId
1
Using where
Don't see a way out of that situation.
The two single column indices on Product2Section serve no purpose. You should change your junction table to:
CREATE TABLE `Product2Section` (
`SectionId` int unsigned NOT NULL,
`ProductId` int unsigned NOT NULL,
PRIMARY KEY (`SectionId`, `ProductId`),
KEY `idx_ProductId_SectionId` (`ProductId`, `SectionId`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
There are other queries in the system that probably use the single field indexes
The single column indices cannot be used for anything that the two composite indices cannot be used for. They are just wasting space and cause unnecessary overhead on insert and for the optimizer. Setting one of the composite indices as PRIMARY stops InnoDB from having to create its own internal rowid, which just wastes space. It also adds the uniqueness constraint which is currently missing from your table.
From the docs:
Accessing a row through the clustered index is fast because the index search leads directly to the page that contains the row data. If a table is large, the clustered index architecture often saves a disk I/O operation when compared to storage organizations that store row data using a different page from the index record.
This is not significant for a "simple" junction table as both columns should be stored in both indices, therefor no further read is required.
You said:
that didn't really bother me since there was no real performance hit
You may not see the difference when running an individual query with no contention but the difference in a highly contended production environment can be huge, due to the amount of effort required.
Do you really need to accommodate 4,294,967,295 (int unsigned) sections? Perhaps the 65,535 provided by smallint unsigned would be enough?
You said:
Might change it in the future. Don't think it will change the performance somehow
Changing SectionId to smallint will reduce each index entry from 8 to 6 bytes. That's a 25% reduction in size. Smaller is faster.
Why are you using LEFT JOIN? The fact that you are happy to reverse the order of the tables in the query suggests it should be an INNER JOIN.
Do you have your buffer pool configured appropriately, or is it set to defaults? Please run ANALYZE TABLE Product2Section; and then provide the output from:
SELECT TABLE_ROWS, AVG_ROW_LENGTH, DATA_LENGTH + INDEX_LENGTH
FROM information_schema.TABLES
WHERE TABLE_NAME = 'Product2Section';
And:
SELECT ROUND(SUM(DATA_LENGTH + INDEX_LENGTH)/POW(1024, 3), 2)
FROM information_schema.TABLES
WHERE TABLE_SCHEMA = 'your_database_name';
And:
SHOW VARIABLES LIKE 'innodb_buffer%';

Do composite key indices improve performance of or clauses

I have a table in MySQL with two columns
id int(11) unsigned NOT NULL AUTO_INCREMENT,
B varchar(191) CHARACTER SET utf8mb4 DEFAULT NULL,
The id being the PK.
I need to do a lookup in a query using either one of these. id in (:idList) or B in (:bList)
Would this query perform better if, there is a composite index with these two columns in them?
No, it will not.
Indexes can be used to look up values from the leftmost columns in an index:
MySQL can use multiple-column indexes for queries that test all the columns in the index, or queries that test just the first column, the first two columns, the first three columns, and so on. If you specify the columns in the right order in the index definition, a single composite index can speed up several kinds of queries on the same table.
So, if you have a composite index on id, B fields (in this order), then the index can be used to look up values based on their id, or a combination of id and B values. But cannot be used to look up values based on B only. However, in case of an or condition that's what you need to do: look up values based on B only.
If both fields in the or condition are leftmost fields in an index, then MySQL attempts to do an index merge optimisation, so you may actually be better off having separate indexes for these two fields.
Note: if you use innodb table engine, then there is no point in adding the primary key to any multi column index because innodb silently adds the PK to every index.
For OR I dont think so.
Optimizer will try to find a match in the first side, if fail will try the second side. So Individual index for each search will be better.
For AND a composite index will help.
MySQL index TIPS
Of course you can always add the index and compare the explain plan.
MySQL Explain Plan
The trick for optimizing OR is to use UNION. (At least, it works well in some cases.)
( SELECT ... FROM ... WHERE id IN (...) )
UNION DISTINCT
( SELECT ... FROM ... WHERE B IN (...) )
Notes:
Need separate indexes on id and B.
No benefit from any composite index (unless it is also "covering").
Change DISTINCT to ALL if you know that there won't be any rows found by both the id and B tests. (This avoids a de-dup pass.)
If you need ORDER BY, add it after the SQL above.
If you need LIMIT, it gets messier. (This is probably not relevant for IN, but it often is with ORDER BY.)
If the rows are 'wide' and the resultset has very few rows, it may be further beneficial to do
Something like this:
SELECT t...
FROM t
JOIN (
( SELECT id FROM t WHERE id IN (...) )
UNION DISTINCT
( SELECT id FROM t WHERE B IN (...) )
) AS u USING(id);
Notes:
This needs PRIMARY KEY(id) and INDEX(B, id). (Actually there is no diff, as Michael pointed out.)
The UNION is cheaper here because of collecting only id, not the bulky columns.
The SELECTs in the UNION are faster because you should be able to provide "covering" indexes.
ORDER BY would go at the very end.

JSONB Index using GIN doesn't work on postgres [duplicate]

I have the following table in PostgreSQL:
CREATE TABLE index_test
(
id int PRIMARY KEY NOT NULL,
text varchar(2048) NOT NULL,
last_modified timestamp NOT NULL,
value int,
item_type varchar(2046)
);
CREATE INDEX idx_index_type ON index_test ( item_type );
CREATE INDEX idx_index_value ON index_test ( value )
I make the following selects:
explain select * from index_test r where r.item_type='B';
explain select r.value from index_test r where r.value=56;
The explanation of execution plan looks like this:
Seq Scan on index_test r (cost=0.00..1.04 rows=1 width=1576)
Filter: ((item_type)::text = 'B'::text)'
As far as I understand, this is a full table scan. The question is: why my indexes are not used?
May be, the reason is that I have too few rows in my table? I have only 20 of them. Could you please provide me with a SQL statement to easily populate my table with random data to check the indexes issue?
I have found this article: http://it.toolbox.com/blogs/db2luw/how-to-easily-populate-a-table-with-random-data-7888, but it doesn't work for me. The efficiency of the statement does not matter, only the simplicity.
Maybe, the reason is that I have too few rows in my table?
Yes. For a total of 20 rows in a table a seq scan is always going to be faster than an index scan. Chances are that those rows are located in a single database block anyway, so the seq scan would only need a single I/O operation.
If you use
explain (analyze true, verbose true, buffers true) select ....
you can see a bit more details about what is really going on.
Btw: you shouldn't use text as a column name, as that is also a datatype in Postgres (and thus a reserved word).
The example you have found is for DB2, in pg you can use generate_series to do it.
For example like this:
INSERT INTO index_test(data,last_modified,value,item_type)
SELECT
md5(random()::text),now(),floor(random()*100),md5(random()::text)
FROM generate_series(1,1000);
SELECT max(value) from index_test;
http://sqlfiddle.com/#!12/52641/3
The second query in above fiddle should use index only scan.

Fastest way to retrieve records from multiple tables

I need to retrieve columns from two tables and I have used an INNER JOIN. But its consuming lot of time during loading the page. Is there any better and faster way to achieve the same?
Select P.Col1, P.Col2, P.Col3, P.Col4, P.Col5, C.Col1, C.Col2, C.Col3 from Pyalers P inner join Customers C on C.Col1 = P.Col1 where P.Col2 = 5
Thanks in Advance.
Without knowing your DDL, there's no way to say.
But conceptually this is ok, just be sure you have proper indexs sets.
For instance: (is your table name really 'Pyalers'? Assuming 'players')
CREATE INDEX idx_players ON `players` (col1);
CREATE INDEX idx_customers ON `customers` (col1);
use the columns you need for joinning the 2 tables.
http://dev.mysql.com/doc/refman/5.0/en/create-index.html
You're doing it the right way, but if you don't have indexes on your tables on the correct columns, it's not going to be very fast for tables of any size. Do Pyalers.col1 and Customers.col1 both have indexes on them?
Show us how the tables are defined.
Be sure your table has the needed indexes... as a "thumb rule", every field which is used for search (WHERE) or data joins (INNER JOIN, LEFT JOIN, RIGHT JOIN) should be indexed.
Example: If you are creating a table, you can add your indexes at that time (notice that your tables should always have a primary key):
CREATE TABLE myTable (
myId int unsigned not null,
someField varchar(50),
primary key (myId),
index someIdx(someField)
);
If your table already exists, and you want to add indexes, you need to use the ALTER statement:
ALTER TABLE myTable
ADD INDEX someIdx(someField),
ADD PRIMARY KEY (myId);
Rules:
To define an index you most provide a unique name for it, and specify the fields included in the index: INDEX myIndex(field1, field2, ...)
There are different types of indexes: PRIMARY KEY is used for primary keys (that's obvious, huh?); INDEX is an 'ordinary index', just used to speed up search and join operations; UNIQUE INDEX is an index that prevents duplicate values.
Recomendations:
Whenever you can, index all numeric and date fields that are relevant (ids, birth date, etc.). Avoid creating indexes on fields that contain 'double' values.
Don't abuse of indexes, because abuse can create very large index files.
Tips:
If you want to see how your query will be executed, you can use the EXPLAIN statement:
EXPLAIN SELECT a., b. FROM a INNER JOIN b on a.myId = b.otherId
This instruction will show you the execution plan of the query. If in the last column you see 'file sort' or 'using temporary', you may (just may) need aditional indexes (notice that if you use GROUP BY you will almost always get the 'using temporary' message)
Hope this help you

Can I optimize my database by splitting one big table into many small ones?

Assume that I have one big table with three columns: "user_name", "user_property", "value_of_property". Lat's also assume that I have a lot of user (let say 100 000) and a lot of properties (let say 10 000). Then the table is going to be huge (1 billion rows).
When I extract information from the table I always need information about a particular user. So, I use, for example where user_name='Albert Gates'. So, every time the mysql server needs to analyze 1 billion lines to find those of them which contain "Albert Gates" as user_name.
Would it not be wise to split the big table into many small ones corresponding to fixed users?
No, I don't think that is a good idea. A better approach is to add an index on the user_name column - and perhaps another index on (user_name, user_property) for looking up a single property. Then the database does not need to scan all the rows - it just need to find the appropriate entry in the index which is stored in a B-Tree, making it easy to find a record in a very small amount of time.
If your application is still slow even after correctly indexing it can sometimes be a good idea to partition your largest tables.
One other thing you could consider is normalizing your database so that the user_name is stored in a separate table and use an integer foriegn key in its place. This can reduce storage requirements and can increase performance. The same may apply to user_property.
you should normalise your design as follows:
drop table if exists users;
create table users
(
user_id int unsigned not null auto_increment primary key,
username varbinary(32) unique not null
)
engine=innodb;
drop table if exists properties;
create table properties
(
property_id smallint unsigned not null auto_increment primary key,
name varchar(255) unique not null
)
engine=innodb;
drop table if exists user_property_values;
create table user_property_values
(
user_id int unsigned not null,
property_id smallint unsigned not null,
value varchar(255) not null,
primary key (user_id, property_id),
key (property_id)
)
engine=innodb;
insert into users (username) values ('f00'),('bar'),('alpha'),('beta');
insert into properties (name) values ('age'),('gender');
insert into user_property_values values
(1,1,'30'),(1,2,'Male'),
(2,1,'24'),(2,2,'Female'),
(3,1,'18'),
(4,1,'26'),(4,2,'Male');
From a performance perspective the innodb clustered index works wonders in this similar example (COLD run):
select count(*) from product
count(*)
========
1,000,000 (1M)
select count(*) from category
count(*)
========
250,000 (500K)
select count(*) from product_category
count(*)
========
125,431,192 (125M)
select
c.*,
p.*
from
product_category pc
inner join category c on pc.cat_id = c.cat_id
inner join product p on pc.prod_id = p.prod_id
where
pc.cat_id = 1001;
0:00:00.030: Query OK (0.03 secs)
Properly indexing your database will be the number 1 way of improving performance. I once had a query take a half an hour (on a large dataset, but none the less). Then we come to find out that the tables had no index. Once indexed the query took less than 10 seconds.
Why do you need to have this table structure. My fundemental problem is that you are going to have to cast the data in value of property every time you want to use it. That is bad in my opinion - also storing numbers as text is crazy given that its all binary anyway. For instance how are you going to have required fields? Or fields that need to have constraints based on other fields? Eg start and end date?
Why not simply have the properties as fields rather than some many to many relationship?
have 1 flat table. When your business rules begin to show that properties should be grouped then you can consider moving them out into other tables and have several 1:0-1 relationships with the users table. But this is not normalization and it will degrade performance slightly due to the extra join (however the self documenting nature of the table names will greatly aid any developers)
One way i regularly see databqase performance get totally castrated is by having a generic
Id, property Type, Property Name, Property Value table.
This is really lazy but exceptionally flexible but totally kills performance. In fact on a new job where performance is bad i actually ask if they have a table with this structure - it invariably becomes the center point of the database and is slow. The whole point of relational database design is that the relations are determined ahead of time. This is simply a technique that aims to speed up development at a huge cost to application speed. It also puts a huge reliance on business logic in the application layer to behave - which is not defensive at all. Eventually you find that you wan to use properties in a key relationsip which leads to all kinds of casting on the join which further degrades performance.
If data has a 1:1 relationship with an entity then it should be a field on the same table. If your table gets to more than 30 fields wide then consider movign them into another table but dont call it normalisation because it isnt. It is a technique to help developers group fields together at the cost of performance in an attempt to aid understanding.
I don't know if mysql has an equivalent but sqlserver 2008 has sparse columns - null values take no space.
SParse column datatypes
I'm not saying a EAV approach is always wrong, but i think using a relational database for this approach is probably not the best choice.