I hold a set of nodes in one mysql table1 and a table of edges in another one (table2). Nodes come with primary keys and edges use this "foreign key"
**table1**
id label
1 node1
2 node2
3 node3
**table2**
FK_first FK_sec rel
1 3 guardian
2 1 guardian
1 3 times
I know the db-design is not perfect, but its simple...
Now i want the number of 'rel' for every node and do a query like:
SELECT
label,
COUNT( rel ) as freq
FROM
`table1`
LEFT JOIN table2 ON (id=FK_first OR id=FK_second)
GROUP BY label
ORDER BY freq DESC
I have about 1000 nodes and 2000 edges. A query with ON (id=FK_first OR id=FK_second), then the query is way faster (<1 sec). The other query needs about 6 sec which is ver slow.
I would appreciate some comments to speed this up a bit :-)
LEFT JOIN table2 ON (id=FK_first OR id=FK_second) ~6 sec
LEFT JOIN table2 ON (id=FK_first) ~0.16 sec
LEFT JOIN table2 ON (id=FK_second) ~0.16 sec
LEFT JOIN table2 ON id IN (FK_first,FK_second) ~6 sec
EXPLAIN 1:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE table1 ALL NULL NULL NULL NULL 2571 Using temporary; Using filesort
1 SIMPLE table2 ALL FK_first,FK_second,FK_first_2 NULL NULL NULL 3858
EXPLAIN 2:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE table1 index NULL PRIMARY 2 NULL 2571 Using index; Using temporary; Using filesort
1 SIMPLE table2 ref FK_first,FK_first_2 FK_first_2 4 table1.id 1
Try doing two joins and moving the "OR" into the COUNT() function:
For every row, this joins table2 once on FK1, then again on FK2 (if it is not already joined to that row via FK1. Then in the COUNT, we specify that only rows which have either join's rel column as non-null.
SELECT
label,
COUNT( table2A.rel || table2B.rel ) as freq
FROM
`table1`
LEFT JOIN
table2 as table2A
ON id=table2A.FK_first
LEFT JOIN
table2 as table2B
ON id=table2B.FK_second
AND table2A.FKFirst != table2B.FKFirst
GROUP BY label
ORDER BY freq DESC
Related
I have a 80000 row database with a result number between 130000000 and 168000000, the results are paired using field pid. I need to change the status of the rows from 'G' to 'X' where the result pair has a difference of 4300000.
I have come up with the query below, which works but is very slow, can it be improved for speed?
UPDATE table1 SET status = 'X'
WHERE id IN (
SELECT id FROM (
SELECT a.id AS id FROM table1 a, table1 b
WHERE a.result = b.result + 4300000
AND a.pid = b.pid
AND a.result between 130000000 and 168000000
AND a.status = 'G'
) AS c
);
The indexes are:-
table1 0 PRIMARY 1 id A 80233 NULL NULL BTREE
table1 1 id 1 id A 80233 NULL NULL BTREE
table1 1 id 2 result A 80233 NULL NULL BTREE
table1 1 id 3 status A 80233 4 NULL YES BTREE
table1 1 id 4 name A 80233 32 NULL BTREE
table1 1 id 5 pid A 80233 16 NULL BTREE
Using a subquery inside the IN(..) clause is generally inefficient in MySQL. Instead, you can rewrite the Update query utilizing UPDATE .. JOIN syntax and utilize "self-join" as well:
UPDATE table1 AS a
JOIN table1 AS b
ON b.pid = a.pid
AND b.result = a.result - 4300000
SET a.status = 'X'
WHERE a.result between 130000000 and 168000000
AND a.status = 'G'
For good performance (and if I understand NLJ (Nested-Loop-Join) correctly), you would need two indexes: (status,result) and (pid).
First (composite) index will be used to consider rows from the table alias a. Since we have range condition on result, it will be better to define status first, otherwise MySQL would simply stop at the result field in the index (if defined first), due to range condition.
Second index will be used for lookups in the Joined table alias b, using NLJ algorithm.
Between these 2 queries, is one faster than the other in MySQL 5.7.19?
select {some columns}
from
table1 t1
join table2 t2 using ({some columns})
where t1.col1=1
vs
select {some columns}
from
(select {some columns} from table1 where col1=1) t1
join table2 t2 using ({some columns})
assuming that all indexes are correctly set
I've created a SQL Fiddle so we can experiment.
Your first query translates to:
select *
from
table1 t1
join table2 t2 on t2.table1_id = t1.id
where t1.col1=1
And the execution plan is:
id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE t1 ref id,col1 col1 4 const 2
100.00 Using index
1 SIMPLE t2 ref table1_id table1_id 4 db_9_0005cd.t1.id 1
100.00 Using index
This is pretty much as fast as it can possibly be.
Your second query becomes
select *
from
(select * from table1 where col1=1) as t1
join table2 t2 on t2.table1_id = t1.id
And the execution plan is:
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY t2 index table1_id table1_id 4 3 100.00 Using index
1 PRIMARY ref 4 db_9_0005cd.t2.table1_id 2100.00
2 DERIVED table1 ref col1 col1 4 const 2 100.00 Using index
The difference here is that you're using a derived table, but it's still using the index. My expectation is that this would perform equally quickly as version 1, as long as the database is not resource constrained - if you're bumping up against memory or CPU limits, the second query may behave slightly more unpredictably.
However...
The theoretical approach is no substitute for having a test environment with test data, and tuning this thing in representative conditions. I doubt that the real query you're building will be as simple as the examples...
For that simple pair of queries, the one involving the "derived table" (subquery), will definitely be no faster.
There are other cases where a derived table can be faster. This includes case where a GROUP BY or LIMIT decreases the number of rows before doing the JOIN.
I've got a composite key table CUSTOMER_PRODUCT_XREF
__________________________________________________________________
|CUSTOMER_ID (PK NN VARCHAR(191)) | PRODUCT_ID(PK NN VARCHAR(191))|
-------------------------------------------------------------------
In my batch program I need to select 500 updated customers and also get the PRODUCT_ID's purchased by CUSTOMERs separated by comma and update our SOLR index. In my query I'm select 500 customers and doing a left join to CUSTOMER_PRODUCT_XREF
SELECT
customer.*, group_concat(xref.PRODUCT_ID separator ', ')
FROM
CUSTOMER customer
LEFT JOIN CUSTOMER_PRODUCT_XREF xref ON customer.CUSTOMER_ID=xref.CUSTOMER_ID
group by customer.CUSTOMER_ID
LIMIT 500;
EDIT: EXPLAIN QUERY
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE customer ALL PRIMARY NULL NULL NULL 74236 Using where; Using temporary; Using filesort
1 SIMPLE xref index NULL PRIMARY 1532 NULL 121627 Using where; Using index; Using join buffer (Block Nested Loop)
I got lost connection exception after 20 minutes running the above query.
I tried with the following (sub query) and it took 1.7 seconds to get result but still slow.
SELECT
customer.*, (SELECT group_concat(PRODUCT_ID separator ', ')
FROM CUSTOMER_PRODUCT_XREF xref
WHERE customer.CUSTOMER_ID=xref.CUSTOMER_ID
GROUP BY customer.CUSTOMER_ID)
FROM
CUSTOMER customer
LIMIT 500;
EDIT: EXPLAIN QUERY produces
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY customer ALL NULL NULL NULL NULL 74236 NULL
2 DEPENDENT SUBQUERY xref index NULL PRIMARY 1532 NULL 121627 Using where; Using index; Using temporary; Using filesort
Question
CUSTOMER_PRODUCT_XREF already has both columns set as PRIMARY_KEY and NOT_NULL but why is my query still very slow ? I thought having Primary Key on a column was enough to build an index for it. Do I need further indexing ?
DATABASE INFO:
All the ID's in my database are VARCHAR(191) because the id's can contain alphabets.
I'm using utf8mb4_unicode_ci character encoding
I'm using SET group_concat_max_len := ##max_allowed_packet to get maximum number of product_ids for each customer. Prefer using group_concat in one main query so that I don't have to do multiple separate queries to get products for each customer.
Your original version of the query is doing the join first and then sorting all the resulting data -- which is probably pretty big given how large the fields are.
You can "fix" that version by selecting 500 hundred customers first and then doing the join:
SELECT c.*, group_concat(xref.PRODUCT_ID separator ', ')
FROM (select c.*
from CUSTOMER customer c
order by c.customer_id
limit 500
) c LEFT JOIN
CUSTOMER_PRODUCT_XREF xref
ON c.CUSTOMER_ID=xref.CUSTOMER_ID
group by c.CUSTOMER_ID ;
An alternative that might or might not have a big impact would be to doing the aggregation by customer in a subquery and join that, as in:
SELECT c.*, xref.products
FROM (select c.*
from CUSTOMER customer c
order by c.customer_id
limit 500
) c LEFT JOIN
(select customer_id, group_concat(xref.PRODUCT_ID separator ', ') as products
from CUSTOMER_PRODUCT_XREF xref
) xref
ON c.CUSTOMER_ID=xref.CUSTOMER_ID;
What you have discovered is that the MySQL optimizer does not recognize this situation (where the limit has a big impact on performance). Some other database engines do a better job of optimization in this case.
Alright the speed of the queries in my question shot up when I created an index just on the CUSTOMER_ID in CUSTOMER_PRODUCT_XREF table.
So I've got two indexes now
PRIMARY_KEY_INDEX on PRODUCT_ID and CUSTOMER_ID
CUSTOMER_ID_INDEX on CUSTOMER_ID
I want to do optimize this query. I have give statistics for using tables.
products and products_categories table have around 500000 record. But for below mentioned category it has 1600 record. I have create slots for this 1600 records. Every product can have minimum 1 slot and maximum 10 slots. But slot table have around 300000 record. slot table can have already expire slots also. I want to get products, which are going to expire soon come first and rest of the products come behind of this products.
I have created index for end_time column. But I used conditional operator, so index not using in this query. I want to optimize this query. kindly tell me the best way.
EXPLAIN
SELECT
xcart_products.*
FROM xcart_products
INNER JOIN xcart_products_categories
ON xcart_products_categories.productid = xcart_products.productid
LEFT JOIN (SELECT
t1.*
FROM bvira_megahour_time_slot t1
LEFT OUTER JOIN bvira_megahour_time_slot t2
ON (t1.product_id = t2.product_id
AND t1.end_time > t2.end_time
AND t1.end_time > NOW())
WHERE t2.product_id IS NULL) as bvira_megahour_time_slot
ON bvira_megahour_time_slot.product_id = xcart_products.productid
WHERE xcart_products_categories.categoryid = '4410'
AND xcart_products.saleid = 2
GROUP BY xcart_products.productid
Below is the result of explain query.
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY xcart_products_categories ref PRIMARY,cpm,productid,orderby,pm cpm 4 const 1523 Using index; Using temporary; Using filesort
1 PRIMARY xcart_products eq_ref PRIMARY,saleid PRIMARY 4 wwwbvira_xcart.xcart_products_categories.productid 1 Using where
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 77215
2 DERIVED t1 ALL NULL NULL NULL NULL 398907
2 DERIVED t2 ref i_product_id,i_end_time i_product_id 4 wwwbvira_xcart.t1.product_id 4 Using where; Not exists
I have rewritten your query as follows:
EXPLAIN
SELECT p.*
FROM xcart_products p
INNER JOIN xcart_products_categories c
ON c.productid = p.productid
LEFT JOIN (
SELECT t1.*
FROM bvira_megahour_time_slot t1
LEFT JOIN bvira_megahour_time_slot t2
ON ( t1.product_id = t2.product_id
AND t1.end_time > t2.end_time
AND t1.end_time > NOW()
)
WHERE t2.product_id IS NULL
) AS bvira_megahour_time_slot
ON bvira_megahour_time_slot.product_id = p.productid
WHERE c.categoryid = '4410'
AND p.saleid = 2
GROUP BY p.productid
Please make sure that you have following compound (multi-column) indexes:
bvira_megahour_time_slot: (product_id, end_time)
xcart_products: (productid, sale_id)
xcart_products_categories: (productid, category_id)
or (category_id, productid)
With these indexes, it should work much better.
I have a Link table with from_uid and to_uid (both indexed) and I want to filter out certain ids. So I do:
SELECT l.uid
FROM Link l
JOIN filter_ids t1 ON l.from_uid = t1.id
JOIN filter_ids t2 ON l.to_uid = t2.id
Now for some reason this is unexpectedly slow :( whereas each individual join is very fast. Can it not use the index right?
EXPLAIN tells me:
id select table type possible_keys key key_len ref rows Extra
1 SIMPLE t1 index Null PRIMARY 34 Null 12205 Using index
1 SIMPLE l ref from_uid,to_uid from_uid 96 func 6 Using where
1 SIMPLE t2 index Null PRIMARY 34 Null 12205 Using where; Using index; Using join buffer
No idea if it'll help but try:
select l.uid
from Link l
where l.from_uid in (select id from filter_ids)
and l.to_uid in (select id from filter_ids)
Maybe it'll make better work with indexes.
The EXPLAIN tells you that the JOIN actually starts from the t1 table. That is you need to add a new index on Link (or better extend the current from_uid index):
(from_uid, to_uid, uid)
or if uid is the primary key, just:
(from_uid, to_uid)
UPD
What you are describing is strange. You can try running just:
SELECT STRAIGHT_JOIN l.uid
FROM Link l
JOIN filter_ids t1 ON l.from_uid = t1.id
JOIN filter_ids t2 ON l.to_uid = t2.id