Why mysql cannot join 2 big / large table? - mysql

Here's my problem all :
I have 2 big table call it A n B.
If I join that's 2 table with a very simple query like this example :
SELECT COUNT(*) FROM lib_judul, lib_buku
Then mysql process is not over yet, I don't know why. Table A have 158,670 records (33,6 MB) and Table B have 130,028 records (34,6 MB). I think myquery is right, cause I've try before to join table A with table C (the very smaller table one) and it's run well.
What should I do to do this?

You have implicit CROSS JOIN in your code which creates full Cartesian Product of the two tables. It creates a new table with 158,670 times 130,028 rows. This is more than 20 billion (20,631,542,760) records.

It's because there is no common field for both of the tables. Try using Explicit Join just like below:
SELECT
COUNT(*)
FROM lib_judul A
JOIN lib_buku B ON A.id=B.id

The cost of your query maybe is too large. Your query have cost = 158,670 x 130,028 = 20,631,542,760 I/O.
The query execution plan will execute join first, then select the column.
Know your need. May be you can add some "where condition" before you join it. Example:
this query: SELECT
COUNT(*)
FROM lib_judul A, lib_buku B
WHERE B.id = 1 AND B.id = A.id
can be optimized like this:
SELECT * FROM
(SELECT * FROM lib_judul) A
JOIN
(SELECT * FROM lib_buku WHERE lib_buku.id = 1) B
ON B.id = A.id

Related

Selecting single max values

Say I need to pull data from several tables like so:
item 1 - from table 1
item 2 - from table 1
item 3 - from table 1 - but select only max value of item 3 from table 1
item 4 - from table 2 - but select only max value of item 4 from table 2
My query is pretty simple:
select
a.item 1,
a.item 2,
b.item 3,
c.item 4
from table 1 a
left join (select b.key_item, max(item 3) from table 1, group by key_item) b on a.key_item = b.key_item
left join (select c.key_item, max(item 4) from table 2, group by key_item) c on c.key_item = a.key_item
I am not sure if my methodology of pulling just a single max item from a table is the most efficient. Assume both tables are over a million rows. my actual sql run forever using this sql setup.
EDIT: I changed the group by clause to reflect comments made. I hope it makes a bit of sense now?
Your best bet is to add an index on table1 and table2, as follows:
ALTER TABLE table1
ADD INDEX `GoodIndexName1` (`key_item`,`item3`)
ALTER TABLE table2
ADD INDEX `GoodIndexName2` (`key_item`,`item4`)
This will allow you to use queries as described in the MySQL documentation for finding the rows holding the group-wise maximum, which appears to be what you are looking for.
Your original (edited) query should work:
select
a.item1,
a.item2,
b.item3,
c.item4
from table1 a
LEFT OUTER JOIN (
SELECT
b.key_item,
MAX(item3) AS item3
FROM table1
GROUP BY key_item
) b
ON a.key_item = b.key_item
LEFT OUTER JOIN (
SELECT
c.key_item,
MAX(item4)
FROM table2
GROUP BY key_item
) c
ON c.key_item = a.key_item
and if that performs slowly after adding the indexes, try the following too:
SELECT
a.item1,
a.item2,
b.item3,
c.item4
FROM table1 a
LEFT OUTER JOIN table1 b
ON b.key_item = a.key_item
LEFT OUTER JOIN table1 larger_b
ON larger_b.key_item = b.key_item
AND larger_b.item3 > b.item_3
LEFT OUTER JOIN table2 c
ON c.key_item = a.key_item
LEFT OUTER JOIN table2 larger_c
ON larger_c.key_item = c.key_item
AND larger_c.item4 > c.item4
WHERE larger_b.key_item IS NULL
AND larger_c.key_item IS NULL
(I have modified the table and column names only slightly, so that they conform to correct MySQL syntax. )
I work with queries that use the above structure all the time, and they perform very efficiently with indexes like the one I provided.
That said, usually I am using INNER JOINs on the b and c tables, but I don't see why your query should have any issues.
If you do experience performance problems still, report the data types of the key_item columns for each table, as if you try to join on different data types, you will generally get poor performance.

Query for self join in MySQL

I have a query which gets the correct result but it is taking 5.5 sec to get the output.. Is there any other way to write a query for this -
SELECT metricName, metricValue
FROM Table sm
WHERE createdtime = (
SELECT MAX(createdtime)
FROM Table b
WHERE sm.metricName = b.metricName
AND b.sinkName='xx'
)
AND sm.sinkName='xx'
In your code, the subselect has to be run for every result row of the outer query, which should be quite expensive. Instead, you could select your filter data in a separate query and join both accordingly:
SELECT `metricName`, `metricValue` FROM Table sm
INNER JOIN (SELECT max(`createdtime`) AS `maxTime, `metricName` from Table b WHERE b.sinkName='xx' GROUP BY `metricName` ) filter
ON (sm.`createdtime` = filter.`maxTime`) AND ( sm.`metricName` = filter.`metricName`)
WHERE sm.sinkName='xx'

How to optimize this DB operation?

I'm quite sloppy with databases, can't get this working with joins, and I'm not even sure that would be faster...
DELETE FROM atable
WHERE btable_id IN (SELECT id
FROM btable
WHERE param > 2)
AND ctable_id IN (SELECT id
FROM ctable
WHERE ( someblob LIKE '%_ID1_%'
OR someblob LIKE '%_ID2_%' ))
Atable contains ~19M rows, this would delete ~3M of that. At the moment, I can only run the query with LIMIT 100000, and I don't want to sit here with phpmyadmin all day, because each deletion (of 100.000 rows) runs for about 1.5 mins.
Any ways to speed this up / automate it?
MySQL 5.5
(do you think it's already bad DB design if any table contains 20M rows?)
Use EXISTS or JOIN instead of IN to improve perfromance
Using EXISTS:
DELETE FROM Atable A
WHERE EXISTS (SELECT 1 FROM Btable B WHERE A.Btable_id = B.id AND B.param > 2) AND
EXISTS (SELECT 1 FROM Ctable C WHERE A.Ctable_id = C.id AND (C.someblob LIKE '%_ID1_%' OR C.someblob LIKE '%_ID2_%'))
Using JOIN:
DELETE A
FROM Atable A
INNER JOIN Btable B ON A.Btable_id = B.id AND B.param > 2
INNER JOIN Ctable C WHERE A.Ctable_id = C.id AND (C.someblob LIKE '%_ID1_%' OR C.someblob LIKE '%_ID2_%')
first you should try with exist instead of in. it's faster in many many case.
Then you could try to do inner join instead of in and exists.
Example :
delete a
from a
inner join b on b.id = a.tablebid
And finally if it could be possible (i don't know if you have id3, ids) to change the or by something else. Sometimes strange and complicated change helps the optimizer. case when, subquery...
I don't see where a simple index would help much. I'd do:
delete from atable where id in (
select
id
from
atable a
join btable b on a.btable_id = b.id
join ctable c on a.ctable_id = c.id
where
b.param > 2
and (
c.someblob LIKE '%_ID1_%'
OR c.someblob LIKE '%_ID2_%'
)
)
Correction: I'm assuming you've got indexes on btable and ctable's id's (probably, if they're primary keys...) and on b.param (if it's numeric).
Beside optimizing the query you could also take a look at a good use of indexes, since they might prevent a full table scan.
For BTable for example create an index on id and param.
To explain why this helps:
If the database has to look up the id and param values in the table in a unsorted manner, the database has to read ALL rows. If the database reads the index, SORTED, it can look up the id and param with reduced costs.

Searching on multi (1-n) relation tables

I learned the hard way that i shouldn't store serialized data in a table when i need to make it searchable .
So i made 3 tables the base & two 1-n relation tables .
So here is the query i get if i want to select a specific activity .
SELECT
jdc_organizations_activities.id
FROM
jdc_activity_sector ,
jdc_activity_type
INNER JOIN jdc_organizations_activities ON jdc_activity_type.activityId = jdc_organizations_activities.id
AND
jdc_activity_sector.activityId = jdc_organizations_activities.id
WHERE
jdc_activity_sector.activitySector = 5 AND
jdc_activity_type.activityType = 3
Questions :
1- What kind of indexes can i add on a 1-n relation table , i already have a unique combination of (activityId - activitySector) & (activityId - activityType)
2- Is there a better way to write the query to have a better performance ?
Thank you !
I would re-organise the query to avoid the cross product caused by using , notation.
Also, you are effectively only using the sector and type tables as filters. So put activity table first, and then join on your other tables.
Some may suggest that; the first join should ideally be the join which is most likely to restrict your results the most, leaving the minimal amount of work to do in the second join. In reality, the sql engine can actually re-arrange your query when generateing a plan, but it does help to think this way to help you think about the efforts the sql engine are having to go to.
Finally, there are the indexes on each table. I would actually suggest reversing the Indexes...
- ActivitySector THEN ActivityId
- ActivityType THEN ActivityId
This is specifically because the sql engine is manipulating your query. It can take the WHERE clause and say "only include records from the Sector table where ActivitySector = 5", and similarly for the Type table. By having the Sector and Type identifies FIRST in the index, this filtering of the tables can be done much faster, and then the joins will have much less work to do.
SELECT
[activity].id
FROM
jdc_organizations_activities AS [activity]
INNER JOIN
jdc_activity_sector AS [sector]
ON [activity].id = [sector].activityId
INNER JOIN
jdc_activity_type AS [type]
ON [activity].id = [type].activityId
WHERE
[sector].activitySector = 5
AND [type].activityType = 3
Or, because you don't actually use the content of the Activity table...
SELECT
[sector].activityId
FROM
jdc_activity_sector AS [sector]
INNER JOIN
jdc_activity_type AS [type]
ON [sector].activityId = [type].activityId
WHERE
[sector].activitySector = 5
AND [type].activityType = 3
Or...
SELECT
[activity].id
FROM
jdc_organizations_activities AS [activity]
WHERE
EXISTS (SELECT * FROM jdc_activity_sector WHERE activityId = [activity].id AND activitySector = 5)
AND EXISTS (SELECT * FROM jdc_activity_type WHERE activityId = [activity].id AND activityType = 3)
I would advise against mixing old style from table1, table2 and new style from table1 inner join table2 ... in a single query. And you can alias tables using table1 as t1, shortening long table names to an easy to remember mnenomic:
select a.id
from jdc_organizations_activities a
join jdc_activity_sector as
on as.activityId = a.Id
join jdc_activity_type as at
on at.activityId = a.Id
where as.activitySector = 5
and at.activityType = 3
Or even more readable using IN:
select a.id
from jdc_organizations_activities a
where a.id in
(
select activityId
from jdc_activity_sector
where activitySector = 5
)
and a.id in
(
select activityId
from jdc_activity_type
where activityType = 3
)

Is there a way to force MySQL execution order?

I know I can change the way MySQL executes a query by using the FORCE INDEX (abc) keyword. But is there a way to change the execution order?
My query looks like this:
SELECT c.*
FROM table1 a
INNER JOIN table2 b ON a.id = b.table1_id
INNER JOIN table3 c ON b.itemid = c.itemid
WHERE a.itemtype = 1
AND a.busy = 1
AND b.something = 0
AND b.acolumn = 2
AND c.itemid = 123456
I have a key for every relation/constraint that I use. If I run explain on this statement I see that mysql starts querying c first.
id select_type table type
1 SIMPLE c ref
2 SIMPLE b ref
3 SIMPLE a eq_ref
However, I know that querying in the order a -> b -> c would be faster (I have proven that)
Is there a way to tell mysql to use a specific order?
Update: That's how I know that a -> b -> c is faster.
The above query takes 1.9 seconds to complete and returns 7 rows. If I change the query to
SELECT c.*
FROM table1 a
INNER JOIN table2 b ON a.id = b.table1_id
INNER JOIN table3 c ON b.itemid = c.itemid
WHERE a.itemtype = 1
AND a.busy = 1
AND b.something = 0
AND b.acolumn = 2
HAVING c.itemid = 123456
the query completes in 0.01 seconds (Without using having I get 10.000 rows).
However that is not a elegant solution because this query is a simplified example. In the real world I have joins from c to other tables. Since HAVING is a filter that is executed on the entire result it would mean that I would pull some magnitues more records from the db than nescessary.
Edit2: Just some information:
The variable part in this query is c.itemid. Everything else are fixed values that don't change.
Indexes are setup fine and mysql chooses the right ones for me
between a and b there is a 1:n relation (index PRIMARY is used)
between b and c there is a many to many relation (index IDX_ITEMID is used)
the point is that mysql should start querying table a and work it's way down to c and not the other way round. Any change to achive that.
Solution: Not exactly what I wanted but this seems to work:
SELECT c.*
FROM table1 a
INNER JOIN table2 b ON a.id = b.table1_id
INNER JOIN table3 c ON b.itemid = c.itemid
WHERE a.itemtype = 1
AND a.busy = 1
AND b.something = 0
AND b.acolumn = 2
AND c.itemid = 123456
AND f.id IN (
SELECT DISTINCT table2.id FROM table1
INNER JOIN table2 ON table1.id = table2.table1_id
WHERE table1.itemtype = 1 AND table1.busy = 1)
Perhaps you need to use STRAIGHT_JOIN.
http://dev.mysql.com/doc/refman/5.0/en/join.html
STRAIGHT_JOIN is similar to JOIN, except that the left table is always read before the right table. This can be used for those (few) cases for which the join optimizer puts the tables in the wrong order.
You can use FORCE INDEX to force the execution order, and I've done that before.
If you think about it, there's usually only one order you could query tables in for any index you pick.
In this case, if you want MySQL to start querying a first, make sure the index you force on b is one that contains b.table1_id. MySQL will only be able to use that index if it's already queried a first.
You can try rewriting in two ways
bring some of the WHERE condition into JOIN
introduce subqueries even though they are not necessary
Both things might impact the planner.
First thing to check, though, would be if your stats are up to date.