How to optimize this Levenshtein distance calculation - mysql

Table a has around 8,000 rows and table b has around 250,000 rows. Without the levenshtein function the query takes just under 2 seconds. With the function included it is taking about 25 minutes.
SELECT
*
FROM
library a,
classifications b
WHERE
a.`release_year` = b.`year`
AND a.`id` IS NULL
AND levenshtein_ratio(a.title, b.title) > 82

I'm assuming that levenshtein_ratio is a function that you wrote (or maybe included from somewhere else). If so, the database server would not be able to optimize that in the normal sense of using an index. So it means that it simply needs to call it for each record that results from the other join conditions. With an inner join, that could be an extremely large number with those table sizes (a maximum of 8000*250000 = 2 billion). You can check the total number of times it would need to be called with this:
SELECT
count(*)
FROM
library a,
classifications b
WHERE
a.`release_year` = b.`year`
AND a.`id` IS NULL
That is an explanation of why it is slow (not really an answer to the question of how to optimize it). To optimize it, you likely need to add additional limiting factors to the join condition to reduce the number of calls to the user-defined function.

You are giving too little information to actually help you.
1) My first guess would be to try to create other WHERE conditions that reduce the amount of rows to be scanned.
2) If that is not possible...Given that the titles from table library and classifications are known, one idea would be to create a table where all the data is already calculated like this:
TABLE levenshtein_ratio
id_table_library
id_table_classifications
precalculated_levenshtein_ratio
so you would populate the table using this query:
insert into levenshtein_ratio select a.id, b.id, levenshtein_ratio(a.title, b.title) from library, classifications
and then your query would be:
SELECT
*
FROM
library a LEFT JOIN
classifications b ON a.`release_year` = b.`year`
LEFT JOIN levenshtein_ratio c ON c.id_table_library = a.id AND c.id_table_classifications = b.id
WHERE
a.`id` IS NULL
AND precalculated_levenshtein_ratio > 82
this query will probably then no more than the original 2 secs.
The problem with this solution is the fact that the data in tables a and b can change, so you will need to create a trigger to keep it updated.

Change your query to use proper joins (syntax has been around since 1996).
Also, all your levensrein condition may be moved into the join condition, which should give you a performance benefit:
SELECT *
FROM library a
JOIN classifications b
ON a.`release_year` = b.`year`
AND levenshtein_ratio(a.title, b.title) > 82
WHERE a.`id` IS NULL
Also, make sure there's an index on b.year:
create index b_year on b(year);

Related

SQL query takes too much time (3 joins)

I'm facing an issue with an SQL Query. I'm developing a php website, and to avoid making too much queries, I prefer to make a big one looking like :
select m.*, cj.*, cjb.*, me.pseudo as pseudo_acheteur
from mercato m
JOIN cartes_joueur cj
ON m.ID_carte = cj.ID_carte_joueur
JOIN cartes_joueur_base cjb
ON cj.ID_carte_joueur_base = cjb.ID_carte_joueur_base
JOIN membres me
ON me.ID_membre = cj.ID_membre
where not exists (select * from mercato_encheres me where me.ID_mercato = m.ID_mercato)
and cj.ID_membre = 2
and m.status <> 'cancelled'
ORDER BY total_carac desc, cj.level desc, cjb.nom_carte asc
This should return all cards sold by the member without any bet on it. In the result, I need all the information to display them.
Here is the approximate rows in each table :
mercato : 1200
cartes_joueur : 800 000
carte_joueur_base : 62
membres : 2000
mercato_enchere : 15 000
I tried to reduce them (in dev environment) by deleting old data; but the query still needs 10~15 seconds to execute (which is way too long on a website )
Thanks for your help.
Let's take a look.
The use of * in SELECT clauses is harmful to query performance. Why? It's wasteful. It needlessly adds to the volume of data the server must process, and in the case of JOINs, can force the processing of columns with duplicate values. If you possibly can do so, try to enumerate the columns you need.
You may not have useful indexes on your tables for accelerating this. We can't tell. Please notice that MySQL can't exploit multiple indexes in a single query, so to make a query fast you often need a well-chosen compound index. I suggest you try defining the index (ID_membre, ID_carte_jouer, ID_carte_joueur_base) on your cartes_joueur table. Why? Your query matches for equality on the first of those columns, and then uses the second and third column in ON conditions.
I have often found that writing a query with the largest table (most rows) first helps me think clearly about optimizing. In your case your largest table is cartes_jouer and you are choosing just one ID_membre value from that table. Your clearest path to optimization is the knowledge that you only need to examine approximately 400 rows from that table, not 800 000. An appropriate compound index will make that possible, and it's easiest to imagine that index's columns if the table comes first in your query.
You have a correlated subquery -- this one.
where not exists (select *
from mercato_encheres me
where me.ID_mercato = m.ID_mercato)
MySQL's query planner can be stupidly literal-minded when it sees this, running it thousands of times. In your case it's even worse: it's got SELECT * in it: see point 1 above.
It should be refactored to use the LEFT JOIN ... IS NULL pattern. Here's how that goes.
select whatever
from mercato m
JOIN ...
JOIN ...
LEFT JOIN mercato_encheres mench ON mench.ID_mercato = m.ID_mercato
WHERE mench.ID_mercato IS NULL
and ...
ORDER BY ...
Explanation: The use of LEFT JOIN rather than ordinary inner JOIN allows rows from the mercato table to be preserved in the output even when the ON condition does not match them to tables in the mercato_encheres table. The mismatching rows get NULL values for the second table. The mench.ID_mercato IS NULL condition in the WHERE clause then selects only the mismatching rows.

Mysql: Why is WHERE IN much faster than JOIN in this case?

I have a query with a long list (> 2000 ids) in a WHERE IN clause in mysql (InnoDB):
SELECT id
FROM table
WHERE user_id IN ('list of >2000 ids')
I tried to optimize this by using an INNER JOIN instead of the wherein like this (both ids and the user_id use an index):
SELECT table.id
FROM table
INNER JOIN users ON table.user_id = users.id WHERE users.type = 1
Surprisingly, however, the first query is much faster (by the factor 5 to 6). Why is this the case? Could it be that the second query outperforms the first one, when the number of ids in the where in clause becomes much larger?
This is not Ans to your Question but you may use as alternative to your first query, You can better increase performance by replacing IN Clause with EXISTS since EXISTS performance better than IN ref : Here
SELECT id
FROM table t
WHERE EXISTS (SELECT 1 FROM USERS WHERE t.user_id = users.id)
This is an unfair comparison between the 2 queries.
In the 1st query you provide a list of constants as a search criteria, therefore MySQL has to open and search only table and / or 1 index file.
In the 2nd query you instruct MySQL to obtain the list dynamically from another table and join that list back to the main table. It is also not clear, if indexes were used to create a join or a full table scan was needed.
To have a fair comparison, time the query that you used to obtain the list in the 1st query along with the query itself. Or try
SELECT table.id FROM table WHERE user_id IN (SELECT users.id FROM users WHERE users.type = 1)
The above fetches the list of ids dynamically in a subquery.

MySQL creating temp table then join faster than left join

I have a LEFT JOIN that is very expensive:
    select X.c1, COUNT(Y.c3) from X LEFT JOIN Y on X.c1=Y.c2 group by X.c1;
After several minutes (20+), it still does not finish. But I want all rows in X. So I really do need a LEFT JOIN at some point.
It appears that I can hack my way around this to return the result set I am looking for by using a temp table in less than two minutes. I first trim down table Y so that it only contains rows in the join.
CREATE TEMPORARY TABLE IF NOT EXISTS table2 AS 
(select X.c1 as t, COUNT(Y.c2) as c from X
INNER JOIN Y where X.c1=Y.c2 group by X.c1);
select X.c1, table2.c from X 
LEFT JOIN table2 on X.c1 = table2.t; 
This finishes in under two minutes.
My questions are:
1) Are they equivalent?
2) Why is the second so much faster (why doesn't MySQL do this kind of optimization), meaning, do I need to do these kinds mysql?
EDIT: additional info: C1, C2 are BIGINTS. C1 is unique but there can be many C2s that all point to the same C1. As far as I know, I have not indexed any tables. X.C1 is an _id column that Y.c2 refers to.
Try indexing X.c1 and Y.c2 and running your original query.
It's hard to tell why your 1st query runs slower without the indexes without comparing the query plans from both queries (you can get the query plan by running your queries with explain at the beginning) but I suspect it's because the 2nd table contains many rows that do not have a corresponding row in the 1st table.
If x.c1 is unique, then I would suggest writing the query as:
select X.c1,
(select COUNT(Y.c3)
from Y
where X.c1 = Y.c2
)
from X;
For this query, you want an index on Y(c2, c3).
The reason why a left join might take longer is if many rows do not match. In that case, the group by is aggregating by many more rows than it really needs to. And no, MySQL does not attempt this type of optimization.

How to improve a MySQL query that have an internal join

I'm using MySQL DB and as a result of a 3rd party client, we have some query that takes a long time. The 'problem' is that there is an outer-select using some internal-join without filtering results with 'where', and the 'where' is only on the "outer" section, which causes a join of 2 very big tables instead of joining 2 much smaller subsets of the tables (I can't control it, this is they way it is done... I must define them the join and they just add where clauses to it using this structure). Note that if the 'where' clauses would have been within the internal-join the join would be much-much smaller and the whole query would have been faster.
I've considered implementing the internal-join using a view, but it results the same performance. All fields compared by the join are indexed.
I was told that it can be improved with some DB's configuration tweaking, but no one could say what exactly.
Here is a paraphrase of the query (takes lots of seconds to minute to execute):
SELECT a.*,
SUM(b.p1) p1
FROM
(SELECT a.*,
b.p1
FROM a
LEFT OUTER JOIN b ON a.some_value = b.some_value)
WHERE a.some_value = 'x'
Just to explain, if I could write the query myself I would have written it like this (takes ~200ms to execute):
SELECT a.*,
SUM(b.p1) p1
FROM a
LEFT OUTER JOIN b ON a.some_value = b.some_value
WHERE a.some_value = 'x'
Any idea how can I improve that?
Your personal rewrite would be ok, however, by adding the AND b.y to the where clause kills your LEFT join to an INNER JOIN. The AND b.y should be part of the join's ON clause to retain left-join qualification.
For indexes, table A should have index on (x, b_id) and table B have a covering index on (id, y, p1)

Which MySQL Query is faster?

Which query will execute faster and which is perfect query ?
SELECT
COUNT(*) AS count
FROM
students
WHERE
status = 1
AND
classes_id IN(
SELECT
id
FROM
classes
WHERE
departments_id = 1
);
Or
SELECT
COUNT(*) AS count
FROM
students s
LEFT JOIN
classes c
ON
c.id = s.classes_id
WHERE
status = 1
AND
c.departments_id = 1
I have placed two queries both will output same result. Now I want to know which method will execute faster and which method is correct way ?
You should always use EXPLAIN to determine how your query will run.
Unfortunately, MySQL will execute your subquery as a DEPENDENT QUERY, which means that the subquery will be ran for each row in the outer query. You'd think MySQL would be smart enough to detect that the subquery isn't a correlated subquery and would run it just once, alas, it's not yet that smart.
So, MySQL will scan through all of the rows in students, running the subquery for each row, and not utilizing any indexes on the outer query whatsoever.
Writing the query as a JOIN would allow MySQL to utilize indexes, and the following query would be the optimum way to write it:
SELECT COUNT(*) AS count
FROMstudents s
JOIN classes c
ON c.id = s.classes_id
AND c.departments_id = 1
WHERE s.status = 1
This would utilize the following indexes:
students(`status`)
classes(`id`, `departements_id`) : multi-column index
From a design and clarity standpoint I'd avoid inner selects like the first one. It is true that to be 100% sure on if or how each query will be optimized and which will run 'better' requires seeing how the SQL server you're using will interperet it and its plan. In Mysql, use "Explain".
However.... Even without seeing this, my money is still on the Join only version... The inner select version has to perform the inner select in it's entirety before determining the values to use inside the "IN" clause--I know this to be true when you wrap stuff in functions, and pretty sure it's true when sticking a select in as IN arguements. I also know that that's a good way to totally neutralize any benefit you might have with indexes on the tables inside the inner select.
I'm generally of the opinion that Inner selects are only really needed for very rare query situations. Usually, those who use them often are thinking like traditional iterative flow programmers not really thinking in relational DB result set terms...
EXPLAIN Both the queries individually
The difference between both queries is of Sub-Queries vs Joins
Mostly Joins are faster than sub-queries. Join creates execution plan and predict what data is going to process, hence it saves time. On the other hand sub-queries run all the queries until all the data is loaded. Most developer use Sub-queries because these are more readable than JOINS, but where the performance is matter, JOIN is better solution.
The best way to find out is to measure it:
Without index
Query 1: 0.9s
Query 2: 0.9s
With index
Query 1: 0.4s
Query 2: 0.2s
The conclusion is:
If you don't have indexes then it makes no difference which query you use.
The join is faster if you have the right index.
The effect of adding the correct index is greater than the effect of choosing the right query. If performance matters, make sure you have the correct indexes.
Of course, your results may vary depending on the MySQL version and the distribution of data you have.
Here's how I tested it:
1,000,000 students (25% with status 1).
50,000 courses.
10 departments.
Here's the SQL I used to create the test data:
CREATE TABLE students
(id INT PRIMARY KEY AUTO_INCREMENT,
status int NOT NULL,
classes_id int NOT NULL);
CREATE TABLE classes
(id INT PRIMARY KEY AUTO_INCREMENT,
departments_id INT NOT NULL);
CREATE TABLE numbers(id INT PRIMARY KEY AUTO_INCREMENT);
INSERT INTO numbers VALUES (),(),(),(),(),(),(),(),(),();
INSERT INTO numbers
SELECT NULL
FROM numbers AS n1
CROSS JOIN numbers AS n2
CROSS JOIN numbers AS n3
CROSS JOIN numbers AS n4
CROSS JOIN numbers AS n5
CROSS JOIN numbers AS n6;
INSERT INTO classes (departments_id)
SELECT id % 10 FROM numbers WHERE id <= 50000;
INSERT INTO students (status, classes_id)
SELECT id % 4 = 0, id % 50000 + 1 FROM numbers WHERE id <= 1000000;
SELECT COUNT(*) AS count
FROM students
WHERE status = 1
AND classes_id IN (SELECT id FROM classes WHERE departments_id = 1);
SELECT COUNT(*) AS count
FROM students s
LEFT JOIN classes c
ON c.id = s.classes_id
WHERE status = 1
AND c.departments_id = 1;
CREATE INDEX ix_students ON students(status, classes_id);
The two queries won't produce the same results:
SELECT
COUNT(*) AS count
FROM
students
WHERE
status = 1
AND
classes_id IN(
SELECT
id
FROM
classes
WHERE
departments_id = 1
);
...will return the number of rows in the students table that have a classes_id field that is also in the classes table with a departments_id of 1.
SELECT
COUNT(*) AS count
FROM
students s
LEFT JOIN
classes c
ON
c.id = s.classes_id
WHERE
status = 1
AND
c.departments_id = 1
...will return the total number of rows in the students table where the status field is 1 and possibly more than that depending on how your data is organised.
If you want the queries to return the same thing, you need to change the LEFT JOIN to an INNER JOIN so it will match only the rows that suit both conditions.
Run EXPLAIN SELECT ... on both queries and check which one does what ;)