MySQL String search and where clause condition order - mysql

I'm working with large number of rows in db(MySQL, innoDb engine, approx 20 milion rows) and I need to perform fuzzy like searching quite a lot. For some reasons I decided to use jaro_winkler algorithm and for performance issues I implemented it as a function in SQL. Application is written in Python and there is a weird situation I came across today:
Comparing these two queries (called from mysql shell not via Orm or etc):
SELECT * FROM products WHERE jaro_winkler(code, '78-1747') > 0.7 AND code LIKE '%78%';
and
SELECT * FROM products WHERE code LIKE '%78%' AND jaro_winkler(code, '78-1747') > 0.7;
I noticed that first one is at least 10 times slower than the second one. It seems logical at first, but as I checked the order of conditions in WHERE should not matter.
So my question - is it the normal behaviour ?
And has someone (from practical experience) can recommend a best algorithm or function to perform fuzzy searching ? I know about damerau-levenshtein metric, but it turns to be slower than my current solution.
EDIT:
After using explain:
I created sample database really quickly and used both queries:
for the first query:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE products ALL NULL NULL NULL NULL 4166 Using where
query time: ~ 2 seconds
explain for the second query:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE products ALL NULL NULL NULL NULL 4332 Using where
query time: ~ 0.1 second

Related

SQL query with subquery takes longer than both queries separately

Problem
I have two queries where one needs the result of the other one. My first guess was to use an independent subquery:
SELECT P2.*
FROM ExampleTable P2
WHERE P2.delivery_start >= (
SELECT MIN(P1.delivery_start)
FROM ExampleTable P1
WHERE 1641288602 < P1.delivery_end
);
The entire query takes 5-6 seconds which is way to long for my application. Running these queries after another takes only around 800ms for both:
SELECT MIN(P1.delivery_start)
FROM ExampleTable P1
WHERE 1641288602 < P1.delivery_end;
SELECT P2.*
FROM ExampleTable P2
WHERE P2.delivery_start >= 1641286800;
I am using Mariadb 10.2 and have indices on both delivery_start and delivery_end.
What I have tried
I have used a CTE instead of a subquery which resulted in the same performance. Using a Variable with SET yields similar results as to running both queries separately, so thats what I will use for the time being.
I ran EXPLAIN on all 3 Queries:
1. Query with subquery
id
select_type
table
type
possible_keys
key
key_len
ref
rows
Extra
1
PRIMARY
P2
ALL
delivery_start
NULL
NULL
NULL
6388282
Using where
2
SUBQUERY
P1
range
delivery_end
delivery_end
4
NULL
36378
Using index condition
2. Separate Queries
id
select_type
table
type
possible_keys
key
key_len
ref
rows
Extra
1
SIMPLE
P1
range
delivery_end
delivery_end
4
NULL
36432
Using index condition
id
select_type
table
type
possible_keys
key
key_len
ref
rows
Extra
1
SIMPLE
P2
range
delivery_start
delivery_start
4
NULL
35944
Using index condition
Question
I think the issue is shown in the first EXPLAIN table as it has type ALL which means that the database performs a full table scan. My question is simply: why? Is the optimizer not able to figure out that the subquery produces a number with which we only need a range type query? And why does it not use any index?
The problem is described in the MariaDB docs:
In all remaining cases when NULL cannot be substituted with FALSE, it
is not possible to use index lookups. This is not a limitation in the
server, but a consequence of the NULL semantics in the ANSI SQL
standard.
There is a full examination here:
https://mariadb.com/kb/en/non-semi-join-subquery-optimizations/
The result of your subquery can potentially return a NULL in the case no rows were found. Hence, MariaDB cannot use the index for the parent query.
You must rewrite your subquery in a way that it will always return a row with a non-NULL scalar or stick with two separate queries. However, what should happen if your first query returns NULL? With a compound statement you could put an if around the second query and don't even execute it if the first returns NULL.
Replace these
INDEX(delivery_start)
INDEX(delivery_end)
with these:
INDEX(delivery_start, delivery_end)
INDEX(delivery_end, delivery_start)
The second one will help significantly with the subquery. Then the first may help with the outer query.
(If those don't help, please add SHOW CREATE TABLE, EXPLAIN SELECT ... and table sizes.)

Optinimizing query with fts + composite index

I have the following query:
SELECT *
FROM table
WHERE
structural_type=1
AND parent_id='167F2-F'
AND points_to_id=''
# AND match(search) against ('donotmatch124213123123')
The search takes about 10ms to run, running on the composite index (structural_type, parent_id, points_to_id). However, when I add in the fts index, the query balloons to taking ~1s, regardless of what is contained in the match criteria. Basically it seems like it 'skips the index' whenever I have a fts search applied.
What would be the best way to optimize this query?
Update: a few explains:
EXPLAIN SELECT... # without fts
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE table NULL ref structural_type structural_type 209 const,const,const 2 100.00 NULL
With fts (also adding 'force index'):
explain SELECT ... force INDEX (structural_type) AND match...
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE table NULL fulltext structural_type,search search 0 const 1 5.00 Using where; Ft_hints: sorted
The only thing I can think of which would be incredibly hack-ish, would be to add an additional term to the fts so it does the filter 'within' that. For example:
fts_term = fts_term += " StructuralType1ParentID167F2FPointsToID"
The MySQL optimizer can only use one index for your WHERE clause, so it has to choose between the composite one and the FULLTEXT one.
Since it can't run both queries to bench which one is faster, it will estimate how fast will different execution plans be.
To do so, MySQL uses some internal stats it keeps about each table. But those stats can be very different from the reality if they aren't updated and the data changes in the table.
Running a OPTIMIZE TABLE table query allows MySQL to refresh its table stats, so it will be able to perform better estimates and choose the better index.
Try expressing this without the full text logic, using like:
SELECT *
FROM table
WHERE structural_type = 1 AND
parent_id ='167F2-F' AND
points_to_id = '' AND
search not like '%donotmatch124213123123%';
The index should still be used for the first three columns. LIKE might be slow, but if not many rows match the first three, this might not be as bad as using the full text index.

InnoDB has index problems when using COUNT() + WHERE

Recently, we switched from MyISAM to InnoDB. I tested the whole application and there are generally no problems except for one thing - using COUNT(*) in combination with 2 or more WHERE conditions.
So, here's the problem. The query below takes half a second which is not acceptable. After all InnoDB shouldn't be slower than MyISAM when using COUNT() + WHERE, but that's exactly what is happening here.
Both project_id and status_id are indexed columns. The table has 350K records.
SELECT COUNT(*) FROM respondents WHERE project_id='366' AND status_id='42'
And here is what EXPLAIN says:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE respondents index_merge project_id,status_id project_id,status_id 4,1 NULL 8343 Using intersect(project_id,status_id); Using where...
When I use only one condition after WHERE (either project_id='366' or status_id='42'), it works fine.
I'm thinking, this whole intersecting thing could be the root of the problem. But then what can I do about it? What do you think?
The index merge can be fixed by a compound index
ALTER TABLE respondents ADD KEY(project_id,status_id)
Assuming the data distribution is not very skewed,so this index will be useful.(the project_id='366' AND status_id='42' will not return more than 50% of rows)
Also make sure that your column types match the search.Are project_id and status_id really VARCHAR? If not remove the quotes.

Stop MySQL after first match

I noticed that adding LIMIT 1 at the end of a query does not decrease the execution time. I have a few thousand records and a simple query. How do I make MySQL stop after the first match?
For example, these two queries both take approximately half a second:
SELECT id,content FROM links WHERE LENGTH(content)<500 ORDER BY likes
SELECT id,content FROM links WHERE LENGTH(content)<500 ORDER BY likes LIMIT 1
Edit: And here's the explain results:
id | select_type | table | type possible_keys | key | key_len | ref | rows | Extra
1 | SIMPLE | links | ALL | NULL | NULL | NULL | NULL | 38556 | Using where; Using filesort
The difference between the two queries run time relies on the actual data.
There are several possible scenarios:
There are many records with LENGTH(content)<500
In this case, MySQL will start scanning all table rows (according to primary key order since you didn't provide any ORDER BY).
There is no index use since your WHERE condition can't be indexed.
Since there are relatively many rows with LENGTH(content)<500, the LIMIT query will return faster than the other one.
There are no records with LENGTH(content)<500
Again, MySQL will start scanning all table rows, but will have to go through all records to figure out none of them satisfies the condition.
Again no index can be used for the same reason.
In this case - the two queries will have exactly the same run time.
Anything between those two scenarios will have different run times, which will be farther apart as you have more valid records in the table.
Edit
Now that you added the ORDER BY, the answer is a bit different:
If there is an index on likes column, ORDER BY would use it and the time would be the time it takes to get to the first record that satisfies the WHERE condition (if 66% of the records do, than this should be faster than without the LIMIT).
If there is no index on likes column, the ORDER BY will take most of the time - MySQL must scan all table to get all records that satisfy the WHERE, then order them by likes, and then take the first one.
In this case both queries will have similar run time (scanning and sorting the results is much longer than returning 1 record or many records...)!
Calling functions on data results in an automatic table scan, these can't be indexed. What you might do is create a derived column where you've saved this value in advance if performance is a concern here:
ALTER TABLE links ADD COLUMN content_length INT
UPDATE links SET content_length=LENGTH(content)
ALTER TABLE links ADD INDEX idx_content_length (content_length)
Once denormalized like this and properly you'll be able to run the query much faster. Keep in mind you'll have to populate content_length each time you add a record.

mysql query optimization

I would need some help on how to optimize the query.
select * from transaction where id < 7500001 order by id desc limit 16
when i do an explain plan on this - the type is "range" and rows is "7500000"
According to the some online reference's this is explained as, it took the query 7,500,000 rows to scan and get the data.
Is there any way i can optimize so it uses less rows to scan and get the data. Also, id is the primary key column.
online reference's this is explained as, it took the query 7,500,000 rows to scan and get the data
not actually. it's the approximate (optimizer cannot say the correct number in many different cases) number of rows that potentially will be scanned. but you specified LIMIT - so only first 16 rows will be affected while query executed.
ps: i hope the used key in EXPLAIN is id?
I performed an explain with your query on a 8 million rows table
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE transaction range PRIMARY PRIMARY 8 NULL 4079100 Using where
The actual execution was fast, Execution Time : 00:00:00:044.