Stop MySQL after first match

Stop MySQL after first match - mysql

I noticed that adding LIMIT 1 at the end of a query does not decrease the execution time. I have a few thousand records and a simple query. How do I make MySQL stop after the first match?
For example, these two queries both take approximately half a second:
SELECT id,content FROM links WHERE LENGTH(content)<500 ORDER BY likes
SELECT id,content FROM links WHERE LENGTH(content)<500 ORDER BY likes LIMIT 1
Edit: And here's the explain results:
id | select_type | table | type possible_keys | key | key_len | ref | rows | Extra
1 | SIMPLE | links | ALL | NULL | NULL | NULL | NULL | 38556 | Using where; Using filesort

The difference between the two queries run time relies on the actual data.
There are several possible scenarios:
There are many records with LENGTH(content)<500
In this case, MySQL will start scanning all table rows (according to primary key order since you didn't provide any ORDER BY).
There is no index use since your WHERE condition can't be indexed.
Since there are relatively many rows with LENGTH(content)<500, the LIMIT query will return faster than the other one.
There are no records with LENGTH(content)<500
Again, MySQL will start scanning all table rows, but will have to go through all records to figure out none of them satisfies the condition.
Again no index can be used for the same reason.
In this case - the two queries will have exactly the same run time.
Anything between those two scenarios will have different run times, which will be farther apart as you have more valid records in the table.
Edit
Now that you added the ORDER BY, the answer is a bit different:
If there is an index on likes column, ORDER BY would use it and the time would be the time it takes to get to the first record that satisfies the WHERE condition (if 66% of the records do, than this should be faster than without the LIMIT).
If there is no index on likes column, the ORDER BY will take most of the time - MySQL must scan all table to get all records that satisfy the WHERE, then order them by likes, and then take the first one.
In this case both queries will have similar run time (scanning and sorting the results is much longer than returning 1 record or many records...)!

Calling functions on data results in an automatic table scan, these can't be indexed. What you might do is create a derived column where you've saved this value in advance if performance is a concern here:
ALTER TABLE links ADD COLUMN content_length INT
UPDATE links SET content_length=LENGTH(content)
ALTER TABLE links ADD INDEX idx_content_length (content_length)
Once denormalized like this and properly you'll be able to run the query much faster. Keep in mind you'll have to populate content_length each time you add a record.

Related

What is the "Default order by" for a mysql Innodb query that omits the Order by clause?

So i understand and found posts that indicates that it is not recommended to omit the order by clause in a SQL query when you are retrieving data from the DBMS.
Resources & Post consulted (will be updated):
SQL Server UNION - What is the default ORDER BY Behaviour
When no 'Order by' is specified, what order does a query choose for your record set?
https://dba.stackexchange.com/questions/6051/what-is-the-default-order-of-records-for-a-select-statement-in-mysql
Questions :
See logic of the question below if you want to know more.
My question is : under mysql with innoDB engine, does anyone know how the DBMS effectively gives us the results ?
I read that it is implementation dependent, ok, but is there a way to know it for my current implementation ?
Where is this defined exactly ?
Is it from MySQL, InnoDB , OS-Dependent ?
Isn't there some kind of list out there ?
Most importantly, if i omit the order by clause and get my result, i can't be sure that this code will still work with newer database versions and that the DBMS will never give me the same result, can i ?
Use case & Logic :
I'm currently writing a CRUD API, and i have table in my DB that doesn't contain an "id" field (there is a PK though), and so when i'm showing the results of that table without any research criteria, i don't really have a clue on what i should use to order the results. I mean, i could use the PK or any field that is never null, but it wouldn't make it relevant. So i was wondering, as my CRUD is supposed to work for any table and i don't want to solve this problem by adding an exception for this specific table, i could also simply omit the order by clause.
Final Note :
As i'm reading other posts, examples and code samples, i'm feeling like i want to go too far. I understand that it is common knowledge that it's just a bad practice to omit the Order By clause in a request and that there is no reliable default order clause, not to say that there is no order at all unless you specify it.
I'd just love to know where this is defined, and would love to learn how this works internally or at least where it's defined (DBMS / Storage Engine / OS-Dependant / Other / Multiple criteria). I think it would also benefit other people to know it, and to understand the inners mechanisms in place here.
Thanks for taking the time to read anyway ! Have a nice day.

Without a clear ORDER BY, current versions of InnoDB return rows in the order of the index it reads from. Which index varies, but it always reads from some index. Even reading from the "table" is really an index—it's the primary key index.
As in the comments above, there's no guarantee this will remain the same in the next version of InnoDB. You should treat it as a coincidental behavior, it is not documented and the makers of MySQL don't promise not to change it.
Even if their implementation doesn't change, reading in index order can cause some strange effects that you might not expect, and which won't give you query result sets that makes sense to you.
For example, the default index is the clustered index, PRIMARY. It means index order is the same as the order of values in the primary key (not the order in which you insert them).
mysql> create table mytable ( id int primary key, name varchar(20));
mysql> insert into mytable values (3, 'Hermione'), (2, 'Ron'), (1, 'Harry');
mysql> select * from mytable;
+----+----------+
| id | name |
+----+----------+
| 1 | Harry |
| 2 | Ron |
| 3 | Hermione |
+----+----------+
But if your query uses another index to read the table, like if you only access column(s) of a secondary index, you'll get rows in that order:
mysql> alter table mytable add key (name);
mysql> select name from mytable;
+----------+
| name |
+----------+
| Harry |
| Hermione |
| Ron |
+----------+
This shows it's reading the table by using an index-scan of that secondary index on name:
mysql> explain select name from mytable;
+----+-------------+---------+-------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+-------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | mytable | index | NULL | name | 83 | NULL | 3 | Using index |
+----+-------------+---------+-------+---------------+------+---------+------+------+-------------+
In a more complex query, it can become very tricky to predict which index InnoDB will use for a given query. The choice can even change from day to day, as your data changes.
All this goes to show: You should just use ORDER BY if you care about the order of your query result set!

Bill's answer is good. But not complete.
If the query is a UNION, it will (I think) deliver first the results of the first SELECT (according to the rules), then the results of the second. Also, if the table is PARTITIONed, it is likely to do a similar thing.
GROUP BY may sort by the grouping expressions, thereby leading to a predictable order, or it may use a hashing technique, which scrambles the rows. I don't know how to predict which.
A derived table used to be an ordered list that propagates into the parent query's ordering. But recently, the ORDER BY is being thrown away in that subquery! (Unless there is a LIMIT.)
Bottom Line: If you care about the order, add an ORDER BY, even if it seems unnecessary based on this Q & A.
MyISAM, in contrast, starts with this premise: The default order is the order in the .MYD file. But DELETEs leave gaps, UPDATEs mess with the gaps, and INSERTs prefer to fill in gaps over appending to the file. So, the row order is rather unpredictable. ALTER TABLE x ORDER BY y temporarily sets the .MYD order; this 'feature' does not work for InnoDB.

MySQL (MyISAM) SELECT query takes too long with join

I have a pretty long insert query that inserts data from a select query in a table. The problem is that the select query takes too long to execute. The table is MyISAM and the select locks the table which affects other users who also use the table. I have found that problem of the query is a join.
When I remove this part of the query, it takes less then a second to execute but when I leave this part the query takes more than 15 minutes:
LEFT JOIN enq_217 Pex_217
ON e.survey_panelId = Pex_217.survey_panelId
AND e.survey_respondentId = Pex_217.survey_respondentId
AND Pex_217.survey_respondentId != 0
db.table_1 contains 5,90,145 rows and e contains 4,703 rows.
Explain Output:
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY e ALL survey_endTime,survey_type NULL NULL NULL 4703 Using where
1 PRIMARY Pex_217 ref survey_respondentId,idx_table_1 idx_table_1 8 e.survey_panelId,e.survey_respondentId 2 Using index
2 DEPENDENT SUBQUERY enq_11525_timing eq_ref code code 80 e.code 1
How can I edit this part of the query to be faster?

I suggest creating an index on the table db.table_1 for the fields panelId and respondentId

You want an index on the table. The best index for this logic is:
create index idx_table_1 on table_1(panelId, respondentId)
The order of these two columns in the index should not matter.
You might want to include other columns in the index, depending on what the rest of the query is doing.
Note: a single index with both columns is different from two indexes with each column.

Why is it a LEFT join?
How many rows in Pex_217?
Run ANALYZE TABLE on each table used. (This sometimes helps MyISAM; rarely is needed for InnoDB.)
Since the 'real problem' seems to be that the query "holds up other users", switch to InnoDB.
Tips on conversion
The JOIN is not that bad (with the new index -- note Using index): 4703 rows scanned, then reach into the other table's index about 2 times each.
Perhaps the "Dependent subquery" is the costly part. Let's see that.

MySQL performance boost after create & drop index

I have a large MySQL, MyISAM table of around 4 million rows running in a core 2 duo, 8G RAM laptop.
This table has 30 columns including varchar, decimal and int types.
I have an index on a varchar(16). Let's call this column: "indexed_varchar_column".
My query is
SELECT 9 columns FROM the_table WHERE indexed_varchar_column = 'something';
It always returns around 5000 rows for every 'something' I query against.
An EXPLAIN to the query returns this:
+----+-------------+-------------+------+----------------------------------------------------+--------------------------------------------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+------+----------------------------------------------------+--------------------------------------------+---------+-------+------+-------------+
| 1 | SIMPLE | the_table | ref | many indexes including indexed_varchar_column | another_index NOT: indexed_varchar_column! | 19 | const | 5247 | Using where |
+----+-------------+-------------+------+----------------------------------------------------+--------------------------------------------+---------+-------+------+-------------+
First thing is I'm not sure is why another_index is chosen. In fact it chooses an index which is a composite index of indexed_varchar_column and another 2 columns (which form part of the selected ones). Perhaps this makes sense since it may make things a bit faster for not having to read 2 of the columns in the query. The real QUESTION is the following one:
The query takes 5 seconds for every 'something' I match. On the 2nd time I query against 'something' it takes 0.15 secs (I guess because the query is being cached). When I run another query against 'something_new' it takes again 5 seconds. So, it is consistent.
THE PROBLEM IS: I discovered that creating an index (another composite index including my indexed_varchar_column) and dropping it again produces that all further queries against new 'something_other' take only 0.15 secs. Please note that 1) I create an index 2) drop it again. So everything is in the same state.
I guess all the operations needed for building and dropping indices make the SQL engine to cache something that is then reused. When I run EXPLAIN on a query after all this I get exactly the same as before.
How can I proceed to understand what is cached in the create-drop index procedure so that I can cache it without manipulating indices?
UPDATE:
Following a comment from Marc B that suggested that when mySQL creates an index it internally does a SELECT... I tried the following:
SELECT * FROM my_table;
It took 30 secs and returned 4 million rows. The good thing is that all further queries are very fast again (until I reboot the system). Please note that after rebooting the queries are slow again. I guess this is because mySQL is using some sort of OS caching.
Any idea? How can I explicitly cache the table I guess?
UPDATE 2:
Perhaps I should have mentioned that this table may be severely fragmented. It's 4 million rows but I remove lots of old fields regularly. I also add new ones. Since I had large gaps in IDs (for the rows deleted) every day I drop the primary index (ID) and create it again with consecutive numbers. The table may be then very fragmented and therefore IO must be an issue... Not sure what to do.

Thanks everybody for your help.
Finally I discovered (thanks to the hint of Marc B) that my table was severely fragmented after many INSERTs and DELETEs. I updated the question with this info some hours ago. There are two things that help:
1)
ALTER TABLE my_table ORDER BY indexed_varchar_column;
2) Running:
myisamchk --sort-records=4 my_table.MYI (where 4 corresponds to my index)
I believe both commands are equivalent. Queries are fast even after a system reboot.
I've put this ALTER TABLE ORDER BY command on a cron that is run everyday. It takes 2 minutes but it's worth it.

How many indexes do you have that contain the indexed_varchar_column? Do you have a single index for just the indexed_varchar_column?
Have you tried:
SELECT 9 columns FROM USE INDEX (name_of_index) the_table WHERE indexed_varchar_column = 'something';?

What is the order of the columns in your composite index.
You must use (at least) a left-associative sub-set of the columns in your query
If you have an index on foo,bar, and baz, that will not be usable as an index against bar or baz by themeselves. Only (foo), (foo,bar), and (foo,bar,baz).
EXPLAIN is your friend here. It will tell you which index, if any, is being used by a query.
EDIT Here's a postgres explain of a simple left join query for comparison.
Nested Loop Left Join (cost=0.00..16.97 rows=13 width=103)
Join Filter: (pagesets.id = pages.pageset_id)
-> Index Scan using ix_pages_pageset_id on pages (cost=0.00..8.51 rows=13 width=80)
Index Cond: (pageset_id = 515)
-> Materialize (cost=0.00..8.27 rows=1 width=23)
-> Index Scan using pagesets_pkey on pagesets (cost=0.00..8.27 rows=1 width=23)
Index Cond: (id = 515)

What indexes can be used to improve this query?

This query selects all the unique visitor sessions in a certain date range:
select distinct(accessid) from accesslog where date > '2009-09-01'
I have indexes on the following fields:
accessid
date
some other fields
Here's what explain looks like:
mysql> explain select distinct(accessid) from accesslog where date > '2009-09-01';
+----+-------------+-----------+-------+----------------------+------+---------+------+-------+------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+----------------------+------+---------+------+-------+------------------------------+
| 1 | SIMPLE | accesslog | range | date,dateurl,dateaff | date | 3 | NULL | 64623 | Using where; Using temporary |
+----+-------------+-----------+-------+----------------------+------+---------+------+-------+------------------------------+
mysql> explain select distinct(accessid) from accesslog;
+----+-------------+-----------+-------+---------------+----------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+---------------+----------+---------+------+---------+-------------+
| 1 | SIMPLE | accesslog | index | NULL | accessid | 257 | NULL | 1460253 | Using index |
+----+-------------+-----------+-------+---------------+----------+---------+------+---------+-------------+
Why doesn't the query with the date clause use the accessid index?
Are there any other indexes I can use to speed up queries for distinct accessid's in certain date spans?
Edit - Resolution
Reducing column width on accessid from varchar 255 to char 32 improved query time by ~75%.
Adding a date+accessid index had no effect on query time.

An index on (date,accessid) could help. However, before tweaking indices I'd recommend checking the type of your accessid column. EXPLAIN says the key is 257 bytes long, which sounds like a lot for an ID column. Are you using a VARCHAR(256) for accessid? If so, can't you use a more compact type? If it's a number, it should by INT (SMALLINT, BIGINT, whatever fits your needs) and if it's an alphanumeric ID, can it really be 256 chars long? If its length is fixed, can't you use CHAR (CHAR(32) for example) instead?

Your problem is that your condition is a range clause (on the date column).
A multi-column index of date->accessid likely wont help the situation as MySQL can't use index columns after a range condition. In theory they should be able to use it to cover the computation in this case, but it appears to be a shortcoming in MySQL, I've never gotten it to use a multi column index in this situation successfully.
You can try creating an index on (date,accessid) hoping that it will use it to cover the query (so you won't need to hit any tables), but I don't hold much hope. There's not a great deal you can do.
Edit:
My answer is courtesy of High Performance MySQL - Second Edition, worth it's weight in gold if you have to do serious MySQL development.

Why doesn't the query with the date clause not use the accessid index?
Because using the date index is more efficient. That's because it's likely to pare the search space down faster.
At least one DBMS (DB2/z, I don't know much about MySQL) would benefit from an index on date+accessid since the access IDs would be sorted within dates in that index. That DBMS will use the date+accessid key to efficiently use the where clause to whittle down the search space and to return distinct values of accessid within that space.
Whether MySQL is that smart, I have no idea. My suggestion would be to try it and see (which is the best answer to most DB optimization questions).

The query uses the 'date' index because thats what you use in the where clause.
This is the only sensible option, if it used the access id index it would need to read all the accessid rows then check the date before it and only then decide if it was distinct.
If this is a really big table a compound index on date and accessid might help.

Why doesn't the query with the date clause not use the accessid index?
Because using the date index allows it to ignore a large part of the data in the table. The chances are that the table holds mostly historical data, and a lot of it refers to dates a lot longer ago than the beginning of the current month, so the date criterion is selective and reduces the workload for the optimizer by allowing it to ignore most of the data.
If it used the accessid index, it would also have to read each row (as well as each index entry) to see whether the date meets the search criterion. This means reading the whole of the index and the whole of the table - in fact, it would do better in the context to ignore the index, but I started of with "if it used the accessid index".
Are there any other indexes I can use to speed up queries for distinct accessid's in certain date spans?
Depending on the sophistication of the optimizer, an index on (date, accessid) might improve things. It can do range searches on the leading column of the index, and the trailing column means that it does not have to refer to the data in the table to establish the accessid - the information is in the index. So, this might convert a query that access an index and a table into one that only accesses the index - which will reduce the amount of I/O needed and therefore improve the performance of the query.
If you have other criteria that need data from other columns, or you need to return more than just the unique accessid values, then you end up reading part of the table data; this is probably still a win compared with scanning the whole of the table.

I have no way of testing it, but I would definitely try to add an index spanning both accessid and date.
Index optimizations if often like alchemy. Different DBMS behave differently, and sometimes you need to simply try (and fail) various combinations. I’m not saying it’s not possible to reason. It is in many cases, but up to a certain point. Often it’s simply faster and easier to follow your instinct.

Is it possible to optimize a query that uses the '<>' operator?

This a follow-up to a previous question.
How can I optimize this query so that it does not perform a full table scan?
SELECT Employee.name FROM Employee WHERE Employee.id <> 1000;
.
explain SELECT Employee.name FROM Employee WHERE Employee.id <> 1000;
+----+-------------+-------------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | Employee | ALL | PRIMARY | NULL | NULL | NULL | 5000 | Using where |
+----+-------------+-------------+------+---------------+------+---------+------+------+-------------+
(Empoyee.id is the primary key, in case that isn't clear.)

Have a covering index for name and id, and it should be able to fulfill the query using the index. This might be faster, because there's a good chance the entire index will already be in memory, while a table scan is more likely to need to go to disk.
Because of the low (non-existent) selectivity of your where clause you may need to provide a hint to get the database to use your index. I'm a sql server guy, and so I'm not sure of the syntax needed in mysql to hint an index, or even if mysql is able to take advantage of a covering index in this manner.
That said, I doubt you can get much improvement: you're returning every row but one. You should expect that to need to scan the table.

There are a lot of things to try, it depends on how the database engine chooses to parse it, really. Some options:
select employee.name from employee where employee.id not in (1000);
You could also try a union with a less than and then a greater than.
But in the specific example you are giving (which may just be too simple for your real case) a table scan isn't necessarily a bad thing. If all the records have to be returned except one, using an index may in fact be slower.

In traditional databases, you cant!
Of course, you could just omit all Employees with the given Id (when it is key or has an index) -- but normally you will still have the total majority of the table under your feet. So using an index might complicate things and thus a fts normally is the faster option.
When you have specialized databases, you could store the names of all employees adjacent to each other.
Edit: I now saw the other answer of Joel. Yes, this could be a way, since in fact your special index is now a specialized form of storing a part of the content. Good databases can just use the index content when it covers the columns needed -- this is rather nice. Of course, you will endup in a so called "full index scan" (but normally much faster as a full-table-scan).

Nothing you can do will increase performance. In this case the database must do a complete table scan, as you are asking for every record save one. Reading every page in an index on top of that would only reduce performance. Fortunately, even if you added an index, the database would be smart enough to ignore it...
EDIT to address #Juergens comment.
Juergen, you are right about a covering index, but there are conflicting effects here. Any use of an index in a scenario like this has bad effects in one sense... The query engine could have to perform one I/O Operation for each level in the index, for each row it needs to examine. If there are, say, 5 levels in the index, and 1M rows, that would be 5 Million I/O operations, compared to only 1M I/Os to do a complete table scan. This is why, in this scenario, most query optimizers would ignore any available index and do the table scan anyway. (unless you force it to use the index with a hint) The only mitigating factor is if EVERY attribute required by the query is in the index (covering index) and the number of index rows per page on disk is sufficiently smaller than the number of table rows per page to counteract the negative effect of having to traverse each level of the index for each row returned by the query.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008