mysql query slow when limit goes to last records - mysql

I have a java application and I would like to get some data from a table and display in the application.
I have millions of records, and the query gets really slow when I am going to the last records. it takes few good minutes to get the results.
select Id from Table1x where description like '%error%' and Id between 0 and 1329999 limit 0, 1000
The above query returns a fast result. That is first pages returns fast. But when I am moving the last pages, it becomes slow.
select Id from Table1x where description like '%error%' and Id between 0 and 1329999 limit 644000, 1000.
This query is slow and taking 17 secs.
Any ideas on how to make this faster? Id is the primary key of table1x.

The problem is in the like. To get the first 1000 records, the database only needs to filter the database until it finds 1000 records that match the search. For the other query, the database needs to match records until it has 645000 records, which makes it much slower. There is no sorting or other filtering, so the index on ID doesn't help at all.
An index on description would help, but not if you start the search with a wildcard, like you do now.
I see two solutions.
First option is to add a FULLTEXT index on the description field. It allows to to look for the word error using MATCH rather than LIKE. I think it will be a lot faster, but the index will become larger too, and I'm not sure about the optimizations on the long run.
Second solution: Since you're obviously looking for errors (I think you're building a report on a log table?), you may add a column with a record type. You can give each record a type (just an integer) which indicates where that record holds an error or not. You will need to update your table once, and insert the type along with new records, but it will make your query faster.
I must admit that this second solution is based on assumptions about the data and your goal. If I'm wrong about that, please provide additional information and I may find a solution that suits you better.

Related

Indexes Show No Improvement In Speed

I have a table with about 22 million rows and about 20 columns containing property data. Currently a query like:
SELECT * FROM fulldataset WHERE county = 'MIDDLESBROUGH'
takes an average of 42 seconds to run. To try and improve this, I created an index on the county column like this:
ALTER TABLE fulldataset ADD INDEX county (county)
There has been no improvement at all in the speed of the same query.
So I used EXPLAIN SELECT to try and find out what was happening. If I SELECT * from countyA, it returns around 85k entries, after ~42 seconds. If I EXPLAIN SELECT the same query it says it's using the county Index I created and that the number of rows is around 167k, which is wrong but better than searching all 22 million.
Likewise, if I SELECT * for countyB I get around 48k results and EXPLAIN SELECT tells me there are around 91k rows. The EXPLAIN SELECT statement returns the result instantly, so it's able to instantly tell that there are around half as many entries for countyB as there are for countyA. The problem is the queries don't execute any faster. If it's only checking 91k rows shouldn't it be very quick?
Here's a screenshot of what I'm doing: image
EDIT: As pointed out, the query itself is not what is taking time. In answer to my own question in the comments, a multiple column index worked wonders.
The query is not the problem. If you look closely at the output of your program you will see that the query execution took less than 1s, but fetching all the rows took 42s.
If you have to wait 42s before you see anything then I recommend to use another querying tool which only fetches the first X rows and displays them in pages.
EXPLAIN is designed to be fast. In doing so, the calculation of "Rows" is only a crude estimate. If can often be off by a factor of 2. So, don't read too much into 85K vs 167K.
Since EXPLAIN is delivering only a single row (or a small number of rows), the "fetch" time is very low.
If you are selecting the AVG() of some column, it has to first read all the relevant rows, doing the computation as it goes. It cannot even start to deliver data until it has finished all the reading.
If you are reading all the rows, it can (but I am not sure that it does) start delivering rows starting with the first row.
If you do something like SELECT * FROM tbl ORDER BY x (and x is not indexed), then you get the worst or both worlds. First it has to read all the rows and write them to a temp table, then it sorts that temp table; only then can it begin to fetch the rows.
I think "duration" and "fetch" are not very useful; the sum of the two is more useful. Here's another example of it: Mysql same querys one with index second without getting 10000xFetch time?
Notice how the sum is consistent, but the separation is not.

Whether or not SQL query (SELECT) continues or stops reading data from table when find the value

Greeting,
My question; Whether or no sql query (SELECT) continues or stops reading data (records) from table when find the value that I was looking for?
referance: "In order to return data for this query, mysql must start at the beginning of the disk data file, read in enough of the record to know where the category field data starts (because long_text is variable length), read this value, see if it satisfies the where condition (and so decide whether to add to the return record set), then figure out where the next record set is, then repeat."
link for referance: http://www.verynoisy.com/sql-indexing-dummies/#how_the_database_finds_records_normally
In general you don't know and you don't care, but you have to adapt when queries take too long to execute. When you do something like
select a,b,c from mytable where a=3 and b=5
then the database engine has a couple of options to optimize. When all these options fail, then it will do a "full table scan" - which means, it will have to examine the entire table to see which rows are eligible. When you have indices on e.g. column a then the database engine can optimize the search because it can pre-select rows where a has value 3. So, in general, make sure that you have indices for the columns that are most searched. (Perversely, some database engines get confused when you have too many indices and will fall back to a full table scan because they've lost their way...)
As to whether or not the scanning stops: In general, the database engine has to examine all data in the table (hopefully aided by indices) and won't stop after having found just one hit. If you want just the first hit, use a limit 1 clause to make sure that your result set has only one outcome. But then again, if you have a sort by clause, the database engine cannot stop after the first hit, there might be next ones that should get priority given the sorting.
Summarizing, how the db engine does its scan depends on how smart it is, what indices are available etc.. If your select queries take too long then consider re-organizing your indices, writing your select statements differently, or rebuilding the table.
The RDBMS reading data from disk is something you cannot know, you should not care and you must not rely on.
The issue is too broad to get a precise answer. The engine reads data from storage in blocks, a block can contain records that are not needed by the query at hand. If all the columns needed by the query is available in an index, the RDBMS won't even read the data file, it will only use the index. The data it needs could already be cached in memory (because it was read during the execution of a previous query). The underlying OS and the storage media also keep their own caches.
On a busy system, all these factors could lead to very different storage access patterns while running the same query several times on a couple of minutes apart.
Yes it scans the entire file. Unless you put something like
select * from user where id=100 limit 1
This of course will still search entire rows if id 100 is the last record.
If id is a primary key it will automatically be indexed and searching would be optimized
I'm sorry... I thought the table.
I will change question and I will explain it in the following image;
I understand that in CASE 1 all columns must be read with each iteration.
My question is: If it's the same in the CASE 2 or columns that are not selected in the query are excluded from reading in each iteration.
Also, are the both queries are the some in performance perspective?
Clarify:
CASE: 1 In first CASE select print all data
CASE: 2 In second CASE select print columns first_name and last_name
Whether in CASE 2 mysql server (SQL query) reads only columns first_name, last_name or read the entire table to get that data(rows)=(first_name, last_name)?
An interest of me how the server reads table row in CASE 1 and CASE 2?

Fast mysql query to randomly select N usernames

In my jsp application I have a search box that lets user to search for user names in the database. I send an ajax call on each keystroke and fetch 5 random names starting with the entered string.
I am using the below query:
select userid,name,pic from tbl_mst_users where name like 'queryStr%' order by rand() limit 5
But this is very slow as I have more than 2000 records in my table.
Is there any better approach which takes less time and let me achieve the same..? I need random values.
How slow is "very slow", in seconds?
The reason why your query could be slow is most likely that you didn't place an index on name. 2000 rows should be a piece of cake for MySQL to handle.
The other possible reason is that you have many columns in the SELECT clause. I assume in this case the MySQL engine first copies all this data to a temp table before sorting this large result set.
I advise the following, so that you work only with indexes, for as long as possible:
SELECT userid, name, pic
FROM tbl_mst_users
JOIN (
-- here, MySQL works on indexes only
SELECT userid
FROM tbl_mst_users
WHERE name LIKE 'queryStr%'
ORDER BY RAND() LIMIT 5
) AS sub USING(userid); -- join other columns only after picking the rows in the sub-query.
This method is a bit better, but still does not scale well. However, it should be sufficient for small tables (2000 rows is, indeed, small).
The link provided by #user1461434 is quite interesting. It describes a solution with almost constant performance. Only drawback is that it returns only one random row at a time.
does table has indexing on name?
if not apply it
2.MediaWiki uses an interesting trick (for Wikipedia's Special:Random feature): the table with the articles has an extra column with a random number (generated when the article is created). To get a random article, generate a random number and get the article with the next larger or smaller (don't recall which) value in the random number column. With an index, this can be very fast. (And MediaWiki is written in PHP and developed for MySQL.)
This approach can cause a problem if the resulting numbers are badly distributed; IIRC, this has been fixed on MediaWiki, so if you decide to do it this way you should take a look at the code to see how it's currently done (probably they periodically regenerate the random number column).
3.http://jan.kneschke.de/projects/mysql/order-by-rand/

MySQL : Does a query searches inside the whole table?

1. So if I search for the word ball inside the toys table where I have 5.000.000 entries does it search for all the 5 millions?
I think the answer is yes because how should it know else, but let me know please.
2. If yes: If I need more informations from that table isn't more logic to query just once and work with the results?
An example
I have this table structure for example:
id | toy_name | state
Now I should query like this
mysql_query("SELECT * FROM toys WHERE STATE = 1");
But isn't more logical to query for all the table
mysql_query("SELECT * FROM toys"); and then do this if($query['state'] == 1)?
3. And something else, if I put an ORDER BY id LIMIT 5 in the mysql_query will it search for the 5 million entries or just the last 5?
Thanks for the answers.
Yes, unless you have a LIMIT clause it will look through all the rows. It will do a table scan unless it can use an index.
You should use a query with a WHERE clause here, not filter the results in PHP. Your RDBMS is designed to be able to do this kind of thing efficiently. Only when you need to do complex processing of the data is it more appropriate to load a resultset into PHP and do it there.
With the LIMIT 5, the RDBMS will look through the table until it has found you your five rows, and then it will stop looking. So, all I can say for sure is, it will look at between 5 and 5 million rows!
Read this about indexes :-)
http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
It makes it uber-fast :-)
Full table scan is here only if there are no matching indexes and indeed very slow operation.
Sorting is also accelerated by indexes.
And for the #2 - this is slow because transfer rate from MySQL -> PHP is slow, and MySQL is MUCH faster at doing filtering.
For your #1 question: Depends on how you're searching for 'ball'. If there's no index on the column(s) where you're searching, then the entire table has to be read. If there is an index, then...
WHERE field LIKE 'ball%' will use an index
WHERE field LIKE '%ball%' will NOT use an index
For your #2, think of it this way: Doing SELECT * FROM table and then perusing the results in your application is exactly the same as going to the local super walmart, loading the store's complete inventory into your car, driving it home, picking through every box/package, and throwing out everything except the pack of gum from the impulse buy rack by the front till that you'd wanted in the first place. The whole point of a database is to make it easy to search for data and filter by any kind of clause you could think of. By slurping everything across to your application and doing the filtering there, you've reduced that shiny database to a very expensive disk interface, and would probably be better off storing things in flat files. That's why there's WHERE clauses. "SELECT something FROM store WHERE type=pack_of_gum" gets you just the gum, and doesn't force you to truck home a few thousand bottles of shampoo and bags of kitty litter.
For your #3, yes. If you have an ORDER BY clause in a LIMIT query, the result set has to be sorted before the database can figure out what those 5 records should be. While it's not quite as bad as actually transferring the entire record set to your app and only picking out the first five records, it still involves a bit more work than just retrieving the first 5 records that match your WHERE clause.

What is running optimize on a table doing to make such a huge difference?

Simple situation, two column table [ID, TEXT]. The Text column has 1-10 word phrases. 300,000 rows.
Running the query:
SELECT * FROM row
WHERE text LIKE '%word%'
...took 0.1 seconds. Ok.
So I created a 2nd column, the table now has: [ID, TEXT2, TEXT2]
I made TEXT2 = TEXT (using an UPDATE table SET TEXT2 = TEXT]
Then I run the query for '%word%' again, and it takes 2.4 seconds.
This leaves me very very stumped but after quite a lot of blind alleys, I run OPTIMIZE on the table, and it goes to about 0.2 seconds.
Two questions:
Does anyone know how the data structure get's itself in such a mess whereby doubling the data increases the search time for this query by a factor of 24?
Is it standard for an un-indexed search like this to increase at the rate of the underlying table data structure as opposed to the data in the actual column being searched?
Thanks!
Sounds to me like you are the victim of Query caching. The second time your run the query (after the optimize), it already has the answer cached, and therefore the result is returned instantly. Have you tried searching for different search terms. Try running the query with caching turned off as so:
SELECT SQL_NO_CACHE * FROM row WHERE text LIKE '%word%'
To see if this changes the results, or try searching for different words, but with similar number of results to ensure that your server isn't just returned a cached value.
The first time it does a table scan which sounds about right for the timing - no index involved.
Then you added the index and the mysql optimizer doesn't notice you've got a wildcard on the front, so it scans the entire index to find the records, then needs two more reads (one to the PK, then one into the table from there) to get the data record on top of that.
OPTIMIZE probably just updates the optimizer statistics so it knows it should scan the table again.
I would think that the difference is caused by the increased row length causing the table to be fragmented on the disk. Optimize will sort that problem out, leading to the search time returning to normal (give or take a bit).