I have a table with about 22 million rows and about 20 columns containing property data. Currently a query like:
SELECT * FROM fulldataset WHERE county = 'MIDDLESBROUGH'
takes an average of 42 seconds to run. To try and improve this, I created an index on the county column like this:
ALTER TABLE fulldataset ADD INDEX county (county)
There has been no improvement at all in the speed of the same query.
So I used EXPLAIN SELECT to try and find out what was happening. If I SELECT * from countyA, it returns around 85k entries, after ~42 seconds. If I EXPLAIN SELECT the same query it says it's using the county Index I created and that the number of rows is around 167k, which is wrong but better than searching all 22 million.
Likewise, if I SELECT * for countyB I get around 48k results and EXPLAIN SELECT tells me there are around 91k rows. The EXPLAIN SELECT statement returns the result instantly, so it's able to instantly tell that there are around half as many entries for countyB as there are for countyA. The problem is the queries don't execute any faster. If it's only checking 91k rows shouldn't it be very quick?
Here's a screenshot of what I'm doing: image
EDIT: As pointed out, the query itself is not what is taking time. In answer to my own question in the comments, a multiple column index worked wonders.
The query is not the problem. If you look closely at the output of your program you will see that the query execution took less than 1s, but fetching all the rows took 42s.
If you have to wait 42s before you see anything then I recommend to use another querying tool which only fetches the first X rows and displays them in pages.
EXPLAIN is designed to be fast. In doing so, the calculation of "Rows" is only a crude estimate. If can often be off by a factor of 2. So, don't read too much into 85K vs 167K.
Since EXPLAIN is delivering only a single row (or a small number of rows), the "fetch" time is very low.
If you are selecting the AVG() of some column, it has to first read all the relevant rows, doing the computation as it goes. It cannot even start to deliver data until it has finished all the reading.
If you are reading all the rows, it can (but I am not sure that it does) start delivering rows starting with the first row.
If you do something like SELECT * FROM tbl ORDER BY x (and x is not indexed), then you get the worst or both worlds. First it has to read all the rows and write them to a temp table, then it sorts that temp table; only then can it begin to fetch the rows.
I think "duration" and "fetch" are not very useful; the sum of the two is more useful. Here's another example of it: Mysql same querys one with index second without getting 10000xFetch time?
Notice how the sum is consistent, but the separation is not.
Related
I am looking for a way to execute a SELECT query on a large table without having to add any new indexes.
SELECT id FROM table_name WHERE column_1 = "" limit 100
there is bout 800,000 of these empty rows and about 5 million of filled ones.
In my mind there has to be a way where the database engine just starts reading the table from one end, collects the first 100 rows (regardless of the order) and stops. However with the above query it checks all the 5M rows.
I did search the internet with no answer. Could someone help me out. Thanks.
"it checks all the 5M rows" -- If you are using EXPLAIN to say that, don't trust it. EXPLAIN rarely adjusts its "Rows" column to account for LIMIT.
OTOH, If only the last 100 rows were blank, it would read all 5M rows. If the first 100 rows are blank, the only 100 would be read. The Optimizer is not smart enough to know which of those will happen.
With INDEX(column1), it will touch only 100 index rows and get the ids (which are in the index's BTree. If you want more than just id, there is an extra step (performed 100 times) to reach into the data's BTree to get the rest of the columns.
If you want to discuss further, please provide SHOW CREATE TABLE; we need to see the engine, PRIMARY KEY, datatypes, etc.
Are you first fetching 100 ids, then fetching something based on them? That is almost always less efficient than combining the two queries.
One way or the other, I would add an index to "column_1".
I am pretty sure MySQL does not give you any possibility to influences this.
What you can try is doing a Stored procedure, which does a select
"SELECT id FROM table_name" then filters on "column_1 = """ and stops after counting 100 positives.
If there are any better methods, I'll be happy to hear.
When I run this query on phpmyadmin, it takes about 30 seconds to fetch the results, but once the results have successfully loaded, it says "Query took 0.5029 seconds".
Why does it say 'Query took 0.50 seconds' if the results take 30 seconds to load?
My Query:
SELECT * FROM `documents` WHERE disable=0 AND author=7 AND MATCH(text) AGAINST('"chocolate"')
The field I am searching (named "text") has a field type of "mediumtext", and each text row contains about 200kb of text. The total size of the table is 15,000 rows and 1.5GB of text.
Does anyone know what causes this to happen?
I am going to expand on my comment.
When a database reports on the time to complete a query, that is generally the time only within the database. It might or might not include the time to compile the query. It does not include the time to return the results.
Your data rows are quite wide, because of the text column on the data. So, you have a situation where running the query in the database is quite fast. But the resulting rows are very big -- so it takes lots of time to return them to the user.
Perhaps further complicating the timing is that you might be looking at when all the rows are returned rather than just for the first row to return (that is also a confusion with timings sometimes).
In any case, if you don't need the wide columns, just select the columns that you do need. That has little effect on the query processing time, but it could have a big impact on the time to return the results.
Rule #1 Select * from anytable is a BAD idea.
Just ask #supercoolville for clarification.
I have a java application and I would like to get some data from a table and display in the application.
I have millions of records, and the query gets really slow when I am going to the last records. it takes few good minutes to get the results.
select Id from Table1x where description like '%error%' and Id between 0 and 1329999 limit 0, 1000
The above query returns a fast result. That is first pages returns fast. But when I am moving the last pages, it becomes slow.
select Id from Table1x where description like '%error%' and Id between 0 and 1329999 limit 644000, 1000.
This query is slow and taking 17 secs.
Any ideas on how to make this faster? Id is the primary key of table1x.
The problem is in the like. To get the first 1000 records, the database only needs to filter the database until it finds 1000 records that match the search. For the other query, the database needs to match records until it has 645000 records, which makes it much slower. There is no sorting or other filtering, so the index on ID doesn't help at all.
An index on description would help, but not if you start the search with a wildcard, like you do now.
I see two solutions.
First option is to add a FULLTEXT index on the description field. It allows to to look for the word error using MATCH rather than LIKE. I think it will be a lot faster, but the index will become larger too, and I'm not sure about the optimizations on the long run.
Second solution: Since you're obviously looking for errors (I think you're building a report on a log table?), you may add a column with a record type. You can give each record a type (just an integer) which indicates where that record holds an error or not. You will need to update your table once, and insert the type along with new records, but it will make your query faster.
I must admit that this second solution is based on assumptions about the data and your goal. If I'm wrong about that, please provide additional information and I may find a solution that suits you better.
Do you think it's a good idea to count entries from a really big table (like 50K rows) on each page load?
SELECT COUNT(*) FROM table
Right now I have like 2000 rows and seems pretty fast, I don't see any delays in page load :)
But the table should reach up to 50K entries... And I'm curious how it will load then
(ps: this page which shows the row count is private, in a Admin interface, not public)
COUNT(*) is optimized to return very quickly if the SELECT retrieves from one table, no other columns are retrieved, and there is no WHERE clause. For example:
mysql> SELECT COUNT(*) FROM student;
This optimization applies only to MyISAM tables only, because an exact row count is stored for this storage engine and can be accessed very quickly.
Source
As you said you use MyISAM and your query is for the whole table, it doesn't matter if its 1 or 100000 rows.
As you have said this page is pvt and not public I don't see any problem with that query and 50k records, shouldn't have any real impact on page load times and server load.
The MyISAM engine stores the row count internally, so when issuing a query like SELECT COUNT(*) FROM table, then it will be fast. With InnoDB, on the other hand, it will take some time because it counts the actual rows. Which means - more rows - the slower it gets. But there's a trick by which you use a small covering index to count all the rows in the table - then it's fast. Another trick is to simply store the row count in a corresponding summary table.
COUNT(*) isnt an expensive operation, it dosent actually return the data just looks at the indexes. You should be fine even on a 50k table.
If you experience issues in loading it would be simple to retractor and optimise at that that point.
In MyISAM the count(*) is optimized away WHEN THERE ISN'T ANY 'WHERE' CONDITION, so the query is very fast even with billions of lines.
In the case of partitioned tables, we could think it would behave the same way if there is a simple condition on the column that defines the partition (ex: count all the lines on a few physical tables of the logical table). But this is not the case : it loops on all the lines of the physical tables considered, even if we want to count them all. For instance, here, on a 98-million-line table partitioned into 40 tables, it takes over 5 minutes to count the number of lines in the last 32 physical tables.
It can be. According to this forum PostgreSql will do an entire scan of the database to figure out the count.
count(*) is O(n) so it's performance is related to the number of records in the table, 50k is not a lot at all, so i think it is fine on an admin page. When you get into the millions count(*) certainly does become expensive.
Simple situation, two column table [ID, TEXT]. The Text column has 1-10 word phrases. 300,000 rows.
Running the query:
SELECT * FROM row
WHERE text LIKE '%word%'
...took 0.1 seconds. Ok.
So I created a 2nd column, the table now has: [ID, TEXT2, TEXT2]
I made TEXT2 = TEXT (using an UPDATE table SET TEXT2 = TEXT]
Then I run the query for '%word%' again, and it takes 2.4 seconds.
This leaves me very very stumped but after quite a lot of blind alleys, I run OPTIMIZE on the table, and it goes to about 0.2 seconds.
Two questions:
Does anyone know how the data structure get's itself in such a mess whereby doubling the data increases the search time for this query by a factor of 24?
Is it standard for an un-indexed search like this to increase at the rate of the underlying table data structure as opposed to the data in the actual column being searched?
Thanks!
Sounds to me like you are the victim of Query caching. The second time your run the query (after the optimize), it already has the answer cached, and therefore the result is returned instantly. Have you tried searching for different search terms. Try running the query with caching turned off as so:
SELECT SQL_NO_CACHE * FROM row WHERE text LIKE '%word%'
To see if this changes the results, or try searching for different words, but with similar number of results to ensure that your server isn't just returned a cached value.
The first time it does a table scan which sounds about right for the timing - no index involved.
Then you added the index and the mysql optimizer doesn't notice you've got a wildcard on the front, so it scans the entire index to find the records, then needs two more reads (one to the PK, then one into the table from there) to get the data record on top of that.
OPTIMIZE probably just updates the optimizer statistics so it knows it should scan the table again.
I would think that the difference is caused by the increased row length causing the table to be fragmented on the disk. Optimize will sort that problem out, leading to the search time returning to normal (give or take a bit).