MySQL SELECT fist X rows without an index - mysql

I am looking for a way to execute a SELECT query on a large table without having to add any new indexes.
SELECT id FROM table_name WHERE column_1 = "" limit 100
there is bout 800,000 of these empty rows and about 5 million of filled ones.
In my mind there has to be a way where the database engine just starts reading the table from one end, collects the first 100 rows (regardless of the order) and stops. However with the above query it checks all the 5M rows.
I did search the internet with no answer. Could someone help me out. Thanks.

"it checks all the 5M rows" -- If you are using EXPLAIN to say that, don't trust it. EXPLAIN rarely adjusts its "Rows" column to account for LIMIT.
OTOH, If only the last 100 rows were blank, it would read all 5M rows. If the first 100 rows are blank, the only 100 would be read. The Optimizer is not smart enough to know which of those will happen.
With INDEX(column1), it will touch only 100 index rows and get the ids (which are in the index's BTree. If you want more than just id, there is an extra step (performed 100 times) to reach into the data's BTree to get the rest of the columns.
If you want to discuss further, please provide SHOW CREATE TABLE; we need to see the engine, PRIMARY KEY, datatypes, etc.
Are you first fetching 100 ids, then fetching something based on them? That is almost always less efficient than combining the two queries.

One way or the other, I would add an index to "column_1".
I am pretty sure MySQL does not give you any possibility to influences this.
What you can try is doing a Stored procedure, which does a select
"SELECT id FROM table_name" then filters on "column_1 = """ and stops after counting 100 positives.
If there are any better methods, I'll be happy to hear.

Related

Indexes Show No Improvement In Speed

I have a table with about 22 million rows and about 20 columns containing property data. Currently a query like:
SELECT * FROM fulldataset WHERE county = 'MIDDLESBROUGH'
takes an average of 42 seconds to run. To try and improve this, I created an index on the county column like this:
ALTER TABLE fulldataset ADD INDEX county (county)
There has been no improvement at all in the speed of the same query.
So I used EXPLAIN SELECT to try and find out what was happening. If I SELECT * from countyA, it returns around 85k entries, after ~42 seconds. If I EXPLAIN SELECT the same query it says it's using the county Index I created and that the number of rows is around 167k, which is wrong but better than searching all 22 million.
Likewise, if I SELECT * for countyB I get around 48k results and EXPLAIN SELECT tells me there are around 91k rows. The EXPLAIN SELECT statement returns the result instantly, so it's able to instantly tell that there are around half as many entries for countyB as there are for countyA. The problem is the queries don't execute any faster. If it's only checking 91k rows shouldn't it be very quick?
Here's a screenshot of what I'm doing: image
EDIT: As pointed out, the query itself is not what is taking time. In answer to my own question in the comments, a multiple column index worked wonders.
The query is not the problem. If you look closely at the output of your program you will see that the query execution took less than 1s, but fetching all the rows took 42s.
If you have to wait 42s before you see anything then I recommend to use another querying tool which only fetches the first X rows and displays them in pages.
EXPLAIN is designed to be fast. In doing so, the calculation of "Rows" is only a crude estimate. If can often be off by a factor of 2. So, don't read too much into 85K vs 167K.
Since EXPLAIN is delivering only a single row (or a small number of rows), the "fetch" time is very low.
If you are selecting the AVG() of some column, it has to first read all the relevant rows, doing the computation as it goes. It cannot even start to deliver data until it has finished all the reading.
If you are reading all the rows, it can (but I am not sure that it does) start delivering rows starting with the first row.
If you do something like SELECT * FROM tbl ORDER BY x (and x is not indexed), then you get the worst or both worlds. First it has to read all the rows and write them to a temp table, then it sorts that temp table; only then can it begin to fetch the rows.
I think "duration" and "fetch" are not very useful; the sum of the two is more useful. Here's another example of it: Mysql same querys one with index second without getting 10000xFetch time?
Notice how the sum is consistent, but the separation is not.

DATABASE optimization insert and search

I was having an argument with a friend of mine. Suppose we have a db table with a userid and some other fields. This table might have a lot of rows. Let's suppose also that by design we limit the records for each userid in the table to about 50.My friend suggested that if I under every row for each userid one after another the lookup would be faster e.g
userid otherfield
1 .........
1 .........
.....until 50...
2 ........
etc. So when a user id 1 is created I pre-popopulate the 50 table's rows to with null values...etc. The idea is that if I know the amount of rows and find the first row with userid =1 I just have to look the next 49 an voila I don't have to search the whole table. Is this correct?can this be done without indexing? Is the pre-population an expensive process?Is there a performance difference if I just inserted in old fashioned way like
1 ........
2 ........
2 ........
1 ........
etc?
To answer a performance question like this, you should run performance tests on the different configurations.
But, let me make a few points.
First, although you might know that the records for a given id are located next to each other, the database does not know this. So, if you are searching for one user -- without an index -- then the engine needs to search through all the records (unless you have a limit clause in the query).
Second, if the data is fixed length (numeric and dates), the populating it with values after populating it with NULL values will occupy the same space on the page. But, if the data is variable length, then a given page will be filled with empty records. When you modify the records with real values, you will get page split.
What you are trying to do is to outsmart the database engine. This isn't necessary, because MySQL provides indexes, which provide almost all the benefits that you are describing.
Now, having said that, there is some performance benefit from having all the records for a user being co-located. If a user has 50 records, then reading the records with an index would typically require loading 50 pages into memory. If the records are co-located, then only one or two records would need to be read. Typically, this would be a very small performance gain, because most frequently accessed tables fit into memory. There might be some circumstances where the performance gain is worth it.

mysql query slow when limit goes to last records

I have a java application and I would like to get some data from a table and display in the application.
I have millions of records, and the query gets really slow when I am going to the last records. it takes few good minutes to get the results.
select Id from Table1x where description like '%error%' and Id between 0 and 1329999 limit 0, 1000
The above query returns a fast result. That is first pages returns fast. But when I am moving the last pages, it becomes slow.
select Id from Table1x where description like '%error%' and Id between 0 and 1329999 limit 644000, 1000.
This query is slow and taking 17 secs.
Any ideas on how to make this faster? Id is the primary key of table1x.
The problem is in the like. To get the first 1000 records, the database only needs to filter the database until it finds 1000 records that match the search. For the other query, the database needs to match records until it has 645000 records, which makes it much slower. There is no sorting or other filtering, so the index on ID doesn't help at all.
An index on description would help, but not if you start the search with a wildcard, like you do now.
I see two solutions.
First option is to add a FULLTEXT index on the description field. It allows to to look for the word error using MATCH rather than LIKE. I think it will be a lot faster, but the index will become larger too, and I'm not sure about the optimizations on the long run.
Second solution: Since you're obviously looking for errors (I think you're building a report on a log table?), you may add a column with a record type. You can give each record a type (just an integer) which indicates where that record holds an error or not. You will need to update your table once, and insert the type along with new records, but it will make your query faster.
I must admit that this second solution is based on assumptions about the data and your goal. If I'm wrong about that, please provide additional information and I may find a solution that suits you better.

mySQL (and MSSQL), using both indexed and non-indexed columns in where clause

The database I use is currently mySQL but maybe later MSSQL.
My questing is about how mySQL and msSQL takes care about indexed and nonindexed columns.
Lets say I have a simple table like this:
*table_ID -Auto increase. just a ID, indexed.
*table_user_ID -every user has a unique ID indexed
*table_somOtherID -some data..
*....
Lets say that I have A LOT!! of rows in this table, But the number of rows that every user add to this table is very small (10-100)
And I want to find one o a few specific rows in this table. a row or rows from a specific User(indexed column).
If I use the following WHERE clause:
..... WHERE table_user_ID= 'someID' AND table_someOtherID='anotherValue'.
Will the database first search for the indexed columns, and then search for the "anotherValue" inside of those rows, or how does the database handle this?
I guess the database will increase a lot if I have to index every column in all tables..
But what do you think, is it enough to index those columns that will decrease the number of rows to just ten maybe hundred?
Database optimizers generally work on a cost basis on indexes by looking at all the possible indexes to use based on the query. In your specific case it will see 2 columns - table_user_ID with an index and someOtherID without an index. If you really only have 10-100 rows per userID then the cost of this index will be very low and it will be used. This is because the cardinality is high and the DB can only read the few rows it needs and not touch the other rows for every other user its not interested in. However, if the cost to use the index is very high (very few unique userIDs and many entries per user) it might actually be more efficient to not use the index and scan the whole table to prevent random seeking action as it jumps around the table grabbing rows based on the index.
Once it picks the index then the DB just grabs the rows that match that index (10 to 100 in your case) and try to match them against your other criteria searching for rows where someOtherID='anotherValue'
But the number of rows that every user add to this table is very small (10-100)
You only need to index the user_id. It should give you good performance regardless of your query, as long as it includes the user_id in the filter. Until you have identified other use cases, it will pretty much work as you state
Will the database first search for the indexed columns, and then search for the "anotherValue" inside of those rows, or how does the database handle this?
Yes (in layman terms that is close).
In regards to SQL Server:
The ordering of the indexes are important depending on how you query and how the indexes are structured. If you create an index on the columns
-table_user_id
-table_someotherID
The index is ordered by the table_user_id first. Example:
1-2
1-5
1-6
2-3
2-5
2-6
For the first record on the index, 1 being the table user id, and 2 being some other value.
If you run a query with a where on table_user_id = blah, it will be very fast to use this index, since the table_user_id are indexed in order.
But if you run a query that only uses table_someotherID in the WHERE clause, it might not even use this index, as instead of doing a quick seek in the index for the matching value, it will do a rough scan of the index (which is less efficient than a seek).
Also SQL Server has a INCLUDE feature that associate the columns you want in the SELECT clause to the index you create on the WHERE or JOIN columns.
So to answer your question, it all depends on how you create the indexes and how you query them. You're right not to think about indexing every column, as indexes take up storage and performance hit when you do inserts and updates on the table.

MySQL : Does a query searches inside the whole table?

1. So if I search for the word ball inside the toys table where I have 5.000.000 entries does it search for all the 5 millions?
I think the answer is yes because how should it know else, but let me know please.
2. If yes: If I need more informations from that table isn't more logic to query just once and work with the results?
An example
I have this table structure for example:
id | toy_name | state
Now I should query like this
mysql_query("SELECT * FROM toys WHERE STATE = 1");
But isn't more logical to query for all the table
mysql_query("SELECT * FROM toys"); and then do this if($query['state'] == 1)?
3. And something else, if I put an ORDER BY id LIMIT 5 in the mysql_query will it search for the 5 million entries or just the last 5?
Thanks for the answers.
Yes, unless you have a LIMIT clause it will look through all the rows. It will do a table scan unless it can use an index.
You should use a query with a WHERE clause here, not filter the results in PHP. Your RDBMS is designed to be able to do this kind of thing efficiently. Only when you need to do complex processing of the data is it more appropriate to load a resultset into PHP and do it there.
With the LIMIT 5, the RDBMS will look through the table until it has found you your five rows, and then it will stop looking. So, all I can say for sure is, it will look at between 5 and 5 million rows!
Read this about indexes :-)
http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
It makes it uber-fast :-)
Full table scan is here only if there are no matching indexes and indeed very slow operation.
Sorting is also accelerated by indexes.
And for the #2 - this is slow because transfer rate from MySQL -> PHP is slow, and MySQL is MUCH faster at doing filtering.
For your #1 question: Depends on how you're searching for 'ball'. If there's no index on the column(s) where you're searching, then the entire table has to be read. If there is an index, then...
WHERE field LIKE 'ball%' will use an index
WHERE field LIKE '%ball%' will NOT use an index
For your #2, think of it this way: Doing SELECT * FROM table and then perusing the results in your application is exactly the same as going to the local super walmart, loading the store's complete inventory into your car, driving it home, picking through every box/package, and throwing out everything except the pack of gum from the impulse buy rack by the front till that you'd wanted in the first place. The whole point of a database is to make it easy to search for data and filter by any kind of clause you could think of. By slurping everything across to your application and doing the filtering there, you've reduced that shiny database to a very expensive disk interface, and would probably be better off storing things in flat files. That's why there's WHERE clauses. "SELECT something FROM store WHERE type=pack_of_gum" gets you just the gum, and doesn't force you to truck home a few thousand bottles of shampoo and bags of kitty litter.
For your #3, yes. If you have an ORDER BY clause in a LIMIT query, the result set has to be sorted before the database can figure out what those 5 records should be. While it's not quite as bad as actually transferring the entire record set to your app and only picking out the first five records, it still involves a bit more work than just retrieving the first 5 records that match your WHERE clause.