Why does greater-than versus equals make a difference in MySQL SELECT? - mysql

I have a large MyISAM table. It's approaching 1 million rows. It's basically a list of items and some information about them.
There are two indices:
primary: the item ID
date (date) and col (int).
I run two queries:
SELECT * FROM table WHERE date = '2011-02-01' AND col < 5 LIMIT 10
SELECT * FROM table WHERE date < '2011-02-01' AND col < 5 LIMIT 10
The first one finishes in ~0.0005 seconds and the second in ~0.05 seconds. That is 100X difference. Is it wrong for me to expect both of these to run at roughly the same speed? I must not be understanding the indices very well. How can I speed up the second query?

Regardless of Mysql it boils down to basic algorithm theory.
Greater than and Less than operations on a large set are slower than Identity operations.
With a large data set an ideal data structure for determining less than or greater is a self balancing tree (binary or n-tree).
On a a self balanced tree the worst case scenario to find all less/greater is log n.
The ideal data structure for identity lookup is a hashtable. The performance of hashtables is generally O(1) aka fixed time. A hashtable however is not good for greater/less.
Generally a well balanced tree is only slightly less performing than a hashtable (which is how Haskell gets away with using a tree for hashtables).
Thus irregardless of what Mysql does its not surprise that <,> is slower than =
Old Answer below:
Because the first one is like Hashtable lookup since its '=' (particularly if your index is a hashtable) it will be faster than the second one which might work better with a tree like index.
Since MySql allows to configure the index format you can try changing that but I'm rather sure the first will always run faster than the second.

I'm assuming you have an index on the date column.
The first query uses the index, the second query probably does a linear scan (at least over part of the data). A direct fetch is always faster than a linear scan.

MySQL stores its indexes by default in a BTREE. No hashing in general.
The short answer for the performance difference is that the < form evaluates more nodes then the = form.
The index that you've got on there (date, col) stores the values roughly like a phone book:
2011-01-01, col=1, row_ptr
2011-01-01, col=2, row_ptr
2011-01-01, col=3, row_ptr
etc...
2011-02-01, col=1, row_ptr
2011-02-01, col=2, row_ptr
2011-02-01, col=3, row_ptr
etc...
2011-02-02, col=1, row_ptr
2011-02-02, col=2, row_ptr
etc...
...in ascending sorted tree nodes of size B (2011-01-01, col=1) < (2011-01-01, col=2) < (2011-01-02, col=1).
Your question is essentially asking the difference between:
Find all phone numbers with last name 'Smith' and first name starting with 'A'
Find all phone numbers that come before
'Smith' and have first name starting with 'A'.
It should be obvious why #1 is so much faster then #2.
There are also considerations of memory /disk transfer efficiency and heap allocations (= does WAY fewer transfers then <) that account for a not-insignificant amount of time but depend largely on the distribution of the data and the specific location of the 2011-02-01, col=min(col) key record.
[1] http://en.wikipedia.org/wiki/B-tree
[2] http://forge.mysql.com/wiki/MySQL_Internals_MyISAM
[3] http://forge.mysql.com/wiki/MySQL_Internals_InnoDB

The first one performs a seek over data where as the second one goes for a scan . Scans are always costlier than seeks hence the time difference .
Its like that, the the scan means running through all the pages of the book where as seek is directly jumping to a page number.
Hope this might help.

Related

Indexed column and not indexed column research

I generated separate MySQL Innodb tables with 2000, 5000, 10000, 50000, 10000, 20000, 50000, 100 000, 200 000 elements(with help of php loop and insert query).
Each table has two columns: id(Primary Key INT autoincrement), number(INT UNIQUE KEY). Then I did the same but this time I generated similar tables where number column doesn't have an INDEX.I generated tables in a such way: the value of column number is equal to value of index + 2: first element == 3, 1000th element is 1002 and so on. I wanted to test a query like that, because It will be used in my application:
SELECT count(number) FROM number_two_hundred_I WHERE number=200002;
After generating data for these tables I wanted to test time for the worst case queries. I used SHOW PROFILES for it. I made an assumption that the worst case query would correspond to the element with value of column number to 1002, 2002, and so on, so here are all the queries that I tested and the time(evaluated by SHOW PROFILES):
SELECT count(number) FROM number_two_thousand_I WHERE number=2002;
// for tables with indexed column number I used **suffix _I** in the end
// of name of the table. Here is the time for it 0.00099250
SELECT count(number) FROM number_two_thousand WHERE number=2002;
// column number is not indexed when there is no **suffix _I**
// time for this one is 0.00226275
SELECT count(number) FROM number_five_thousand_I WHERE number=5002;
// 0.00095600
SELECT count(number) FROM number_five_thousand WHERE number=5002;
// 0.00404125
So here are the results:
2000 el - indexed 0.00099250 not indexed - 0.00226275
5000 el - indexed 0.00095600 not indexed - 0.00404125
10000 el - indexed 0.00156900 not indexed - 0.00761750
20000 el - indexed 0.00155850 not indexed - 0.01452820
50000 el - indexed 0.00051100 not indexed - 0.04127450
100000 el indexed 0.00121750 not indexed - 0.07120075
200000 el indexed 0.00095025 not indexed - 0.11406950
Here is infographic for that. It shows how number of elements depends on the worst case time of query for indexed/not indexed column. Indexed is red color. When I tested speed, I typed the same query in mysql console 2 times, because I figured out that when you make query for the 1st time, sometimes query for not indexed column can be even a bit faster, than for indexed one. Question is: why this type of query for 200000 elements takes sometimes less time, than the same query for 100000 elements when column number is indexed. You can see that there are other unpredictable for me results. I ask this, because when column number is not indexed, the results are quite predictable: 200000 el time is always bigger than 100000. Please tell me what I'm doing wrong when trying to make research about UNIQUE indexed column.
In not indexed case it is always a full-table scan so time collerates well with row number, if it's indexed you are measuring the index lookup time, which is constant in your case (small numbers , with small deviation)
It is not the "worst" case.
Make the UNIQUE key random instead of being in lock step with the PK. An example of such is UUID().
Generate enough rows so that the table and index(es) cannot fit in the buffer_pool.
If you both of those you will eventually see the performance slow down significantly.
UNIQUE keys have the following impact on INSERTs: The uniqueness constraint is checked before returning to the client. For a non-UNIQUE index, the work to insert into the index's BTree can (and is) delayed. (cf "Change buffer). With no index on the second column, there is even less work to do.
WHERE number=2002 --
With UNIQUE(number) -- Drill down the BTree. Very fast, very efficient.
With INDEX(number) -- Drill down the BTree. Very fast, very efficient. However it is slightly slower since it can't assume there is only one such row. That is, after finding the right spot in the BTree, it will scan forward (very efficient) until it finds a value other than 2002.
With no index on number -- Scan the entire table. So the cost depends on table size, not the value of number. It has no clue if 2002 exists anywhere in the table, or how many times. If you plot the times you got, you will see that it is rather linear.
I suggest you use log-log 'paper' for your graph. Anyway, note how linear the non-indexed case is. And the indexed case is essentially constant. Finding number=200002 is just as cheap as finding number=2002. This applies for UNIQUE and INDEX. (Actually, there is a very slight rise in the line because a BTree is really O(log n), not O(1). For 2K rows, there are probably 2 levels in the BTree; for 200K, 3 levels.)
The Query cache can trip you up in timings (if it is turned on). When timing, do SELECT SQL_NO_CACHE ... to avoid the QC. If the QC is on and applies, then the second and subsequent runs of the identical query will take very close to 0.000 seconds.
Those timings that varied between 0.5ms and 1.2ms -- chalk it up to the phase of the moon. Seriously, any timing below 10ms should not be trusted. This is because of all the other things that may be happening on the computer at the same time. You can temper it somewhat by averaging multiple runs -- being sure to avoid (1) the Query cache, and (2) I/O.
As for I/O... This gets back to my earlier comment about what may happen when the table (and/or index) is bigger than can be cached in RAM.
When smaller than RAM, the first run is likely to fetch stuff from disk. The second and subsequent runs are likely to be faster and consistent.
Whem bigger than RAM, all runs may need to hit the disk. Hence, all may be slow, and perhaps more flaky than the variations you found.
Your tags are, technically, incorrect. Most of MySQL's indexes are BTrees (actually B+Trees), not Binary Trees. (Sure, there is a lot of similarity, and many of the principles are shared.)
Back to your research goal.
Assume there is 'background noise' that is messing with your figures.
Make your tests non-trivial (eg the non-indexed case) so that it overwhelms the noise, or
Repeat the timings to mask the issue. And be sure to ignore the first run.
The main cost in performing any SELECT is how many rows it touches.
With your UNIQUE index, it is touching 1 row. So expect fast and O(1) (plus noise).
Without an index, it is touching N rows for an N-row table. So expect O(N).

Mysql: How to correctly generate a unique 10-digit random number that does not exist on the current set of numbers

I have this query:
select t1.newval from
(select LPAD(FLOOR(10000000000*RAND()), 10, '0') as 'newval') as t1
where t1.newval not in (select unqiue_no from table)
Note: the query works fine however the test is not extensive since my data is limited and the length is 10 digits.
I'm not sure if this query will never collide with the existing numbers.
Since this query does not loop.
Is there any possibility that it will return an empty value if the newval number collides with the existing uqnique numbers?
Thanks in advance for your insights.
Edited to format mysql
Of course there is a chance.
Initially that chance would be extremely low:, 1 in 10 billion. With every allocated key, the odds increase, but the odds of a collision are still extremely low ... 1 in 10billion - (# of previously allocated numbers).
Regardless, there will always be a chance because that is how you designed the query -- requiring it to query against the table of allocated id's, just in case.
I am going to assume here, that what you need is a key. If that is the case, you would be best off generating the number and attempting to INSERT (assuming you are using this number as a primary key) and building in recovery from a duplicate key violation, rather than selecting against an ever increasing data set of allocated keys.
The more immediate concern is not whether you will get an empty result set, but rather the speed of the query which requires a query of the table of existing keys, and as that becomes large, will slow the query down in a predictable manner. As you already are banking on that being an event that is very low in probability, I don't see the value in that subquery.

Big-O of MySQL Fuzzy Search

What is the Big-O of MySQL Fuzzy Search? Does it vary by index type, if so, what performs the best?
e.g. SELECT * FROM foo WHERE field1 LIKE '%ello Wo%';
I'm unsure of the underlying datatype, what kind of magic it possesses. Something like a trie (https://en.wikipedia.org/wiki/Trie) would be nice for search who is fuzzy at the end, e.g. LIKE 'Hello Wo%'.
I'm guessing the Big-O is O(n) but wish to confirm. There may even be differences between fuzzy searches too, e.g. %ello Wo% vs. Hello W% vs. %lo World vs. %ell%o%W%or%
Are there different ways to index that give better performance? If yes, for particular cases, can you please share?
With a leading wildcard
MySQL will
Scan all the rows in the table (not an index). This is called a "table scan". (This assuming no other filtering going on.)
For each row, scan the column in question for the LIKE;
Deliver the rows not filtered out.
Most of the time is spent in Step 1, which is O(N) where N is the number of rows. Much less time is spent in steps 2 and 3.
Without a leading wildcard
Use an index on that column, if you have one, to limit the number of rows to search. If you have an index on the column and are saying WHERE col LIKE 'Hello W%', it will find all the rows in the index starting with Hello W. They will be consecutive in the index, making this step faster.
For each of those, reach into the Data for the row and do whatever else is required.
There are a number of variables (caching, number of rows, randomness of rows, etc) that lead to whether #1 is more or less costly than #2. But this is likely to be much faster than the leading-wildcard case -- O(n), where n is the number of rows starting with 'Hello W'.

How does MySQL handle huge databases?

For the sake of example:
Let's say my database has 1 table where the fields are
id, first_name (VARCHAR 100 chars), last_name (VARCHAR 100 chars), about (VARCHAR 10,000 chars)
Now let's say the database is 100 Gigs large.
How will random access look like on a machine that only has 4 Gigs of RAM?
Will the query take constant time every time it's made?
If you search on first name and it is not indexed the server will read each row in the table and compare it to the where clause. This query will probably vary hugely in time since the time taken to retrieve the result is dependant on the position of the row. For example firstname 'a' will be quick to find and firstname 'z' will take much longer. Essentially you are doing a linear/sequential access of the database.
If there was a index on the firstname MySQL build a tree on the column. Trees are highly efficient when used in searching. Basically find values 'a' and 'z' should take the same amount of operations since you are doing a binary search. Note that I am saying operations.
There is now way you can gaurantee that a query will always execute in the same amount of time. Just remember while a database is memory intensive most people overlook the fact that a database is really bound to disk io. These factors make it highly unlikely that you can gaurantee execution time is always predictable and constant. However you can ensure that the number of operations used remain optimised.
Just one other thing while indexes speeds up reads they slow down writes. So indexing is a double edged sword. Index only what you reallly need to.

Efficiency of finding nearest X locations to decimal degree coordinates using MySQL

This sounds like a problem many others have posted about, but there's a nuance to it that I can't figure out: I don't want to limit my bounds when asking for the nearest X data points, and the query needs to be fast.
Right now I'm using SQL as follows:
SELECT * FROM myTable WHERE col1 = 'String' AND col2 = 1
ORDER BY (latCol - <suppliedLat>) + (longCol - <suppliedLong>)
LIMIT X; //X is usually lower than 100
With Lat and Long stored as doubles, and the table containing about a million rows, this query takes ~6 seconds on our server - not fast enough. EXPLAIN SELECT shows me that it's using no indexes (as expected - there's only one index and it's not location-related), performing a filesort, and hitting all ~1 million rows.
Removing the two WHERE clauses don't improve performance at all, and the one index we applied to col1, col2 and a third col actually DECREASED the performance of this query, despite greatly increasing the speed of others.
My reading up on how to solve this leads me to believe that spatial indexing is the way to go, but we never intend to use any of the more advanced spatial features like polygons and circular bounds, we just need the speed. Is there an easy way to apply a spatial (or other sort of) index to an already-existing decimal-degrees table to improve the speed of the above query? Is there a better way of writing the query to make it more efficient?
The big killer is that most things i read about implementing a spatial index in MySQL seem to require changes in how you INSERT the data, but modifying our INSERT statements to use geo/spatial data types greatly increases our dev cycle.
The idea is to use a quadkey. It can look like this 12131212. Then each character in the key represents a leaf node (of a quadtree). If you want to find a similar location you can simply use the mysql substring in the where clause: WHERE SUBSTRING(Field,0,4) = "1213". For the above data it would return the first location 12131212 and any other location starting with 1213. Of course you can replace the character 1,2,3,4 with any other character more meaningful. You download my php class hilbert-curve # phpclasses.org.