I have a table with tuples where timestamps (time) are not consecutive but (we can assume for simplicity) unique.
time | value
------------
0 |4
3 |2
5 |6
8 |10
9 |5
13 |-1
15 |-3
... |...
I am faced with the problem of finding the "next tuple given some time T" ( <- next(T);), e.g. next(4) -> <5,6>, or next(5) -> <8,10>. Further, since this data is hold in a MySQL database I would prefer to realize this with SQL. However, time constraints require to find the respective tuple in O (log n).
At first glance, I tried the following SQL statement (I hope my Pseudo-code is understandable):
<time, value> = next(T) {
return (select * from table
where time = (select min(time) from table
where time > T))
}
However, this does not give the result in reasonable time. I guess that "select min(time) from table where time > find" takes O(n) time. Of course, I know performing a search in an ordered list takes only O(log n) time but I have no clue how to do that in SQL. Is this even possible? If so, how does it work?
Thanks!
For your information:
(1) At the moment my solution caches the respective data in memory and orders it initially. This way I can then find the next tuple in O(log n) time. However, this consumes lots of memory and I would prefer to do it kind of "in-line" in the DBMS which is surely highly optimized regarding caching etc.
(2) I could imagine a solution where data is hold ordered by time in the database, but I don't know how to ensure ordering or to implement a respective search algorithm in SQL. :-/
(3) I am aware of indexing etc. and that it improves performance if I declare time as primary key but I don't know how it could help to find next in O(log n).
You need to make sure that an index exists for the time column. You can check if an index exists by examining the results of this command:
show index from table;
If the time column is the primary key of the table, then the index almost certainly exists. The index is necessary for an efficient search in the time column. You will get O(log n) performance with the right index, if not constant time lookups (just read more about btrees).
MySQL uses B-tree indexes, which allow lookup and sequential traversing, both in logarithmic time. That means that finding the next higher time for a given time is done in logarithmic time, provided that MySQL utilizes the index correctly. This is not always the case and you have to try this. If it does not work, you have to give MySQL execution hints to make it utilizing the index correctly.
Order the results by time and then use the limit keyword for taking only the first result from the result set:
select * from table
where time > T
order by time
limit 1
Related
I have a table with 161886415 rows. When i run:
SELECT * FROM table
It takes 0.0083 seconds.
But when i try to run:
SELECT A, SUM(B)
FROM table
GROUP BY A
It takes an infinite time
I already have an A and B index, AB and BA composite index
A is date and B is an int.
Your comparison is misleading. When you have a query like this:
select a.*
from table;
You are seeing the first rows returned, not all of them. MySQL can start to return rows as quickly as it reads them. By contrast, the aggregation query needs to read the entire table before returning a single row.
You might find that the aggregation query is faster if you have an index on (A, B). But you seem to already have this index.
Pretty much your best option is to filter down to a subset of the dates.
First select is quite easy to process. Database engine can use table scan over heap stored data which is used when you try to retrieve bigger percentage of data stored in table.
You should look at your query plan which aggregate operator is being used. Also, you can edit your original post.
Index could be helpful. MariaDB offers column store for example. Depends on query and your speed expectations.
Similar issue dealing with SUM() performance Is it possible to speed up a sum() in MySQL?
Your first query returned all 181 million rows in 8.3 milliseconds. I think not.
The second query, as you will see from EXPLAIN SELECT ... efficiently uses INDEX(A, B). Still it needs to read all 181 'rows' in that index, so it takes a long time.
Often in Data Warehouse applications, it is beneficial to build and maintain "Summary Tables" to speed things up -- significantly. You might have a daily subtotal of SUM(B) for each A, then sum up the subtotals when you need it.
If you want to discuss thing further, provide more specifics on the table and the query.
I have a very simple table with three columns:
- A BigINT,
- Another BigINT,
- A string.
The first two columns are defined as INDEX and there are no repetitions. Moreover, both columns have values in a growing order.
The table has nearly 400K records.
I need to select the string when a value is within those of column 1 and two, in order words:
SELECT MyString
FROM MyTable
WHERE Col_1 <= Test_Value
AND Test_Value <= Col_2 ;
The result may be either a NOT FOUND or a single value.
The query takes nearly a whole second while, intuitively (imagining a binary search throughout an array), it should take just a small fraction of a second.
I checked the index type and it is BTREE for both columns (1 and 2).
Any idea how to improve performance?
Thanks in advance.
EDIT:
The explain reads:
Select type: Simple,
Type: Range,
Possible Keys: PRIMARY
Key: Primary,
Key Length: 8,
Rows: 441,
Filtered: 33.33,
Extra: Using where.
If I understand your obfuscation correctly, you have a start and end value such as a datetime or an ip address in a pair of columns? And you want to see if your given datetime/ip is in the given range?
Well, there is no way to generically optimize such a query on such a table. The optimizer does not know whether a given value could be in multiple ranges. Or, put another way, whether the ranges are disjoint.
So, the optimizer will, at best, use an index starting with either start or end and scan half the table. Not efficient.
Are the ranges non-overlapping? IP Addresses
What can you say about the result? Perhaps a kludge like this will work: SELECT ... WHERE Col_1 <= Test_Value ORDER BY Col_1 DESC LIMIT 1.
Your query, rewritten with shorter identifiers, is this
SELECT s FROM t WHERE t.low <= v AND v <= t.high
To satisfy this query using indexes would go like this: First we must search a table or index for all rows matching the first of these criteria
t.low <= v
We can think of that as a half-scan of a BTREE index. It starts at the beginning and stops when it gets to v.
It requires another half-scan in another index to satisfy v <= t.high. It then requires a merge of the two resultsets to identify the rows matching both criteria. The problem is, the two resultsets to merge are large, and they're almost entirely non-overlapping.
So, the query planner probably should just choose a full table scan instead to satisfy your criteria. That's especially true in the case of MySQL, where the query planner isn't very good at using more than one index.
You may, or may not, be able to speed up this exact query with a compound index on (low, high, s) -- with your original column names (Col_1, Col_2, MyString). This is called a covering index and allows MySQL to satisfy the query completely from the index. It sometimes helps performance. (It would be easier to guess whether this will help if the exact definition of your table were available; the efficiency of covering indexes depends on stuff like other indexes, primary keys, column size, and so forth. But you've chosen minimal disclosure for that information.)
What will really help here? Rethinking your algorithm could do you a lot of good. It seems you're trying to retrieve rows where a test point v lies in the range [t.low, t.high]. Does your application offer an a-priori limit on the width of the range? That is, is there a known maximum value of t.high - t.low? If so, let's call that value maxrange. Then you can rewrite your query like this:
SELECT s
FROM t
WHERE t.low BETWEEN v-maxrange AND v
AND t.low <= v AND v <= t.high
When maxrange is available we can add the col BETWEEN const1 AND const2 clause. That turns into an efficient range scan on an index on low. In that case, the covering index I mentioned above will certainly accelerate this query.
Read this. http://use-the-index-luke.com/
Well... I found a suitable solution for me (not sure your guys will like it but, as stated, it works for me).
I simply partitioned my 400K records into a number of tables and created a simple table that serves as a selector:
The selector table holds the minimal value of the first column for each partition along with a simple index (i.e. 1, 2, ,...).
I then user the following to get the index of the table that is supposed to contain the searched for range like:
SELECT Table_Index
FROM tbl_selector
WHERE start_range <= Test_Val
ORDER BY start_range DESC LIMIT 1 ;
This will give me the Index of the table I wish to select from.
I then have a CASE on the retrieved Index to select the correct partition table from perform the actual search.
(I guess that more elegant would be to use Dynamic SQL, but will take care of that later; for now just wanted to test the approach).
The result is that I get the response well below a second (~0.08) and it is uniform regardless of the number being used for test. This, by the way, was not the case with the previous approach: There, if the number was "close" to the beginning of the table, the result was produced quite fast; if, on the other hand, the record was near the end of the table, it would take several seconds to complete).
[By the way, I assume you understand what I mean by beginning and end of the table]
Again, I'm sure people might dislike this, but it does the job for me.
Thank you all for the effort to assist!!
I am reading High performance MySQL and I am a little confused about deferred join.
The book says that the following operation cannot be optimized by index(sex, rating) because the high offset requires them to spend most of their time scanning a lot of data that they will then throw away.
mysql> SELECT <cols> FROM profiles WHERE sex='M' ORDER BY rating LIMIT 100000, 10;
While a deferred join helps minimize the amount of work MySQL must do gathering data that it will only throw away.
SELECT <cols> FROM profiles INNER JOIN (
SELECT <primary key cols> FROM profiles
WHERE x.sex='M' ORDER BY rating LIMIT 100000, 10
) AS x USING(<primary key cols>);
Why a deferred join will minimize the amount of gathered data.
The example you presented assumes that InnoDB is used. Let's say that the PRIMARY KEY is just id.
INDEX(sex, rating)
is a "secondary key". Every secondary key (in InnoDB) includes the PK implicitly, so it is really an ordered list of (sex, rating, id) values. To get to the "data" (<cols>), it uses id to drill down the PK BTree (which contains the data, too) to find the record.
Fast Case: Hence,
SELECT id FROM profiles
WHERE x.sex='M' ORDER BY rating LIMIT 100000, 10
will do a "range scan" of 100010 'rows' in the index. This will be quite efficient for I/O, since all the information is consecutive, and nothing is wasted. (No, it is not smart enough to jump over 100000 rows; that would be quite messy, especially when you factor in the transaction_isolation_mode.) Those 100010 rows probably fit in about 1000 blocks of the index. Then it gets the 10 values of id.
With those 10 ids, it can do 10 joins ("NLJ" = "Nested Loop Join"). It is rather likely that the 10 rows are scattered around the table, possibly requiring 10 hits to the disk.
Let's "count the disk hits" (ignoring non-leaf nodes in the BTrees, which are likely to be cached anyway): 1000 + 10 = 1010. On ordinary disks, this might take 10 seconds.
Slow Case: Now let's look at the original query (SELECT <cols> FROM profiles WHERE sex='M' ORDER BY rating LIMIT 100000, 10;). Let's continue to assume INDEX(sex, rating) plus the implicit id on the end.
As before, it will index scan through the 100010 rows (est. 1000 disk hits). But as it goes, it is too dumb to do what was done above. It will reach over into the data to get the <cols>. This often (depending on caching) requires a random disk hit. This could be upwards of 100010 disk hits (if the table is huge and caching is not very useful).
Again, 100000 are tossed and 10 are delivered. Total 'cost': 100010 disk hits (worst case), which might take 17 minutes.
Keep in mind that there are 3 editions of High performance MySQL; they were written over the past 13 or so years. You are probably using a much newer version of MySQL than they covered. I do not happen to know if the optimizer has gotten any smarter in this area. These, if available to you, may give clues:
EXPLAIN FORMAT=JSON SELECT ...;
OPTIMIZER TRACE...
My favorite "Handler" trick for studying how things work may be helpful:
FLUSH STATUS;
SELECT ...
SHOW SESSION STATUS LIKE 'Handler%'.
You are likely to see numbers like 100000 and 10, or small multiples of such. But, keep in mind that a fast range scan of the index counts as 1 per row, and so does a slow random disk hit for a big set of <cols>.
Overview: To make this technique work, the subquery need a "covering" index, with the columns correctly ordered.
"Covering" means that (sex, rating, id) contains all the columns touched. (We are assuming that <cols> contains other columns, perhaps bulky ones that won't work in an INDEX.)
"Correct" ordering of the columns: The columns are in just the right order to get all the way through the query. (See also my cookbook.)
First come any WHERE columns compared with = to constants. (sex)
Then comes the entire ORDER BY, in order. (rating)
Finally it is 'covering'. (id)
From the description below from official (https://dev.mysql.com/doc/refman/5.7/en/limit-optimization.html):
If you combine LIMIT row_count with ORDER BY, MySQL stops sorting as soon as it has found the first row_count rows of the sorted result, rather than sorting the entire result. If ordering is done by using an index, this is very fast. If a filesort must be done, all rows that match the query without the LIMIT clause are selected, and most or all of them are sorted, before the first row_count are found. After the initial rows have been found, MySQL does not sort any remainder of the result set.
We can see that they should have no difference.
But the percona suggest this, and give test data. But give no reason, I think there maybe exist some "bug" in mysql when deal with this kind of case. So we just regard this as a useful experience.
I have a table with 10 columns, Now I want to give the users an option to sort the data with any column they want. For example suppose a combo box with 7 items that each of them is a column of the table, now the user choose one item and get the data sorted by the chosen column.
Now what is the problem?
My table has 3M records, and if I sort the data with indexed column I have no problem but with a non index column it takes 3.5mins to sort!!!
What is the solution I am thinking about?
Add index to every column of table that is needed to be sort by! In my case I will have index on 8 columns!!!!
What is the problem of my solution?
Having a lot of index on columns may decrease the speed of INSERT/UPDATE queries! In my case the table is updated frequently (every second!!!!!)
What is your solution for this case?!
Read this for more details on optimization: http://dev.mysql.com/doc/refman/5.0/en/order-by-optimization.html
In some cases, MySQL cannot use indexes to resolve the ORDER BY, although it still uses indexes to find the rows that match the WHERE clause. Using index for sorting often comes together with using index to find rows, however it can also be used just for sort for example if you’re just using ORDER BY without and where clauses on the table. In such case you would see “Index” type in EXPLAIN which correspond to scanning (potentially) complete table in the index order. It is very important to understand in which conditions index can be used to sort data together with restricting amount of rows.
Looking at the same index (A,B) things like ORDER BY A ; ORDER BY A,B ; ORDER BY A DESC, B DESC will be able to use full index for sorting (note MySQL may not select to use index for sort if you sort full table without a limit). However ORDER BY B or ORDER BY A, B DESC will not be able to use index because requested order does not line up with the order of data in BTREE. If you have both restriction and sorting things like this would work A=5 ORDER BY B ; A=5 ORDER BY B DESC; A>5 ORDER BY A ; A>5 ORDER BY A,B ; A>5 ORDER BY A DESC which again can be easily visualized as scanning a range in BTREE. Things like this however would not work A>5 ORDER BY B , A>5 ORDER BY A,B DESC or A IN (3,4) ORDER BY B – in these cases getting data in sorting form would require a bit more than simple range scan in the BTREE and MySQL decides to pass it on.
Option #1: If you are limited to MySQL there's no better option but create 8 indexes for the possible order columns. You're insert/update are going to suffer it for sure but no real visitor will wait for 3.5 minutes for a list to be sorted.
Tune #1: To make it a little faster you can create partial indexes instead of standard indexes which will use much less space (I assume some of these columns are varchar) and this means less writes, smaller footprint in memory. You just need to check the entropy for each column with the substring and make sure you still have distinction over 90%.
For example with a query like this:
> select count(distinct(substring(COLUMN, 1, 5))) as part_5, count(distinct(substring(COLUMN, 1, 10))) as part_10, count(distinct(substring(COLUMN, 1, 20))) as part_20, count(distinct(COLUMN)) as sum from TABLE;
+--------+---------+---------+---------+
| part_5 | part_10 | part_20 | sum |
+--------+---------+---------+---------+
| 892183 | 1996053 | 1996058 | 1996058 |
+--------+---------+---------+---------+
Tune #2: You can make you insert/update statements to execute in the background. The application won't be faster but the user experience is going to be much better.
Tune #3: Use bigger transactions if you can for the inserts/updates.
Option #2: You can try to use one of the search engines which have been built for this usage pattern (too). I would recommend Solr as I'm using it for a while with great satisfaction but I heard good about elastic search as well.
I know it's generally a bad idea to do queries like this:
SELECT * FROM `group_relations`
But when I just want the count, should I go for this query since that allows the table to change but still yields the same results.
SELECT COUNT(*) FROM `group_relations`
Or the more specfic
SELECT COUNT(`group_id`) FROM `group_relations`
I have a feeling the latter could potentially be faster, but are there any other things to consider?
Update: I am using InnoDB in this case, sorry for not being more specific.
If the column in question is NOT NULL, both of your queries are equivalent. When group_id contains null values,
select count(*)
will count all rows, whereas
select count(group_id)
will only count the rows where group_id is not null.
Also, some database systems, like MySQL employ an optimization when you ask for count(*) which makes such queries a bit faster than the specific one.
Personally, when just counting, I'm doing count(*) to be on the safe side with the nulls.
If I remember it right, in MYSQL COUNT(*) counts all rows, whereas COUNT(column_name) counts only the rows that have a non-NULL value in the given column.
COUNT(*) count all rows while COUNT(column_name) will count only rows without NULL values in the specified column.
Important to note in MySQL:
COUNT() is very fast on MyISAM tables for * or not-null columns, since the row count is cached. InnoDB has no row count caching, so there is no difference in performance for COUNT(*) or COUNT(column_name), regardless if the column can be null or not. You can read more on the differences on this post at the MySQL performance blog.
if you try SELECT COUNT(1) FROMgroup_relations it will be a bit faster because it will not try to retrieve information from your columns.
Edit: I just did some research and found out that this only happens in some db. In sqlserver it's the same to use 1 or *, but on oracle it's faster to use 1.
http://social.msdn.microsoft.com/forums/en-US/transactsql/thread/9367c580-087a-4fc1-bf88-91a51a4ee018/
Apparently there is no difference between them in mysql, like sqlserver the parser appears to change the query to select(1). Sorry if I mislead you in some way.
I was curious about this myself. It's all fine to read documentation and theoretical answers, but I like to balance those with empirical evidence.
I have a MySQL table (InnoDB) that has 5,607,997 records in it. The table is in my own private sandbox, so I know the contents are static and nobody else is using the server. I think this effectively removes all outside affects on performance. I have a table with an auto_increment Primary Key field (Id) that I know will never be null that I will use for my where clause test (WHERE Id IS NOT NULL).
The only other possible glitch I see in running tests is the cache. The first time a query is run will always be slower than subsequent queries that use the same indexes. I'll refer to that below as the cache Seeding call. Just to mix it up a little I ran it with a where clause I know will always evaluate to true regardless of any data (TRUE = TRUE).
That said here are my results:
QueryType
| w/o WHERE | where id is not null | where true=true
COUNT()
| 9 min 30.13 sec ++ | 6 min 16.68 sec ++ | 2 min 21.80 sec ++
| 6 min 13.34 sec | 1 min 36.02 sec | 2 min 0.11 sec
| 6 min 10.06 se | 1 min 33.47 sec | 1 min 50.54 sec
COUNT(Id)
| 5 min 59.87 sec | 1 min 34.47 sec | 2 min 3.96 sec
| 5 min 44.95 sec | 1 min 13.09 sec | 2 min 6.48 sec
COUNT(1)
| 6 min 49.64 sec | 2 min 0.80 sec | 2 min 11.64 sec
| 6 min 31.64 sec | 1 min 41.19 sec | 1 min 43.51 sec
++This is considered the cache Seeding call. It is expected to be slower than the rest.
I'd say the results speak for themselves. COUNT(Id) usually edges out the others. Adding a Where clause dramatically decreases the access time even if it's a clause you know will evaluate to true. The sweet spot appears to be COUNT(Id)... WHERE Id IS NOT NULL.
I would love to see other peoples' results, perhaps with smaller tables or with where clauses against different fields than the field you're counting. I'm sure there are other variations I haven't taken into account.
Seek Alternatives
As you've seen, when tables grow large, COUNT queries get slow. I think the most important thing is to consider the nature of the problem you're trying to solve. For example, many developers use COUNT queries when generating pagination for large sets of records in order to determine the total number of pages in the result set.
Knowing that COUNT queries will grow slow, you could consider an alternative way to display pagination controls that simply allows you to side-step the slow query. Google's pagination is an excellent example.
Denormalize
If you absolutely must know the number of records matching a specific count, consider the classic technique of data denormalization. Instead of counting the number of rows at lookup time, consider incrementing a counter on record insertion, and decrementing that counter on record deletion.
If you decide to do this, consider using idempotent, transactional operations to keep those denormalized values in synch.
BEGIN TRANSACTION;
INSERT INTO `group_relations` (`group_id`) VALUES (1);
UPDATE `group_relations_count` SET `count` = `count` + 1;
COMMIT;
Alternatively, you could use database triggers if your RDBMS supports them.
Depending on your architecture, it might make sense to use a caching layer like memcached to store, increment and decrement the denormalized value, and simply fall through to the slow COUNT query when the cache key is missing. This can reduce overall write-contention if you have very volatile data, though in cases like this, you'll want to consider solutions to the dog-pile effect.
MySQL ISAM tables should have optimisation for COUNT(*), skipping full table scan.
An asterisk in COUNT has no bearing with asterisk for selecting all fields of table. It's pure rubbish to say that COUNT(*) is slower than COUNT(field)
I intuit that select COUNT(*) is faster than select COUNT(field). If the RDBMS detected that you specify "*" on COUNT instead of field, it doesn't need to evaluate anything to increment count. Whereas if you specify field on COUNT, the RDBMS will always evaluate if your field is null or not to count it.
But if your field is nullable, specify the field in COUNT.
COUNT(*) facts and myths:
MYTH: "InnoDB doesn't handle count(*) queries well":
Most count(*) queries are executed same way by all storage engines if you have a WHERE clause, otherwise you InnoDB will have to perform a full table scan.
FACT: InnoDB doesn't optimize count(*) queries without the where clause
It is best to count by an indexed column such as a primary key.
SELECT COUNT(`group_id`) FROM `group_relations`
It should depend on what you are actually trying to achieve as Sebastian has already said, i.e. make your intentions clear! If you are just counting the rows then go for the COUNT(*), or counting a single column go for the COUNT(column).
It might be worth checking out your DB vendor too. Back when I used to use Informix it had an optimisation for COUNT(*) which had a query plan execution cost of 1 compared to counting single or mutliple columns which would result in a higher figure
if you try SELECT COUNT(1) FROM group_relations it will be a bit faster because it will not try to retrieve information from your columns.
COUNT(1) used to be faster than COUNT(*), but that's not true anymore, since modern DBMS are smart enough to know that you don't wanna know about columns
The advice I got from MySQL about things like this is that, in general, trying to optimize a query based on tricks like this can be a curse in the long run. There are examples over MySQL's history where somebody's high-performance technique that relies on how the optimizer works ends up being the bottleneck in the next release.
Write the query that answers the question you're asking -- if you want a count of all rows, use COUNT(*). If you want a count of non-null columns, use COUNT(col) WHERE col IS NOT NULL. Index appropriately, and leave the optimization to the optimizer. Trying to make your own query-level optimizations can sometimes make the built-in optimizer less effective.
That said, there are things you can do in a query to make it easier for the optimizer to speed it up, but I don't believe COUNT is one of them.
Edit: The statistics in the answer above are interesting, though. I'm not sure whether there is actually something at work in the optimizer in this case. I'm just talking about query-level optimizations in general.
I know it's generally a bad idea to do
queries like this:
SELECT * FROM `group_relations`
But when I just want the count, should
I go for this query since that allows
the table to change but still yields
the same results.
SELECT COUNT(*) FROM `group_relations`
As your question implies, the reason SELECT * is ill-advised is that changes to the table could require changes in your code. That doesn't apply to COUNT(*). It's pretty rare to want the specialized behavior that SELECT COUNT('group_id') gives you - typically you want to know the number of records. That's what COUNT(*) is for, so use it.