I have around 420 million records in my table. There is an only index on column colC of user_table . Below query returns around 1.5 million records based on
colC. But index is not used somehow and return the records 20 to 25 mins
select colA ,ColB , count(*) as count
from user_table
where colC >='2019-09-01 00:00:00'
and colC<'2019-09-30 23:59:59'
and colA in ("some static value")
and ColB in (17)
group by colA ,ColB;
But when I do force index, it starts getting used and returns the record in 2 mins only. My question why MYSQL is not using index by default when fetch
time is much lesser with index ? I have recreated the index alongwith repair but nothing works to make it in use by default .
Another observation for information is same query(without force index) works for previous months (having same volume of data) .
Update For the details asked by Evert
CREATE TABLE USER_TABLE (
id bigint(20) unsigned NOT NULL AUTO_INCREMENT,
COLA varchar(10) DEFAULT NULL,
COLB int(11) DEFAULT NULL,
COLC datetime DEFAULT NULL,
....
PRIMARY KEY (id),
KEYcolA(COLA),
KEYcolB(COLB),
KEYcolC(COLC)
) ENGINE=MyISAM AUTO_INCREMENT=2328036072 DEFAULT CHARSET=latin1 |
for better performance you could try using composite index .. based on the column involved in your where clause
and try to change the IN clause in a inner join
assuming your IN clause content is a set of fixed values you could use union (or a new table with the value you need )
eg using the union (you can do somethings similar if the IN clause is a subquery)
select user_table.colA ,ColB , count(*) as count
from user_table
INNER JOIN (
select 'FIXED1' colA
union
select 'FIXED2'
....
union
select 'FIXEDX'
) t on t.colA = user_table.colA
where colC >='2019-09-01 00:00:00'
and ColB = 17
group by colA ,ColB;
you could also add a composite index on table user_table on columns
colA, colB, colC
for what related to element used by mysql query optimizer for decide the index to use there several aspect and for all of these the query optimizer assign a cost
any what you should take in consideration
the column involved in Where clause
The size of the tables (and not yiuy case the size of the tables in join)
An estimation of how many rows will be fetched ( to decide whether to use an index, or simply scan the table )
if the datatypes match or not between columns in the jion and where clause
The use of function or data type conversion including mismacth of collation
The size of the index
cardinality of the index
and for all of these option is evaluated a cost and this lead to the index choose
In you case the colC as date could be implies a data conversion (respect the literal values as string ) and for this the index in not choosed ..
Is also for this that i have suggested a composite index with the left most column related to non converted values
Indexes try to get used as best as possible. I cant guarantee, but it SOUNDS like the engine is building a temporary index based on A & B to qualify the static values in your query. For 420+ million is just the time to build such temporary index. By you forcing an index is helping optimize the time otherwise.
Now, if you (and others) don't quite understand indexes, its a way of pre-grouping data to help the optimizer. When you have GROUP BY conditions, those components, where practical, should be part of the index, and TYPICALLY would be part of the criteria as you have in your query.
select colA ,ColB , count(*) as count
from user_table
where colC >='2019-09-01 00:00:00'
and colC<'2019-09-30 23:59:59'
and colA in ("some static value")
and ColB in (17)
group by colA ,ColB;
Now, lets look at your index, and only available based on ColC. Assume that all records are based on a day for scenario purposes. Make pretend each INDEX (single or compound) is stored in its own room. You have an index on just the date column C. In the room, you have 30 boxes (representing Sept 1 to Sept 30), not counting all other boxes for other days. Now, you have to go through each box per day and look for all entries that have a value of ColA and ColB that you want. The stuff in the box is not sorted, so you have to look at every record. Now, do this for the all 30 days of September.
Now, simulate the NEXT index, boxes stored in another room. This room is a compound index based on (and in this order to help optimize you query), Columns A, B and C.
So now, you could have 100 entries for "A". You only care about ColA = "some static value", so you grab that one box.
Now, you open that box and see a bunch of smaller boxes... Oh.. These are all the individual "Column B" records. On the top of each box represents each individual "B" entries so you find the 1 box with the value 17.
Finally, now you open Box B and look in side. Wow... they are all nicely sorted for you by date. So now, you scroll quickly to find Sept 1 and pull all entries up to Sept 30 you are looking for.
By quickly getting to the source by an optimized index will help you in the long run. Having an index on
(colA, colB, colC)
will significantly help your query performance.
One final note. Since you are only querying for a single "A" and single "B" value, you would only get a single row back and would not need a group by clause (in this case).
Hope this helps you and others better understand how indexes work from just individual vs compound (multi-columns).
One additional advantage of a multi-column index. Such as in this case where all the columns are part of the index, the database does not have to go to the raw data pages to confirm the other columns. Meaning you are looking only at the values A, B and C. All these fields are part of the index. It does not have to go back to the raw data pages where the actual data is stored to confirm its qualification to be returned.
In a single column index such as yours, it uses the index to find what records qualify (by date in this case). Then on an each record basis, it has to go to the raw data page holding the entire record (could have 50 columns in a record) just to confirm if the A and B columns qualify, then discard if not applicable. Then go back to the index by date, then back to the raw data page to confirm its A and B... You can probably understand much more time to keep going back and forth.
The second index already has "A", "B" and the pre-sorted date range of "C". Done without having to go to the raw data pages.
Related
I am using Mysql database.
I have a table daily_price_history of stock values stored with the following fields. It has 11 million+ rows
id
symbolName
symbolId
volume
high
low
open
datetime
close
So for each stock SymbolName there are various daily stock values. And the data is now more than 11 million rows,
The following sql try to get the last 100 days of daily data for a set of 1500 symbols
SELECT `daily_price_history`.`id`,
`daily_price_history`.`symbolId_id`,
`daily_price_history`.`volume`,
`daily_price_history`.`close`
FROM `daily_price_history`
WHERE (`daily_price_history`.`id` IN
(SELECT U0.`id`
FROM `daily_price_history` U0
WHERE (U0.`symbolName` = `daily_price_history`.`symbolName`
AND U0.`datetime` >= 1598471533546))
AND `daily_price_history`.`symbolName` IN (A,AA, ...... 1500 symbols Names)
I have the table indexed on symbolName and also datetime
For getting 130K (i.e 1500 x 100 ~ 150000) rows of data it takes 20 secs.
Also i have weekly_price_history and monthly_price_history tables, and I try to run the similar sql, they take less time for the same number (130K) of rows, because they have less data in the table than daily.
weekly_price_history getting 150K rows takes 3s. The total number of rows in it are 2.5million
monthly_price_history getting 150K rows takes 1s. The total number of rows in it are 800K
So how to speed up the thing when the size of table is large.
As a starter: I don't see the point for the subquery at all. Presumably, your query could filter directly in the where clause:
select id, symbolid_id, volume, close
from daily_price_history
where datetime >= 1598471533546 and symbolname in ('A', 'AA', ...)
Then, you want an index on (datetime, symbolname):
create index idx_daily_price_history
on daily_price_history(datetime, symbolname)
;
The first column of the index matches on the predicate on datetime. It is not very likley, however, that the database will be able to use the index to filter symbolname against a large list of values.
An alternative would be to put the list of values in a table, say symbolnames.
create table symbolnames (
symbolname varchar(50) primary key
);
insert into symbolnames values ('A'), ('AA'), ...;
Then you can do:
select p.id, p.symbolid_id, p.volume, p.close
from daily_price_history p
inner join symbolnames s on s.symbolname = p.symbolname
where s.datetime >= 1598471533546
That should allow the database to use the above index. We can take one step forward and try and add the 4 columns of the select clause to the index:
create index idx_daily_price_history_2
on daily_price_history(datetime, symbolname, id, symbolid_id, volume, close)
;
When you add INDEX(a,b), remove INDEX(a) as being no longer necessary.
Your dataset and query may be a case for using PARTITIONing.
PRIMARY KEY(symbolname, datetime)
PARTITION BY RANGE(datetime) ...
This will do "partition pruning": datetime >= 1598471533546. Then the PRIMARY KEY will do most of the rest of the work for symbolname in ('A', 'AA', ...).
Aim for about 50 partitions; the exact number does not matter. Too many partitions may hurt performance; too few won't provide effective pruning.
Yes, get rid of the subquery as GMB suggests.
Meanwhile, it sounds like Django is getting in the way.
Some discussion of partitioning: http://mysql.rjweb.org/doc.php/partitionmaint
I have a very large table 20-30 million rows that is completely overwritten each time it is updated by the system supplying the data over which I have no control.
The table is not sorted in a particular order.
The rows in the table are unique, there is no subset of columns that I can be assured to have unique values.
Is there a way I can run a SELECT query followed by a DELETE query on this table with a fixed limit without having to trigger any expensive sorting/indexing/partitioning/comparison whilst being certain that I do not delete a row not covered by the previous select.
I think you're asking for:
SELECT * FROM MyTable WHERE x = 1 AND y = 3;
DELETE * FROM MyTable WHERE NOT (x = 1 AND y = 3);
In other words, use NOT against the same search expression you used in the first query to get the complement of the set of rows. This should work for most expressions, unless some of your terms return NULL.
If there are no indexes, then both the SELECT and DELETE will incur a table-scan, but no sorting or temp tables.
Re your comment:
Right, unless you use ORDER BY, you aren't guaranteed anything about the order of the rows returned. Technically, the storage engine is free to return the rows in any arbitrary order.
In practice, you will find that InnoDB at least returns rows in a somewhat predictable order: it reads rows in some index order. Even if your table has no keys or indexes defined, every InnoDB table is stored as a clustered index, even if it has to generate an internal key called GEN_CLUST_ID behind the scenes. That will be the order in which InnoDB returns rows.
But you shouldn't rely on that. The internal implementation is not a contract, and it could change tomorrow.
Another suggestion I could offer:
CREATE TABLE MyTableBase (
id INT AUTO_INCREMENT PRIMARY KEY,
A INT,
B DATE,
C VARCHAR(10)
);
CREATE VIEW MyTable AS SELECT A, B, C FROM MyTableBase;
With a table and a view like above, your external process can believe it's overwriting the data in MyTable, but it will actually be stored in a base table that has an additional primary key column. This is what you can use to do your SELECT and DELETE statements, and order by the primary key column so you can control it properly.
I have a simple history table that I am developing a new lookup for. I am wondering what is the best index (if any) to add to this table so that the lookups are as fast as possible.
The history table is a simple set of records of actions taken. Each action has a type and an action date (and some other attributes). Every day a new set of action records is generated by the system.
The relevant pseudo-schema is:
TABLE history
id int,
type int,
action_date date
...
INDEX
id
...
Note: the table is not indexed on type or action_date.
The new lookup function is intended to retrieve all the records of a specific type that occurred on a specific action date.
My initial inclination is to define a compound key consisting of both the type and the action_date.
However in my case there will be many actions with the same type and date. Further, the actions will be roughly evenly distributed in number each day.
Given all of the above: (a) is an index worthwhile; and (b) if so, what is the preferred index(es)?
I am using MySQL, but I think my question is not specific to this RDBMS.
The first field on index should be the one giving you the smallest dataset for the majority of queries after the condition is applied.
Depending on your business requirements, you may request a specific date or specific date range (most likely the date range). So the date should one the last field on the index. Most likely you will always have the date condition.
A common answer is to have the (type,date) index, but you should consider just the date index if you ever query more than one type value in the query or if you have just a few types (like less than 5) and they are not evenly distributed.
For example, you have type 1 70% of the table, type 2,3,4,... is less than few percent of the table, and you often query type 1, you better have just separate date index, and type index (for cases when you query type 2,3,4,), not compound (type, date) index.
INDEX(type, action_date), regardless of cardinality or distribution of either column. Doing so will minimize the number of 'rows' of the index's BTree` that need to be looked at. (Yes, I am disagreeing with Sergiy's Answer.)
Even for WHERE type IN (2,3) AND action_date ... can use that index.
For checking against a date range of, say 2 weeks, I recommend this pattern:
AND action_date >= '2016-10-16`
AND action_date < '2016-10-16` + INTERVAL 2 WEEK
A way to see how much "work" is needed for a query:
FLUSH STATUS;
SELECT ...;
SHOW SESSION STATUS LIKE 'Handler%';
The numbers presented will give you a feel for how many index (or data) rows need to be touched. This makes it easy to see which of two possible queries/indexes works better, even when the table is too small to get reliable timings.
Yes, an index is worthwhile. Especially if you search for a small subset of the table.
If your search would match 20% or more of the table (approximately), the MySQL optimizer decides that the index is more trouble than it's worth, and it'll do a table-scan even if the index is available.
If you search for one specific type value and one specific date value, an index on (type, date) or an index on (date, type) is a good choice. It doesn't matter much which column you list first.
If you search for multiple values of type or multiple dates, then the order of columns matters. Follow this guide:
The leftmost columns of the index should be the ones on which you do equality comparisons. An equality comparison is one that matches exactly one value (even if that value is found on many rows).
WHERE type = 2 AND date = '2016-10-19' -- both equality
The next column of the index can be part of a range comparison. A range comparison matches multiple values. For example, > or IN( ) or BETWEEN or !=.
WHERE type = 2 AND date > '2016-10-19' -- one equality, one range
Only one such column benefits from an index. If you have range comparisons on multiple columns, only the first column of the index will use the index to support lookups. The subsequent column(s) will have to search through those matching rows "the hard way".
WHERE type IN (2, 3, 4) AND date > '2016-10-19' -- multiple range
If you sometimes search using a range condition on type and equality on date, you'll need to create a second index.
WHERE type IN (2, 3, 4) AND date = '2016-10-19' -- make index on (date, type)
The order of terms in your WHERE clause doesn't matter. The SQL query optimizer will figure that out and reorder them to match the right columns defined in an index.
I'm trying to optimize my MySQL DB so I can query it as quickly as possible.
It goes like this:
My DB consists of 1 table that has (for now) about 18 million rows - and growing rapidly.
This table has the following columns - idx, time, tag_id, x, y, z.
No column has any null values.
'idx' is an INT(11) index column, AI and PK. right now it's in ascending order.
'time' is a date-time column. it's also ascending. 50% of the 'time' values in the table are distinct (and the rest of the values will appear probably twice or 3 times at most).
'tag_id' is an INT(11) column. it's not ordered in any way, and there are between 30-100 different possible tag_id values that spread over the whole DB. It's also a foreign key with another table.
INSERT -
A new row is being inserted to the table every 2-3 seconds. 'idx' is calculated by the server (AI). since the 'time' column represents the time the row was inserted, every new 'time' that's inserted will be either higher or equal to the previous row. all the other column values don't have any order.
SELECT -
here is an example of a typical query:
"select x, y, z, time from table where date(time) between '2014-08-01' and '2014-10-01' and tag_id = 123456"
so, 'time' and 'tag_id' are the only columns that appear in the where part, and both of them will ALWAYS appear in the where part of every query. 'x', 'y' and 'z' and 'time' will always appear in the select part. 'tag_id' might also appear in the select part sometimes.
the queries will usually seek higher (more recent) times, rather then the older times. meaning - later rows in the table will be searched more.
INDEXES-
right now, 'idx', being the PK, is the clustered ASC index. 'time' has also a non-clustered ASC index.
That's it. considering all this data, a typical query will return results for me in around 30 seconds. I'm trying to lower this time. any advice??
I'm thinking about changing one or both of the indexes from ASC to DESC (since the higher values are more popular in the search). if I change 'idx' to DESC it will physically reverse the whole table. if I change 'time' to DESC it will reverse the 'time' index tree. but since this is an 18 million row table, changes like this might take a long time for the server so I want to be sure it's a good idea. the question is, if I reverse the order and a new row is inserted, will the server know to put it in the beginning of the table quickly? or will it search the table every time for the place? and will putting a new row in the beginning of the table mean that some kind of data shifting will need to be done to the whole table every time?
Or maybe I just need a different indexing technique??
Any ideas you have are very welcome.. thanks!!
select x, y, z, time from table
where date(time) between '2014-08-01' and '2014-10-01' and tag_id = 123456
Putting a column inside a function call like date(time) spoils any chance of using an index for that column. You must use only a bare column for comparison, if you want to use an index.
So if you want to compare it to dates, you should store a DATE column. If you have a DATETIME column, you may have to use a search term like this:
WHERE `time` >= '2014-08-01 00:00:00 AND `time` < '2014-10-02 00:00:00' ...
Also, you should use multi-column indexes where you can. Put columns used in equality conditions first, then one column used in range conditions. For more on this rule, see my presentation How to Design Indexes, Really.
You may also benefit from adding columns that are not used for searching, so that the query can retrieve the columns from the index entry alone. Put these columns following the columns used for searching or sorting. This is called an index-only query.
So for this query, your index should be:
ALTER TABLE `this_table` ADD INDEX (tag_id, `time`, x, y, z);
Regarding ASC versus DESC, the syntax supports the option for different direction indexes, but in the two most popular storage engines used in MySQL, InnoDB and MyISAM, there is no difference. Either direction of sorting can use either type of index more or less equally well.
I have a table 'tbl' something like that:
ID bigint(20) - primary key, autoincrement
field1
field2
field3
That table has 600k+ rows.
Query:
SELECT * from tbl ORDER by ID LIMIT 600000, 1 takes 1.68 second
Query:
SELECT ID, field1 from tbl ORDER by ID LIMIT 600000, 1 takes 1.69 second
Query:
SELECT ID from tbl ORDER by ID LIMIT 600000, 1 takes 0.16 second
Query:
SELECT * from tbl WHERE ID = xxx takes 0.005 second
Those queries are tested in phpmyadmin.
And the result is query 3 and query 4 together return necessarily data.
Query 1 does the same jobs but much slower...
This doesn't look right for me.
Could anyone give any advice?
P.S. I'm sorry for formatting.. I'm new to this site.
New test:
Q5 : CREATE TEMPORARY TABLE tmptable AS (SELECT ID FROM tbl WHERE ID LIMIT 600030, 30);
SELECT * FROM tbl WHERE ID IN (SELECT ID FROM tmptable); takes 0.38 sec
I still don't understand how it's possible. I recreated all indexes.. what else can I do with that table? Delete and refill it manually? :)
Query 1 looks at the table's primary key index, finds the correct 600,000 ids and their corresponding locations within the table, then goes to the table and fetches everything from those 600k locations.
Query 2 looks at the table's primary key index, finds the correct 600k ids and their corresponding locations within the table, then goes to the table and fetches whichever subset of fields are asked for from those 600k rows.
Query 3 looks at the table's primary key index, finds the correct 600k ids, and returns them. It doesn't need to look at the table at all.
Query 4 looks at the table's primary key index, finds the single entry requested, goes to the table, reads that single entry, and returns it.
Time-wise, let's build backwards:
(Q4) The table index allows lookup of a key (id) in O(log n) time, meaning every time the table doubles in size it only takes one extra step to find the key in the index*. If you have 1 million rows, then, it would only take ~20 steps to find it. A billion rows? 30 steps. The index entry includes data on where in the table to go to find the data for that row, so MySQL jumps to that spot in the table and reads the row. The time reported for this is almost entirely overhead.
(Q3) As I mentioned, the table index is very fast; this query finds the first entry and just traverses the tree until it has the requested number of rows. I'm sure I could calculate the precise number of steps it would take, but as a maximum we'll say 20 steps x 600k rows = 12M steps; since it's traversing a tree it would likely be more like 1M steps, but the precise number is largely irrelevant. The most important thing to realize here is that once MySQL has walked the index to pull the ids it needs, it has everything you asked for. There's no need to go look at the table. The time reported for this one is essentially the time it takes MySQL to walk the index.
(Q2) This begins with the same tree-walking as discussed for query 3, but while pulling the IDs it needs, MySQL also pulls their location within the table files. It then has to go to the table file (probably already cached/mmapped in memory), and for every entry it pulled, seek to the proper place in the table and get the fields requested out of those rows. The time reported for this query is the time it takes to walk the index (as in Q3) plus the time to visit every row specified in the index.
(Q1) This is identical to Q2 when all fields are specified. As the time is essentially identical to Q2, we can see that it doesn't really take measurably more time to pull more fields out of the database, any time there is dwarfed by crawling the index and seeking to the rows.
*: Most databases use an indexing data structure (B-trees for MySQL) that has a log base much higher than 2, meaning that instead of an extra step every time the table doubles, it's more like an extra step every time the table size goes up by a factor of hundreds to thousands. This means that instead of the 20-30 steps I stated in the example, it's more like 2-5.