I have created the table measurements as listed below.
This table is written to periodically and will rapidly grow to contain millions of rows after a few days.
On read: I only need the precise time of the measurement and its value (unix_epoch and value).
To improve performance, I have added column date_from_epoch which is the day extracted out of unix_epoch (the measurement precise time) in this format: yyyymmdd. It should have a good selectivity (after multiple days of measurements have been written to the table) and I am using it as a key for an index. I am hoping to scan only the days for which I want the measurements on read, and not all the days present in the table (example: after 10 days, if 1,000,000 are added each day, I am hoping to scan only 1,000,000 rows if I need data contained within one day, not 10,000,000).
I have also:
used innoDB for the engine
partitioned the table by hash into 10 files to help with I/O
made sure the type used in my query is the same as the column type (or did I get this verification wrong?).
Question:
I have made a test after measurements have trickled in the measurement table for 2 days.
Using EXPLAIN, I see my read query does not use the index. Why is the query not using the index?
Table is created with:
CREATE TABLE measurements(
date_from_epoch INT UNSIGNED,
unix_epoch INT UNSIGNED,
application_name varchar(255),
environment varchar(255),
metric_name varchar(255),
host_name varchar(1024),
value FLOAT(38,3)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
PARTITION BY HASH(unix_epoch)
PARTITIONS 10;
CREATE TRIGGER write_epoch_day
BEFORE INSERT ON measurements
FOR EACH ROW
SET NEW.date_from_epoch = FROM_UNIXTIME(NEW.unix_epoch, '%Y%m%d');
ALTER TABLE measurements ADD INDEX (date_from_epoch);
The query is:
EXPLAIN SELECT * FROM measurements
WHERE date_from_epoch >= 20150615 AND date_from_epoch <= 20150615
AND unix_epoch >= 1434423478 AND unix_epoch <= 1434430678
AND BINARY application_name = 'all'
AND BINARY environment = 'prod'
AND BINARY metric_name = 'Internet availability'
AND (BINARY host_name = 'kitkat' )
ORDER BY unix_epoch ASC;
Explain gives:
id select_type table type possible_keys key key_len ref rows Extra
-------------------------------------------------------------------------------------------------------------------------------------------------------
1 SIMPLE measurements ALL date_from_epoch 118011 Using where; Using filesort
Thanks for reading and head-scratching!
There is an option to use FORCE INDEX in MYSQL
Refer this for better understanding.
Thanks Sashi!
I have modified the query to
EXPLAIN SELECT * FROM measurements FORCE INDEX (date_from_epoch)
WHERE date_from_epoch >= 20150615 AND date_from_epoch <= 20150615
AND unix_epoch >= 1434423478 AND unix_epoch <= 1434430678
AND BINARY application_name = 'all'
AND BINARY environment = 'prod'
AND BINARY metric_name = 'Internet availability'
AND BINARY host_name = 'kitkat'
ORDER BY unix_epoch ASC;
Explain says still "Using where; Using file sort" but the number of rows scanned is now down to 67,906 vs the 118,011 initially scanned (Which is great).
Although the number of rows for date_from_epoch = 20150615 is 113,182. I am now wondering why the number of rows scanned is not 113,182 (not that I want it to go up, but I would like to understand what mysql did to even further optimize the execution).
A lot of things need fixing:
Don't use PARTITION BY HASH; it does not help.
Since you have a range across the partition key, it must touch all partitions. See EXPLAIN PARTITIONS SELECT ....
Don't bother with the extra epoch_from_date and Trigger; just do comparisons on unix_epoch. (See the manual on the conversion routines needed.)
Don't use BINARY. Instead, specify the columns as COLLATION utf8_bin. Performance will be much better.
Normalize (or turn into an ENUM) these fields: application_name, environment, metric_name, host_name. What you have is unnecessarily bulky for millions of rows. (I am assuming there are only a few distinct values for those fields.) The space savings will make the SELECT run much faster.
FLOAT(38, 3) has an extra (unnecessary?) rounding. Simply use FLOAT.
(After making the above changes) INDEX(application_name, environment, metric_name, host_name, unix_epoch) would be quite helpful, at least for that one query. And it will be significantly better than the INDEX you are asking about.
Related
enter image description here
Hello
I want to configure a partition (monthly)/subpartition (day by day) as the query above.
If the total number of subpartitions exceeds 64,
'(errno: 168 "Unknown (generic) error from engine")'
The table is not created due to an error. (Creating less than 64 is successed).
I know that the maximum number of partitions (including subpartitions) that can be created is 8,192, is there anything I missed?
Below is the log table.
create table detection_log
(
id bigint auto_increment,
detected_time datetime default '1970-01-01' not null,
malware_title varchar(255) null,
malware_category varchar(30) null,
user_name varchar(30) null,
department_path varchar(255) null,
PRIMARY KEY (detected_time, id),
INDEX `detection_log_id_uindex` (id),
INDEX `detection_log_malware_title_index` (malware_title),
INDEX `detection_log_malware_category_index` (malware_category),
INDEX `detection_log_user_name_index` (user_name),
INDEX `detection_log_department_path_index` (departmen`enter code here`t_path)
);
SUBPARTITIONs provide no benefit that I know of.
HASH partitioning either provides no benefit or hurts performance.
So... Explain what you hoped to gain by partitioning; then we can discuss whether any type of partitioning is worth doing. Also, provide the likely SELECTs so we can discuss the optimal INDEXes. If you need a "two-dimensional" index, that might indicate a need for partitioning (but still not subpartitioning).
More
I see PRIMARY KEY(detected_time,id). This provides a very fast way to do
SELECT ...
WHERE detected_time BETWEEN ... AND ...
ORDER BY detected_time, id
In fact, it will probably be faster than if you also partition the table. (As a general rule it is useless to partition on the first part of the PK.)
If you need to do
SELECT ...
WHERE user_id = 123
AND detected_time BETWEEN ... AND ...
ORDER BY detected_time, id
Then this is optimal:
INDEX(user_id, detected_time, id)
Again, probably faster than any form of partitioning on any column(s).
And
A "point query" (WHERE key = 123) takes a few milliseconds more in a 1-billion-row table compared to a 1000-row table. Rarely is the difference important. The depth of the BTree (perhaps 5 levels vs 2 levels) is the main difference. If you PARTITION the table, you are removing perhaps 1 or 2 levels of the BTree, but replacing them with code to "prune" down to the desired partition. I claim that this tradeoff does not provide a performance benefit.
A "range query" is very nearly the same speed regardless of the table size. This is because the structure is actually a B+Tree, so it is very efficient to fetch the 'next' row.
Hence, the main goal in optimizing queries on a huge table is to take advantage of the characteristics of the B+Tree.
Pagination
SELECT log.detected_time, log.user_name, log.department_path,
log.malware_category, log.malware_title
FROM detection_log as log
JOIN
(
SELECT id
FROM detection_log
WHERE user_name = 'param'
ORDER BY detected_time DESC
LIMIT 25 OFFSET 1000
) as temp ON temp.id = log.id;
The good part: Finding ids, then fetching the data.
The slow part: Using OFFSET.
Have this composite index: INDEX(user_name, detected_time, id) in that order. Make another index for when you use department_path.
Instead of OFFSET, "remember where you left off". A blog specifically about that: http://mysql.rjweb.org/doc.php/pagination
Purging
Deleting after a year is an excellent use of PARTITIONing. Use PARTITION BY RANGE(TO_DAYS(detected_time)) and have either ~55 weekly or 15 monthly partitions. See HTTP://mysql.rjweb.org/doc.php/partitionmaint for details. DROP PARTITION is immensely faster than DELETE. (This partitioning will not speed up SELECT.)
I have the following query that runs forever and I am looking to see if there is anyway that I can optimise it. This is running on a table that has in total 1,406,480 rows of data but apart from the Filename and Refcolumn, the ID and End_Date have both been indexed.
My Query:
INSERT INTO UniqueIDs
(
SELECT
T1.ID
FROM
master_table T1
LEFT JOIN
master_table T2
ON
(
T1.Ref_No = T2.Ref_No
AND
T1.End_Date = T2.End_Date
AND
T1.Filename = T2.Filename
AND
T1.ID > T2.ID
)
WHERE T2.ID IS NULL
AND
LENGTH(T1.Ref_No) BETWEEN 5 AND 10
)
;
Explain Results:
The reason for not indexing the Ref_No is that this is a text column and therefore I get a BLOB/TEXT error when I try and index this column.
Would really appreciate if somebody could advise on how I can quicken this query.
Thanks
Thanks to Bill in regards to multi column indexes I have managed to make some headway. I first ran this code:
CREATE INDEX I_DELETE_DUPS ON master_table(id, End_Date);
I then added a new column to show the length of the Ref_No but had to change it from the query Bill mentioned as my version of MySQL is 5.5. So I ran it in 3 steps:
ALTER TABLE master_table
ADD COLUMN Ref_No_length SMALLINT UNSIGNED;
UPDATE master_table SET Ref_No_length = LENGTH(Ref_No);
ALTER TABLE master_table ADD INDEX (Ref_No_length);
Last step was to change my insert query with the where clause for the length. This was changed to:
AND t1.Ref_No_length between 5 and 10;
I then ran this query and within 15 mins I had 280k worth of id's inserted into my UniqueIDs table. I did go change my insert script to see if I could add more values to the length by doing the following:
AND t1.Ref_No_length IN (5,6,7,8,9,10,13);
This was to bring in the values where length was also equal to 13. This query took a lot longer, 2hr 50 mins to be precise but the additional ask of looking for all rows that have length of 13 gave me an extra 700k unique ids.
I am looking at ways to optimise the query with the IN clause, but a big improvement where this query kept running for 24 hours. So thank you so much Bill.
For the JOIN, you should have a multi-column index on (Ref_No, End_Date, Filename).
You can create a prefix index on a TEXT column like this:
ALTER TABLE master_table ADD INDEX (Ref_No(10));
But that won't help you search based on the LENGTH(). Indexing only helps search by value indexed, not by functions on the column.
In MySQL 5.7 or later, you can create a virtual column like this, with an index on the values calculated for the virtual column:
ALTER TABLE master_table
ADD COLUMN Ref_No_length SMALLINT UNSIGNED AS (LENGTH(Ref_No)),
ADD INDEX (Ref_No_length);
Then MySQL will recognize that your condition in your query is the same as the expression for the virtual column, and it will automatically use the index (exception: in my experience, this doesn't work for expressions using JSON functions).
But this is no guarantee that the index will help. If most of the rows match the condition of the length being between 5 and 10, the optimizer will not bother with the index. It may be more work to use the index than to do a table-scan.
the ID and End_Date have both been indexed.
You have PRIMARY KEY(id) and redundantly INDEX(id)? A PK is a unique key.
"have both been indexed" -- INDEX(a), INDEX(b) is not the same as INDEX(a,b) -- they have different uses. Read about "composite" indexes.
That query smells a lot like "group-wise" max done in a very slow way. (Alas, that may have come from the online docs.)
I have compiled the fastest ways to do that task here: http://mysql.rjweb.org/doc.php/groupwise_max (There are multiple versions, based on MySQL version and what issues your code can/cannot tolerate.)
Please provide SHOW CREATE TABLE. One important question: Is id the PRIMARY KEY?
This composite index may be useful:
(Filename, End_Date, Ref_No, -- first, in any order
ID) -- last
This, as others have noted, is unlikely to be helped by any index, hence T1 will need a full-table-scan:
AND LENGTH(T1.Ref_No) BETWEEN 5 AND 10
If Ref_No cannot be bigger than 191 characters, change it to a VARCHAR so that it can be used in an index. Oh, did I ask for SHOW CREATE TABLE? If you can't make it VARCHAR, then my recommended composite index is
INDEX(Filename, End_Date, ID)
I need to fetch last 24 hrs data frequently and this query runs frequently.
Since this scans many rows, using it frequently, affects the database performance.
MySql execution strategy picks index on created_at and that returns 1,00,000 rows approx. and these rows are scanned one by one to filter customer_id = 10 and my final result has 20000 rows.
How can I optimize this query?
explain SELECT *
FROM `order`
WHERE customer_id = 10
and `created_at` >= NOW() - INTERVAL 1 DAY;
id : 1
select_type : SIMPLE
table : order
partitions : NULL
type : range
possible_keys : idx_customer_id, idx_order_created_at
key : idx_order_created_at
key_len : 5
ref : NULL
rows : 103357
filtered : 1.22
Extra : Using index condition; Using where
The first optimization I would do is on the access to the table:
create index ix1 on `order` (customer_id, created_at);
Then, if the query is still slow I would try appending the columns you are selecting to the index. If, for example, you are selecting the columns order_id, amount, and status:
create index ix1 on `order` (customer_id, created_at,
order_id, amount, status);
This second strategy could be beneficial, but you'll need to test it to find out what performance improvement it peoduces in your particular case.
The big improvement of this second strategy is that it walks the secondary index only, by avoiding to walk back to the primary clustered index of the table (that can be time consumming).
Instead of two single indexes on ID and Created, create a single composite index on ( customer_id, created_at ). This way the index engine can use BOTH parts of the where clause instead of just hoping to get the one. Jump right to the customer ID, then jump directly to the date desired, then gives results. it SHOULD be very fast.
Additional Follow-up.
I hear your comment about having multiple indexes, but add those into the main one, just after such as
( customer_id, created_at, updated_at, completion_time )
Then, in your queries could always include some help on the index in the where clause. For example, and I don't know your specific data. A record is created at some given point. The updated and completion time will always be AFTER that. How long does it take (worst-case scenario) from a creation to completion time... 2 days, 10 days, 90 days?
where
customerID = ?
AND created_at >= date - 10 days
AND updated_at >= date -1
Again, just an example, but if a person has 1000's of orders and relatively quick turn-around time, you could jump to those most recent and then find those updated within the time period.. Again, just an option as a single index vs 3, 4 or more indexes.
Seems you are dealing a very quick growing table, I should consider moving this frequent query to a cold table or replica.
One more point is that did you consider partition by customer_id. I am not quite understand the business logic behind to query customer_id = 10. If it's multi tenancy application, try partition.
For this query:
SELECT o.*
FROM `order` o
WHERE o.customer_id = 10 AND
created_at >= NOW() - INTERVAL 1 DAY;
My first inclination would be a composite index on (customer_id, created_at) -- as others have suggested.
But, you appear to have a lot of data and many inserts per day. That suggests partitioning plus an index. The appropriate partition would be on created_at, probably on a daily basis, along with an index for user_id.
A typical query would access the two most recent partitions. Because your queries are focused on recent data, this also reduces the memory occupied by the index, which might be an overall benefit.
This technique should be better than all the other answers, though perhaps by only a small amount:
Instead of orders being indexed thus:
PRIMARY KEY(order_id) -- AUTO_INCREMENT
INDEX(customer_id, ...) -- created_at, and possibly others
do this to "cluster" the rows together:
PRIMARY KEY(customer_id, order_id)
INDEX (order_id) -- to keep AUTO_INCREMENT happy
Then you can optionally have more indexes starting with customer_id as needed. Or not.
Another issue -- What will you do with 20K rows? That is a lot to feed to a client, especially of the human type. If you then munch on it, can't you make a more complex query that does more work, and returns fewer rows? That will probably be faster.
example i have some gps devices that send info to my database every seconds
so 1 device create 1 row in mysql database with these columns (8)
id=12341 date=22.02.2018 time=22:40
langitude=22.236558789 longitude=78.9654582 deviceID=24 name=device-name someinfo=asdadadasd
so for 1 minute it create 60 rows , for 24 hours it create 864000 rows
and for 1 month(31days) 2678400 ROWS
so 1 device is creating 2.6 million rows per month in my db table ( records are deleted every month.)
so if there are more devices will be 2.6 Million * number of devices
so my questions are like this:
Question 1: if i make a search like this from php ( just for current day and for 1 device)
SELECT * FROM TABLE WHERE date='22.02.2018' AND deviceID= '24'
max possible results will be 86400 rows
will it overload my server too much
Question 2: limit with 5 hours (18000 rows) will that be problem for database or will it load server like first example or less
SELECT * FROM TABLE WHERE date='22.02.2018' AND deviceID= '24' LIMIT 18000
Question 3: if i show just 1 result from db will it overload server
SELECT * FROM TABLE WHERE date='22.02.2018' AND deviceID= '24' LIMIT 1
does it mean that if i have millions of rows and 1000rows will load server same if i show just 1 result
Millions of rows is not a problem, this is what SQL databases are designed to handle, if you have a well designed schema and good indexes.
Use proper types
Instead of storing your dates and times as separate strings, store them either as a single datetime or separate date and time types. See indexing below for more about which one to use. This is both more compact, allows indexing, faster sorting, and it makes available date and time functions without having to do conversions.
Similarly, be sure to use the appropriate numeric type for latitude, and longitude. You'll probably want to use numeric to ensure precision.
Since you're going to be storing billions of rows, be sure to use a bigint for your primary key. A regular int can only go up to about 2 billion.
Move repeated data into another table.
Instead of storing information about the device in every row, store that in a separate table. Then only store the device's ID in your log. This will cut down on your storage size, and eliminate mistakes due to data duplication. Be sure to declare the device ID as a foreign key, this will provide referential integrity and an index.
Add indexes
Indexes are what allows a database to search through millions or billions of rows very, very efficiently. Be sure there are indexes on the rows you use frequently, such as your timestamp.
A lack of indexes on date and deviceID is likely why your queries are so slow. Without an index, MySQL has to look at every row in the database known as a full table scan. This is why your queries are so slow, you're lacking indexes.
You can discover whether your queries are using indexes with explain.
datetime or time + date?
Normally it's best to store your date and time in a single column, conventionally called created_at. Then you can use date to get just the date part like so.
select *
from gps_logs
where date(created_at) = '2018-07-14'
There's a problem. The problem is how indexes work... or don't. Because of the function call, where date(created_at) = '2018-07-14' will not use an index. MySQL will run date(created_at) on every single row. This means a performance killing full table scan.
You can work around this by working with just the datetime column. This will use an index and be efficient.
select *
from gps_logs
where '2018-07-14 00:00:00' <= created_at and created_at < '2018-07-15 00:00:00'
Or you can split your single datetime column into date and time columns, but this introduces new problems. Querying ranges which cross a day boundary becomes difficult. Like maybe you want a day in a different time zone. It's easy with a single column.
select *
from gps_logs
where '2018-07-12 10:00:00' <= created_at and created_at < '2018-07-13 10:00:00'
But it's more involved with a separate date and time.
select *
from gps_logs
where (created_date = '2018-07-12' and created_time >= '10:00:00')
or (created_date = '2018-07-13' and created_time < '10:00:00');
Or you can switch to a database with partial indexes like Postgresql. A partial index allows you to index only part of a value, or the result of a function. And Postgresql does a lot of things better than MySQL. This is what I recommend.
Do as much work in SQL as possible.
For example, if you want to know how many log entries there are per device per day, rather than pulling all the rows out and calculating them yourself, you'd use group by to group them by device and day.
select gps_device_id, count(id) as num_entries, created_at::date as day
from gps_logs
group by gps_device_id, day;
gps_device_id | num_entries | day
---------------+-------------+------------
1 | 29310 | 2018-07-12
2 | 23923 | 2018-07-11
2 | 23988 | 2018-07-12
With this much data, you will want to rely heavily on group by and the associated aggregate functions like sum, count, max, min and so on.
Avoid select *
If you must retrieve 86400 rows, the cost of simply fetching all that data from the database can be costly. You can speed this up significantly by only fetching the columns you need. This means using select only, the, specific, columns, you, need rather than select *.
Putting it all together.
In PostgreSQL
Your schema in PostgreSQL should look something like this.
create table gps_devices (
id serial primary key,
name text not null
-- any other columns about the devices
);
create table gps_logs (
id bigserial primary key,
gps_device_id int references gps_devices(id),
created_at timestamp not null default current_timestamp,
latitude numeric(12,9) not null,
longitude numeric(12,9) not null
);
create index timestamp_and_device on gps_logs(created_at, gps_device_id);
create index date_and_device on gps_logs((created_at::date), gps_device_id);
A query can generally only use one index per table. Since you'll be searching on the timestamp and device ID together a lot timestamp_and_device combines indexing both the timestamp and device ID.
date_and_device is the same thing, but it's a partial index on just the date part of the timestamp. This will make where created_at::date = '2018-07-12' and gps_device_id = 42 very efficient.
In MySQL
create table gps_devices (
id int primary key auto_increment,
name text not null
-- any other columns about the devices
);
create table gps_logs (
id bigint primary key auto_increment,
gps_device_id int references gps_devices(id),
foreign key (gps_device_id) references gps_devices(id),
created_at timestamp not null default current_timestamp,
latitude numeric(12,9) not null,
longitude numeric(12,9) not null
);
create index timestamp_and_device on gps_logs(created_at, gps_device_id);
Very similar, but no partial index. So you'll either need to always use a bare created_at in your where clauses, or switch to separate date and time types.
Just read you question, for me the Answer is
Just create a separate table for Latitude and longitude and make your ID Foreign key and save it their.
Without knowing the exact queries you want to run I can just guess the best structure. Having said that, you should aim for the optimal types that use the minimum number of bytes per row. This should make your queries faster.
For example, you could use the structure below:
create table device (
id int primary key not null,
name varchar(20),
someinfo varchar(100)
);
create table location (
device_id int not null,
recorded_at timestamp not null,
latitude double not null, -- instead of varchar; maybe float?
longitude double not null, -- instead of varchar; maybe float?
foreign key (device_id) references device (id)
);
create index ix_loc_dev on location (device_id, recorded_at);
If you include the exact queries (naming the columns) we can create better indexes for them.
Since probably your query selectivity is bad, your queries may run Full Table Scans. For this case I took it a step further I used the smallest possible data types for the columns, so it will be faster:
create table location (
device_id tinyint not null,
recorded_at timestamp not null,
latitude float not null,
longitude float not null,
foreign key (device_id) references device (id)
);
Can't really think of anything smaller than this.
The best what I can recommend to you is to use time-series database for storing and accessing time-series data. You can host any kind of time-series database engine locally, just put a little bit more resources into development of it's access methods or use any specialized databases for telematics data like this.
I use following query frequently:
SELECT * FROM table WHERE Timestamp > [SomeTime] AND Timestamp < [SomeOtherTime] and publish = 1 and type = 2 order by Timestamp
I would like to optimize this query, and I am thinking about put timestamp as part of primary key for clustered index, I think if timestamp is part of primary key , data inserted in table has write to disk sequentially by timestamp field.Also I think this improve my query a lot, but am not sure if this would help.
table has 3-4 million+ rows.
timestamp field never changed.
I use mysql 5.6.11
Anothet point is : if this is improve my query , it is better to use timestamp(4 byte in mysql 5.6) or datetime(5 byte in mysql 5.6)?
Four million rows isn't huge.
A one-byte difference between the data types datetime and timestamp is the last thing you should consider in choosing between those two data types. Review their specs.
Making a timestamp part of your primary key is a bad, bad idea. Think about reviewing what primary key means in a SQL database.
Put an index on your timestamp column. Get an execution plan, and paste that into your question. Determine your median query performance, and paste that into your question, too.
Returning a single day's rows from an indexed, 4 million row table on my desktop computer takes 2ms. (It returns around 8000 rows.)
1) If values of timestamp are unique you can make it primary key. If not, anyway create index on timestamp column as you frequently use it in "where".
2) using BETWEEN clause looks more natural here. I suggest you use TREE index (default index type) not HASH.
3) when timestamp column is indexed, you don't need call order by - it already sorted.
(of course, if your index is TREE not HASH).
4) integer unix_timestamp is better than datetime both from memory usage side and performance side - comparing dates is more complex operation than comparing integer numbers.
Searching data on indexed field takes O(log(rows)) tree lookups. Comparison of integers is O(1) and comparison of dates is O(date_string_length). So, difference is (number of tree lookups) * (difference_comparison) = O(date_string_length)/O(1))* O(log(rows)) = O(date_string_length)* O(log(rows))