at a MySQL database I have the following table:
id int Primary key
timestamp timestamp
tpid varchar
tpidno int
serialnumber int
command varchar
sequence int
startTime varchar
endTime varchar
PosData varchar
...
I also have 3 secondary indices:
tpid,tpidno
serialnumber
command
The table contains ~2.5M rows and it is about 500MB
Although I have complex queries that work fast I have great delay on those two simple queries:
Select id, sequence, PosData
From myTable
Where serialNumber = 130541
and command = "myCommand"
and startTime = "20140106194300"
and endtime = "20140106200000"
(~4.4sec)
Select id
From myTable
Where serialNumber = 130541
and command = 'myCommand'
and sequence = 128
(~4.5sec)
Does more indices like
serialnumber, command
command, sequence
or
serialnumber,command, sequence
will speed up the queries?
At the first query is it possible the data type of startTime and endTime to be the problem? if they were int instead of varchar it would be better?
any other suggestions?
A single index on (serialnumber, command) will definitively improve performance for those two queries. Further you could add the other columns to make it even faster. However, the best choice of the other columns depends on the data distribution and on the question which of the two statements is more often executed. It might even not worth adding those columns if the other two columns are very selective.
The datatype for startdate and enddate is unfortunate at least. Proper types will improve performance in a range from "a little" to "a lot" depending on you SQL. The SQL above will be in the "a little" range.
Some refs:
How multi-column indexes work
Possible problems when using improper types (ex: varchar instead of numeric types)
Related
example i have some gps devices that send info to my database every seconds
so 1 device create 1 row in mysql database with these columns (8)
id=12341 date=22.02.2018 time=22:40
langitude=22.236558789 longitude=78.9654582 deviceID=24 name=device-name someinfo=asdadadasd
so for 1 minute it create 60 rows , for 24 hours it create 864000 rows
and for 1 month(31days) 2678400 ROWS
so 1 device is creating 2.6 million rows per month in my db table ( records are deleted every month.)
so if there are more devices will be 2.6 Million * number of devices
so my questions are like this:
Question 1: if i make a search like this from php ( just for current day and for 1 device)
SELECT * FROM TABLE WHERE date='22.02.2018' AND deviceID= '24'
max possible results will be 86400 rows
will it overload my server too much
Question 2: limit with 5 hours (18000 rows) will that be problem for database or will it load server like first example or less
SELECT * FROM TABLE WHERE date='22.02.2018' AND deviceID= '24' LIMIT 18000
Question 3: if i show just 1 result from db will it overload server
SELECT * FROM TABLE WHERE date='22.02.2018' AND deviceID= '24' LIMIT 1
does it mean that if i have millions of rows and 1000rows will load server same if i show just 1 result
Millions of rows is not a problem, this is what SQL databases are designed to handle, if you have a well designed schema and good indexes.
Use proper types
Instead of storing your dates and times as separate strings, store them either as a single datetime or separate date and time types. See indexing below for more about which one to use. This is both more compact, allows indexing, faster sorting, and it makes available date and time functions without having to do conversions.
Similarly, be sure to use the appropriate numeric type for latitude, and longitude. You'll probably want to use numeric to ensure precision.
Since you're going to be storing billions of rows, be sure to use a bigint for your primary key. A regular int can only go up to about 2 billion.
Move repeated data into another table.
Instead of storing information about the device in every row, store that in a separate table. Then only store the device's ID in your log. This will cut down on your storage size, and eliminate mistakes due to data duplication. Be sure to declare the device ID as a foreign key, this will provide referential integrity and an index.
Add indexes
Indexes are what allows a database to search through millions or billions of rows very, very efficiently. Be sure there are indexes on the rows you use frequently, such as your timestamp.
A lack of indexes on date and deviceID is likely why your queries are so slow. Without an index, MySQL has to look at every row in the database known as a full table scan. This is why your queries are so slow, you're lacking indexes.
You can discover whether your queries are using indexes with explain.
datetime or time + date?
Normally it's best to store your date and time in a single column, conventionally called created_at. Then you can use date to get just the date part like so.
select *
from gps_logs
where date(created_at) = '2018-07-14'
There's a problem. The problem is how indexes work... or don't. Because of the function call, where date(created_at) = '2018-07-14' will not use an index. MySQL will run date(created_at) on every single row. This means a performance killing full table scan.
You can work around this by working with just the datetime column. This will use an index and be efficient.
select *
from gps_logs
where '2018-07-14 00:00:00' <= created_at and created_at < '2018-07-15 00:00:00'
Or you can split your single datetime column into date and time columns, but this introduces new problems. Querying ranges which cross a day boundary becomes difficult. Like maybe you want a day in a different time zone. It's easy with a single column.
select *
from gps_logs
where '2018-07-12 10:00:00' <= created_at and created_at < '2018-07-13 10:00:00'
But it's more involved with a separate date and time.
select *
from gps_logs
where (created_date = '2018-07-12' and created_time >= '10:00:00')
or (created_date = '2018-07-13' and created_time < '10:00:00');
Or you can switch to a database with partial indexes like Postgresql. A partial index allows you to index only part of a value, or the result of a function. And Postgresql does a lot of things better than MySQL. This is what I recommend.
Do as much work in SQL as possible.
For example, if you want to know how many log entries there are per device per day, rather than pulling all the rows out and calculating them yourself, you'd use group by to group them by device and day.
select gps_device_id, count(id) as num_entries, created_at::date as day
from gps_logs
group by gps_device_id, day;
gps_device_id | num_entries | day
---------------+-------------+------------
1 | 29310 | 2018-07-12
2 | 23923 | 2018-07-11
2 | 23988 | 2018-07-12
With this much data, you will want to rely heavily on group by and the associated aggregate functions like sum, count, max, min and so on.
Avoid select *
If you must retrieve 86400 rows, the cost of simply fetching all that data from the database can be costly. You can speed this up significantly by only fetching the columns you need. This means using select only, the, specific, columns, you, need rather than select *.
Putting it all together.
In PostgreSQL
Your schema in PostgreSQL should look something like this.
create table gps_devices (
id serial primary key,
name text not null
-- any other columns about the devices
);
create table gps_logs (
id bigserial primary key,
gps_device_id int references gps_devices(id),
created_at timestamp not null default current_timestamp,
latitude numeric(12,9) not null,
longitude numeric(12,9) not null
);
create index timestamp_and_device on gps_logs(created_at, gps_device_id);
create index date_and_device on gps_logs((created_at::date), gps_device_id);
A query can generally only use one index per table. Since you'll be searching on the timestamp and device ID together a lot timestamp_and_device combines indexing both the timestamp and device ID.
date_and_device is the same thing, but it's a partial index on just the date part of the timestamp. This will make where created_at::date = '2018-07-12' and gps_device_id = 42 very efficient.
In MySQL
create table gps_devices (
id int primary key auto_increment,
name text not null
-- any other columns about the devices
);
create table gps_logs (
id bigint primary key auto_increment,
gps_device_id int references gps_devices(id),
foreign key (gps_device_id) references gps_devices(id),
created_at timestamp not null default current_timestamp,
latitude numeric(12,9) not null,
longitude numeric(12,9) not null
);
create index timestamp_and_device on gps_logs(created_at, gps_device_id);
Very similar, but no partial index. So you'll either need to always use a bare created_at in your where clauses, or switch to separate date and time types.
Just read you question, for me the Answer is
Just create a separate table for Latitude and longitude and make your ID Foreign key and save it their.
Without knowing the exact queries you want to run I can just guess the best structure. Having said that, you should aim for the optimal types that use the minimum number of bytes per row. This should make your queries faster.
For example, you could use the structure below:
create table device (
id int primary key not null,
name varchar(20),
someinfo varchar(100)
);
create table location (
device_id int not null,
recorded_at timestamp not null,
latitude double not null, -- instead of varchar; maybe float?
longitude double not null, -- instead of varchar; maybe float?
foreign key (device_id) references device (id)
);
create index ix_loc_dev on location (device_id, recorded_at);
If you include the exact queries (naming the columns) we can create better indexes for them.
Since probably your query selectivity is bad, your queries may run Full Table Scans. For this case I took it a step further I used the smallest possible data types for the columns, so it will be faster:
create table location (
device_id tinyint not null,
recorded_at timestamp not null,
latitude float not null,
longitude float not null,
foreign key (device_id) references device (id)
);
Can't really think of anything smaller than this.
The best what I can recommend to you is to use time-series database for storing and accessing time-series data. You can host any kind of time-series database engine locally, just put a little bit more resources into development of it's access methods or use any specialized databases for telematics data like this.
Lately I discovered a performance issue in the following use case
Before I had a table "MyTable" with a INT indexed column "MyCode"
Afterwhile Ineeded to change the table structure converting "MyCode" column to VARCHAR (index on the column was preserved)
ALTER TABLE MyTable CHANGE MyCode MyCode VARCHAR(250) DEFAULT NULL
Then experienced a unexpected latency, query were being performed like:
SELECT * FROM MyTable where MyCode = 1234
This query was completely ignoring the MyCode VARCHAR indexing, impression was it was full scanning the table
Converting the query to
SELECT * FROM MyTable where MyCode = "1234"
Performance get back to optimal leveraging on VARCHAR indexing
So the question is.... how to explain it... and how does actually MySQL treat indexing. Or maybe some DB setting to be changed to avoid this ?
int_col = 1234 -- no problem; same type
char_col = "1234" -- no problem; same type
int_col = "1234" -- string is converted to number, then no problem
char_col = 1234 -- converting all the strings to numbers -- tedious
In the 4th case, the index is useless, so the Optimizer looks for some other way to perform the query. This is likely to lead to a "full table scan".
The main exception involves a "covering index", which is only slightly faster -- involving a "full index scan".
I accepted Rick James answer because he got the point.
But I'd like to add more info after having some testing.
The case in the question is: how does actually MySQL compares two values when the filtered column is varchar type and the provided value to filter by is not a string.
If this is the case you'll lose the opportunity to leverage on the index applied to the VARCHAR column having a dramatically loss of performances in your query, supposed instead to be immediate and simple.
Explanation is that MySQL in front of a given value which has a different type from
VARCHAR will perform a full table scan and for every record's field will to perform a CAST(varcharcol as providedvaluetype) and compare the result with provided value.
E.g.
having a VARCHAR column named "code" and filtering
SELECT * FROM table WHERE code=1234
will full scan every record just like doing doing
SELECT * FROM table WHERE CAST(code as UNSIGNED)=1234
Notice that if you'll test it against 0
SELECT * FROM table WHERE CAST(code as UNSIGNED)=0
you'll get back ALL records having a string that its CAST to UNSIGNED won't have a unsigned meaning for mysql CAST function.
Currently, I have a mySQL table with columns that looks something like this:
run_date DATE
name VARCHAR(10)
load INTEGER
sys_time TIME
rec_time TIME
valid TINYINT
The column valid is essentially a valid bit, 1 if this row is the latest value for this (run_date,name) pair, and 0 if not. To make insertions simpler, I wrote a stored procedure that first runs an UPDATE table_name SET valid = 0 WHERE run_date = X AND name = Y command, then inserts the new row.
The table reads are in such a way that I usually use only the valid = 1 rows, but I can't discard the invalid rows. Obviously, this schema also has no primary key.
Is there a better way to structure this data or the valid bit, so that I can speed up both inserts and searches? A bunch of indexes on different orders of columns gets large.
In all of the suggestions below, get rid of valid and the UPDATE of it. That is not scalable.
Plan A: At SELECT time, use 'groupwise max' code to locate the latest run_date, hence the "valid" entry.
Plan B: Have two tables and change both when inserting: history, with PRIMARY KEY(name, run_date) and a simple INSERT statement; current, with PRIMARY KEY(name) and INSERT ... ON DUPLICATE KEY UPDATE. The "usual" SELECTs need only touch current.
Another issue: TIME is limited to 838:59:59 and is intended to mean 'time of day', not 'elapsed time'. For the latter, use INT UNSIGNED (or some variant of INT). For formatting, you can use sec_to_time(). For example sec_to_time(3601) -> 01:00:05.
I have created the table measurements as listed below.
This table is written to periodically and will rapidly grow to contain millions of rows after a few days.
On read: I only need the precise time of the measurement and its value (unix_epoch and value).
To improve performance, I have added column date_from_epoch which is the day extracted out of unix_epoch (the measurement precise time) in this format: yyyymmdd. It should have a good selectivity (after multiple days of measurements have been written to the table) and I am using it as a key for an index. I am hoping to scan only the days for which I want the measurements on read, and not all the days present in the table (example: after 10 days, if 1,000,000 are added each day, I am hoping to scan only 1,000,000 rows if I need data contained within one day, not 10,000,000).
I have also:
used innoDB for the engine
partitioned the table by hash into 10 files to help with I/O
made sure the type used in my query is the same as the column type (or did I get this verification wrong?).
Question:
I have made a test after measurements have trickled in the measurement table for 2 days.
Using EXPLAIN, I see my read query does not use the index. Why is the query not using the index?
Table is created with:
CREATE TABLE measurements(
date_from_epoch INT UNSIGNED,
unix_epoch INT UNSIGNED,
application_name varchar(255),
environment varchar(255),
metric_name varchar(255),
host_name varchar(1024),
value FLOAT(38,3)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
PARTITION BY HASH(unix_epoch)
PARTITIONS 10;
CREATE TRIGGER write_epoch_day
BEFORE INSERT ON measurements
FOR EACH ROW
SET NEW.date_from_epoch = FROM_UNIXTIME(NEW.unix_epoch, '%Y%m%d');
ALTER TABLE measurements ADD INDEX (date_from_epoch);
The query is:
EXPLAIN SELECT * FROM measurements
WHERE date_from_epoch >= 20150615 AND date_from_epoch <= 20150615
AND unix_epoch >= 1434423478 AND unix_epoch <= 1434430678
AND BINARY application_name = 'all'
AND BINARY environment = 'prod'
AND BINARY metric_name = 'Internet availability'
AND (BINARY host_name = 'kitkat' )
ORDER BY unix_epoch ASC;
Explain gives:
id select_type table type possible_keys key key_len ref rows Extra
-------------------------------------------------------------------------------------------------------------------------------------------------------
1 SIMPLE measurements ALL date_from_epoch 118011 Using where; Using filesort
Thanks for reading and head-scratching!
There is an option to use FORCE INDEX in MYSQL
Refer this for better understanding.
Thanks Sashi!
I have modified the query to
EXPLAIN SELECT * FROM measurements FORCE INDEX (date_from_epoch)
WHERE date_from_epoch >= 20150615 AND date_from_epoch <= 20150615
AND unix_epoch >= 1434423478 AND unix_epoch <= 1434430678
AND BINARY application_name = 'all'
AND BINARY environment = 'prod'
AND BINARY metric_name = 'Internet availability'
AND BINARY host_name = 'kitkat'
ORDER BY unix_epoch ASC;
Explain says still "Using where; Using file sort" but the number of rows scanned is now down to 67,906 vs the 118,011 initially scanned (Which is great).
Although the number of rows for date_from_epoch = 20150615 is 113,182. I am now wondering why the number of rows scanned is not 113,182 (not that I want it to go up, but I would like to understand what mysql did to even further optimize the execution).
A lot of things need fixing:
Don't use PARTITION BY HASH; it does not help.
Since you have a range across the partition key, it must touch all partitions. See EXPLAIN PARTITIONS SELECT ....
Don't bother with the extra epoch_from_date and Trigger; just do comparisons on unix_epoch. (See the manual on the conversion routines needed.)
Don't use BINARY. Instead, specify the columns as COLLATION utf8_bin. Performance will be much better.
Normalize (or turn into an ENUM) these fields: application_name, environment, metric_name, host_name. What you have is unnecessarily bulky for millions of rows. (I am assuming there are only a few distinct values for those fields.) The space savings will make the SELECT run much faster.
FLOAT(38, 3) has an extra (unnecessary?) rounding. Simply use FLOAT.
(After making the above changes) INDEX(application_name, environment, metric_name, host_name, unix_epoch) would be quite helpful, at least for that one query. And it will be significantly better than the INDEX you are asking about.
I use following query frequently:
SELECT * FROM table WHERE Timestamp > [SomeTime] AND Timestamp < [SomeOtherTime] and publish = 1 and type = 2 order by Timestamp
I would like to optimize this query, and I am thinking about put timestamp as part of primary key for clustered index, I think if timestamp is part of primary key , data inserted in table has write to disk sequentially by timestamp field.Also I think this improve my query a lot, but am not sure if this would help.
table has 3-4 million+ rows.
timestamp field never changed.
I use mysql 5.6.11
Anothet point is : if this is improve my query , it is better to use timestamp(4 byte in mysql 5.6) or datetime(5 byte in mysql 5.6)?
Four million rows isn't huge.
A one-byte difference between the data types datetime and timestamp is the last thing you should consider in choosing between those two data types. Review their specs.
Making a timestamp part of your primary key is a bad, bad idea. Think about reviewing what primary key means in a SQL database.
Put an index on your timestamp column. Get an execution plan, and paste that into your question. Determine your median query performance, and paste that into your question, too.
Returning a single day's rows from an indexed, 4 million row table on my desktop computer takes 2ms. (It returns around 8000 rows.)
1) If values of timestamp are unique you can make it primary key. If not, anyway create index on timestamp column as you frequently use it in "where".
2) using BETWEEN clause looks more natural here. I suggest you use TREE index (default index type) not HASH.
3) when timestamp column is indexed, you don't need call order by - it already sorted.
(of course, if your index is TREE not HASH).
4) integer unix_timestamp is better than datetime both from memory usage side and performance side - comparing dates is more complex operation than comparing integer numbers.
Searching data on indexed field takes O(log(rows)) tree lookups. Comparison of integers is O(1) and comparison of dates is O(date_string_length). So, difference is (number of tree lookups) * (difference_comparison) = O(date_string_length)/O(1))* O(log(rows)) = O(date_string_length)* O(log(rows))