What is the best thing for my scenario
I have a tables with nearly 20,000,000 records, which basically stores what users have done in the site
id -> primary int 11 auto increment
user_id -> index int 11 not null
create_date -> ( no index yet ) date-time not null
it has other columns but seems irrelevant to name them here
I know I must put an index on create_date but do I put a single column index or a double column, which one first on the double index ( given the large number of records)?
by the way the query that I'm now using is like :
select max(id) -- in here I'm selecting actions that users have done, after this date, since date is today
from table t
where
t.create_date >= '2014-12-29 00:00:00'
group by t.user_id
Could you edit your question with an EXPLAIN PLAN of your SELECT? EXPLAIN Link. Meanwhile, you can try with this:
Make partitions using your date field create_date. Partitions
Build your index with the most restrictive criteria first. I think that in your case, it will be better create_date + user_id
CREATE INDEX index_name
ON table_name ( create_date , user_id );
Related
I want to delete data which are past 2years. filed name is Date and type is varchar(255)
delete from <table_name> where <Filed> like '%2022';
running very longtime but no deletion of data
I have check and tried the query, you can try with
DELETE From <datatable> WHERE <date> LIKE '%2022';
DELETE From post WHERE date LIKE '%2022'; #Example
May you provide the database or screenshot? I have tried the query and no issue https://www.db-fiddle.com/f/syhtgVyEcSPcHRXBXHLtor/0
If primary key(probably id) and the date column are correlated, meaning bigger id will result the later dates(in this case, it is a of type varchar, and thanks to P.Salmon for pointing this out),then
I think you can delete using primary key(normally it is column id), for example:
select id from table where date > '2020' order by id asc limit 1;
// assume this id = 123456789, and delete rows that created before this id was created
delete from table where id < 123456789;
if there is not correlation, I have some ideas like below:
create a new column called created_at of type year/date/datetime/timestamp(probably date or year will do), it will store the actual year or date or datetime, use it to replace the date column of type varchar, probably create an index on created_at, and delete with the new column
If there is a index on date(varchar), since the % sign in like clause will cause the server not using index, so it is a full table scan for sure, and can you like enumerate all date like '01-01-2020', '01-02-2020', and delete rows one date by one date, with a script, I think in this way at least you get to use the index
if there are too many rows, like 10 years or even more, is it possible just migrate data within 2 years to a new table, and just remove the old table?
write a script, fetch 10000 row each time from beginning of primary key, and delete those that are over 2 years, and fetch next 10000
last_id = 0
select * from table where id > last_id order by id asc limit 10000;
last_id = [last id of the query]
delete from table where id in (xxx);
example i have some gps devices that send info to my database every seconds
so 1 device create 1 row in mysql database with these columns (8)
id=12341 date=22.02.2018 time=22:40
langitude=22.236558789 longitude=78.9654582 deviceID=24 name=device-name someinfo=asdadadasd
so for 1 minute it create 60 rows , for 24 hours it create 864000 rows
and for 1 month(31days) 2678400 ROWS
so 1 device is creating 2.6 million rows per month in my db table ( records are deleted every month.)
so if there are more devices will be 2.6 Million * number of devices
so my questions are like this:
Question 1: if i make a search like this from php ( just for current day and for 1 device)
SELECT * FROM TABLE WHERE date='22.02.2018' AND deviceID= '24'
max possible results will be 86400 rows
will it overload my server too much
Question 2: limit with 5 hours (18000 rows) will that be problem for database or will it load server like first example or less
SELECT * FROM TABLE WHERE date='22.02.2018' AND deviceID= '24' LIMIT 18000
Question 3: if i show just 1 result from db will it overload server
SELECT * FROM TABLE WHERE date='22.02.2018' AND deviceID= '24' LIMIT 1
does it mean that if i have millions of rows and 1000rows will load server same if i show just 1 result
Millions of rows is not a problem, this is what SQL databases are designed to handle, if you have a well designed schema and good indexes.
Use proper types
Instead of storing your dates and times as separate strings, store them either as a single datetime or separate date and time types. See indexing below for more about which one to use. This is both more compact, allows indexing, faster sorting, and it makes available date and time functions without having to do conversions.
Similarly, be sure to use the appropriate numeric type for latitude, and longitude. You'll probably want to use numeric to ensure precision.
Since you're going to be storing billions of rows, be sure to use a bigint for your primary key. A regular int can only go up to about 2 billion.
Move repeated data into another table.
Instead of storing information about the device in every row, store that in a separate table. Then only store the device's ID in your log. This will cut down on your storage size, and eliminate mistakes due to data duplication. Be sure to declare the device ID as a foreign key, this will provide referential integrity and an index.
Add indexes
Indexes are what allows a database to search through millions or billions of rows very, very efficiently. Be sure there are indexes on the rows you use frequently, such as your timestamp.
A lack of indexes on date and deviceID is likely why your queries are so slow. Without an index, MySQL has to look at every row in the database known as a full table scan. This is why your queries are so slow, you're lacking indexes.
You can discover whether your queries are using indexes with explain.
datetime or time + date?
Normally it's best to store your date and time in a single column, conventionally called created_at. Then you can use date to get just the date part like so.
select *
from gps_logs
where date(created_at) = '2018-07-14'
There's a problem. The problem is how indexes work... or don't. Because of the function call, where date(created_at) = '2018-07-14' will not use an index. MySQL will run date(created_at) on every single row. This means a performance killing full table scan.
You can work around this by working with just the datetime column. This will use an index and be efficient.
select *
from gps_logs
where '2018-07-14 00:00:00' <= created_at and created_at < '2018-07-15 00:00:00'
Or you can split your single datetime column into date and time columns, but this introduces new problems. Querying ranges which cross a day boundary becomes difficult. Like maybe you want a day in a different time zone. It's easy with a single column.
select *
from gps_logs
where '2018-07-12 10:00:00' <= created_at and created_at < '2018-07-13 10:00:00'
But it's more involved with a separate date and time.
select *
from gps_logs
where (created_date = '2018-07-12' and created_time >= '10:00:00')
or (created_date = '2018-07-13' and created_time < '10:00:00');
Or you can switch to a database with partial indexes like Postgresql. A partial index allows you to index only part of a value, or the result of a function. And Postgresql does a lot of things better than MySQL. This is what I recommend.
Do as much work in SQL as possible.
For example, if you want to know how many log entries there are per device per day, rather than pulling all the rows out and calculating them yourself, you'd use group by to group them by device and day.
select gps_device_id, count(id) as num_entries, created_at::date as day
from gps_logs
group by gps_device_id, day;
gps_device_id | num_entries | day
---------------+-------------+------------
1 | 29310 | 2018-07-12
2 | 23923 | 2018-07-11
2 | 23988 | 2018-07-12
With this much data, you will want to rely heavily on group by and the associated aggregate functions like sum, count, max, min and so on.
Avoid select *
If you must retrieve 86400 rows, the cost of simply fetching all that data from the database can be costly. You can speed this up significantly by only fetching the columns you need. This means using select only, the, specific, columns, you, need rather than select *.
Putting it all together.
In PostgreSQL
Your schema in PostgreSQL should look something like this.
create table gps_devices (
id serial primary key,
name text not null
-- any other columns about the devices
);
create table gps_logs (
id bigserial primary key,
gps_device_id int references gps_devices(id),
created_at timestamp not null default current_timestamp,
latitude numeric(12,9) not null,
longitude numeric(12,9) not null
);
create index timestamp_and_device on gps_logs(created_at, gps_device_id);
create index date_and_device on gps_logs((created_at::date), gps_device_id);
A query can generally only use one index per table. Since you'll be searching on the timestamp and device ID together a lot timestamp_and_device combines indexing both the timestamp and device ID.
date_and_device is the same thing, but it's a partial index on just the date part of the timestamp. This will make where created_at::date = '2018-07-12' and gps_device_id = 42 very efficient.
In MySQL
create table gps_devices (
id int primary key auto_increment,
name text not null
-- any other columns about the devices
);
create table gps_logs (
id bigint primary key auto_increment,
gps_device_id int references gps_devices(id),
foreign key (gps_device_id) references gps_devices(id),
created_at timestamp not null default current_timestamp,
latitude numeric(12,9) not null,
longitude numeric(12,9) not null
);
create index timestamp_and_device on gps_logs(created_at, gps_device_id);
Very similar, but no partial index. So you'll either need to always use a bare created_at in your where clauses, or switch to separate date and time types.
Just read you question, for me the Answer is
Just create a separate table for Latitude and longitude and make your ID Foreign key and save it their.
Without knowing the exact queries you want to run I can just guess the best structure. Having said that, you should aim for the optimal types that use the minimum number of bytes per row. This should make your queries faster.
For example, you could use the structure below:
create table device (
id int primary key not null,
name varchar(20),
someinfo varchar(100)
);
create table location (
device_id int not null,
recorded_at timestamp not null,
latitude double not null, -- instead of varchar; maybe float?
longitude double not null, -- instead of varchar; maybe float?
foreign key (device_id) references device (id)
);
create index ix_loc_dev on location (device_id, recorded_at);
If you include the exact queries (naming the columns) we can create better indexes for them.
Since probably your query selectivity is bad, your queries may run Full Table Scans. For this case I took it a step further I used the smallest possible data types for the columns, so it will be faster:
create table location (
device_id tinyint not null,
recorded_at timestamp not null,
latitude float not null,
longitude float not null,
foreign key (device_id) references device (id)
);
Can't really think of anything smaller than this.
The best what I can recommend to you is to use time-series database for storing and accessing time-series data. You can host any kind of time-series database engine locally, just put a little bit more resources into development of it's access methods or use any specialized databases for telematics data like this.
I have a table with the following structure
CREATE TABLE rel_score (
user_id bigint(20) NOT NULL DEFAULT '0',
score_date date NOT NULL,
rel_score decimal(4,2) DEFAULT NULL,
doc_count int(8) NOT NULL
total_doc_count int(8) NOT NULL
PRIMARY KEY (user_id,score_date),
KEY SCORE_DT_IDX (score_date)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 PACK_KEYS=1
The table will store rel_score value for every user in the application for every day since 1st Jan 2000 till date. I estimated the total number records will be over 700 million. I populated the table with 6 months data (~ 30 million rows) and the query response time is about 8 minutes. Here is my query,
select
user_id, max(rel_score) as max_rel_score
from
rel_score
where score_date between '2012-01-01' and '2012-06-30'
group by user_id
order by max_rel_score desc;
I tried optimizing the query using the following techniques,
Partitioning on the score_date column
Adding an index on the score_date column
The query response time improved marginally to a little less than 8 mins.
How can I improve response time? Is the design of the table appropropriate?
Also, I cannot move the old data to archive as an user is allowed to query on the entire data range.
If you partition your table on the same level of the score_date you will not reduce the query response time.
Try to create another attribut that will contain only the year of the date, cast it to an INTEGER , partition your table on this attribut (you will get 13 partition), and reexecute your query to see .
Your primary index should do a good job of covering the table. If you didn't have it, I would suggest building an index on rel_score(user_id, score_date, rel_score). For your query, this is a "covering" index, meaning that the index has all the columns in the query, so the engine never has to access the data pages (only the index).
The following version might also make good use of this index (although I much prefer your version of the query):
select u.user_id,
(select max(rel_score)
from rel_score r2
where r2.user_id = r.user_id and
r2.score_date between '2012-01-01' and '2012-06-30'
) as rel_score
from (select distinct user_id
from rel_score
where score_date between '2012-01-01' and '2012-06-30'
) u
order by rel_score desc;
The idea behind this query is to replace the aggregation with a simple index lookup. Aggregation in MySQL is a slow operation -- it works much better in other databases so such tricks shouldn't be necessary.
I'm having a MySQL-Table like this:
CREATE TABLE `dates` (
`id` int UNSIGNED NULL AUTO_INCREMENT ,
`object_id` int UNSIGNED NOT NULL ,
`date_from` date NOT NULL ,
`date_to` date NULL ,
`time_from` time NULL ,
`time_to` time NULL ,
PRIMARY KEY (`id`)
);
which is queried mostly this way:
SELECT object_id FROM `dates`
WHERE NOW() BETWEEN date_from AND date_to
How do I index the table best? Should I create two indexes, one for date_from and one for date_to or is a combined index on both columns better?
For the query:
WHERE NOW() >= date_from
AND NOW() <= date_to
A compound index (date_from, date_to) is useless.
Create both indices: (date_from) and (date_to) and let the SQL optimizer decide each time which one to use. Depending on the values and the selectivity, the optimizer may choose one or the other index. Or none of them. There is no easy way to create an index that will take both conditions into consideration.
(A spatial index could be used to optimize such a condition, if you could translate the dates to latitude and longitude).
Update
My mistake. An index on (date_from, date_to, object_id) can and is indeed used in some situations for this query. If the selectivity of the NOW() <= date_from is high enough, the optimizer chooses to use this index, than doing a full scan on the table or using another index. This is because it's a covering index, meaning no data is needed to be fetched from the table, only reading from the index data is required.
Minor note (not related to performance, only correctness of the query). Your condition is equivalent to:
WHERE CURRENT_DATE() >= date_from
AND ( CURRENT_DATE() + INTERVAL 1 DAY <= date_to
OR ( CURRENT_DATE() = NOW()
AND CURRENT_DATE() = date_to
)
)
Are you sure you want that or do you want this:
WHERE CURRENT_DATE() >= date_from
AND CURRENT_DATE() <= date_to
The NOW() function returns a DATETIME, while CURRENT_DATE() returns a DATE, without the time part.
You should create an index covering date_from, date_to and object_id as explained by ypercube. The order of the fields in the index is dependant on whether you will have more data for the past or the future. As pointed out by Erwin in response to Sanjay's comment, the date_to field will be more selective if you have more dates in the past and vice versa.
CREATE INDEX ON (date_to, date_from, object_id);
How many rows in relation to your table size does your query return? If it's more than 10 percent i would not bother to create an index, in such a case your quite close to a table scan anyway. If it's well below 10 percent, then in this case, would use an index containg
(date_from, date_to, object_id) so, that the query result can be constructed entirely from the information in the index, without the database havind to track back to the table data to get the value for object_id.
Depending on the size of your table this can use up alot of space. If you can spare that, give it a try.
Create an index with (date_from,date_to) as that single index would be usable for the WHERE criteria
If you create separate indexes then MySQL will have to use one or the other instead of both
I have a table Cars with datetime (DATE) and bit (PUBLIC).
Now i would like to take rows ordered by DATE and with PUBLIC = 1 so i use:
select
c.*
from
Cars c
WHERE
c.PUBLIC = 1
ORDER BY
DATE DESC
But unfortunately when I use explain to see what is going on I have this:
1 SIMPLE a ALL IDX_PUBLIC,DATE NULL NULL NULL 103 Using where; Using filesort
And it takes 0,3 ms to take this data while I have only 100 rows. Is there any other way to disable filesort?
If i goes to indexes I have index on (PUBLIC, DATE) not unique.
Table def:
CREATE TABLE IF NOT EXISTS `Cars` (
`ID` int(11) NOT NULL auto_increment,
`DATE` datetime NOT NULL,
`PUBLIC` binary(1) NOT NULL default '0'
PRIMARY KEY (`ID`),
KEY `IDX_PUBLIC` (`PUBLIC`),
KEY `DATE` (`PUBLIC`,`DATE`)
) ENGINE=MyISAM AUTO_INCREMENT=186 ;
You need to have a composite index on (public, date)
This way, MySQL will filter on public and sort on date.
From your EXPLAIN I see that you don't have a composite index on (public, date).
Instead you have two different indexes on public and on date. At least, that's what their names IDX_PUBLIC and DATE tell.
Update:
You public column is not a BIT, it's a BINARY(1). It's a character type and uses character comparison.
When comparing integers to characters, MySQL converts the latter to the former, not vice versa.
These queries return different results:
CREATE TABLE t_binary (val BINARY(2) NOT NULL);
INSERT
INTO t_binary
VALUES
(1),
(2),
(3),
(10);
SELECT *
FROM t_binary
WHERE val <= 10;
---
1
2
3
10
SELECT *
FROM t_binary
WHERE val <= '10';
---
1
10
Either change your public column to be a bit or rewrite your query as this:
SELECT c.*
FROM Cars c
WHERE c.PUBLIC = '1'
ORDER BY
DATE DESC
, i. e. compare characters with characters, not integers.
If you are ordering by date, a sort will be required. If there isn't an index by date, then a filesort will be used. The only way to get rid of that would be to either add an index on date or not do the order by.
Also, a filesort does not always imply that the file will be sorted on disk. It could be sorting it in memory if the table is small enough or the sort buffer is large enough. It just means that the table itself has to be sorted.
Looks like you have an index on date already, and since you are using PUBLIC in your where clause, MySQL should be able to use that index. However, the optimizer may have decided that since you have so few rows it isn't worth bothering with the index. Try adding 10,000 or so rows to the table, re-analyze it, and see if that changes the plan.