Limit large table according to date - mysql

I have a table with 5 million rows, and I want to get only rows that have the field date between two dates (date1 and date2). I tried to do
select column from table where date > date1 and date < date2
but the processing time is really big. Is there a smarter way to do this? Maybe access directly a row and make the query only after that row? My point is, is there a way to discard a large part of my table that does not match to the date period? Or I have to read row by row and compare the dates?

Usually you apply some kind of condition before retrieving the results. If you don't have anything to filter on you might want to use LIMIT and OFFSET:
SELECT * FROM table_name WHERE date BETWEEN ? AND ? LIMIT 1000 OFFSET 1000
Generally you will LIMIT to whatever amount of records you'd like to show on a particular page.

You can try/do a couple of things:
1.) If you don't already have one, index your date column
2.) Range partition your table on the date field
When you partition a table, the query optimizer can eliminate partitions that are not able to satisfy the query without actually processing any data.
For example, lets say you partitioned your table by the date field monthly and that you had 6 months of data in the table. If you query for a date between range of a week in OCT-2012, the query optimizer can throw out 5 of the 6 partitions and only scan the partition that has records in the month of OCT in 2012.
For more details, check the MySQL Partitioning page. It gives you all the necessary information and gives a more through example of what I described above in the "Partition Pruning" section.
Note, I would recommend creating/cloning your table in a new partitioned table and do the query in order to test the results and whether it satisfies your requirements. If you haven't already indexed the date column, that should be your first step, test, and if need be check out partitioning.

Related

How to optimize a MySQL query which is taking 10 sec to fetch results

I have a MySQL database. I have a table in it which has around 200000 rows.
I am querying through this table to fetch the latest data.Query
select *
from `db`.`Data`
where
floor = "floor_value" and
date = "date_value" and
timestamp > "time_value"
order by
timestamp DESC
limit 1
It is taking about 9 sec to fetch the data, when the number of rows in the table were less, it did not take this long to fetch the data. Can anyone help me with how do I reduce the time taken for the query?
Try adding the following compound index:
CREATE INDEX idx ON Data (floor, date, timestamp);
This index should cover the entire WHERE clause and also ideally should be usable for the ORDER BY clause. The reason why timestamp appears last in the index is that this allows for generating a final set of matching timestamp values by scanning the index. Had we put timestamp first, MySQL might have to seek back to the clustered index to find the set of matching timestamp values.

Count daily rows by tables in MySQL

I have a database with ~200 tables that I want to audit to ensure the tables won't grow too large.
I know I can easily get an idea of a lot of the table attributes I want (size in MB, rows, row length, data length, etc) with:
SHOW TABLE STATUS FROM myDatabaseName;
But it's missing one key piece of information I'm after: how many rows are added to each table in a given time period?
My records each contain a datestamp column in matching formats, if it helps.
Edit: Essentially, I want something like:
SELECT COUNT(*)
FROM *
WHERE datestamp BETWEEN [begindate] AND [enddate]
GROUP BY tablename
The following should work to get number of rows entered in for a given table for a given time period:
select count(*) from [tablename] where datestamp between [begindate] and [enddate]
After a bit of research, it looks like this isn't possible in MySQL, since it would require massive table reads (after all, the number of rows can differ between users).
Instead, I grabbed the transaction logs for all the jobs that write into the tables and I'll parse them. A bit hacky, but it works.

the total number of locks exceeds lock table size , when using mysql delete commnad

I am using mysql server v5.1.73 on a Centos 6.4 64bit operating system.I have a table with about 17m records that it's size is about 10GB. mysql engine for this table is innodb.
This table has 10 columns and one of them is 'date' which its type is datetime. I want to delete records of a specific date with mysql 'delete' command.
delete
from table
where date(date) = '2015-06-01'
limit 1000
but when I run this command, i get an error 'the total number of locks exceeds lock table size'. I had this problem before and when i change innodb_buffer_poolsize, it would fix the problem but this time even increasing this amount, problem still exits.
I tried many tricks like changing limit value to 100 or even 1 record, but it doesn't work. I even increased innodb-buffer-poolsize to 20GB but nothing changed.
I also read these links: "The total number of locks exceeds the lock table size" Deleting 267 Records andThe total number of locks exceeds the lock table size
but they didn't solve my problem. my server has 64GB RAM.
in the other hand, I can delete records when not using a filter on a specific date:
delete
from table
limit 1000
and also I can select records of the day without any problem.can anyone help me with this?
I would appreciate any help to fix the problem.
Don't use date(date), it cannot use INDEX(date). Instead simply use date and have an index beginning with date.
More ways to do a chunking delete.
date(date) in the WHERE clause requires the database to calculate the value of that column for all 17m rows in the database at run time - the DB will create date(date) row by row for 17m rows, and then table scan (there is no index) those 17m rows to work out your result set. This is where you are running out of resources.
You need to remove the usage of the calculated column, which can be solved a couple of different ways.
Rather than doing date(date), change your comparison to be:
WHERE date >= '2015-06-01 00:00:00' AND date <= '2015-06-01 23:59:59'
This will now hit the index on the date column directly (I'm assuming you have an index on this column, it just won't be used by your original query)
The other solution would be to add a column to the table of type DATE, and permanently store the DATE of each DATETIME in that column (and obviously add an index for the new DATE column). That would allow you to run any query you like that just wants to examine the DATE portion only, without having to specify the time range. If you've got other queries currently using date(date), having a column with just the date specifically in it might be a preferred solution than adding the time range to the query (adding the time range is fine for a straight index comparison in a SELECT or DELETE like here, but might not be a usable solution for other queries involving JOIN, GROUP BY, etc).

Need advice on how to index and optimize a specific MySQL database

I'm trying to optimize my MySQL DB so I can query it as quickly as possible.
It goes like this:
My DB consists of 1 table that has (for now) about 18 million rows - and growing rapidly.
This table has the following columns - idx, time, tag_id, x, y, z.
No column has any null values.
'idx' is an INT(11) index column, AI and PK. right now it's in ascending order.
'time' is a date-time column. it's also ascending. 50% of the 'time' values in the table are distinct (and the rest of the values will appear probably twice or 3 times at most).
'tag_id' is an INT(11) column. it's not ordered in any way, and there are between 30-100 different possible tag_id values that spread over the whole DB. It's also a foreign key with another table.
INSERT -
A new row is being inserted to the table every 2-3 seconds. 'idx' is calculated by the server (AI). since the 'time' column represents the time the row was inserted, every new 'time' that's inserted will be either higher or equal to the previous row. all the other column values don't have any order.
SELECT -
here is an example of a typical query:
"select x, y, z, time from table where date(time) between '2014-08-01' and '2014-10-01' and tag_id = 123456"
so, 'time' and 'tag_id' are the only columns that appear in the where part, and both of them will ALWAYS appear in the where part of every query. 'x', 'y' and 'z' and 'time' will always appear in the select part. 'tag_id' might also appear in the select part sometimes.
the queries will usually seek higher (more recent) times, rather then the older times. meaning - later rows in the table will be searched more.
INDEXES-
right now, 'idx', being the PK, is the clustered ASC index. 'time' has also a non-clustered ASC index.
That's it. considering all this data, a typical query will return results for me in around 30 seconds. I'm trying to lower this time. any advice??
I'm thinking about changing one or both of the indexes from ASC to DESC (since the higher values are more popular in the search). if I change 'idx' to DESC it will physically reverse the whole table. if I change 'time' to DESC it will reverse the 'time' index tree. but since this is an 18 million row table, changes like this might take a long time for the server so I want to be sure it's a good idea. the question is, if I reverse the order and a new row is inserted, will the server know to put it in the beginning of the table quickly? or will it search the table every time for the place? and will putting a new row in the beginning of the table mean that some kind of data shifting will need to be done to the whole table every time?
Or maybe I just need a different indexing technique??
Any ideas you have are very welcome.. thanks!!
select x, y, z, time from table
where date(time) between '2014-08-01' and '2014-10-01' and tag_id = 123456
Putting a column inside a function call like date(time) spoils any chance of using an index for that column. You must use only a bare column for comparison, if you want to use an index.
So if you want to compare it to dates, you should store a DATE column. If you have a DATETIME column, you may have to use a search term like this:
WHERE `time` >= '2014-08-01 00:00:00 AND `time` < '2014-10-02 00:00:00' ...
Also, you should use multi-column indexes where you can. Put columns used in equality conditions first, then one column used in range conditions. For more on this rule, see my presentation How to Design Indexes, Really.
You may also benefit from adding columns that are not used for searching, so that the query can retrieve the columns from the index entry alone. Put these columns following the columns used for searching or sorting. This is called an index-only query.
So for this query, your index should be:
ALTER TABLE `this_table` ADD INDEX (tag_id, `time`, x, y, z);
Regarding ASC versus DESC, the syntax supports the option for different direction indexes, but in the two most popular storage engines used in MySQL, InnoDB and MyISAM, there is no difference. Either direction of sorting can use either type of index more or less equally well.

What is the best way to "roll up" aggregate data in MySql?

I have a large table containing hourly statistical data broken down across a number of dimensions. It's now large enough that I need to start aggregating the data to make queries faster. The table looks something like:
customer INT
campaign INT
start_time TIMESTAMP
end_time TIMESTAMP
time_period ENUM(hour, day, week)
clicks INT
I was thinking that I could, for example, insert a row into the table where campaign is null, and the clicks value would be the sum of all clicks for that customer and time period. Similarly, I could set the time period to "day" and this would be the sum of all of the hours in that day.
I'm sure this is a fairly common thing to do, so I'm wondering what the best way to achieve this in MySql? I'm assuming an INSERT INTO combined with a SELECT statement (like with a materialized view) - however since new data is constantly being added to this table, how do I avoid re-calculating aggregate data that I've previously calculated?
I done something similar and here is the problems I have deal with:
You can use round(start_time/86400)*86400 in "group by" part to get summary of all entries from same day. (For week is almost the same)
The SQL will look like:
insert into the_table
( select
customer,
NULL,
round(start_time/86400)*86400,
round(start_time/86400)*86400 + 86400,
'day',
sum(clicks)
from the_table
where time_period = 'hour' and start_time between <A> and <B>
group by customer, round(start_time/86400)*86400 ) as tbl;
delete from the_table
where time_period = 'hour' and start_time between <A> and <B>;
If you going to insert summary from same table to itself - you will use temp (Which mean you copy part of data from the table aside, than it dropped - for each transaction). So you must be very careful with the indexes and size of data returned by inner select.
When you constantly inserting and deleting rows - you will get fragmentation issues sooner or later. It will slow you down dramatically. The solutions is to use partitioning & to drop old partitions from time to time. Or you can run "optimize table" statement, but it will stop you work for relatively long time (may be minutes).
To avoid mess with duplicate data - you may want to clone the table for each time aggregation period (hour_table, day_table, ...)
If you're trying to make the table smaller, you'll be deleting the detailed rows after you make the summary row, right? Transactions are your friend. Start one, compute the rollup, insert the rollup, delete the detailed rows, end the transaction.
If you happen to add more rows for an older time period (who does that??), you can run the rollup again - it will combine your previous rollup entry with your extra data into a new, more powerful, rollup entry.