MySQL index for only maximum and minimum values - mysql

I have a huge table with millions of rows which stores the values obtained from some weather stations. Every row contains the station that gathered the value, the metric (for example, temperature, humidity, noise level, etc.), the date and the value itself.
This is its structure:
station: int(8)
metric: int(8)
date: datetime
value: float
And these are the indices I've defined:
PRIMARY KEY: station+metric+date
KEY: metrica (for the foreign key)
Sometimes, I'm interested in retrieving the last time every station has sent some value. Then I use this query:
SELECT station, MAX(date)
FROM MyTable
GROUP BY station
This query is very slow, as it has to read the entire table. If I add an index for station+date, the query now can use it and it becomes very fast. But the table storage also increases a lot, and for me indexing all date values is not useful, given I'm only interested on the max value.
So my question is if it's possible to create an index that indexes some range, ideally to only keep track of the maximum value.

Not that I know. But you have alternative solutions.
In other databases I'd suggest a materialized view, but MySQL does not support materialized views (SO#3991912) so you have to create and manage your own aggregate table your self.
If your source table is not updated too frequently a CREATE TABLE last_observation AS SELECT station, MAX(date) AS date FROM observations GROUP BY station will do the work. Simply issue the statement before any relevant request.
If your server has enough resources, you can leave the table in MEMORY, to get superfast responses. In that case you need to name explicitly the columns CREATE TABLE last_observation (station VARCHAR(x), lastDate DATE) ENGINE=MEMORY AS SELECT station, MAX(date) AS lastDate FROM observations GROUP BY station. Of course this statement should issued routinely each time you open mysql.
If your table is updated frequently, you can manage the content with triggers on the source table (Full tutorial here).
An other solution, on a totally different path, is to use a column oriented database. We used Infobright a few years ago, which has an free community edition, and is totally transparent for you (just install it and use mysql as before).

INDEX(station, date)
will handle that query efficiently. Alternatively, you could rearrange the PRIMARY KEY to (station, date, metric).
If you also want the temp on that date, then you are into a more complex groupwise-max .

Related

MySQL performance for aggregate functions -- 80Million records

I am currently stuck in improving the performance of MySQL query. It takes 30 seconds to execute and we don't want users waiting that long for the backend response.
My Query:
select count(case_id), sum(net_value), sum(total_time_spent), events from event_log group by events order by count(case_id) desc
Indexes:
Created a composite index on events,case_id, net_value, total_time_spent.
Time taken:30 seconds
Number of records in event_log table:80 Million
Table structure:
Create table event_log( case_id varchar(100) primary key, events varchar(200), creation_date timestamp, total_time_spent bigint)
Composite Unique key: case_id, events, creation_date.
Infrastructure: 
AWS RDS instance type : r5d.2xlarge ( 8CPUs, 64GB RAM )
Tried partitioning the data on the basis of key case_id but could see no improvement.
Tried upgrading the server size but no improvement there as well.
If you can give us some hints, or something that we can try that would be really helpful.
Build and maintain a Summary Table of events by day (or week) and subtotals of the counts and sums you need.
Then run the query against the summary table, summing up the sums, etc.
That may run 10 times as fast.
If practical, normalize case_id and/or events; that may shrink the table size by a significant amount. Consider using a smaller datatype for the total_time_spent; BIGINT consumes 8 bytes.
With a summary table, few, if any, indexes are needed; the summary table is likely to have indexes. I would try to have the PRIMARY KEY start with events.
Be aware that COUNT(x) checks x for being NOT NULL. If this is not necessary, then simply do COUNT(*).

How to effectively save last 7 days statistics in SQL database? [duplicate]

I have to collect statisctics by days, weeks, months and years of user activity for a site. I am the DB design stage and I wanted to do this stage properly since it will make my coding life easier.
What I have to do is just simply increment the values in the fields by 1 in the DB each time an activity happens. So then I can pull up the date by each day, each week, each month and year. How should my DB be structured? Apologies if this is a simple question for most. It would also be great if this structure could be extendable so that it can be broken down by other categories.
The bit am having trouble with is each month is made up of more days and these days change each calender year.
Thanks all for any help or direction.
Other info: Linux Machine, making use of PHP and MySQL
Instead of updating counts per day, week etc. just INSERT a row into a table each time an activity happens like this:
insert into activities (activity_date, activity_info)
values (CURRENT_TIMESTAMP, 'whatever');
Now your reports are very simple like:
select count(*) from activities
where activity_date between '2008-01-01' and '2008-01-07';
or
select YEARWEEK(`activity_date`) as theweek, count(*)
group by theweek
You may just add records into the table and SELECT them using aggregate functions.
If for some reason you need to keep aggregated statistics, you may use:
CREATE TABLE aggregates (type VARCHAR(20), part VARCHAR(10) NOT NULL PRIMARY KEY, activity INT)
INSERT INTO aggregates (type, part, activity)
VALUES ('year', SUBSTRING(SYSDATE(), 1, 4), 1)
ON DUPLICATE KEY UPDATE activity = activity + 1
INSERT INTO aggregates (type, part, activity)
VALUES ('month', SUBSTRING(SYSDATE(), 1, 7), 1)
ON DUPLICATE KEY UPDATE activity = activity + 1
INSERT INTO aggregates (type, part, activity)
VALUES ('day', SUBSTRING(SYSDATE(), 1, 10), 1)
ON DUPLICATE KEY UPDATE activity = activity + 1
This will automatically update existing rows and insert non-existing when needed.
table of events : id, activity id, datetime, userid.
table of users : id, username etc
table of activities : id, activity name, etc
Just enter a new row into events when an event happens. Then you can analyse the events but manipulating time, date, user, activity etc.
To start with, you would probably imagine a single table, as this would be the most normalized form. The table would simply have an entry for each hit you receive, with each row containing the date/time of that hit.
Now, this way, in order to get statistics for each hour, day, week etc, the queries are simple but your database will have to do some pretty heavy query work. In particular, queries that do sums, counts or averages will need to fetch all the relevant rows.
You could get around this by precalculating the required counts in a second table, and making sure you sync that table to the first regularly. Problem is, you will be responsible for keeping that cache in sync yourself.
This would probably involve making a row for each hour. It will still be a lot quicker to do a query for a day, or a month, if you are only fetching a maximum of 24 rows per day.
Your other suggestion was to aggregate it from the start, never storing every single hit as a row. You would probably do that, as before, with a row for each hour. Every hit would increment the relevant hours' row by one. You would only have the data in one location, and it would already be pretty well summarised.
The reason I suggest by hour instead of by day, is that this still gives you the option to support multiple time zones. If your granularity is only to the day, you don't have that option.
Tony Andrews' answer is the simplest, however a snowflake structure is sometimes used in data warehouse applications: a table that counts all the activities, another for activities per day, another for activities per month, and a third for activities per year.
With this kind of structure, the activity between any two dates can be computed very efficiently.
https://en.wikipedia.org/wiki/Snowflake_schema
Use a star schema design. (or perhaps a snowflake design).
Star-Schema Design
You will end up doing an insert into a fact table for each new activity. See Tony's suggestion.
You will need at least two dimension tables, one for users and one for time frames. There will probably be dimensions for activity type, and maybe even for location. It depends on what you want to do with the data.
Your question relates to the time frames dimension table. Let's call it "Almanac". Choose a granularity. Let's say the day. The almanac will have one row per day. The primary key can be the date. Your fact table should include this primary key as a foreign key, in order to make joins easier. (It doesn't matter whether or not you declare it as a foreign key. That only affects referential integrity during your update process.)
Include columns in the Almanac for each reporting period you can think of. Week, Month, Quarter, Year, etc. You can even include reporting periods that relate to your company's own calendar.
Here's an article comparing ER and DM. I'm unusual in that I like both methods, choosing the appropriate method for the appropriate task.
http://www.dbmsmag.com/9510d05.html
Your question relates to the time frames dimension table. Let's call it "Almanac". Choose a granularity. Let's say the day. The almanac will have one row per day. The primary key can be the date. Your fact table should include this primary key as a foreign key, in order to make joins easier. (It doesn't matter whether or not you declare it as a foreign key. That only affects referential integrity during your update process.)

What is the best way to "roll up" aggregate data in MySql?

I have a large table containing hourly statistical data broken down across a number of dimensions. It's now large enough that I need to start aggregating the data to make queries faster. The table looks something like:
customer INT
campaign INT
start_time TIMESTAMP
end_time TIMESTAMP
time_period ENUM(hour, day, week)
clicks INT
I was thinking that I could, for example, insert a row into the table where campaign is null, and the clicks value would be the sum of all clicks for that customer and time period. Similarly, I could set the time period to "day" and this would be the sum of all of the hours in that day.
I'm sure this is a fairly common thing to do, so I'm wondering what the best way to achieve this in MySql? I'm assuming an INSERT INTO combined with a SELECT statement (like with a materialized view) - however since new data is constantly being added to this table, how do I avoid re-calculating aggregate data that I've previously calculated?
I done something similar and here is the problems I have deal with:
You can use round(start_time/86400)*86400 in "group by" part to get summary of all entries from same day. (For week is almost the same)
The SQL will look like:
insert into the_table
( select
customer,
NULL,
round(start_time/86400)*86400,
round(start_time/86400)*86400 + 86400,
'day',
sum(clicks)
from the_table
where time_period = 'hour' and start_time between <A> and <B>
group by customer, round(start_time/86400)*86400 ) as tbl;
delete from the_table
where time_period = 'hour' and start_time between <A> and <B>;
If you going to insert summary from same table to itself - you will use temp (Which mean you copy part of data from the table aside, than it dropped - for each transaction). So you must be very careful with the indexes and size of data returned by inner select.
When you constantly inserting and deleting rows - you will get fragmentation issues sooner or later. It will slow you down dramatically. The solutions is to use partitioning & to drop old partitions from time to time. Or you can run "optimize table" statement, but it will stop you work for relatively long time (may be minutes).
To avoid mess with duplicate data - you may want to clone the table for each time aggregation period (hour_table, day_table, ...)
If you're trying to make the table smaller, you'll be deleting the detailed rows after you make the summary row, right? Transactions are your friend. Start one, compute the rollup, insert the rollup, delete the detailed rows, end the transaction.
If you happen to add more rows for an older time period (who does that??), you can run the rollup again - it will combine your previous rollup entry with your extra data into a new, more powerful, rollup entry.

Index design for queries using 2 ranges

I am trying to find out how to design the indexes for my data when my query is using ranges for 2 fields.
expenses_tbl:
idx date category amount
auto-inc INT TINYINT DECIMAL(7,2)
PK
The column category defines the type of expense. Like, entertainment, clothes, education, etc. The other columns are obvious.
One of my query on this table is to find all those instances where for a given date range, the expense has been more than $50. This query will look like:
SELECT date, category, amount
FROM expenses_tbl
WHERE date > 120101 AND date < 120811
AND amount > 50.00;
How do I design the index/secondary index on this table for this particular query.
Assumption: The table is very large (It's not currently, but that gives me a scope to learn).
MySQL generally doesn't support ranges on multiple parts of a compound index. Either it will use the index for the date, or an index for the amount, but not both. It might do an index merge if you had two indexes, one on each, but I'm not sure.
I'd check the EXPLAIN before and after adding these indexes:
CREATE INDEX date_idx ON expenses_tbl (date);
CREATE INDEX amount_idx ON expenses_tbl (amount);
Compound index ranges - http://dev.mysql.com/doc/refman/5.5/en/range-access-multi-part.html
Index Merge - http://dev.mysql.com/doc/refman/5.0/en/index-merge-optimization.html
A couple more points that have not been mentioned yet:
The order of the columns in the index can make a difference. You may want to try both of these indexes:
(date, amount)
(amount, date)
Which to pick? Generally you want the most selective condition be the first column in the index.
If your date ranges are large but few expenses are over $50 then you want amount first in the index.
If you have narrow date ranges and most of the expenses are over $50 then you should put date first.
If both indexes are present then MySQL will choose the index with the lowest estimated cost.
You can try adding both indexes and then look at the output of EXPLAIN SELECT ... to see which index MySQL chooses for your query.
You may also want to consider a covering index. By including the column category in the index (as the last column) it means that all the data required for your query is available in the index, so MySQL does not need to look at the base table at all to get the results for your query.
The general answer to your question is that you want a composite index, with two keys. The first being date and the second being the amount.
Note that this index will work for queries with restrictions on the date or on the date and on the expense. It will not work for queries with restrictions on the expense only. If you have both types, you might want a second index on expense.
If the table is really, really large, then you might want to partition it by date and build indexes on expense within each partition.

Database structure for holding statistics by day, week, month, year

I have to collect statisctics by days, weeks, months and years of user activity for a site. I am the DB design stage and I wanted to do this stage properly since it will make my coding life easier.
What I have to do is just simply increment the values in the fields by 1 in the DB each time an activity happens. So then I can pull up the date by each day, each week, each month and year. How should my DB be structured? Apologies if this is a simple question for most. It would also be great if this structure could be extendable so that it can be broken down by other categories.
The bit am having trouble with is each month is made up of more days and these days change each calender year.
Thanks all for any help or direction.
Other info: Linux Machine, making use of PHP and MySQL
Instead of updating counts per day, week etc. just INSERT a row into a table each time an activity happens like this:
insert into activities (activity_date, activity_info)
values (CURRENT_TIMESTAMP, 'whatever');
Now your reports are very simple like:
select count(*) from activities
where activity_date between '2008-01-01' and '2008-01-07';
or
select YEARWEEK(`activity_date`) as theweek, count(*)
group by theweek
You may just add records into the table and SELECT them using aggregate functions.
If for some reason you need to keep aggregated statistics, you may use:
CREATE TABLE aggregates (type VARCHAR(20), part VARCHAR(10) NOT NULL PRIMARY KEY, activity INT)
INSERT INTO aggregates (type, part, activity)
VALUES ('year', SUBSTRING(SYSDATE(), 1, 4), 1)
ON DUPLICATE KEY UPDATE activity = activity + 1
INSERT INTO aggregates (type, part, activity)
VALUES ('month', SUBSTRING(SYSDATE(), 1, 7), 1)
ON DUPLICATE KEY UPDATE activity = activity + 1
INSERT INTO aggregates (type, part, activity)
VALUES ('day', SUBSTRING(SYSDATE(), 1, 10), 1)
ON DUPLICATE KEY UPDATE activity = activity + 1
This will automatically update existing rows and insert non-existing when needed.
table of events : id, activity id, datetime, userid.
table of users : id, username etc
table of activities : id, activity name, etc
Just enter a new row into events when an event happens. Then you can analyse the events but manipulating time, date, user, activity etc.
To start with, you would probably imagine a single table, as this would be the most normalized form. The table would simply have an entry for each hit you receive, with each row containing the date/time of that hit.
Now, this way, in order to get statistics for each hour, day, week etc, the queries are simple but your database will have to do some pretty heavy query work. In particular, queries that do sums, counts or averages will need to fetch all the relevant rows.
You could get around this by precalculating the required counts in a second table, and making sure you sync that table to the first regularly. Problem is, you will be responsible for keeping that cache in sync yourself.
This would probably involve making a row for each hour. It will still be a lot quicker to do a query for a day, or a month, if you are only fetching a maximum of 24 rows per day.
Your other suggestion was to aggregate it from the start, never storing every single hit as a row. You would probably do that, as before, with a row for each hour. Every hit would increment the relevant hours' row by one. You would only have the data in one location, and it would already be pretty well summarised.
The reason I suggest by hour instead of by day, is that this still gives you the option to support multiple time zones. If your granularity is only to the day, you don't have that option.
Tony Andrews' answer is the simplest, however a snowflake structure is sometimes used in data warehouse applications: a table that counts all the activities, another for activities per day, another for activities per month, and a third for activities per year.
With this kind of structure, the activity between any two dates can be computed very efficiently.
https://en.wikipedia.org/wiki/Snowflake_schema
Use a star schema design. (or perhaps a snowflake design).
Star-Schema Design
You will end up doing an insert into a fact table for each new activity. See Tony's suggestion.
You will need at least two dimension tables, one for users and one for time frames. There will probably be dimensions for activity type, and maybe even for location. It depends on what you want to do with the data.
Your question relates to the time frames dimension table. Let's call it "Almanac". Choose a granularity. Let's say the day. The almanac will have one row per day. The primary key can be the date. Your fact table should include this primary key as a foreign key, in order to make joins easier. (It doesn't matter whether or not you declare it as a foreign key. That only affects referential integrity during your update process.)
Include columns in the Almanac for each reporting period you can think of. Week, Month, Quarter, Year, etc. You can even include reporting periods that relate to your company's own calendar.
Here's an article comparing ER and DM. I'm unusual in that I like both methods, choosing the appropriate method for the appropriate task.
http://www.dbmsmag.com/9510d05.html
Your question relates to the time frames dimension table. Let's call it "Almanac". Choose a granularity. Let's say the day. The almanac will have one row per day. The primary key can be the date. Your fact table should include this primary key as a foreign key, in order to make joins easier. (It doesn't matter whether or not you declare it as a foreign key. That only affects referential integrity during your update process.)