MySQL queries to generate table statistics - mysql

I have a MySQL database table that have around 10-15k inserts daily, and it certainly will increase next months.
- Table Example (reservations): *important fields*
+----+--------+----------+---------+-----+
| ID | people | modified | created | ... |
+----+--------+----------+---------+-----+
I need to provide daily statistics, informing how much entries had (total and specified with same number of people), based on a DATE or a date RANGE that user selected.
Today I'm executing two queries each request. It's working fine, with desirable delay, but I'm wondering if it will be stable with more data.
- Single Date:
SELECT COUNT(*) from reservations WHERE created='DATE USER SELECTED'
SELECT COUNT(*), people from reservations WHERE created='DATE USER SELECTED' GROUP BY people
- Date Range:
SELECT COUNT(*) from reservations WHERE created BETWEEN 'DATE USE SELECTED' AND 'DATE USE SELECTED';
SELECT COUNT(*), people from reservations WHERE created BETWEEN 'DATE USE SELECTED' AND 'DATE USE SELECTED' GROUP BY people
IN MY VIEW
Pros: Real time statistics.
Cons: Can overload the database, with similar and slow queries.
I thought to create a secondary table, named 'statistics', and run a cronjob on my server, each morning, to calculate all statistics.
- Table Example (statistics):
+----+------+--------------------+---------------------------+---------------------------+-----+
| ID | date | numberReservations | numberReservations2People | numberReservations3People | ... |
+----+------+--------------------+---------------------------+---------------------------+-----+
- IN MY VIEW
Pros: Faster queries, do not need to count every request.
Cons: Not real time statistics.
What you think about it? Theres a better approach?

The aggregate queries you've shown can efficiently be satisfied if you have the correct compound index in your table. If you're not sure about compound indexes, you can read about them.
The index (created,people) on your reservations is the right one for both those queries. They both can be satisfied with an efficient index scan known as a loose range scan. You'll find that they are fast enough that you don't need to bother with a secondary table for the foreseeable future in your system.
That's good, because secondary tables like you propose are a common source of confusion and errors.

Related

best database solution for multiple date time column indexing

I am creating a duty management system for workers, the database will contain the following columns
ID | STAFF_ID | DUTY_ID | DUTY_START | DUTY_END
This system is required to know how many staffs are working in between the given time.
I am currently using mysqli which seem to be slow as the table data is increasing.
I am looking for a suitable service which can handle daily 500,000 records insert and search within DUTY_START and DUTY_END indexes.
Start-End ranges are nasty to optimize in large tables.
Perhaps the best solution for this situation is to do something really bad -- split off the date. (Yeah 2 wrongs may make a right this time.)
ID | STAFF_ID | DUTY_ID | date | START_time | END_time
Some issues:
If a shift spans midnight, make 2 rows.
Have an index on date, or at least starting with date. Then even though the query must scan all entries for the day in question, at least it is much less than scanning the entire table. (All other 'obvious solutions' get no better than scanning half the table.)
If you want to discuss and debate this further, please provide SELECT statement(s) based on either your 'schema' or mine. Don't worry about performance; we can discuss how to improve them.

Efficient way to get last record from the database

I have a database with following table structure :
id | entry | log_type | user_id | created_on |
------------------------------------------------|
1 |a | error | 1 | 1433752884000|
2 |b | warn | 2 | 1433752884001|
3 |c | error | 2 | 1433752884002|
4 |d | warn | 4 | 1433752884003|
I want to obtain the last record from the table based on created_on field, currently i am using the following query to obtain the result list and obtain the last record on it using java:
select * from log_table l where l.user_id=2 and l.log_type = 'error' order by l.created_on desc;
I am using JPA and i execute the query using .getResultList() on the Query interface .Once i get the result list i do a get(0) to obtain the desired last record .
I have a large table with too much data , above query takes too long to execute and stalls the application. I cannot add additional index for now on existing data . Apart from adding the index on the data is there an alternate approach to avoid stalling of this query .
I was thinking of executing the following query,
select * from log_table l where l.user_id=2 and l.log_type = 'error' order by l.created_on desc limit 1;
Currently i cannot execute my second query on the database as it might cause my application to stall. Will execution of the second query be faster than the first query ?
I don't have a sufficiently large dataset available to reproduce the stalling problems on my local system and hence . I tried executing the queries on my local database and due to the lack of the available large dataset , unable to determine if the second query would be faster with the addition of "limit" on the returned query.
If the above second query isn't supposed to provide a better result , what would be the approach that i should to get an optimized query .
In case the second query should be good enough to avoid stalling , is it due to the reason that the DB fetches only one record instead instead of the entire set of records ? does the database handle looking/fetching for a single record differently as compared to looking/fetching too many records (as in first query) to improve query timings.
The performance depends...
ORDER BY x LIMIT 1
is a common pattern. It may or may not be very efficient -- It depends on the query and the indexes.
In your case:
where l.user_id=2 and l.log_type = 'error' order by l.created_on desc
this would be optimal:
INDEX(user_id, log_type, created_on)
With that index, it will essentially do one probe to find the row you need. Without that index, it will scan much or all of the table, sort it descending (ORDER BY .. DESC) and deliver the first row (LIMIT 1)
Before you do your query.getResultList(), you need to query.setMaxResults(1). This is the equivalent of LIMIT 1.
But be aware that if your Entity has a Collection of related sub-objects JOINed to it in the query, the Entity Manager may still have to do an unbounded select to get all the data it needs to build the first Entity. See this question and answer for more information about that.
In your case, as you only need one Entity, I would recommend lazy-loading any attached Entities after you have done the initial query.

Aggregate data tables

I am building a front-end to a largish db (10's of millions of rows). The data is water usage for loads of different companies and the table looks something like:
id | company_id | datetime | reading | used | cost
=============================================================
1 | 1 | 2012-01-01 00:00:00 | 5000 | 5 | 0.50
2 | 1 | 2012-01-01 00:01:00 | 5015 | 15 | 1.50
....
On the frontend users can select how they want to view the data, eg: 6 hourly increments, daily increments, monthly etc. What would be the best way to do this quickly. Given the data changes so much and the number of times any one set of data will be seen, caching the query data in memcahce or something similar is almost pointless and there is no way to build the data before hand as there are too many variables.
I figured using some kind of agregate aggregate table would work having tables such as readings, readings_6h, readings_1d with exactly the same structure, just already aggregated.
If this is a viable solution, what is the best way to keep the aggregate tables upto date and accurate. Besides the data coming in from meters the table is read only. Users don't ever have to update or write to it.
A number of possible solutions include:
1) stick to doing queries with group / aggregate functions on the fly
2) doing a basic select and save
SELECT `company_id`, CONCAT_WS(' ', date(`datetime`), '23:59:59') AS datetime,
MAX(`reading`) AS reading, SUM(`used`) AS used, SUM(`cost`) AS cost
FROM `readings`
WHERE `datetime` > '$lastUpdateDateTime'
GROUP BY `company_id`
3) duplicate key update (not sure how the aggregation would be done here, also making sure that the data is accurate not counted twice or missing rows.
INSERT INTO `readings_6h` ...
SELECT FROM `readings` ....
ON DUPLICATE KEY UPDATE .. calculate...
4) other ideas / recommendations?
I am currently doing option 2 which is taking around 15 minutes to aggregate +- 100k rows into +- 30k rows over 4 tables (_6h, _1d, _7d, _1m, _1y)
TL;DR What is the best way to view / store aggregate data for numerous reports that can't be cached effectively.
This functionality would be best served by a feature called materialized view, which MySQL unfortunately lacks. You could consider migrating to a different database system, such as PostgreSQL.
There are ways to emulate materialized views in MySQL using stored procedures, triggers, and events. You create a stored procedure that updates the aggregate data. If the aggregate data has to be updated on every insert you could define a trigger to call the procedure. If the data has to be updated every few hours you could define a MySQL scheduler event or a cron job to do it.
There is a combined approach, similar to your option 3, that does not depend on the dates of the input data; imagine what would happen if some new data arrives a moment too late and does not make it into the aggregation. (You might not have this problem, I don't know.) You could define a trigger that inserts new data into a "backlog," and have the procedure update the aggregate table from the backlog only.
All these methods are described in detail in this article: http://www.fromdual.com/mysql-materialized-views

mysql query - optimizing existing MAX-MIN query for a huge table

I have a more or less good working query (concerning to the result) but it takes about 45seconds to be processed. That's definitely too long for presenting the data in a GUI.
So my demand is to find a much faster/efficient query (something around a few milliseconds would be nice)
My data table has something around 3000 ~2,619,395 entries and is still growing.
Schema:
num | station | fetchDate | exportValue | error
1 | PS1 | 2010-10-01 07:05:17 | 300 | 0
2 | PS2 | 2010-10-01 07:05:19 | 297 | 0
923 | PS1 | 2011-11-13 14:45:47 | 82771 | 0
Explanation
the exportValue is always incrementing
the exportValue represents the actual absolute value
in my case there are 10 stations
every ~15 minutes 10 new entries are written to the table
error is just an indicator for a proper working station
Working query:
select
YEAR(fetchDate), station, Max(exportValue)-MIN(exportValue)
from
registros
where
exportValue > 0 and error = 0
group
by station, YEAR(fetchDate)
order
by YEAR(fetchDate), station
Output:
Year | station | Max-Min
2008 | PS1 | 24012
2008 | PS2 | 23709
2009 | PS1 | 28102
2009 | PS2 | 25098
My thoughts on it:
writing several queries with between statements like 'between 2008-01-01 and 2008-01-02' to fetch the MIN(exportValue) and between 2008-12-30 and 2008-12-31' to grab the MAX(exportValue) - Problem: a lot of queries and the problem with having no data in a specified time range (it's not guaranteed that there will be data)
limiting the resultset to my 10 stations only with using order by MIN(fetchDate) - problem: takes also a long time to process the query
Additional Info:
I'm using the query in a JAVA Application. That means, it would be possible to do some post-processing on the resultset if necessary. (JPA 2.0)
Any help/approaches/ideas are very appreciated. Thanks in advance.
Adding suitable indexes will help.
2 compound indexes will speed things up significantly:
ALTER TABLE tbl_name ADD INDEX (error, exportValue);
ALTER TABLE tbl_name ADD INDEX (station, fetchDate);
This query running on 3000 records should be extremely fast.
Suggestions:
do You have PK set on this table? station, fetchDate?
add indexes; You should experiment and try with indexes as rich.okelly suggested in his answer
depending on experiments with indexes, try breaking your query into multiple statements - in one stored procedure; this way You will not loose time in network traffic between multiple queries sent from client to mysql
You mentioned that You tried with separate queries and there is a problem when there is no data for particular month; it is regular case in business applications, You should handle it in a "master query" (stored procedure or application code)
guess fetchDate is current date and time at the moment of record insertion; consider keeping previous months data in sort of summary table with fields: year, month, station, max(exportValue), min(exportValue) - this means that You should insert summary records in summary table at the end of each month; deleting, keeping or moving detail records to separate table is your choice
Since your table is rapidly growing (every 15 minutes) You should take the last suggestion into account. Probably, there is no need to keep detailed history at one place. Archiving data is process that should be done as part of maintenance.

Best practice question for MySQL: order by id or date?

This is kind of a noobish question, but it's one that I've never been given a straight answer on.
Suppose I have a DB table with the following fields and values:
| id | date_added | balance |
+------------------------------------+
| 1 | 2009-12-01 19:43:22 | 1237.50 |
| 2 | 2010-01-12 03:19:54 | 473.00 |
| 3 | 2010-01-12 03:19:54 | 2131.20 |
| 4 | 2010-01-20 11:27:31 | 3238.10 |
| 5 | 2010-01-25 22:52:07 | 569.40 |
+------------------------------------+
This is for a very basic 'accounting' sub-system. I want to get the most recent balance. The id field is set to auto_increment. Typically, I would use:
SELECT balance FROM my_table ORDER BY date_added DESC LIMIT 1;
But I need to make absolutely sure that the value returned is the most recent... (see id# 2 & 3 above)
1) Would I be better off using:
SELECT balance FROM my_table ORDER BY id DESC LIMIT 1;
2) Or would this be a better solution?:
SELECT balance FROM my_table ORDER BY date_added,id DESC LIMIT 1;
AFAIK, auto_increment works pretty well, but is it reliable enough to sort something this crucial by? That's why I'm thinking sorting by both fields is a better idea, but I've seen some really quirky behavior in MySQL when I've done that in the past. Or if there's an even better solution, I'd appreciate your input.
Thanks in advance!
Brian
If there is a chance you'll get two added with the same date, you'll probably need:
SELECT balance FROM my_table ORDER BY date_added DESC,id DESC LIMIT 1;
(note the 'descending' clause on both fields).
However, you will need to take into account what you want to happen when someone adds an adjusting entry of the 2nd of February which is given the date 31st January to ensure the month of January is complete. It will have an ID greater than those made on the 1st of February.
Generally, accounting systems just work on the date. Perhaps if you could tell us why the order is important, we could make other suggestions.
In response to your comment:
I would love to hear any other ideas or advice you might have, even if they're off-topic since I have zero knowledge of accounting-type database models.
I would provide a few pieces of advice - this is all I could think of immediately, I usually spew forth much more "advice" with even less encouragement :-) The first two, more database-related than accounting-relared, are:
First, do everything in third normal form and only revert if and when you have performance problems. This will save you a lot of angst with duplicate data which may get out of step. Even if you do revert, use triggers and other DBMS capabilities to ensure that data doesn't get out of step.
An example, if you want to speed up your searches on a last_name column, you can create an upper_last_name column (indexed) then use that to locate records matching your already upper-cased search term. This will almost always be faster than the per-row function upper(last_name). You can use an insert/update trigger to ensure the upper_last_name is always set correctly and this incurs the cost only when the name changes, not every time you search.
Secondly, don't duplicate data even across tables (like your current schema) unless you can use those same trigger-type tricks to guarantee the data won't get out of step. What will your customer do when you send them an invoice where the final balance doesn't match the starting balance plus purchases? That's not going to make your company look very professional :-)
Thirdly (and this is more accounting-related), you generally don't need to worry about the number of transactions when calculating balances on the fly. That's because accounting systems usually have a roll-over function at year end which resets the opening balances.
So you're usually never having to process more than a year's worth of data at once which, unless you're the US government or Microsoft, is not that onerous.
Maybe is faster by id, but safer by datetime; use the latter if have performance issues add an index.
Personally I'd never trust an autoincrement in that way. I'd sort by the date.
I'm pretty sure that the ID is guaranteed to be unique, but not necessarily sequential and increasing.