I am building a front-end to a largish db (10's of millions of rows). The data is water usage for loads of different companies and the table looks something like:
id | company_id | datetime | reading | used | cost
=============================================================
1 | 1 | 2012-01-01 00:00:00 | 5000 | 5 | 0.50
2 | 1 | 2012-01-01 00:01:00 | 5015 | 15 | 1.50
....
On the frontend users can select how they want to view the data, eg: 6 hourly increments, daily increments, monthly etc. What would be the best way to do this quickly. Given the data changes so much and the number of times any one set of data will be seen, caching the query data in memcahce or something similar is almost pointless and there is no way to build the data before hand as there are too many variables.
I figured using some kind of agregate aggregate table would work having tables such as readings, readings_6h, readings_1d with exactly the same structure, just already aggregated.
If this is a viable solution, what is the best way to keep the aggregate tables upto date and accurate. Besides the data coming in from meters the table is read only. Users don't ever have to update or write to it.
A number of possible solutions include:
1) stick to doing queries with group / aggregate functions on the fly
2) doing a basic select and save
SELECT `company_id`, CONCAT_WS(' ', date(`datetime`), '23:59:59') AS datetime,
MAX(`reading`) AS reading, SUM(`used`) AS used, SUM(`cost`) AS cost
FROM `readings`
WHERE `datetime` > '$lastUpdateDateTime'
GROUP BY `company_id`
3) duplicate key update (not sure how the aggregation would be done here, also making sure that the data is accurate not counted twice or missing rows.
INSERT INTO `readings_6h` ...
SELECT FROM `readings` ....
ON DUPLICATE KEY UPDATE .. calculate...
4) other ideas / recommendations?
I am currently doing option 2 which is taking around 15 minutes to aggregate +- 100k rows into +- 30k rows over 4 tables (_6h, _1d, _7d, _1m, _1y)
TL;DR What is the best way to view / store aggregate data for numerous reports that can't be cached effectively.
This functionality would be best served by a feature called materialized view, which MySQL unfortunately lacks. You could consider migrating to a different database system, such as PostgreSQL.
There are ways to emulate materialized views in MySQL using stored procedures, triggers, and events. You create a stored procedure that updates the aggregate data. If the aggregate data has to be updated on every insert you could define a trigger to call the procedure. If the data has to be updated every few hours you could define a MySQL scheduler event or a cron job to do it.
There is a combined approach, similar to your option 3, that does not depend on the dates of the input data; imagine what would happen if some new data arrives a moment too late and does not make it into the aggregation. (You might not have this problem, I don't know.) You could define a trigger that inserts new data into a "backlog," and have the procedure update the aggregate table from the backlog only.
All these methods are described in detail in this article: http://www.fromdual.com/mysql-materialized-views
Related
I am creating a duty management system for workers, the database will contain the following columns
ID | STAFF_ID | DUTY_ID | DUTY_START | DUTY_END
This system is required to know how many staffs are working in between the given time.
I am currently using mysqli which seem to be slow as the table data is increasing.
I am looking for a suitable service which can handle daily 500,000 records insert and search within DUTY_START and DUTY_END indexes.
Start-End ranges are nasty to optimize in large tables.
Perhaps the best solution for this situation is to do something really bad -- split off the date. (Yeah 2 wrongs may make a right this time.)
ID | STAFF_ID | DUTY_ID | date | START_time | END_time
Some issues:
If a shift spans midnight, make 2 rows.
Have an index on date, or at least starting with date. Then even though the query must scan all entries for the day in question, at least it is much less than scanning the entire table. (All other 'obvious solutions' get no better than scanning half the table.)
If you want to discuss and debate this further, please provide SELECT statement(s) based on either your 'schema' or mine. Don't worry about performance; we can discuss how to improve them.
I am going to have data relating the pull force of a block magnet to its three dimentions in an excel table in this form:
a/mm | b/mm | c/mm | force/N
---------------------------------
1 | 1 | 1 | 0.11
1 | 1 | 2 | 0.19
1 | 1 | 3 | 0.26
...
100 | 80 | 59 | 7425
100 | 80 | 60 | 7542
diagram showing what a, b and c mean
There is a row for each block magnet whose a, b and c in mm are whole numbers and the ranges are 1-100 for a, 1-80 for b and 1-60 for c. So in total there are 100*80*60=480,000 rows.
I want to make an online calculator where you enter a, b and c and it gives you the force. For this, I want to use a query something like this:
SELECT FROM blocks WHERE a=$a AND b=$b AND c=$c LIMIT 1
I want to make this query as fast as possible. I would like to know what measures I can take to optimise this search. How should I arrange the data in the SQL table? Should I keep the structure of the table the same as in my Excel sheet? Should I keep the order of the rows as it is? What indexes should I use if any? Should I add a unique ID column to the table? I am open to any suggestions to speed this up.
Note that:
The data is already nicely sorted by a, b and c
The table already contains all the data and nothing else will be done to it except displaying it, so we don't have to worry about the speed of UPDATE queries
a and b are interchangable, so I could delete all the rows where b>a
Increasing a, b or c will always result in a greater pull force
I want this calculator to be a part of a website. I use PHP and MySQL.
If possible, minimising the memory needed to store the table would also be desirable, speed is the priority though
Please don't suggest answers involving using a formula instead of my table of data. It is a requirement that the data are extracted from the database rather than calculated
Finally, can you estimeate:
How long such SELECT a query would take with and without optimization?
How much memory would such a table require?
I would create your table using a, b, c as primary key (since I assume for each triplet of a, b, c there will be no more one record).
The time that will take this select will depend on the rdbms you use but with the primary key it should be very quick. How many peak of queries per minute do you expect to have?
If you want to make the app as fast as possible, store the data in a file and load it into memory into the app or app server (your overall architecture is unclear). Whatever language you are using to develop the app probably supports a hash-table lookup data structure.
There are good reasons for storing data in a database: transactional integrity, security mechanisms, backup/restore functionality, replication, complex queries, and more. Your question doesn't actually suggest the need for any database functionality. You just want a lookup table for a fixed set of data.
If you really want to store the data in a database, then follow the above procedure. That is, load it into memory for users to query.
If you have some requirement to use a database (say, your data is changing), then follow my version of USeptim's advice: create a table with all four columns as primary keys (or alternatively use a secondary index on all four columns). The database will then do something similar to the first solution. The difference is the database will (in general) use b-trees to search the data instead of hash functions.
I have a MySQL database table that have around 10-15k inserts daily, and it certainly will increase next months.
- Table Example (reservations): *important fields*
+----+--------+----------+---------+-----+
| ID | people | modified | created | ... |
+----+--------+----------+---------+-----+
I need to provide daily statistics, informing how much entries had (total and specified with same number of people), based on a DATE or a date RANGE that user selected.
Today I'm executing two queries each request. It's working fine, with desirable delay, but I'm wondering if it will be stable with more data.
- Single Date:
SELECT COUNT(*) from reservations WHERE created='DATE USER SELECTED'
SELECT COUNT(*), people from reservations WHERE created='DATE USER SELECTED' GROUP BY people
- Date Range:
SELECT COUNT(*) from reservations WHERE created BETWEEN 'DATE USE SELECTED' AND 'DATE USE SELECTED';
SELECT COUNT(*), people from reservations WHERE created BETWEEN 'DATE USE SELECTED' AND 'DATE USE SELECTED' GROUP BY people
IN MY VIEW
Pros: Real time statistics.
Cons: Can overload the database, with similar and slow queries.
I thought to create a secondary table, named 'statistics', and run a cronjob on my server, each morning, to calculate all statistics.
- Table Example (statistics):
+----+------+--------------------+---------------------------+---------------------------+-----+
| ID | date | numberReservations | numberReservations2People | numberReservations3People | ... |
+----+------+--------------------+---------------------------+---------------------------+-----+
- IN MY VIEW
Pros: Faster queries, do not need to count every request.
Cons: Not real time statistics.
What you think about it? Theres a better approach?
The aggregate queries you've shown can efficiently be satisfied if you have the correct compound index in your table. If you're not sure about compound indexes, you can read about them.
The index (created,people) on your reservations is the right one for both those queries. They both can be satisfied with an efficient index scan known as a loose range scan. You'll find that they are fast enough that you don't need to bother with a secondary table for the foreseeable future in your system.
That's good, because secondary tables like you propose are a common source of confusion and errors.
I am parsing a collection of monthly lists of bulletin board systems from 1993-2000 in a city. The goal is to make visualizations from this data. For example, a line chart that shows month by month the total number of BBSes using various kinds of BBS software.
I have assembled the data from all these lists into one large table of around 17,000 rows. Each row represents a single BBS during a single month in time. I know this is probably not the optimal table scheme, but that's a question for a different day. The structure is something like this:
date | name | phone | codes | sysop | speed | software
1990-12 | Aviary | xxx-xxx-xxxx | null | Birdman | 2400 | WWIV
Google Fusion Tables offers a function called "summarize" (or "aggregation" in the older version). If I make a view summarizing by the "date" and "software" columns, then FT produces a table of around 500 rows with three columns: date, software, count. Each row lists the number of BBSes using a given type of software in a given month. With this data, I can make the graph I described above.
So, now to my question. Rather than FT, I'd like to work on this data in MySQL. I have imported the same 17,000-row table into a MySQL database, and have been trying various queries with COUNT and DISTINCT, hoping to return a list equivalent what I get from FT's Summarize function. But nothing I've tried has worked.
Can anyone suggest how to structure such a query?
Kirkman, you can do this using a COUNT function and the GROUP BY statement (which is used in conjunction with aggregate SQL functions)
select date, software, count(*) as cnt
from your_table
group by date, software
I have a more or less good working query (concerning to the result) but it takes about 45seconds to be processed. That's definitely too long for presenting the data in a GUI.
So my demand is to find a much faster/efficient query (something around a few milliseconds would be nice)
My data table has something around 3000 ~2,619,395 entries and is still growing.
Schema:
num | station | fetchDate | exportValue | error
1 | PS1 | 2010-10-01 07:05:17 | 300 | 0
2 | PS2 | 2010-10-01 07:05:19 | 297 | 0
923 | PS1 | 2011-11-13 14:45:47 | 82771 | 0
Explanation
the exportValue is always incrementing
the exportValue represents the actual absolute value
in my case there are 10 stations
every ~15 minutes 10 new entries are written to the table
error is just an indicator for a proper working station
Working query:
select
YEAR(fetchDate), station, Max(exportValue)-MIN(exportValue)
from
registros
where
exportValue > 0 and error = 0
group
by station, YEAR(fetchDate)
order
by YEAR(fetchDate), station
Output:
Year | station | Max-Min
2008 | PS1 | 24012
2008 | PS2 | 23709
2009 | PS1 | 28102
2009 | PS2 | 25098
My thoughts on it:
writing several queries with between statements like 'between 2008-01-01 and 2008-01-02' to fetch the MIN(exportValue) and between 2008-12-30 and 2008-12-31' to grab the MAX(exportValue) - Problem: a lot of queries and the problem with having no data in a specified time range (it's not guaranteed that there will be data)
limiting the resultset to my 10 stations only with using order by MIN(fetchDate) - problem: takes also a long time to process the query
Additional Info:
I'm using the query in a JAVA Application. That means, it would be possible to do some post-processing on the resultset if necessary. (JPA 2.0)
Any help/approaches/ideas are very appreciated. Thanks in advance.
Adding suitable indexes will help.
2 compound indexes will speed things up significantly:
ALTER TABLE tbl_name ADD INDEX (error, exportValue);
ALTER TABLE tbl_name ADD INDEX (station, fetchDate);
This query running on 3000 records should be extremely fast.
Suggestions:
do You have PK set on this table? station, fetchDate?
add indexes; You should experiment and try with indexes as rich.okelly suggested in his answer
depending on experiments with indexes, try breaking your query into multiple statements - in one stored procedure; this way You will not loose time in network traffic between multiple queries sent from client to mysql
You mentioned that You tried with separate queries and there is a problem when there is no data for particular month; it is regular case in business applications, You should handle it in a "master query" (stored procedure or application code)
guess fetchDate is current date and time at the moment of record insertion; consider keeping previous months data in sort of summary table with fields: year, month, station, max(exportValue), min(exportValue) - this means that You should insert summary records in summary table at the end of each month; deleting, keeping or moving detail records to separate table is your choice
Since your table is rapidly growing (every 15 minutes) You should take the last suggestion into account. Probably, there is no need to keep detailed history at one place. Archiving data is process that should be done as part of maintenance.