best database solution for multiple date time column indexing - mysql

I am creating a duty management system for workers, the database will contain the following columns
ID | STAFF_ID | DUTY_ID | DUTY_START | DUTY_END
This system is required to know how many staffs are working in between the given time.
I am currently using mysqli which seem to be slow as the table data is increasing.
I am looking for a suitable service which can handle daily 500,000 records insert and search within DUTY_START and DUTY_END indexes.

Start-End ranges are nasty to optimize in large tables.
Perhaps the best solution for this situation is to do something really bad -- split off the date. (Yeah 2 wrongs may make a right this time.)
ID | STAFF_ID | DUTY_ID | date | START_time | END_time
Some issues:
If a shift spans midnight, make 2 rows.
Have an index on date, or at least starting with date. Then even though the query must scan all entries for the day in question, at least it is much less than scanning the entire table. (All other 'obvious solutions' get no better than scanning half the table.)
If you want to discuss and debate this further, please provide SELECT statement(s) based on either your 'schema' or mine. Don't worry about performance; we can discuss how to improve them.

Related

On the Efficiency of Data Infrastructure, Storage and Retrieval with SQL

I'm curious about which is the most efficient way to store and retrieve data in and from a database.
The table:
+----+--------+--------+ +-------+ +----------+
| id | height | weight | ← | bmi | ← | category |
+----+--------+--------+ +-------+ +----------+
| 1 | 184 | 64 | | 18.90 | | 2 |
| 2 | 147 | 80 | | 37.02 | | 4 |
| … | ……… | …… | | …… …… | | … |
| | | | ← |  | ← | |
+----+--------+--------+ +-------+ +----------+
From a storage perspective
If we want to be more efficient in terms of storing the data, columns bmi and category would be obsolete, adding data we could've otherwise figured out based on the former two columns height and weight.
From a retrieval perspective
Leaving out the category column we could ask
SELECT *
FROM bmi_entry
WHERE bmi >= 18.50 AND bmi < 25.00
and leaving out the bmi column as well, that becomes
SELECT *
FROM bmi_entry
WHERE weight / ((height * 100) * (height * 100)) >= 18.50
AND weight / ((height * 100) * (height * 100)) < 25
However, calculation could hypothetically take much longer that simply comparing a column to a value, in which case
SELECT *
FROM bmi_entry
WHERE category = 2
would be the far superior query in terms of retrieval time.
Best practice?
At first, I was about to go with method one, thinking why store "useless" data and take up storage space… but then I thought about the implementation and how potentially having to recalculate those "obsolete" fields for every single row every time I want to sort and retrieve specific sets of BMI entries within specific ranges or categories could dramatically slow down the time it takes to collect the data.
Ultimately:
Wouldn't the arithmetic functions of division and multiplication take more time and thus slow down the user experience?
Would there ever be a case in which you would prioritise storage space over retrieval time?
If the answer to (1.) is a simple "yup", you can comment that below. :-)
If you have a more in depth elaboration on either (1.) or (2.), however, feel free to post that or those as well, as I, and others, would be very interested in reading more!
Wouldn't the arithmetic functions of division and multiplication take more time and thus slow down the user experience?
You might have assumed "yup" would be the answer, but in fact the complexity of the arithmetic is not the issue. The issue is that you shouldn't need to evaluate the expressions at all to check if it should be included in your query result.
Searching on an expression instead of an indexed column, MySQL will be forced to visit every single row and evaluate the expression. This is a table-scan. The cost of the query, even disregarding the possible slowness of the arithmetic, is bound to increase in linear proportion to the number of rows.
In complexity of algorithms, we say this is "Order N" cost to the algorithm. Even if it's actually "N * a fixed multiplier due to the cost of of the arithmetic," it's still the N we're worried about, especially if N is ever-increasing.
You showed the example where you stored an extra column for the pre-calculated bmi or category, but that alone wouldn't avoid the table-scan. Searching for category=2 is still going to cause a table-scan unless category is an indexed column.
Indexing a column is fine, but it's a little more tricky to index an expression. Recent versions of MySQL have given us that ability for most types of expressions, but if you're using an older version of MySQL you may be out of luck.
With MySQL 8.0, you can index the expression without having to store the calculated columns. The index is prepared based on the result of the expression. The index itself takes storage space, but it would have if you had indexed the column too. Read more about this here: https://dev.mysql.com/doc/refman/8.0/en/create-index.html in the section on "Functional Key Parts".
Would there ever be a case in which you would prioritise storage space over retrieval time?
Sure. Suppose you have a very large amount of data, but you don't need to run queries especially frequently or quickly.
Example: I managed a database of bulk statistics that we added to throughout the month, but we only needed to query it about once at the end of the month to make a report. It didn't matter that this report took a couple of hours to prepare, because the managers who read the report would be viewing it in a document, not by running the query themselves. Meanwhile, the storage space for the indexes would have been too much for the server the data was on, so they were dropped.
Once a month I would kick off the task of running the query for the report, and then switch windows and go do some of my other work for a few hours. As long as I got the result by the time the people who needed to read it were expecting it (e.g. the next day) I didn't care how long it took to do the query.
Ultimately the best practice you're looking for varies, based on your needs and the resources you can utilize for the task.
There is no best practice. It depends on the considerations of what you are trying to do. Here are some considerations:
Consistency
Storing the in separate columns means that the values can get out-of-synch.
Using a computed column or view means that the values are always consistent.
Updatability (the inverse of consistency)
Storing the data in separate columns means that the values can be updated.
Storing the data as computed columns means that the values cannot be separately updated.
Read Performance
Storing the data in separate columns increases the size of the rows, which tends to increase the size of the table. This can decrease performance because more data must be read -- for any query on the table.
This is not an issue for computed columns, unless they are persisted in some way.
Indexing
Either method supports indexing.

Sorting query performance in PHP+MYSQL

I have a table with huge number of records. When I query from that specific table specially when using ORDER BY in query, it takes too much execution time.
How can I optimize this table for Sorting & Searching?
Here is an example scheme of my table (jobs):
+---+-----------------+---------------------+--+
| id| title | created_at |
+---+-----------------+---------------------+--+
| 1 | Web Developer | 2018-04-12 10:38:00 | |
| 2 | QA Engineer | 2018-04-15 11:10:00 | |
| 3 | Network Admin | 2018-04-17 11:15:00 | |
| 4 | System Analyst | 2018-04-19 11:19:00 | |
| 5 | UI/UX Developer | 2018-04-20 12:54:00 | |
+---+-----------------+---------------------+--+
I have been searching for a while, I learned that creating INDEX can help improving the performance, can someone please elaborate how the performance can be increased?
Add "explain" word before ur query, and check result
explain select ....
There u can see what u need to improve, then add index on ur search and/or sorting field and run explain query again
If you want to earn performance on your request, a way is paginating it. So, you can put a limit (as you want) and specify the page you want to display.
For example SELECT * FROM your_table LIMIT 50 OFFSET 0.
I don't know if this answer will help you in your problem but you can try it ;)
Indexes are the databases way to create lookup trees (B-Trees in most cases) to more efficiently sort, filter, and find rows.
Indexes are used to find rows with specific column values quickly.
Without an index, MySQL must begin with the first row and then read
through the entire table to find the relevant rows. The larger the
table, the more this costs. If the table has an index for the columns
in question, MySQL can quickly determine the position to seek to in
the middle of the data file without having to look at all the data.
This is much faster than reading every row sequentially.
https://dev.mysql.com/doc/refman/5.5/en/mysql-indexes.html
You can use EXPLAIN to help identify how the query is currently running, and identify areas of improvement. It's important to not over-index a table, for reasons probably beyond the scope of this question, so it'd be good to do some research on efficient uses of indexes.
ALTER TABLE jobs
ADD INDEX(created_at);
(Yes, there is a CREATE INDEX syntax that does the equivalent.)
Then, in the query, do
ORDER BY created_at DESC
However, with 15M rows, it may still take a long time. Will you be filtering (WHERE)? LIMITing?
If you really want to return to the user 15M rows -- well, that is a lot of network traffic, and that will take a long time.
MySQL details
Regardless of the index declaration or version, the ASC/DESC in ORDER BY will be honored. However it may require a "filesort" instead of taking advantage of the ordering built into the index's BTree.
In some cases, the WHERE or GROUP BY is to messy for the Optimizer to make use of any index. But if it can, then...
(Before MySQL 8.0) While it is possible to declare an index DESC, the attribute is ignored. However, ORDER BY .. DESC is honored; it scans the data backwards. This also works for ORDER BY a DESC, b DESC, but not if you have a mixture of ASC and DESC in the ORDER BY.
MySQL 8.0 does create DESC indexes; that is, the BTree containing the index is stored 'backwards'.

MySQL queries to generate table statistics

I have a MySQL database table that have around 10-15k inserts daily, and it certainly will increase next months.
- Table Example (reservations): *important fields*
+----+--------+----------+---------+-----+
| ID | people | modified | created | ... |
+----+--------+----------+---------+-----+
I need to provide daily statistics, informing how much entries had (total and specified with same number of people), based on a DATE or a date RANGE that user selected.
Today I'm executing two queries each request. It's working fine, with desirable delay, but I'm wondering if it will be stable with more data.
- Single Date:
SELECT COUNT(*) from reservations WHERE created='DATE USER SELECTED'
SELECT COUNT(*), people from reservations WHERE created='DATE USER SELECTED' GROUP BY people
- Date Range:
SELECT COUNT(*) from reservations WHERE created BETWEEN 'DATE USE SELECTED' AND 'DATE USE SELECTED';
SELECT COUNT(*), people from reservations WHERE created BETWEEN 'DATE USE SELECTED' AND 'DATE USE SELECTED' GROUP BY people
IN MY VIEW
Pros: Real time statistics.
Cons: Can overload the database, with similar and slow queries.
I thought to create a secondary table, named 'statistics', and run a cronjob on my server, each morning, to calculate all statistics.
- Table Example (statistics):
+----+------+--------------------+---------------------------+---------------------------+-----+
| ID | date | numberReservations | numberReservations2People | numberReservations3People | ... |
+----+------+--------------------+---------------------------+---------------------------+-----+
- IN MY VIEW
Pros: Faster queries, do not need to count every request.
Cons: Not real time statistics.
What you think about it? Theres a better approach?
The aggregate queries you've shown can efficiently be satisfied if you have the correct compound index in your table. If you're not sure about compound indexes, you can read about them.
The index (created,people) on your reservations is the right one for both those queries. They both can be satisfied with an efficient index scan known as a loose range scan. You'll find that they are fast enough that you don't need to bother with a secondary table for the foreseeable future in your system.
That's good, because secondary tables like you propose are a common source of confusion and errors.

Aggregate data tables

I am building a front-end to a largish db (10's of millions of rows). The data is water usage for loads of different companies and the table looks something like:
id | company_id | datetime | reading | used | cost
=============================================================
1 | 1 | 2012-01-01 00:00:00 | 5000 | 5 | 0.50
2 | 1 | 2012-01-01 00:01:00 | 5015 | 15 | 1.50
....
On the frontend users can select how they want to view the data, eg: 6 hourly increments, daily increments, monthly etc. What would be the best way to do this quickly. Given the data changes so much and the number of times any one set of data will be seen, caching the query data in memcahce or something similar is almost pointless and there is no way to build the data before hand as there are too many variables.
I figured using some kind of agregate aggregate table would work having tables such as readings, readings_6h, readings_1d with exactly the same structure, just already aggregated.
If this is a viable solution, what is the best way to keep the aggregate tables upto date and accurate. Besides the data coming in from meters the table is read only. Users don't ever have to update or write to it.
A number of possible solutions include:
1) stick to doing queries with group / aggregate functions on the fly
2) doing a basic select and save
SELECT `company_id`, CONCAT_WS(' ', date(`datetime`), '23:59:59') AS datetime,
MAX(`reading`) AS reading, SUM(`used`) AS used, SUM(`cost`) AS cost
FROM `readings`
WHERE `datetime` > '$lastUpdateDateTime'
GROUP BY `company_id`
3) duplicate key update (not sure how the aggregation would be done here, also making sure that the data is accurate not counted twice or missing rows.
INSERT INTO `readings_6h` ...
SELECT FROM `readings` ....
ON DUPLICATE KEY UPDATE .. calculate...
4) other ideas / recommendations?
I am currently doing option 2 which is taking around 15 minutes to aggregate +- 100k rows into +- 30k rows over 4 tables (_6h, _1d, _7d, _1m, _1y)
TL;DR What is the best way to view / store aggregate data for numerous reports that can't be cached effectively.
This functionality would be best served by a feature called materialized view, which MySQL unfortunately lacks. You could consider migrating to a different database system, such as PostgreSQL.
There are ways to emulate materialized views in MySQL using stored procedures, triggers, and events. You create a stored procedure that updates the aggregate data. If the aggregate data has to be updated on every insert you could define a trigger to call the procedure. If the data has to be updated every few hours you could define a MySQL scheduler event or a cron job to do it.
There is a combined approach, similar to your option 3, that does not depend on the dates of the input data; imagine what would happen if some new data arrives a moment too late and does not make it into the aggregation. (You might not have this problem, I don't know.) You could define a trigger that inserts new data into a "backlog," and have the procedure update the aggregate table from the backlog only.
All these methods are described in detail in this article: http://www.fromdual.com/mysql-materialized-views

Best practice question for MySQL: order by id or date?

This is kind of a noobish question, but it's one that I've never been given a straight answer on.
Suppose I have a DB table with the following fields and values:
| id | date_added | balance |
+------------------------------------+
| 1 | 2009-12-01 19:43:22 | 1237.50 |
| 2 | 2010-01-12 03:19:54 | 473.00 |
| 3 | 2010-01-12 03:19:54 | 2131.20 |
| 4 | 2010-01-20 11:27:31 | 3238.10 |
| 5 | 2010-01-25 22:52:07 | 569.40 |
+------------------------------------+
This is for a very basic 'accounting' sub-system. I want to get the most recent balance. The id field is set to auto_increment. Typically, I would use:
SELECT balance FROM my_table ORDER BY date_added DESC LIMIT 1;
But I need to make absolutely sure that the value returned is the most recent... (see id# 2 & 3 above)
1) Would I be better off using:
SELECT balance FROM my_table ORDER BY id DESC LIMIT 1;
2) Or would this be a better solution?:
SELECT balance FROM my_table ORDER BY date_added,id DESC LIMIT 1;
AFAIK, auto_increment works pretty well, but is it reliable enough to sort something this crucial by? That's why I'm thinking sorting by both fields is a better idea, but I've seen some really quirky behavior in MySQL when I've done that in the past. Or if there's an even better solution, I'd appreciate your input.
Thanks in advance!
Brian
If there is a chance you'll get two added with the same date, you'll probably need:
SELECT balance FROM my_table ORDER BY date_added DESC,id DESC LIMIT 1;
(note the 'descending' clause on both fields).
However, you will need to take into account what you want to happen when someone adds an adjusting entry of the 2nd of February which is given the date 31st January to ensure the month of January is complete. It will have an ID greater than those made on the 1st of February.
Generally, accounting systems just work on the date. Perhaps if you could tell us why the order is important, we could make other suggestions.
In response to your comment:
I would love to hear any other ideas or advice you might have, even if they're off-topic since I have zero knowledge of accounting-type database models.
I would provide a few pieces of advice - this is all I could think of immediately, I usually spew forth much more "advice" with even less encouragement :-) The first two, more database-related than accounting-relared, are:
First, do everything in third normal form and only revert if and when you have performance problems. This will save you a lot of angst with duplicate data which may get out of step. Even if you do revert, use triggers and other DBMS capabilities to ensure that data doesn't get out of step.
An example, if you want to speed up your searches on a last_name column, you can create an upper_last_name column (indexed) then use that to locate records matching your already upper-cased search term. This will almost always be faster than the per-row function upper(last_name). You can use an insert/update trigger to ensure the upper_last_name is always set correctly and this incurs the cost only when the name changes, not every time you search.
Secondly, don't duplicate data even across tables (like your current schema) unless you can use those same trigger-type tricks to guarantee the data won't get out of step. What will your customer do when you send them an invoice where the final balance doesn't match the starting balance plus purchases? That's not going to make your company look very professional :-)
Thirdly (and this is more accounting-related), you generally don't need to worry about the number of transactions when calculating balances on the fly. That's because accounting systems usually have a roll-over function at year end which resets the opening balances.
So you're usually never having to process more than a year's worth of data at once which, unless you're the US government or Microsoft, is not that onerous.
Maybe is faster by id, but safer by datetime; use the latter if have performance issues add an index.
Personally I'd never trust an autoincrement in that way. I'd sort by the date.
I'm pretty sure that the ID is guaranteed to be unique, but not necessarily sequential and increasing.