On the Efficiency of Data Infrastructure, Storage and Retrieval with SQL

On the Efficiency of Data Infrastructure, Storage and Retrieval with SQL - mysql

I'm curious about which is the most efficient way to store and retrieve data in and from a database.
The table:
+----+--------+--------+ +-------+ +----------+
| id | height | weight | ← | bmi | ← | category |
+----+--------+--------+ +-------+ +----------+
| 1 | 184 | 64 | | 18.90 | | 2 |
| 2 | 147 | 80 | | 37.02 | | 4 |
| … | ……… | …… | | …… …… | | … |
| | | | ← |  | ← | |
+----+--------+--------+ +-------+ +----------+
From a storage perspective
If we want to be more efficient in terms of storing the data, columns bmi and category would be obsolete, adding data we could've otherwise figured out based on the former two columns height and weight.
From a retrieval perspective
Leaving out the category column we could ask
SELECT *
FROM bmi_entry
WHERE bmi >= 18.50 AND bmi < 25.00
and leaving out the bmi column as well, that becomes
SELECT *
FROM bmi_entry
WHERE weight / ((height * 100) * (height * 100)) >= 18.50
AND weight / ((height * 100) * (height * 100)) < 25
However, calculation could hypothetically take much longer that simply comparing a column to a value, in which case
SELECT *
FROM bmi_entry
WHERE category = 2
would be the far superior query in terms of retrieval time.
Best practice?
At first, I was about to go with method one, thinking why store "useless" data and take up storage space… but then I thought about the implementation and how potentially having to recalculate those "obsolete" fields for every single row every time I want to sort and retrieve specific sets of BMI entries within specific ranges or categories could dramatically slow down the time it takes to collect the data.
Ultimately:
Wouldn't the arithmetic functions of division and multiplication take more time and thus slow down the user experience?
Would there ever be a case in which you would prioritise storage space over retrieval time?
If the answer to (1.) is a simple "yup", you can comment that below. :-)
If you have a more in depth elaboration on either (1.) or (2.), however, feel free to post that or those as well, as I, and others, would be very interested in reading more!

Wouldn't the arithmetic functions of division and multiplication take more time and thus slow down the user experience?
You might have assumed "yup" would be the answer, but in fact the complexity of the arithmetic is not the issue. The issue is that you shouldn't need to evaluate the expressions at all to check if it should be included in your query result.
Searching on an expression instead of an indexed column, MySQL will be forced to visit every single row and evaluate the expression. This is a table-scan. The cost of the query, even disregarding the possible slowness of the arithmetic, is bound to increase in linear proportion to the number of rows.
In complexity of algorithms, we say this is "Order N" cost to the algorithm. Even if it's actually "N * a fixed multiplier due to the cost of of the arithmetic," it's still the N we're worried about, especially if N is ever-increasing.
You showed the example where you stored an extra column for the pre-calculated bmi or category, but that alone wouldn't avoid the table-scan. Searching for category=2 is still going to cause a table-scan unless category is an indexed column.
Indexing a column is fine, but it's a little more tricky to index an expression. Recent versions of MySQL have given us that ability for most types of expressions, but if you're using an older version of MySQL you may be out of luck.
With MySQL 8.0, you can index the expression without having to store the calculated columns. The index is prepared based on the result of the expression. The index itself takes storage space, but it would have if you had indexed the column too. Read more about this here: https://dev.mysql.com/doc/refman/8.0/en/create-index.html in the section on "Functional Key Parts".
Would there ever be a case in which you would prioritise storage space over retrieval time?
Sure. Suppose you have a very large amount of data, but you don't need to run queries especially frequently or quickly.
Example: I managed a database of bulk statistics that we added to throughout the month, but we only needed to query it about once at the end of the month to make a report. It didn't matter that this report took a couple of hours to prepare, because the managers who read the report would be viewing it in a document, not by running the query themselves. Meanwhile, the storage space for the indexes would have been too much for the server the data was on, so they were dropped.
Once a month I would kick off the task of running the query for the report, and then switch windows and go do some of my other work for a few hours. As long as I got the result by the time the people who needed to read it were expecting it (e.g. the next day) I didn't care how long it took to do the query.
Ultimately the best practice you're looking for varies, based on your needs and the resources you can utilize for the task.

There is no best practice. It depends on the considerations of what you are trying to do. Here are some considerations:
Consistency
Storing the in separate columns means that the values can get out-of-synch.
Using a computed column or view means that the values are always consistent.
Updatability (the inverse of consistency)
Storing the data in separate columns means that the values can be updated.
Storing the data as computed columns means that the values cannot be separately updated.
Read Performance
Storing the data in separate columns increases the size of the rows, which tends to increase the size of the table. This can decrease performance because more data must be read -- for any query on the table.
This is not an issue for computed columns, unless they are persisted in some way.
Indexing
Either method supports indexing.

Related

MYSQL performance for multiple row write vs a single row with more columns

Is it better for write performance to write a row with multiple columns?
id1 | field1Name | field1Value | field2Name | field2Value | field3Name | field3Value
or multiple rows with less columns
id1 | field1Name | field1Value
id1 | field2Name | field2Value
id1 | field3Name | field3Value
In terms of query requirements, we can achieve what we wanted with both structures. I am wondering how write performance would be impacted between these 2 approaches.

It depends in a complex way by the type of the columns and the binary size of their actual values. For example, if the column is of type TEXT and you write 50Kb of plain text - it will not fit into a single page and MySQL will have to use several data-pages to fit the data. Obviously more pages means more I/O.
On the other hand, if you write multiple rows (one value per row) you will have 2 headaches:
MySQL will have to update the primary index multiple times - again suboptimal I/O
If you want to SELECT several values - you will have to JOIN multiple rows which is very inconvenient (and you will soon realize it)
If you desperately need the best write performance - you should consider using columnar-based database. Otherwise stick with the multi-columns option.

best database solution for multiple date time column indexing

I am creating a duty management system for workers, the database will contain the following columns
ID | STAFF_ID | DUTY_ID | DUTY_START | DUTY_END
This system is required to know how many staffs are working in between the given time.
I am currently using mysqli which seem to be slow as the table data is increasing.
I am looking for a suitable service which can handle daily 500,000 records insert and search within DUTY_START and DUTY_END indexes.

Start-End ranges are nasty to optimize in large tables.
Perhaps the best solution for this situation is to do something really bad -- split off the date. (Yeah 2 wrongs may make a right this time.)
ID | STAFF_ID | DUTY_ID | date | START_time | END_time
Some issues:
If a shift spans midnight, make 2 rows.
Have an index on date, or at least starting with date. Then even though the query must scan all entries for the day in question, at least it is much less than scanning the entire table. (All other 'obvious solutions' get no better than scanning half the table.)
If you want to discuss and debate this further, please provide SELECT statement(s) based on either your 'schema' or mine. Don't worry about performance; we can discuss how to improve them.

Sorting query performance in PHP+MYSQL

I have a table with huge number of records. When I query from that specific table specially when using ORDER BY in query, it takes too much execution time.
How can I optimize this table for Sorting & Searching?
Here is an example scheme of my table (jobs):
+---+-----------------+---------------------+--+
| id| title | created_at |
+---+-----------------+---------------------+--+
| 1 | Web Developer | 2018-04-12 10:38:00 | |
| 2 | QA Engineer | 2018-04-15 11:10:00 | |
| 3 | Network Admin | 2018-04-17 11:15:00 | |
| 4 | System Analyst | 2018-04-19 11:19:00 | |
| 5 | UI/UX Developer | 2018-04-20 12:54:00 | |
+---+-----------------+---------------------+--+
I have been searching for a while, I learned that creating INDEX can help improving the performance, can someone please elaborate how the performance can be increased?

Add "explain" word before ur query, and check result
explain select ....
There u can see what u need to improve, then add index on ur search and/or sorting field and run explain query again

If you want to earn performance on your request, a way is paginating it. So, you can put a limit (as you want) and specify the page you want to display.
For example SELECT * FROM your_table LIMIT 50 OFFSET 0.
I don't know if this answer will help you in your problem but you can try it ;)

Indexes are the databases way to create lookup trees (B-Trees in most cases) to more efficiently sort, filter, and find rows.
Indexes are used to find rows with specific column values quickly.
Without an index, MySQL must begin with the first row and then read
through the entire table to find the relevant rows. The larger the
table, the more this costs. If the table has an index for the columns
in question, MySQL can quickly determine the position to seek to in
the middle of the data file without having to look at all the data.
This is much faster than reading every row sequentially.
https://dev.mysql.com/doc/refman/5.5/en/mysql-indexes.html
You can use EXPLAIN to help identify how the query is currently running, and identify areas of improvement. It's important to not over-index a table, for reasons probably beyond the scope of this question, so it'd be good to do some research on efficient uses of indexes.

ALTER TABLE jobs
ADD INDEX(created_at);
(Yes, there is a CREATE INDEX syntax that does the equivalent.)
Then, in the query, do
ORDER BY created_at DESC
However, with 15M rows, it may still take a long time. Will you be filtering (WHERE)? LIMITing?
If you really want to return to the user 15M rows -- well, that is a lot of network traffic, and that will take a long time.
MySQL details
Regardless of the index declaration or version, the ASC/DESC in ORDER BY will be honored. However it may require a "filesort" instead of taking advantage of the ordering built into the index's BTree.
In some cases, the WHERE or GROUP BY is to messy for the Optimizer to make use of any index. But if it can, then...
(Before MySQL 8.0) While it is possible to declare an index DESC, the attribute is ignored. However, ORDER BY .. DESC is honored; it scans the data backwards. This also works for ORDER BY a DESC, b DESC, but not if you have a mixture of ASC and DESC in the ORDER BY.
MySQL 8.0 does create DESC indexes; that is, the BTree containing the index is stored 'backwards'.

Speed up SQL SELECT in a table with just numbers

I am going to have data relating the pull force of a block magnet to its three dimentions in an excel table in this form:
a/mm | b/mm | c/mm | force/N
---------------------------------
1 | 1 | 1 | 0.11
1 | 1 | 2 | 0.19
1 | 1 | 3 | 0.26
...
100 | 80 | 59 | 7425
100 | 80 | 60 | 7542
diagram showing what a, b and c mean
There is a row for each block magnet whose a, b and c in mm are whole numbers and the ranges are 1-100 for a, 1-80 for b and 1-60 for c. So in total there are 100*80*60=480,000 rows.
I want to make an online calculator where you enter a, b and c and it gives you the force. For this, I want to use a query something like this:
SELECT FROM blocks WHERE a=$a AND b=$b AND c=$c LIMIT 1
I want to make this query as fast as possible. I would like to know what measures I can take to optimise this search. How should I arrange the data in the SQL table? Should I keep the structure of the table the same as in my Excel sheet? Should I keep the order of the rows as it is? What indexes should I use if any? Should I add a unique ID column to the table? I am open to any suggestions to speed this up.
Note that:
The data is already nicely sorted by a, b and c
The table already contains all the data and nothing else will be done to it except displaying it, so we don't have to worry about the speed of UPDATE queries
a and b are interchangable, so I could delete all the rows where b>a
Increasing a, b or c will always result in a greater pull force
I want this calculator to be a part of a website. I use PHP and MySQL.
If possible, minimising the memory needed to store the table would also be desirable, speed is the priority though
Please don't suggest answers involving using a formula instead of my table of data. It is a requirement that the data are extracted from the database rather than calculated
Finally, can you estimeate:
How long such SELECT a query would take with and without optimization?
How much memory would such a table require?

I would create your table using a, b, c as primary key (since I assume for each triplet of a, b, c there will be no more one record).
The time that will take this select will depend on the rdbms you use but with the primary key it should be very quick. How many peak of queries per minute do you expect to have?

If you want to make the app as fast as possible, store the data in a file and load it into memory into the app or app server (your overall architecture is unclear). Whatever language you are using to develop the app probably supports a hash-table lookup data structure.
There are good reasons for storing data in a database: transactional integrity, security mechanisms, backup/restore functionality, replication, complex queries, and more. Your question doesn't actually suggest the need for any database functionality. You just want a lookup table for a fixed set of data.
If you really want to store the data in a database, then follow the above procedure. That is, load it into memory for users to query.
If you have some requirement to use a database (say, your data is changing), then follow my version of USeptim's advice: create a table with all four columns as primary keys (or alternatively use a secondary index on all four columns). The database will then do something similar to the first solution. The difference is the database will (in general) use b-trees to search the data instead of hash functions.

Best practice question for MySQL: order by id or date?

This is kind of a noobish question, but it's one that I've never been given a straight answer on.
Suppose I have a DB table with the following fields and values:
| id | date_added | balance |
+------------------------------------+
| 1 | 2009-12-01 19:43:22 | 1237.50 |
| 2 | 2010-01-12 03:19:54 | 473.00 |
| 3 | 2010-01-12 03:19:54 | 2131.20 |
| 4 | 2010-01-20 11:27:31 | 3238.10 |
| 5 | 2010-01-25 22:52:07 | 569.40 |
+------------------------------------+
This is for a very basic 'accounting' sub-system. I want to get the most recent balance. The id field is set to auto_increment. Typically, I would use:
SELECT balance FROM my_table ORDER BY date_added DESC LIMIT 1;
But I need to make absolutely sure that the value returned is the most recent... (see id# 2 & 3 above)
1) Would I be better off using:
SELECT balance FROM my_table ORDER BY id DESC LIMIT 1;
2) Or would this be a better solution?:
SELECT balance FROM my_table ORDER BY date_added,id DESC LIMIT 1;
AFAIK, auto_increment works pretty well, but is it reliable enough to sort something this crucial by? That's why I'm thinking sorting by both fields is a better idea, but I've seen some really quirky behavior in MySQL when I've done that in the past. Or if there's an even better solution, I'd appreciate your input.
Thanks in advance!
Brian

If there is a chance you'll get two added with the same date, you'll probably need:
SELECT balance FROM my_table ORDER BY date_added DESC,id DESC LIMIT 1;
(note the 'descending' clause on both fields).
However, you will need to take into account what you want to happen when someone adds an adjusting entry of the 2nd of February which is given the date 31st January to ensure the month of January is complete. It will have an ID greater than those made on the 1st of February.
Generally, accounting systems just work on the date. Perhaps if you could tell us why the order is important, we could make other suggestions.
In response to your comment:
I would love to hear any other ideas or advice you might have, even if they're off-topic since I have zero knowledge of accounting-type database models.
I would provide a few pieces of advice - this is all I could think of immediately, I usually spew forth much more "advice" with even less encouragement :-) The first two, more database-related than accounting-relared, are:
First, do everything in third normal form and only revert if and when you have performance problems. This will save you a lot of angst with duplicate data which may get out of step. Even if you do revert, use triggers and other DBMS capabilities to ensure that data doesn't get out of step.
An example, if you want to speed up your searches on a last_name column, you can create an upper_last_name column (indexed) then use that to locate records matching your already upper-cased search term. This will almost always be faster than the per-row function upper(last_name). You can use an insert/update trigger to ensure the upper_last_name is always set correctly and this incurs the cost only when the name changes, not every time you search.
Secondly, don't duplicate data even across tables (like your current schema) unless you can use those same trigger-type tricks to guarantee the data won't get out of step. What will your customer do when you send them an invoice where the final balance doesn't match the starting balance plus purchases? That's not going to make your company look very professional :-)
Thirdly (and this is more accounting-related), you generally don't need to worry about the number of transactions when calculating balances on the fly. That's because accounting systems usually have a roll-over function at year end which resets the opening balances.
So you're usually never having to process more than a year's worth of data at once which, unless you're the US government or Microsoft, is not that onerous.

Maybe is faster by id, but safer by datetime; use the latter if have performance issues add an index.

Personally I'd never trust an autoincrement in that way. I'd sort by the date.
I'm pretty sure that the ID is guaranteed to be unique, but not necessarily sequential and increasing.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008