I am going to have data relating the pull force of a block magnet to its three dimentions in an excel table in this form:
a/mm | b/mm | c/mm | force/N
---------------------------------
1 | 1 | 1 | 0.11
1 | 1 | 2 | 0.19
1 | 1 | 3 | 0.26
...
100 | 80 | 59 | 7425
100 | 80 | 60 | 7542
diagram showing what a, b and c mean
There is a row for each block magnet whose a, b and c in mm are whole numbers and the ranges are 1-100 for a, 1-80 for b and 1-60 for c. So in total there are 100*80*60=480,000 rows.
I want to make an online calculator where you enter a, b and c and it gives you the force. For this, I want to use a query something like this:
SELECT FROM blocks WHERE a=$a AND b=$b AND c=$c LIMIT 1
I want to make this query as fast as possible. I would like to know what measures I can take to optimise this search. How should I arrange the data in the SQL table? Should I keep the structure of the table the same as in my Excel sheet? Should I keep the order of the rows as it is? What indexes should I use if any? Should I add a unique ID column to the table? I am open to any suggestions to speed this up.
Note that:
The data is already nicely sorted by a, b and c
The table already contains all the data and nothing else will be done to it except displaying it, so we don't have to worry about the speed of UPDATE queries
a and b are interchangable, so I could delete all the rows where b>a
Increasing a, b or c will always result in a greater pull force
I want this calculator to be a part of a website. I use PHP and MySQL.
If possible, minimising the memory needed to store the table would also be desirable, speed is the priority though
Please don't suggest answers involving using a formula instead of my table of data. It is a requirement that the data are extracted from the database rather than calculated
Finally, can you estimeate:
How long such SELECT a query would take with and without optimization?
How much memory would such a table require?
I would create your table using a, b, c as primary key (since I assume for each triplet of a, b, c there will be no more one record).
The time that will take this select will depend on the rdbms you use but with the primary key it should be very quick. How many peak of queries per minute do you expect to have?
If you want to make the app as fast as possible, store the data in a file and load it into memory into the app or app server (your overall architecture is unclear). Whatever language you are using to develop the app probably supports a hash-table lookup data structure.
There are good reasons for storing data in a database: transactional integrity, security mechanisms, backup/restore functionality, replication, complex queries, and more. Your question doesn't actually suggest the need for any database functionality. You just want a lookup table for a fixed set of data.
If you really want to store the data in a database, then follow the above procedure. That is, load it into memory for users to query.
If you have some requirement to use a database (say, your data is changing), then follow my version of USeptim's advice: create a table with all four columns as primary keys (or alternatively use a secondary index on all four columns). The database will then do something similar to the first solution. The difference is the database will (in general) use b-trees to search the data instead of hash functions.
Related
I'm curious about which is the most efficient way to store and retrieve data in and from a database.
The table:
+----+--------+--------+ +-------+ +----------+
| id | height | weight | ← | bmi | ← | category |
+----+--------+--------+ +-------+ +----------+
| 1 | 184 | 64 | | 18.90 | | 2 |
| 2 | 147 | 80 | | 37.02 | | 4 |
| … | ……… | …… | | …… …… | | … |
| | | | ← | | ← | |
+----+--------+--------+ +-------+ +----------+
From a storage perspective
If we want to be more efficient in terms of storing the data, columns bmi and category would be obsolete, adding data we could've otherwise figured out based on the former two columns height and weight.
From a retrieval perspective
Leaving out the category column we could ask
SELECT *
FROM bmi_entry
WHERE bmi >= 18.50 AND bmi < 25.00
and leaving out the bmi column as well, that becomes
SELECT *
FROM bmi_entry
WHERE weight / ((height * 100) * (height * 100)) >= 18.50
AND weight / ((height * 100) * (height * 100)) < 25
However, calculation could hypothetically take much longer that simply comparing a column to a value, in which case
SELECT *
FROM bmi_entry
WHERE category = 2
would be the far superior query in terms of retrieval time.
Best practice?
At first, I was about to go with method one, thinking why store "useless" data and take up storage space… but then I thought about the implementation and how potentially having to recalculate those "obsolete" fields for every single row every time I want to sort and retrieve specific sets of BMI entries within specific ranges or categories could dramatically slow down the time it takes to collect the data.
Ultimately:
Wouldn't the arithmetic functions of division and multiplication take more time and thus slow down the user experience?
Would there ever be a case in which you would prioritise storage space over retrieval time?
If the answer to (1.) is a simple "yup", you can comment that below. :-)
If you have a more in depth elaboration on either (1.) or (2.), however, feel free to post that or those as well, as I, and others, would be very interested in reading more!
Wouldn't the arithmetic functions of division and multiplication take more time and thus slow down the user experience?
You might have assumed "yup" would be the answer, but in fact the complexity of the arithmetic is not the issue. The issue is that you shouldn't need to evaluate the expressions at all to check if it should be included in your query result.
Searching on an expression instead of an indexed column, MySQL will be forced to visit every single row and evaluate the expression. This is a table-scan. The cost of the query, even disregarding the possible slowness of the arithmetic, is bound to increase in linear proportion to the number of rows.
In complexity of algorithms, we say this is "Order N" cost to the algorithm. Even if it's actually "N * a fixed multiplier due to the cost of of the arithmetic," it's still the N we're worried about, especially if N is ever-increasing.
You showed the example where you stored an extra column for the pre-calculated bmi or category, but that alone wouldn't avoid the table-scan. Searching for category=2 is still going to cause a table-scan unless category is an indexed column.
Indexing a column is fine, but it's a little more tricky to index an expression. Recent versions of MySQL have given us that ability for most types of expressions, but if you're using an older version of MySQL you may be out of luck.
With MySQL 8.0, you can index the expression without having to store the calculated columns. The index is prepared based on the result of the expression. The index itself takes storage space, but it would have if you had indexed the column too. Read more about this here: https://dev.mysql.com/doc/refman/8.0/en/create-index.html in the section on "Functional Key Parts".
Would there ever be a case in which you would prioritise storage space over retrieval time?
Sure. Suppose you have a very large amount of data, but you don't need to run queries especially frequently or quickly.
Example: I managed a database of bulk statistics that we added to throughout the month, but we only needed to query it about once at the end of the month to make a report. It didn't matter that this report took a couple of hours to prepare, because the managers who read the report would be viewing it in a document, not by running the query themselves. Meanwhile, the storage space for the indexes would have been too much for the server the data was on, so they were dropped.
Once a month I would kick off the task of running the query for the report, and then switch windows and go do some of my other work for a few hours. As long as I got the result by the time the people who needed to read it were expecting it (e.g. the next day) I didn't care how long it took to do the query.
Ultimately the best practice you're looking for varies, based on your needs and the resources you can utilize for the task.
There is no best practice. It depends on the considerations of what you are trying to do. Here are some considerations:
Consistency
Storing the in separate columns means that the values can get out-of-synch.
Using a computed column or view means that the values are always consistent.
Updatability (the inverse of consistency)
Storing the data in separate columns means that the values can be updated.
Storing the data as computed columns means that the values cannot be separately updated.
Read Performance
Storing the data in separate columns increases the size of the rows, which tends to increase the size of the table. This can decrease performance because more data must be read -- for any query on the table.
This is not an issue for computed columns, unless they are persisted in some way.
Indexing
Either method supports indexing.
Is it better for write performance to write a row with multiple columns?
id1 | field1Name | field1Value | field2Name | field2Value | field3Name | field3Value
or multiple rows with less columns
id1 | field1Name | field1Value
id1 | field2Name | field2Value
id1 | field3Name | field3Value
In terms of query requirements, we can achieve what we wanted with both structures. I am wondering how write performance would be impacted between these 2 approaches.
It depends in a complex way by the type of the columns and the binary size of their actual values. For example, if the column is of type TEXT and you write 50Kb of plain text - it will not fit into a single page and MySQL will have to use several data-pages to fit the data. Obviously more pages means more I/O.
On the other hand, if you write multiple rows (one value per row) you will have 2 headaches:
MySQL will have to update the primary index multiple times - again suboptimal I/O
If you want to SELECT several values - you will have to JOIN multiple rows which is very inconvenient (and you will soon realize it)
If you desperately need the best write performance - you should consider using columnar-based database. Otherwise stick with the multi-columns option.
We have a table for which we have to present many many counts for different combinations of fields.
This takes quite a while to do on the fly and doesn't provide historical data, so I'm thinking in the best way to store those counts in another table, with a timestamp, so we can query them fast and get historical trends.
For each count we need 4 pieces of information to identify it, and there are about 1000 different metrics we would like to store.
I'm thinking on three different strategies, having a count and a timestamp but varying in how to identify the count for retrieval.
1 table with 4 fields to identify the count, the 4 fields wouldn't be normalized as they contain data from different external tables.
1 table with 1 "tag" field, which will contain the 4 pieces of information as a tag. This tags could be enriched and kept in another table maybe having a field for each tag part and linking them to the external tables.
Different tables for the different groups of counts to be able to normalize on one or more fields, but this will need anywhere from 6 to tens of tables.
I'm going with the first one, not normalized at all, but wondering if anyone has a better or simpler way to store all this counts.
Sample of a value:
status,installed,all,virtual,1234,01/05/2015
First field, status, can have up to 10 values
Second field, installed, can have up to 10 per different field 1
Third field, all,can have up to 10 different values, but they are the same for all categories
Fourth field, virtual, can have up to 30 values and will also be the same for all previous categories.
Last two fields will be a number and a timestamp
Thanks,
Isaac
When you have a lot of metrics and you don't need to use them to do intra-metrics calculation you can go for the 1. solution.
I would probably build a table like this
Satus_id | Installed_id | All_id | Virtual_id | Date | Value
Or if the combination of the first four columns have a proper name, I would probably create two tables (I think you refer to this possibility as the second solution with the 2):
Metric Table
Satus_id | Installed_id | All_id | Virtual_id | Metric_id | Metric_Name
Values Table
Metric_id | Date | Value
This is good if you have names for your metrics or other details which otherwise you will need to duplicate for each combination with the first approach.
In both cases it will be a bit complicated to do intra-rows operations using different metrics, for this reason this approach is suggested only for high level KPIs.
Finally, because all possible combination for the last two fields are always present in you table you can think to convert them to a columns:
Satus_id | Installed_id | Date | All1_Virtual1 | All1_Virtual2 | ... | All10_Virtua30
With 10 values for All and 30 for Virtual you will have 300 columns, not very easy to handle, but they will be worth to have if you have to do something like:
(All1_Virtual2 - All5_Virtual23) * All6_Virtual12
But in these case I would prefer (if possible) to do the calculation in advance to reduce the number of columns.
'customer_data' table:
id - int auto increment
user_id - int
json - TEXT field containing json object
tags - varchar 200
* id + user_id are set as index.
Each customer (user_id) may have multiple lines.
"json" is text because it may be very large with many keys or or not so big with few keys containing short values.
I usually search for the json for user_id.
Problem: with over 100,000 lines and it takes forever to complete a query. I understand that TEXT field are very wasteful and mysql does not index them well.
Fix 1:
Convert the "json" field to multiple columns in the same table where some columns may be blank.
Fix 2:
Create another table with user_id|key|value, but I may go into huge "joins" and will that not be much slower? Also the key is string but value may be int or text and various lengths. How to I reconcile that?
I know this is a pretty regular usecase, what are the "industry standards" for this usecase?
UPDATE
So I guess Fix 2 is the best option, how would I query this table and get one row result, efficiently?
id | key | value
-------------------
1 | key_1 | A
2 | key_1 | D
1 | key_2 | B
1 | key_3 | C
2 | key_3 | E
result:
id | key_1 | key_2 | key_3
---------------------------
1 | A | B | C
2 | D | | E
This answer is a bit outside the box defined in your question, but I'd suggest:
Fix 3: Use MongoDB instead of MySQL.
This is not to criticize MySQL at all -- MySQL is a great structured relational database implementation. However, you don't seem interested in using either the structured aspects or the relational aspects (either because of the specific use case and requirements or because of your own programming preferences, I'm not sure which). Using MySQL because relational architecture suits your use case (if it does) would make sense; using relational architecture as a workaround to make MySQL efficient for your use case (as seems to be the path you're considering) seems unwise.
MongoDB is another great database implementation, which is less structured and not relational, and is designed for exactly the sort of use case you describe: flexibly storing big blobs of json data with various identifiers, and storing/retrieving them efficiently, without having to worry about structural consistency between different records. JSON is Mongo's native document representation.
I am building a front-end to a largish db (10's of millions of rows). The data is water usage for loads of different companies and the table looks something like:
id | company_id | datetime | reading | used | cost
=============================================================
1 | 1 | 2012-01-01 00:00:00 | 5000 | 5 | 0.50
2 | 1 | 2012-01-01 00:01:00 | 5015 | 15 | 1.50
....
On the frontend users can select how they want to view the data, eg: 6 hourly increments, daily increments, monthly etc. What would be the best way to do this quickly. Given the data changes so much and the number of times any one set of data will be seen, caching the query data in memcahce or something similar is almost pointless and there is no way to build the data before hand as there are too many variables.
I figured using some kind of agregate aggregate table would work having tables such as readings, readings_6h, readings_1d with exactly the same structure, just already aggregated.
If this is a viable solution, what is the best way to keep the aggregate tables upto date and accurate. Besides the data coming in from meters the table is read only. Users don't ever have to update or write to it.
A number of possible solutions include:
1) stick to doing queries with group / aggregate functions on the fly
2) doing a basic select and save
SELECT `company_id`, CONCAT_WS(' ', date(`datetime`), '23:59:59') AS datetime,
MAX(`reading`) AS reading, SUM(`used`) AS used, SUM(`cost`) AS cost
FROM `readings`
WHERE `datetime` > '$lastUpdateDateTime'
GROUP BY `company_id`
3) duplicate key update (not sure how the aggregation would be done here, also making sure that the data is accurate not counted twice or missing rows.
INSERT INTO `readings_6h` ...
SELECT FROM `readings` ....
ON DUPLICATE KEY UPDATE .. calculate...
4) other ideas / recommendations?
I am currently doing option 2 which is taking around 15 minutes to aggregate +- 100k rows into +- 30k rows over 4 tables (_6h, _1d, _7d, _1m, _1y)
TL;DR What is the best way to view / store aggregate data for numerous reports that can't be cached effectively.
This functionality would be best served by a feature called materialized view, which MySQL unfortunately lacks. You could consider migrating to a different database system, such as PostgreSQL.
There are ways to emulate materialized views in MySQL using stored procedures, triggers, and events. You create a stored procedure that updates the aggregate data. If the aggregate data has to be updated on every insert you could define a trigger to call the procedure. If the data has to be updated every few hours you could define a MySQL scheduler event or a cron job to do it.
There is a combined approach, similar to your option 3, that does not depend on the dates of the input data; imagine what would happen if some new data arrives a moment too late and does not make it into the aggregation. (You might not have this problem, I don't know.) You could define a trigger that inserts new data into a "backlog," and have the procedure update the aggregate table from the backlog only.
All these methods are described in detail in this article: http://www.fromdual.com/mysql-materialized-views