Relational database (RDBMS) denormalized data - mysql

I believe this question does not address specifically MySQL - which is the database that I'm using -, and it's one about best practices.
Up until now, my problems could be solved by creating tables and querying them (sometimes JOINing here and there). But there is something that I'm doing that doesn't feel right, and it triggers me whenever I need a denormalized data alongside my "common" queries.
Example Use-case
So that I can express myself better, let's create a superficial scenario where:
a user can buy a product, generating a purchase (let's ignore the fact that the purchase can only have a single product);
and we need to query the products with the total amount of times it has been purchased;
To solve our use-case, we could define a simple structure made by:
product table:
product_id [INT PK]
user table:
user_id [INT PK]
purchase table:
purchase_id [INT PK]
product_id [INT FK NOT NULL]
user_id [INT FK NOT NULL]
Here is where it doesn't feel right: When we need to retrieve a list of products with the total times it has been purchased, I would create the query:
# There are probably faster queries than this to reach the same output
SELECT
product.product_id,
(SELECT COUNT(*) FROM purchase
WHERE purchase.product_id = product.product_id)
FROM
product
My concern origin is that I've read that COUNT does a full table scan, and it scares me to perform the query above when scaled to thousands of products being purchased - even though I've created an INDEX with the product_id FK on purchase (MySQL does this by default).
Possible solutions
My knowledge with relational databases is pretty shallow, so I'm kind of lost when comparing what are the alternatives (plausible ones) for these kinds of problems. To not say that I haven't done my homework (searching before asking), I've found plausible to:
Create Transactions:
When INSERTing a new purchase, it must always be inside a transaction that also updates the product table with the purchase.product_id.
Possible Problems: human error. Someone might manually insert a purchase without doing the transaction and BAM - we have an inconsistency.
Create Triggers:
Whenever I insert, delete or update some row in some specific table, I would update my products table with a new value (bought_amount). So the table would become:
product table:
product_id [INT PK]
bought_amount [INT NOT NULL];
Possible problems: are triggers expensive? is there a way that the insertion succeeds but the trigger won't - thus leaving me with an inconsistency?
Question
Updating certain tables to store data that constantly changes is a plausible approach with RDBMSs? Is it safer and - in the long term - more beneficial to just keep joining and counting/summing other occurrences?
I've found a couple of useful questions/answers regarding this matter, but none of them addressed this subject in a wide perspective.
Please take into consideration my ignorance about RDBMS, as I may be suggesting nonsense Possible solutions.

The usual way to get a count per key is
SELECT product_id, COUNT(*)
FROM purchase
GROUP BY product_id
You don't need to mention the product table, because all it contains is the key column. Now although that uses COUNT(*), it doesn't need a full table scan for every product_id because the SQL engine is smart enough to see the GROUP BY.
But this produces a different result to your query: for products that have never been purchased, my query simply won't show them; your query will show the product_id with count zero.
Then before you start worrying about implementation and efficiency, what question(s) are you trying to answer? If you want to see all products whether purchased or not, then you must scan the whole product table and look up from that to purchase. I would go
SELECT product_id, count
FROM product
OUTER JOIN (SELECT product_id, COUNT(*) AS count
FROM purchase
GROUP BY product_id) AS purch
ON product.product_id = purch.product_id
As regards your wider questions (not sure I fully understand them), in the early days SQL was quite inefficient at this sort of joining and aggregating, and schema often were denormalised with repeated columns in multiple tables. SQL engines are now much smarter, so that's not necessary. You might see that old-fashioned practice in older textbooks. I would ignore it and design your schema as normalised as possible.

This query:
SELECT p.product_id,
(SELECT COUNT(*)
FROM purchase pu
WHERE pu.product_id = p.product_id
)
FROM product p;
has to scan both product and purchase. I'm not sure why you are emotional about one table scan but not the other.
As for performance, this can take advantage of an index on purchase(product_id). In MySQL, this will probably be faster than the equivalent (left) join version.
You should not worry about performance of such queries until that becomes an issue. If you need to increase performance of such a query, first I would ask: Why? That is a lot of information being returned -- about all products over all time. More typically, I would expect someone to care about one product or a period of time or both. And, those concerns would suggest the development of a datamart.
If performance is an issue, you have many alternatives, such as:
Defining a data mart to periodically summarize the data into more efficient structures for such queries.
Adding triggers to the database to summarize the data, if the results are needed in real-time.
Developing a methodology for maintaining the data that also maintains the summaries, either at the application-level or using stored procedures.
What doesn't "feel right" to you is actually the tremendous strength of a relational database (with a reasonable data model). You can keep it up-to-date. And you can query it using a pretty concise language that meets business needs.

Possible Problems: human error. Someone might manually insert a purchase without doing the transaction and BAM - we have an inconsistency.
--> Build a Stored Procedure that does both steps in a transaction, then force users to go through that.
Possible problems: are triggers expensive? is there a way that the insertion succeeds but the trigger won't - thus leaving me with an inconsistency?
Triggers are not too bad. But, again, I would recommend forcing users through a Stored Procedure that does all the desired steps.
Note: Instead of Stored Procedures, you could have an application that does the necessary steps; then force users to go through the app and give them no direct access to the database.
A database is the "source of truth" on the data. It is the "persistent" repository for such. It should not be considered the entire engine for building an application.
As for performance:
Summing over a million rows may take a noticeable amount of time.
You can easily do a hundred single-row queries (select/insert/update) per second.
Please think through numbers like that.

Related

Adding extra fields to prevent needing joins

In consideration of schema design, is it appropriate to add extra table fields I wouldn't otherwise need in order to prevent having to do a join? Example:
products_table
| id | name | seller_id
users_table
| id | username |
reviews_table
| id | product_id | seller_id |
For the reviews table, I could use a join on the products table to get the user id of the seller. If I leave it out of the reviews table, I have to use a join to get it. There are often tables where several joins are needed to get at some information where I could just have my app add redundant data to the table instead. Which is more correct in terms of schema design?
You seem overly concerned about the performance of JOINs. With proper indexing, performance is not usually an issue. In fact, there are situations where JOINs are faster -- because the data is more compact in two tables than storing the fields over and over and over again (this applies more to strings than to integers, though).
If you are going to have multiple tables, then use JOINs to access the "lookup" information. There may be some situations where you want to denormalize the information. But in general, you don't. And premature optimization is the root of a lot of bad design.
Suppose you add a column reviews.seller_id and you populate it with values, and then some weeks later you find that the values aren't always the same as the seller in the products_table.
In other words, the following query should always return a count of 0, but what if one day it returns a count of 6?
SELECT COUNT(*)
FROM products_table AS p
JOIN reviews_table AS r USING (product_id)
WHERE p.seller_id <> r.seller_id
Meaning there was some update of one table, but not the other. They weren't both updated to keep the seller_id in sync.
How did this happen? Which table was updated, and which one still has the original seller_id? Which one is correct? Was the update intentional?
You start researching each of the 6 cases, verify who is the correct seller, and update the data to make them match.
Then the next week, the count of mismatched sellers is 1477. You must have a bug in your code somewhere that allows an update to one table without updating the other to match. Now you have a much larger data cleanup project, and a bug-hunt to go find out how this could happen.
And how many other times have you done the same thing for other columns -- copied them into a related table to avoid a join? Are those creating mismatched data too? How would you check them all? Do you need to check them every night? Can they be corrected?
This is the kind of trouble you get into when you use denormalization, in other words storing columns redundantly to avoid joins, avoid aggregations, or avoid expensive calculations, to speed up certain queries.
In fact, you don't avoid those operations, you just move the work of those operations to an earlier time.
It's possible to make it all work seamlessly, but it's a lot more work for the coder to develop and test the perfect code, and fix the subsequent code bugs and inevitable data cleanup chores.
This depends on each specific case. Purely in terms of schema design, you should not have any redundant columns (see database normalization). However, in a real case scenario, sometimes it makes sense to have redundant data; for example, when having performance issues, you can sacrifice some memory in order to make SELECT queries faster.
Adding redundant column today will make you curse tomorrow.If you Handle keys in database properly, performance will not penalize you.

How to optimize MySQL queries with many combinations of where conditions?

I have a MySQL table like this, and I want to create indexes that make all queries to the table run fast. The difficult thing is that there are many possible combinations of where conditions, and that the size of table is large (about 6M rows).
Table name: items
id: PKEY
item_id: int (the id of items)
category_1: int
category_2: int
.
.
.
category_10: int
release_date: date
sort_score: decimal
item_id is not unique because an item can have several numbers of category_x .
An example of queries to this table is:
SELECT DISTINCT(item_id) FROM items WHERE category_1 IN (1, 2) AND category_5 IN (3, 4), AND release_date > '2019-01-01' ORDER BY sort_score
And another query maybe:
SELECT DISTINCT(item_id) FROM items WHERE category_3 IN (1, 2) AND category_4 IN (3, 4), AND category_8 IN (5) ORDER BY sort_score
If I want to optimize all the combinations of where conditions , do I have to make a huge number of composite indexes of the column combinations? (like ADD INDEX idx1_3_5(category_1, category_3, category_5))
Or is it good to create 10 tables which have data of category_1~10, and execute many INNER JOIN in the queries?
Or, is it difficult to optimize this kind of queries in MySQL, and should I use other middlewares , such as Elasticsearch ?
Well, the file (it is not a table) is not at all Normalised. Therefore no amount indices on combinations of fields will help the queries.
Second, MySQL is (a) not compliant with the SQL requirement, and (b) it does not have a Server Architecture or the features of one.
Such a Statistics, which is used by a genuine Query Optimiser, which commercial SQL platforms have. The "single index" issue you raise in the comments does not apply.
Therefore, while we can fix up the table, etc, you may never obtain the performance that you seek from the freeware.
Eg. in the commercial world, 6M rows is nothing, we worry when we get to a billion rows.
Eg. Statistics is automatic, we have to tweak it only when necessary: an un-normalised table or billions of rows.
Or ... should I use other middlewares , such as Elasticsearch ?
It depends on the use of genuine SQL vs MySQL, and the middleware.
If you fix up the file and make a set of Relational tables, the queries are then quite simple, and fast. It does not justify a middleware search engine (that builds a data cube on the client system).
If they are not fast on MySQL, then the first recommendation would be to get a commercial SQL platform instead of the freeware.
The last option, the very last, is to stick to the freeware and add a big fat middleware search engine to compensate.
Or is it good to create 10 tables which have data of category_1~10, and execute many INNER JOIN in the queries?
Yes. JOINs are quite ordinary in SQL. Contrary to popular mythology, a normalised database, which means many more tables than an un-normalised one, causes fewer JOINs, not more JOINs.
So, yes, Normalise that beast. Ten tables is the starting perception, still not at all Normalised. One table for each of the following would be a step in the direction of Normalised:
Item
Item_id will be unique.
Category
This is not category-1, etc, but each of the values that are in category_1, etc. You must not have multiple values in a single column, it breaks 1NF. Such values will be (a) Atomic, and (b) unique. The Relational Model demands that the rows are unique.
The meaning of category_1, etc in Item is not given. (If you provide some example data, I can improve the accuracy of the data model.) Obviously it is not [2].
.
If it is a Priority (1..10), or something similar, that the users have chosen or voted on, this table will be a table that supplies the many-to-many relationship between Item and Category, with a Priority for each row.
.
Let's call it Poll. The relevant Predicates would be something like:
Each Poll is 1 Item
Each Poll is 1 Priority
Each Poll is 1 Category
Likewise, sort_score is not explained. If it is even remotely what it appears to be, you will not need it. Because it is a Derived Value. That you should compute on the fly: once the tables are Normalised, the SQL required to compute this is straight-forward. Not one that you compute-and-store every 5 minutes or every 10 seconds.
The Relational Model
The above maintains the scope of just answering your question, without pointing out the difficulties in your file. Noting the Relational Database tag, this section deals with the Relational errors.
The Record ID field (item_id or category_id is yours) is prohibited in the Relational Model. It is a physical pointer to a record, which is explicitly the very thing that the RM overcomes, and that is required to be overcome if one wishes to obtain the benefits of the RM, such as ease of queries, and simple, straight-forward SQL code.
Conversely, the Record ID is always one additional column and one additional index, and the SQL code required for navigation becomes complex (and buggy) very quickly. You will have enough difficulty with the code as it is, I doubt you would want the added complexity.
Therefore, get rid of the Record ID fields.
The Relational Model requires that the Keys are "made up from the data". That means something from the logical row, that the users use. Usually they know precisely what identifies their data, such as a short name.
It is not manufactured by the system, such as a RecordID field which is a GUID or AUTOINCREMENT, which the user does not see. Such fields are physical pointers to records, not Keys to logical rows. Such fields are pre-Relational, pre-DBMS, 1960's Record Filing Systems, the very thing that RM superseded. But they are heavily promoted and marketed as "relational.
Relational Data Model • Initial
Looks like this.
All my data models are rendered in IDEF1X, the Standard for modelling Relational databases since 1993
My IDEF1X Introduction is essential reading for beginners.
Relational Data Model • Improved
Ternary relations (aka three-way JOINs) are known to be a problem, indicating that further Normalisation is required. Codd teaches that every ternary relation can be reduced to two binary relations.
In your case, perhaps a Item has certain, not all, Categories. The above implements Polls of Items allowing all Categories for each Item, which is typical error in a ternary relation, which is why it requires further Normalisation. It is also the classic error in every RFS file.
The corrected model would therefore be to establish the Categories for each Item first as ItemCategory, your "item can have several numbers of category_x". And then to allow Polls on that constrained ItemCategory. Note, this level of constraining data is not possible in 1960' Record Filing Systems, in which the "key" is a fabricated id field:
Each ItemCategory is 1 Item
Each ItemCategory is 1 Category
Each Poll is 1 Priority
Each Poll is 1 ItemCategory
Your indices are now simple and straight-forward, no additional indices are required.
Likewise your query code will now be simple and straight-forward, and far less prone to bugs.
Please make sure that you learn about Subqueries. The Poll table supports any type of pivoting that may be required.
It is messy to optimize such queries against such a table. Moving the categories off to other tables would only make it slower.
Here's a partial solution... Identify the categories that are likely to be tested with
=
IN
a range, such as your example release_date > '2019-01-01'
Then devise a few indexes (perhaps no more than a dozen) that have, say, 3-4 columns. Those columns should be ones that are often tested together. Order the columns in the indexes based on the list above. It is quite fine to have multiple = columns (first), but don't include more than one 'range' (last).
Keep in mind that the order of tests in WHERE does not matter, but the order of the columns in an INDEX does.

keeping a record of the number of rows

Say we have a table "posts" in a MySQL database, that as its name suggests stores users' posts on some social media platform. Now I want to display the number of posts each user has created. A potential solution would be:
SELECT COUNT(*) FROM posts WHERE ....etc;
But To me -at least- this looks like an expensive query. wouldn't it be better to keep a record in some table say (statistics) using a column named (number_of_posts). I'm aware that In the last scenario I would have to update both tables (posts) & (statistics) once a post is created. What do you think the best way to tackle it?
Queries like
SELECT COUNT(*), user_id
FROM posts
GROUP BY user_id
are capable of doing an index scan if you create an index on the user_id column. Read this. Index scans are fast. So the query you propose is just fine. SQL, and MySQL, are made for such queries.
And, queries like
SELECT COUNT(*)
FROM posts
WHERE user_id = 123456
are very fast if you have the user_id index. You may save a few dozen microseconds if you keep a separate table, or you may not. The savings will be hard to measure. But, you'll incur a cost maintaining that table, both in server performance and software-maintenance complexity.
For people just learning to use database software, intuition about performance often is grossly pessimistic. Database software packages have many thousands of programmer-years of work in them to improve performance. Truly. And, you probably can't outdo them with your own stuff.
Why did the developers of MySQL optimize this kind of thing? So developers using MySQL can depend on it for stuff like your problem, without having to do a lot of extra optimization work. They did it for you. Spend that time getting other parts of your application working.

MySQL performance on storing and returning ids

I have an API where I need to log which ids from a table that were returned in a query, and in another query, return results sorted based on the log of ids.
For example:
Tables products had a PK called id and users had a PK called id . I can create a log table with one insert/update per returned id. I'm wondering about performance and the design of this.
Essentially, for each returned ID in the API, I would:
INSERT INTO log (product_id, user_id, counter)
VALUES (#the_product_id, #the_user_id, 1)
ON DUPLICATE KEY UPDATE counter=counter+1;
.. I'd either have an id column as PK or a combination of product_id and user_id (alt. having those two as a UNIQUE index).
So the first issue is the performance of this (20 insert/updates and the effect on my select calls in the API) - is there a better/smarter way to log these IDs? Extracting from the webserver log?
Second is the performance of the select statements to include the logged data, to allow a user to see new products every request (a simplified example, I'd specify the table fields instead of * in real life):
SELECT p.*, IFNULL(
SELECT log.counter
FROM log
WHERE log.product_id = p.id
AND log.user_id = #the_user_id
, 0 ) AS seen_by_user
FROM products AS p
ORDER BY seen_by_user ASC
In our database, the products table has millions of rows, and the users table is growing rapidly. Am I right in my thinking to do it this way, or are there better ways? How do I optimize the process, and are there tools I can use?
Callie, I just wanted to flag a different perspective to keymone, and it doesn't fit into a comment hence this answer.
Performance is sensitive to the infrastructure environment: are you running in a shared hosting service (SHS), a dedicated private virtual service (PVS) or dedicate server, or even a multiserver config with separate web and database servers.
What are your transaction rates and volumetics? How many insert/updates are you doing per min in your 2 peaks trading hours in the day? What are your integrity requirements v.v the staleness of log counters?
Yes, keymone's points are appropriate if you are doing, say, 3-10 updates per second, and as you move into this domain, some form of collection process to batch up inserts to allow bulk insert becomes essential. But just as important here are Qs are choice of storage engine, transactional vs batch split and the choice of infrastructure architecture itself (in-server DB instance vs separate DB server, master/slave configurations ...).
However, if you are averaging <1/sec then INSERT ON DUPLICATE KEY UPDATE has comparable performance to the equivalent UPDATE statements and it is the better approach if doing single row insert/updates as it ensures ACID integrity of the counts.
Any form of PHP process start-up will typically take ~100mSec on your web server, so even thinking of this to do an asynchronous update is just plain crazy as the performance hit is significantly larger than the update itself.
Your SQL statement just doesn't jive with your comment that you have "millions of rows" in the products table as it will do a full fetch of the product table executing a correlated subquery on every row. I would have used a LEFT OUTER JOIN myself, with some sort of strong constraint to filter which product items are appropriate to this result set. However it runs, it will take materially longer to execute that any count update.
you will have really bad performance with such approach.
mysql is not exactly well suited for logging so here are few steps you might do to achieve good performance:
instead of maintaining stats table on fly (the update on duplicate key bit which will absolutely destroy your performance) you want to have a single raw logs table where you will just be doing inserts and every now and then(say daily) you would be running a script that aggregates data from that table into real statistics table.
instead of having single statistics table - have a daily stats, monthly stats, etc. aggregate jobs would then be building up data from already aggregated stuff - awesome for performance. it also allows you to drop(or archive) stats granularity over time - who the hell cares about daily stats in 2 years time? or at least about "real-time" access to those stats.
instead of inserting into log table use something like syslog-ng to gather such information into log files(much less load on mysql server[s]) and then aggregate data into mysql from raw text files(many choices here, you can even import raw files back into mysql if your aggregation routine really needs some SQL-flexibility)
that's about it

mysql multiple table/multiple schema performance

A quick bit of background - We have a table "orders" which has about 10k records written into it per day. This is the most queried table in the database. To keep the table small we plan to move the records written into it a week or so ago into a different table. This will be done by an automated job. While we understand it would make sense to pop the history off to a separate server we currently just have a single DB server.
The orders table is in databaseA. Following are the approaches we are considering:
Create a new schema databaseB and create an orders table that contains the history?
Create a table ordershistory in databaseA.
It would be great if we could get pointers as to which design would give a better performance?
EDIT:
Better performance for
Querying the current orders - since its not weighed down by the past data
Querying the history
You could either:
Have a separate archival table, possibly in other database. This can potentially compilcate querying.
Use partitioning.
I'm not sure how effective the MySQL partitioning is. For alternatives, you may take a look at PostgreSQL partitioning. Most commercial databases support it too.
I take it from your question that you only want to deal with current orders.
In the past, have used 3 tables on busy sites
new orders,
processing orders,
filled orders,
and a main orders table
orders
All these tables have a relation to the orders table and a primary key.
eg new_orders_id, orders_id
processing_orders_id, orders_id ....
using a left join to find new and processing orders should be relatively efficient