Say we have a table "posts" in a MySQL database, that as its name suggests stores users' posts on some social media platform. Now I want to display the number of posts each user has created. A potential solution would be:
SELECT COUNT(*) FROM posts WHERE ....etc;
But To me -at least- this looks like an expensive query. wouldn't it be better to keep a record in some table say (statistics) using a column named (number_of_posts). I'm aware that In the last scenario I would have to update both tables (posts) & (statistics) once a post is created. What do you think the best way to tackle it?
Queries like
SELECT COUNT(*), user_id
FROM posts
GROUP BY user_id
are capable of doing an index scan if you create an index on the user_id column. Read this. Index scans are fast. So the query you propose is just fine. SQL, and MySQL, are made for such queries.
And, queries like
SELECT COUNT(*)
FROM posts
WHERE user_id = 123456
are very fast if you have the user_id index. You may save a few dozen microseconds if you keep a separate table, or you may not. The savings will be hard to measure. But, you'll incur a cost maintaining that table, both in server performance and software-maintenance complexity.
For people just learning to use database software, intuition about performance often is grossly pessimistic. Database software packages have many thousands of programmer-years of work in them to improve performance. Truly. And, you probably can't outdo them with your own stuff.
Why did the developers of MySQL optimize this kind of thing? So developers using MySQL can depend on it for stuff like your problem, without having to do a lot of extra optimization work. They did it for you. Spend that time getting other parts of your application working.
Related
I believe this question does not address specifically MySQL - which is the database that I'm using -, and it's one about best practices.
Up until now, my problems could be solved by creating tables and querying them (sometimes JOINing here and there). But there is something that I'm doing that doesn't feel right, and it triggers me whenever I need a denormalized data alongside my "common" queries.
Example Use-case
So that I can express myself better, let's create a superficial scenario where:
a user can buy a product, generating a purchase (let's ignore the fact that the purchase can only have a single product);
and we need to query the products with the total amount of times it has been purchased;
To solve our use-case, we could define a simple structure made by:
product table:
product_id [INT PK]
user table:
user_id [INT PK]
purchase table:
purchase_id [INT PK]
product_id [INT FK NOT NULL]
user_id [INT FK NOT NULL]
Here is where it doesn't feel right: When we need to retrieve a list of products with the total times it has been purchased, I would create the query:
# There are probably faster queries than this to reach the same output
SELECT
product.product_id,
(SELECT COUNT(*) FROM purchase
WHERE purchase.product_id = product.product_id)
FROM
product
My concern origin is that I've read that COUNT does a full table scan, and it scares me to perform the query above when scaled to thousands of products being purchased - even though I've created an INDEX with the product_id FK on purchase (MySQL does this by default).
Possible solutions
My knowledge with relational databases is pretty shallow, so I'm kind of lost when comparing what are the alternatives (plausible ones) for these kinds of problems. To not say that I haven't done my homework (searching before asking), I've found plausible to:
Create Transactions:
When INSERTing a new purchase, it must always be inside a transaction that also updates the product table with the purchase.product_id.
Possible Problems: human error. Someone might manually insert a purchase without doing the transaction and BAM - we have an inconsistency.
Create Triggers:
Whenever I insert, delete or update some row in some specific table, I would update my products table with a new value (bought_amount). So the table would become:
product table:
product_id [INT PK]
bought_amount [INT NOT NULL];
Possible problems: are triggers expensive? is there a way that the insertion succeeds but the trigger won't - thus leaving me with an inconsistency?
Question
Updating certain tables to store data that constantly changes is a plausible approach with RDBMSs? Is it safer and - in the long term - more beneficial to just keep joining and counting/summing other occurrences?
I've found a couple of useful questions/answers regarding this matter, but none of them addressed this subject in a wide perspective.
Please take into consideration my ignorance about RDBMS, as I may be suggesting nonsense Possible solutions.
The usual way to get a count per key is
SELECT product_id, COUNT(*)
FROM purchase
GROUP BY product_id
You don't need to mention the product table, because all it contains is the key column. Now although that uses COUNT(*), it doesn't need a full table scan for every product_id because the SQL engine is smart enough to see the GROUP BY.
But this produces a different result to your query: for products that have never been purchased, my query simply won't show them; your query will show the product_id with count zero.
Then before you start worrying about implementation and efficiency, what question(s) are you trying to answer? If you want to see all products whether purchased or not, then you must scan the whole product table and look up from that to purchase. I would go
SELECT product_id, count
FROM product
OUTER JOIN (SELECT product_id, COUNT(*) AS count
FROM purchase
GROUP BY product_id) AS purch
ON product.product_id = purch.product_id
As regards your wider questions (not sure I fully understand them), in the early days SQL was quite inefficient at this sort of joining and aggregating, and schema often were denormalised with repeated columns in multiple tables. SQL engines are now much smarter, so that's not necessary. You might see that old-fashioned practice in older textbooks. I would ignore it and design your schema as normalised as possible.
This query:
SELECT p.product_id,
(SELECT COUNT(*)
FROM purchase pu
WHERE pu.product_id = p.product_id
)
FROM product p;
has to scan both product and purchase. I'm not sure why you are emotional about one table scan but not the other.
As for performance, this can take advantage of an index on purchase(product_id). In MySQL, this will probably be faster than the equivalent (left) join version.
You should not worry about performance of such queries until that becomes an issue. If you need to increase performance of such a query, first I would ask: Why? That is a lot of information being returned -- about all products over all time. More typically, I would expect someone to care about one product or a period of time or both. And, those concerns would suggest the development of a datamart.
If performance is an issue, you have many alternatives, such as:
Defining a data mart to periodically summarize the data into more efficient structures for such queries.
Adding triggers to the database to summarize the data, if the results are needed in real-time.
Developing a methodology for maintaining the data that also maintains the summaries, either at the application-level or using stored procedures.
What doesn't "feel right" to you is actually the tremendous strength of a relational database (with a reasonable data model). You can keep it up-to-date. And you can query it using a pretty concise language that meets business needs.
Possible Problems: human error. Someone might manually insert a purchase without doing the transaction and BAM - we have an inconsistency.
--> Build a Stored Procedure that does both steps in a transaction, then force users to go through that.
Possible problems: are triggers expensive? is there a way that the insertion succeeds but the trigger won't - thus leaving me with an inconsistency?
Triggers are not too bad. But, again, I would recommend forcing users through a Stored Procedure that does all the desired steps.
Note: Instead of Stored Procedures, you could have an application that does the necessary steps; then force users to go through the app and give them no direct access to the database.
A database is the "source of truth" on the data. It is the "persistent" repository for such. It should not be considered the entire engine for building an application.
As for performance:
Summing over a million rows may take a noticeable amount of time.
You can easily do a hundred single-row queries (select/insert/update) per second.
Please think through numbers like that.
I've read a number of posts here and elsewhere about people wrestling to improve the performance of the MySQL/MariaDB COUNTfunction, but I haven't found a solution that quite fits what I am trying to do. I'm trying to produce a live updating list of read counts for a list of articles. Each time a visitor visits a page, a log table in the SQL database records the usual access log-type data (IP, browser, etc.). Of particular interest, I record the user's ID (uid) and I process the user agent tag to classify known spiders (uaType). The article itself is identified by the "paid" column. The goal is to produce a statistic that doesn't count the poster's own views of the page and doesn't include known spiders, either.
Here's the query I have:
"COUNT(*) FROM uninet_log WHERE paid='1942' AND uid != '1' AND uaType != 'Spider'"
This works nicely enough, but very slowly (approximately 1 sec.) when querying against a database with 4.2 million log entries. If I run the query several times during a particular run, it increases the runtime by about another second for each query. I know I could group by paid and then run a single query, but even then (which would require some reworking of my code, but could be done) I feel like 1 second for the query is still really slow and I'm worried about the implications when the server is under a load.
I've tried switching out COUNT(*) for COUNT(1) or COUNT(id) but that doesn't seem to make a difference.
Does anyone have a suggestion on how I might create a better, faster query that would accomplish this same goal? I've thought about having a background process regularly calculate the statistics and cache them, but I'd love to stick to live updating information if possible.
Thanks,
Tim
Add a boolean "summarized" column to your statistics table and making it part of a multicolumn index with paid.
Then have a background process that produces/updates rows containing the read count in a summary table (by article) and marks the statistics table rows as summarized. (Though the summary table could just be your article table.)
Then your live query reports the sum of the already summarized results and the as-yet-unsummarized statistics rows.
This also allows you to expire old statistics table rows without losing your read counts.
(All this assumes you already have an index on paid; if you don't, definitely add one and that will likely solve your problem for now, though in the long run likely you still want to be able to delete old statistics records.)
I am developing a website by using ASP.net and my DB is MYSQL.
Users can put ads for each categories. And I want to display how much ads for each category infront of the category.
Like this.
To achieve this now I am using a code similar to this
SELECT b.name, COUNT(*) AS count
FROM `vehicle_cat` a
INNER JOIN `vehicle_type` b
ON a.`type_id_ref` = b.`vehicle_type_id`
GROUP BY b.name
This is my explain result
So assume I have 200,000 records for each category.
So am I doing the right thing by considering the performance and efficiency?
What if I manage a separate table for store count for each category? If user save a record for each category I am incrementing the value for corresponding type. Assume 100,000 of users will Post records at once. Is it crash my DB?
Or is there any solutions?
Start by developing the application using the query. If performance is a problem, then create indexes on the query to optimize the query. If indexes are not sufficient, then think about partitioning.
Things not to do:
Don't create a separate table for each category.
Don't focus on performance before you have a performance problem. Do reasonable things, but get the functionality to work first.
If you do need to maintain counts in a separate table for performance reasons, you will probably have to maintain them using triggers.
You can use any caching solution, probably in memory caching like Redis or Memcached. And store your counters here. On cache initialization get them with your SQL script, later change this counters when adding or deleting ads. It will be faster then store them in SQL.
But you probably need to check if COUNT(*) is really hard operation in your SQL database. SQL engine is clever and may be this SELECT is working fast enough or you can optimize it well. If it works, you'd better do nothing until you have perfomance problems!
I have a table in which I store ratings from users for each product..
The table consists of the following fields productid, userid, rating(out of 5)...This table might contain a million rows in the future....
So in order to select the top 5 products I am using the following query::.
SELECT productid, avg( rating ) as avg_rating
from product_ratingstblx245v
GROUP BY productid
ORDER BY avg( rating ) DESC
LIMIT 5
My Question is since I will be showing this result on few pages of my site, would it be better to maintain a seperate table for the average ratings with fields productid,avgrating,totalvotes???
You don't need the answer to that question yet. You can start with a VIEW, that is the result of executing the above query. If, after performing load tests (e.g. with JMeter), you see that your site runs slow indeed, you can replace the VIEW with a TEMPORARY TABLE (stored in memory). Since the view and the temporary table will look the same from the outside, you will not have to change your business logic.
Tbh, if MySql wasn't able to handle queries on a simple table schema such as yours above for over a million records in (sub)millisecond speeds, I really would wonder why companies use it for LOB applications.
As it is, I'm a MS Sql developer, so I don;t really know that much about MySql's ability. However assuming that its Database Engine is as good as Sql Servers (I've heard good things about My Sql), you don;t need to worry about your performance issue. If you do want to tweak it, then why not cache the results for 10 minutes (or longer) at your application layer? Triggers are generally pure (albeit sometimes neccessary) evil. Sql Servers are designed specfically for the type of query you with to execute, trust in the Sql.
Personally, I do not like the idea of running totals like this but if it does become necessary then I would not store the average, I would store the TOTAL VOTES and TOTAL RATING. That way it's a very simple UPDATE query (add 1 to TOTAL VOTES and add rating to TOTAL RATING). You can then calculate the average on the fly in minimal time.
As for how you might handle this, I would use a trigger as someone already suggested. But only after trying the VIEW thing that someone else suggested.
I have a large database of normalized order data that is becoming very slow to query for reporting. Many of the queries that I use in reports join five or six tables and are having to examine tens or hundreds of thousands of lines.
There are lots of queries and most have been optimized as much as possible to reduce server load and increase speed. I think it's time to start keeping a copy of the data in a denormalized format.
Any ideas on an approach? Should I start with a couple of my worst queries and go from there?
I know more about mssql that mysql, but I don't think the number of joins or number of rows you are talking about should cause you too many problems with the correct indexes in place. Have you analyzed the query plan to see if you are missing any?
http://dev.mysql.com/doc/refman/5.0/en/explain.html
That being said, once you are satisifed with your indexes and have exhausted all other avenues, de-normalization might be the right answer. If you just have one or two queries that are problems, a manual approach is probably appropriate, whereas some sort of data warehousing tool might be better for creating a platform to develop data cubes.
Here's a site I found that touches on the subject:
http://www.meansandends.com/mysql-data-warehouse/?link_body%2Fbody=%7Bincl%3AAggregation%7D
Here's a simple technique that you can use to keep denormalizing queries simple, if you're just doing a few at a time (and I'm not replacing your OLTP tables, just creating a new one for reporting purposes). Let's say you have this query in your application:
select a.name, b.address from tbla a
join tblb b on b.fk_a_id = a.id where a.id=1
You could create a denormalized table and populate with almost the same query:
create table tbl_ab (a_id, a_name, b_address);
-- (types elided)
Notice the underscores match the table aliases you use
insert tbl_ab select a.id, a.name, b.address from tbla a
join tblb b on b.fk_a_id = a.id
-- no where clause because you want everything
Then to fix your app to use the new denormalized table, switch the dots for underscores.
select a_name as name, b_address as address
from tbl_ab where a_id = 1;
For huge queries this can save a lot of time and makes it clear where the data came from, and you can re-use the queries you already have.
Remember, I'm only advocating this as the last resort. I bet there's a few indexes that would help you. And when you de-normalize, don't forget to account for the extra space on your disks, and figure out when you will run the query to populate the new tables. This should probably be at night, or whenever activity is low. And the data in that table, of course, will never exactly be up to date.
[Yet another edit] Don't forget that the new tables you create need to be indexed too! The good part is that you can index to your heart's content and not worry about update lock contention, since aside from your bulk insert the table will only see selects.
MySQL 5 does support views, which may be helpful in this scenario. It sounds like you've already done a lot of optimizing, but if not you can use MySQL's EXPLAIN syntax to see what indexes are actually being used and what is slowing down your queries.
As far as going about normalizing data (whether you're using views or just duplicating data in a more efficient manner), I think starting with the slowest queries and working your way through is a good approach to take.
I know this is a bit tangential, but have you tried seeing if there are more indexes you can add?
I don't have a lot of DB background, but I am working with databases a lot recently, and I've been finding that a lot of the queries can be improved just by adding indexes.
We are using DB2, and there is a command called db2expln and db2advis, the first will indicate whether table scans vs index scans are being used, and the second will recommend indexes you can add to improve performance. I'm sure MySQL has similar tools...
Anyways, if this is something you haven't considered yet, it has been helping a lot with me... but if you've already gone this route, then I guess it's not what you are looking for.
Another possibility is a "materialized view" (or as they call it in DB2), which lets you specify a table that is essentially built of parts from multiple tables. Thus, rather than normalizing the actual columns, you could provide this view to access the data... but I don't know if this has severe performance impacts on inserts/updates/deletes (but if it is "materialized", then it should help with selects since the values are physically stored separately).
In line with some of the other comments, i would definately have a look at your indexing.
One thing i discovered earlier this year on our MySQL databases was the power of composite indexes. For example, if you are reporting on order numbers over date ranges, a composite index on the order number and order date columns could help. I believe MySQL can only use one index for the query so if you just had separate indexes on the order number and order date it would have to decide on just one of them to use. Using the EXPLAIN command can help determine this.
To give an indication of the performance with good indexes (including numerous composite indexes), i can run queries joining 3 tables in our database and get almost instant results in most cases. For more complex reporting most of the queries run in under 10 seconds. These 3 tables have 33 million, 110 million and 140 millions rows respectively. Note that we had also already normalised these slightly to speed up our most common query on the database.
More information regarding your tables and the types of reporting queries may allow further suggestions.
For MySQL I like this talk: Real World Web: Performance & Scalability, MySQL Edition. This contains a lot of different pieces of advice for getting more speed out of MySQL.
You might also want to consider selecting into a temporary table and then performing queries on that temporary table. This would avoid the need to rejoin your tables for every single query you issue (assuming that you can use the temporary table for numerous queries, of course). This basically gives you denormalized data, but if you are only doing select calls, there's no concern about data consistency.
Further to my previous answer, another approach we have taken in some situations is to store key reporting data in separate summary tables. There are certain reporting queries which are just going to be slow even after denormalising and optimisations and we found that creating a table and storing running totals or summary information throughout the month as it came in made the end of month reporting much quicker as well.
We found this approach easy to implement as it didn't break anything that was already working - it's just additional database inserts at certain points.
I've been toying with composite indexes and have seen some real benefits...maybe I'll setup some tests to see if that can save me here..at least for a little longer.