mysql database design? - mysql

I have a table in which I store ratings from users for each product..
The table consists of the following fields productid, userid, rating(out of 5)...This table might contain a million rows in the future....
So in order to select the top 5 products I am using the following query::.
SELECT productid, avg( rating ) as avg_rating
from product_ratingstblx245v
GROUP BY productid
ORDER BY avg( rating ) DESC
LIMIT 5
My Question is since I will be showing this result on few pages of my site, would it be better to maintain a seperate table for the average ratings with fields productid,avgrating,totalvotes???

You don't need the answer to that question yet. You can start with a VIEW, that is the result of executing the above query. If, after performing load tests (e.g. with JMeter), you see that your site runs slow indeed, you can replace the VIEW with a TEMPORARY TABLE (stored in memory). Since the view and the temporary table will look the same from the outside, you will not have to change your business logic.

Tbh, if MySql wasn't able to handle queries on a simple table schema such as yours above for over a million records in (sub)millisecond speeds, I really would wonder why companies use it for LOB applications.
As it is, I'm a MS Sql developer, so I don;t really know that much about MySql's ability. However assuming that its Database Engine is as good as Sql Servers (I've heard good things about My Sql), you don;t need to worry about your performance issue. If you do want to tweak it, then why not cache the results for 10 minutes (or longer) at your application layer? Triggers are generally pure (albeit sometimes neccessary) evil. Sql Servers are designed specfically for the type of query you with to execute, trust in the Sql.

Personally, I do not like the idea of running totals like this but if it does become necessary then I would not store the average, I would store the TOTAL VOTES and TOTAL RATING. That way it's a very simple UPDATE query (add 1 to TOTAL VOTES and add rating to TOTAL RATING). You can then calculate the average on the fly in minimal time.
As for how you might handle this, I would use a trigger as someone already suggested. But only after trying the VIEW thing that someone else suggested.

Related

keeping a record of the number of rows

Say we have a table "posts" in a MySQL database, that as its name suggests stores users' posts on some social media platform. Now I want to display the number of posts each user has created. A potential solution would be:
SELECT COUNT(*) FROM posts WHERE ....etc;
But To me -at least- this looks like an expensive query. wouldn't it be better to keep a record in some table say (statistics) using a column named (number_of_posts). I'm aware that In the last scenario I would have to update both tables (posts) & (statistics) once a post is created. What do you think the best way to tackle it?
Queries like
SELECT COUNT(*), user_id
FROM posts
GROUP BY user_id
are capable of doing an index scan if you create an index on the user_id column. Read this. Index scans are fast. So the query you propose is just fine. SQL, and MySQL, are made for such queries.
And, queries like
SELECT COUNT(*)
FROM posts
WHERE user_id = 123456
are very fast if you have the user_id index. You may save a few dozen microseconds if you keep a separate table, or you may not. The savings will be hard to measure. But, you'll incur a cost maintaining that table, both in server performance and software-maintenance complexity.
For people just learning to use database software, intuition about performance often is grossly pessimistic. Database software packages have many thousands of programmer-years of work in them to improve performance. Truly. And, you probably can't outdo them with your own stuff.
Why did the developers of MySQL optimize this kind of thing? So developers using MySQL can depend on it for stuff like your problem, without having to do a lot of extra optimization work. They did it for you. Spend that time getting other parts of your application working.

Optimizing COUNT() on MariaDB for a Statistics Table

I've read a number of posts here and elsewhere about people wrestling to improve the performance of the MySQL/MariaDB COUNTfunction, but I haven't found a solution that quite fits what I am trying to do. I'm trying to produce a live updating list of read counts for a list of articles. Each time a visitor visits a page, a log table in the SQL database records the usual access log-type data (IP, browser, etc.). Of particular interest, I record the user's ID (uid) and I process the user agent tag to classify known spiders (uaType). The article itself is identified by the "paid" column. The goal is to produce a statistic that doesn't count the poster's own views of the page and doesn't include known spiders, either.
Here's the query I have:
"COUNT(*) FROM uninet_log WHERE paid='1942' AND uid != '1' AND uaType != 'Spider'"
This works nicely enough, but very slowly (approximately 1 sec.) when querying against a database with 4.2 million log entries. If I run the query several times during a particular run, it increases the runtime by about another second for each query. I know I could group by paid and then run a single query, but even then (which would require some reworking of my code, but could be done) I feel like 1 second for the query is still really slow and I'm worried about the implications when the server is under a load.
I've tried switching out COUNT(*) for COUNT(1) or COUNT(id) but that doesn't seem to make a difference.
Does anyone have a suggestion on how I might create a better, faster query that would accomplish this same goal? I've thought about having a background process regularly calculate the statistics and cache them, but I'd love to stick to live updating information if possible.
Thanks,
Tim
Add a boolean "summarized" column to your statistics table and making it part of a multicolumn index with paid.
Then have a background process that produces/updates rows containing the read count in a summary table (by article) and marks the statistics table rows as summarized. (Though the summary table could just be your article table.)
Then your live query reports the sum of the already summarized results and the as-yet-unsummarized statistics rows.
This also allows you to expire old statistics table rows without losing your read counts.
(All this assumes you already have an index on paid; if you don't, definitely add one and that will likely solve your problem for now, though in the long run likely you still want to be able to delete old statistics records.)

SQL Optimization: how to JOIN a table with itself

I'm trying to optimize a SQL query and I am not sure if further optimization is possible.
Here's my query:
SELECT someColumns
FROM (((smaller_table))) AS a
INNER JOIN (((smaller_table))) AS b
ON a.someColumnProperty = b.someColumnProperty
...the problem with this approach is that my table has half a trillion records in it. In my query, you'll notice (((smaller_table))). I wrote that as an abbreviation for a SELECT statement being run on MY_VERY_LARGE_TABLE to reduce it's size.
(((smaller_table))) appears twice, and the code within is exactly the same both times. There's no reason for me to run the same sub-query twice. This table is several TB and I shouldn't scan through it twice just to get the same results.
Do you have any suggestions on how I can NOT run the exact same reduction twice? I tried replacing the INNER JOIN line with INNER JOIN a AS b but got an "unrecognized table a" warning. Is there any way to store the value of a so I can reuse it?
Thoughts:
Make sure there is an index on userid and dayid.
I would ask you to define better what it is you are trying to find out.
Examples:
What is the busiest time of the week?
Who are the top 25 people who come to the gym the most often?
Who are the top 25 people who utilize the gem the most? (This is different than the one above because maybe I have a user that comes 5 times a month, but stays 5 hours per session vs a user that comes 30 times a month and stays .5 hour per session.)
Maybe doing all days in a horizontal method (day1, day2, day3) would be better visually to try to find out what you are looking for. You could easily put this into excel or libreoffice and color the days that are populated to get a visual "picture" of people who come consecutively.
It might be interesting to run this for multiple months to see if what the seasonality looks like.
Alas CTE is not available in MySQL. The ~equivalent is
CREATE TABLE tmp (
INDEX(someColumnProperty)
)
SELECT ...;
But...
You can't use CREATE TEMPORARY TABLE because such can't be used twice in the same query. (No, I don't know why.)
Adding the INDEX (or PK or ...) during the CREATE (or afterwards) provides the very necessary key for doing the self join.
You still need to worry about DROPping the table (or otherwise dealing with it).
The choice of ENGINE for tmp depends on a number of factors. If you are sure it will be "small" and has no TEXT/BLOB, then MEMORY may be optimal.
In a Replication topology, there are additional considerations.

Shall I get count of records for each categories by using Count(*) or using separate count table? Or any other?

I am developing a website by using ASP.net and my DB is MYSQL.
Users can put ads for each categories. And I want to display how much ads for each category infront of the category.
Like this.
To achieve this now I am using a code similar to this
SELECT b.name, COUNT(*) AS count
FROM `vehicle_cat` a
INNER JOIN `vehicle_type` b
ON a.`type_id_ref` = b.`vehicle_type_id`
GROUP BY b.name
This is my explain result
So assume I have 200,000 records for each category.
So am I doing the right thing by considering the performance and efficiency?
What if I manage a separate table for store count for each category? If user save a record for each category I am incrementing the value for corresponding type. Assume 100,000 of users will Post records at once. Is it crash my DB?
Or is there any solutions?
Start by developing the application using the query. If performance is a problem, then create indexes on the query to optimize the query. If indexes are not sufficient, then think about partitioning.
Things not to do:
Don't create a separate table for each category.
Don't focus on performance before you have a performance problem. Do reasonable things, but get the functionality to work first.
If you do need to maintain counts in a separate table for performance reasons, you will probably have to maintain them using triggers.
You can use any caching solution, probably in memory caching like Redis or Memcached. And store your counters here. On cache initialization get them with your SQL script, later change this counters when adding or deleting ads. It will be faster then store them in SQL.
But you probably need to check if COUNT(*) is really hard operation in your SQL database. SQL engine is clever and may be this SELECT is working fast enough or you can optimize it well. If it works, you'd better do nothing until you have perfomance problems!

Should totals be denormalized?

I am working on a website with a simple normalized database.
There is a table called Pages and a table called Views. Each time a Page is viewed, a unique record of that View is recorded in the Views table.
When displaying a Page on the site, I use a simple MySQL COUNT() to total up the number of Views for display.
Database design seems fine, except for this problem: I am at a loss for how to retrieve the top 10 most viewed pages among thousands.
Should I denormalize the Pages table by adding a Pages.views column to hold the total number of views for each page? Or is there an efficient way to query for the top 10 most viewed pages?
SELECT p.pageid, count(*) as viewcount FROM
pages p
inner join views v on p.pageid = v.pageid
group by p.pageid
order by count(*) desc
LIMIT 10 OFFSET 0;
I can't test this, but something along those lines. I would not store the value unless I have to due to performance constraints (I just learned the term "premature optimization", and it seems to apply if you do).
It depends on the level of information you are trying to maintain. If you want to record who viewed when? Then the separate table is fine. Otherwise, a column for Views is the way to go. Also If you keep a separate column, you'll find that the table will be locked more often since each page view will try to update the column for its corresponding row.
Select pageid, Count(*) as countCol from Views
group by pageid order by countCol DESC
LIMIT 10 OFFSET 0;
Database normalization is all about the most efficient / least redundant way to store data. This is good for transaction processing, but often directly conflicts with the need to efficiently get the data out again. The problem is usually addressed by having derived tables (indexes, materialized views, rollup tables...) with more accessible, pre-processed data. The (slightly dated) buzzword here is Data Warehousing.
I think you want to keep your Pages table normalized, but have an extra table with the totals. Depending on how recent those counts need to be, you can update the table when you update the original table, or you can have a background job to periodically recalculate the totals.
You also want to do this only if you really run into a performance problem, which you will not unless you have a very large number of records, or a very large number of concurrent accesses. Keep your code flexible to be able to switch between having the table and not having it.
I would probably include the views column in the Pages table.
It seems like a perfectly reasonable breaking of normalization to me. Especially since I can't imagine you deleting views so you wouldn't expect the count to get out of whack. Referential integrity doesn't seem super-critical in this case.
Denormalizing would definitely work in this case. Your loss is the extra storage room used up by the extra column.
Alternatively you could set up a scheduled job to populate this information on a nightly basis, whenever your traffic is low, x period of time.
In this case you would be losing the ability to instantly know your page counts unless you run this query manually.
Denormalization can definitely be employed to increase performance.
--Kris
While this is an old question, I'd like to add my answer because I find the accepted one to be misguided.
It is one thing to have COUNT for a single selected row; it is quite another to sort the COUNT of ALL columns.
Even if you have just 1000 rows, each counted with some join, you can easily involve reading tens of thousands if not millions of rows.
It can be ok if you only call this occasionally, but it is very costly otherwise.
What you can do is to add a TRIGGER:
CREATE TRIGGER ins AFTER INSERT ON table1 FOR EACH ROW
UPDATE table2
SET count = count + 1
WHERE CONDITION