I have a DB with a lot of records (of articles) and currently I keep track of how many times each record has been viewed by counting the views so I can sort on somehting like "see the top 5 most viewed articles"
This is done with a column of integers, and whenever the record is retrieved, the integer count increases by 1.
This works fine but since the counting system is very simple, I can only see views of "all time".
I would like to have something like "see the top 5 most viewed articles this week".
The only way I can think of is to have a separate table which makes a record with the article Id and Date whenever an article is viewed, and then make a SELECT statement for a limited time period.
This could easily work, but at the same time the table would be very large in no time.
Is there any better way of acomplishing the same thing? I've seen the sorting criteria on many websites, but I dont know how this is achieved.
Any thoughts or comments?
Thanks in advance :)
Instead of a row for each view of each article, you could have a row per day. When an article is viewed, you would do:
INSERT INTO article_views (article_id, date, views)
VALUES (#article, CURRENT_DATE(), 1)
ON DUPLICATE KEY UPDATE views = views + 1;
Then to get the top 5 articles viewed in the past week:
SELECT article_id, SUM(views) total_views
FROM article_views
WHERE date > NOW() - INTERVAL 7 day
GROUP BY article_id
ORDER BY total_views DESC
LIMIT 5
To keep the table from growing too large, you can delete old records periodically.
Related
How would I do this question in sql (taken from glassdoor):
You have a table where you have date, user_id, song_id and count. It shows at the end of each day how many times in her history a user has listened to a given song. So count is cumulative sum.
You have to update this on a daily basis based on a second table that records in real time when a user listens to a given song. Basically, at the end of each day, you go to this second table and pull a count of each user/song combination and then add this count to the first table that has the lifetime count.
I particularly do not know how to update a table in such a bulk/massive/looping type of way and would appreciate the mysql code to achieve something like that. I haven't written the code because I do not know how to do such a large scale addition in an efficient manner.
It sounds like you don't add anything to the existing counts, you insert a new record for that day with the total play count as on that day. Old records in the history table are not updated
At the end of each day, you run this:
INSERT INTO playhistory
SELECT CURDATE(), user_id, song_id, count(*)
FROM individualplays
GROUP BY user_id, song_id
Individualplays table holds the user and song ids for all time. If a new user plays the same song 10 times today, the count(*) will be 10. Tomorrow if she plays that song another 5 times, the count will now be 15
If you cannot guarantee to run the query right at the end of the day, your individualplays table needs the date and time that a song was played, then at any time the day after, you can update your history table thus:
INSERT INTO playhistory
SELECT DATE_SUB(CURDATE(), INTERVAL 1 DAY), user_id, song_id, count(*)
FROM individualplays
WHERE playdate < CURDATE()
GROUP BY user_id, song_id
its a shame you're using MySQL actually, because more powerful rdbms can do the history entirely out of the individualplays table dynamically through use of analytic/window functions; devices that can do things like counting all the rows from the start of time, to the current row, per user/song. You can simulate these in MySQL but it's pretty nasty - it basically would involve joining the individualplays table to itself on userid=userid,songid=songid and playdate
I have a basic table which hold record of which user viewed which list.
Each time a list is viewed, a record is stored into the "views" table, storing user_ID and list_ID, along with the time at which it is stored.
I want to know for each user when the last viewed a list, and which list it was.
I'm kinda stuck here. It gives me al the latest times, by order, but for some users I get multiple records.
How to sort this out?
http://sqlfiddle.com/#!2/39f41/4
sqlFiddle
SELECT user_ID, list_ID, max(times) as times
FROM views
GROUP BY user_ID, list_ID;
or if you want the view_id returned as well and assuming that larger view_id will always be at a later time than a smaller view_id you can use sqlFiddle
SELECT max(view_ID) as view_ID,user_ID, list_ID, max(times) as times
FROM views
GROUP BY user_ID, list_ID;
I would like to perform some request in mysql that i know will be really slow:
I have 3 tables:
Users:
id, username, email
Question:
id, date, question
Answer
id_question, id_user, response, score
And i would like to do some statistics like the top X users with the best score (sum of all the scores) for all time or for a given amount of time (last month for example). Or it could be users between the 100th and the 110th range
I will have thousands of users and hundred of questions so the requests could be very long since I'll need to order by sum of scores, limit to a given range and sometimes only select some questions depending on the date, ...
I would like to know if there are some methods to optimize the requests!
If you have a lot of data no other choices, Only you can optimize it with creating new table that will somehow summarizes data in every day/week or month. Maybe summarizes scores by each week for users and stamps by that weeks date, or by month. As the range of summing longer as much your query works fast.
For archived statistics, you can create tables that store rankings that won't move (last year, last month, last day). Try to calculated as much as possible statistics in such tables, put indexes on id_user, date, type_of_ranking...
Try to limit as much as possible subqueries.
I'm creating a site where all of the users have a score that is updated everyday. I can easily create rankings from this score, however I'd like to be able to create a "Hot" list of the week or month, etc..
My brute force design would be each day for every user, calculate their score and put it into the "Scores" table. So everyday the Scores table would increase by how many users there are. I could rank users by their score deltas over whatever time period.
While I believe this would technically work I feel like there has to be a more sophisticated way of doing this, right? Or not? I feel like a Scores table that increases everyday by how many users there are can't be the way other sites are doing it.
You get the most flexibility by not storing any snapshots of score at all. Instead, record incremental scores, as they happen.
If you have tables like this:
USER
user_id
name
personal_high_score
{anything else that you store once per user}
SCORE_LOG
score_log_id
user_id (FK to USER)
date_time
scored_points
Now you can get a cumulative score for a user as of any point in time with a simple query like:
select sum(scored_points)
from SCORE_LOG
where user_id = #UserID
and date_time <= #PointInTime
You can also easily get top ranking scorers for a time period with something like:
select
user_id
, sum(scored_points)
from SCORE_LOG
group by
user_id
where date_time >= #StartOfPeriod
and date_time <= #EndOfPeriod
order by
sum(scored_points) desc
limit 5
If you get to production and find that you're having performance issues in practice, then you could consider denormalizing a snapshot of whatever statistics make sense. The problem with these snapshot statistics is that they can get out of sync with your source data, so you'll need a strategy for recalculating the snapshots periodically.
It's pretty much a truism (consider it a corollary of Murphy's Law) that if you have two sources of truth you'll eventually end up with two "truths".
Barranka was on the right track with his comment, you need to make sure you are not duplicating any of the data wherever possible.
However, if you are looking to be able to revert back to some old users score or possibly be able to pick out a day and see who was top at a certain point i.e. dynamic reporting, then you will need to record each record separately next to a date. Having a separate table for this would be useful as you could deduce the daily score from the existing user data via SQL and just enter it in to the table whenever you want.
The decision you have is how many users record do you want to maintain in the history and how long. I have written the below with the idea that the "hot list" would be the top 5 users, you could have a CRON job or scheduled task running each day/month to run the inserts and also clean out very old data.
Users
id
username
score
score_ranking
id
user_id (we normalise by using the id rather than all the user info)
score_at_the_time
date_of_ranking
So to generate a single data ranking you could insert into this table. Something like:
INSERT INTO
`score_ranking` (`user_id`, `score_at_the_time`, `date_of_ranking`)
SELECT
`id`, `score`, CURDATE()
FROM
`users`
ORDER BY
`score` DESC
LIMIT
5
To read the data for a specific date (or date range) you could then do:
SELECT * FROM score_ranking
WHERE date_of_ranking = 'somedate'
ORDER BY score_at_the_time DESC
I have a table with comments almost 2 million rows. We receive roughly 500 new comments per day. Each comment is assigned to a specific ID. I want to grab the most popular "discussions" based on the specific ID.
I have an index on the ID column.
What is best practice? Do I just group by this ID and then sort by the ID who has the most comments? Is this most efficient for a table this size?
Do I just group by this ID and then sort by the ID who has the most comments?
That's pretty much simply how I would do it. Let's just assume you want to retrieve the top 50:
SELECT id
FROM comments
GROUP BY id
ORDER BY COUNT(1) DESC
LIMIT 50
If your users are executing this query quite frequently in your application and you're finding that it's not running quite as fast as you'd like, one way you could optimize it is to store the result of the above query in a separate table (topdiscussions), and perhaps have a script or cron that runs intermittently every five minutes or so which would update that table.
Then in your application, just have your users select from the topdiscussions table so that they only need to select from 50 rows rather than 2 million.
The downside of this of course being that the selection will no longer be in real-time, but rather out of sync by up to five minutes or however often you want to update the table. How real-time you actually need it to be depends on the requirements of your system.
Edit: As per your comments to this answer, I know a little more about your schema and requirements. The following query retrieves the discussions that are the most active within the past day:
SELECT a.id, etc...
FROM discussions a
INNER JOIN comments b ON
a.id = b.discussion_id AND
b.date_posted > NOW() - INTERVAL 1 DAY
GROUP BY a.id
ORDER BY COUNT(1) DESC
LIMIT 50
I don't know your field names, but that's the general idea.
If I understand your question, the ID indicates the discussion to which a comment is attached. So, first you would need some notion of most popular.
1) Initialize a "Comment total" table by counting up comments by ID and setting a column called 'delta' to 0.
2) Periodically
2.1) Count the comments by ID
2.2) Subtract the old count from the new count and store the value into the delta column.
2.3) Replace the count of comments with the new count.
3) Select the 10 'hottest' discussions by selecting 10 row from comment total in order of descending delta.
Now the rest is trivial. That's just the comments whose discussion ID matches the ones you found in step 3.