I have a table with comments almost 2 million rows. We receive roughly 500 new comments per day. Each comment is assigned to a specific ID. I want to grab the most popular "discussions" based on the specific ID.
I have an index on the ID column.
What is best practice? Do I just group by this ID and then sort by the ID who has the most comments? Is this most efficient for a table this size?
Do I just group by this ID and then sort by the ID who has the most comments?
That's pretty much simply how I would do it. Let's just assume you want to retrieve the top 50:
SELECT id
FROM comments
GROUP BY id
ORDER BY COUNT(1) DESC
LIMIT 50
If your users are executing this query quite frequently in your application and you're finding that it's not running quite as fast as you'd like, one way you could optimize it is to store the result of the above query in a separate table (topdiscussions), and perhaps have a script or cron that runs intermittently every five minutes or so which would update that table.
Then in your application, just have your users select from the topdiscussions table so that they only need to select from 50 rows rather than 2 million.
The downside of this of course being that the selection will no longer be in real-time, but rather out of sync by up to five minutes or however often you want to update the table. How real-time you actually need it to be depends on the requirements of your system.
Edit: As per your comments to this answer, I know a little more about your schema and requirements. The following query retrieves the discussions that are the most active within the past day:
SELECT a.id, etc...
FROM discussions a
INNER JOIN comments b ON
a.id = b.discussion_id AND
b.date_posted > NOW() - INTERVAL 1 DAY
GROUP BY a.id
ORDER BY COUNT(1) DESC
LIMIT 50
I don't know your field names, but that's the general idea.
If I understand your question, the ID indicates the discussion to which a comment is attached. So, first you would need some notion of most popular.
1) Initialize a "Comment total" table by counting up comments by ID and setting a column called 'delta' to 0.
2) Periodically
2.1) Count the comments by ID
2.2) Subtract the old count from the new count and store the value into the delta column.
2.3) Replace the count of comments with the new count.
3) Select the 10 'hottest' discussions by selecting 10 row from comment total in order of descending delta.
Now the rest is trivial. That's just the comments whose discussion ID matches the ones you found in step 3.
Related
Imagine I would like to develop something similar to Tinder. I have a database with roughly 170k rows (=persons) and I would like to present them on my website. After the user's response, the next person is shown etc.
Once a person has been shown, this is marked in the column 'seen' with a 1. The order in which the persons are shown should be random and only persons that have not been seen yet should be shown.
At the moment, I have this solution. However, this is rather slow and takes too much time for a smooth experience. What would be a more efficient approach to this problem? What is the gold standard for such problems?
SELECT * FROM data WHERE (seen = 0) ORDER BY RAND() LIMIT 1
Add a non-clustered index on the 'seen' column and PK column which will improve querying on the same.
If the primary id is sequential and u know the limits of the records, you can get a random number between max value and min value and query like
SELECT *
FROM data
WHERE seen = 0 AND id >= random_id
LIMIT 1
Say you have a table with n rows, what is the most efficient way to get the first row ever recorded on that table without sorting?
This is guaranteed to work, but becomes slower as the number of records increases:
SELECT * FROM posts ORDER BY created_at DESC LIMIT 1;
UPDATE:
This is even better in case there are multiple records with the same created_at value, but still needs sorting:
SELECT * FROM posts ORDER BY id ASC LIMIT 1;
Imagine a ledger book with 1 million pages and 1 billion lines of records, to get the first ever record, you'd simply turn to the first page and get the one on the top most, right? Regardless of the size of the ledger, you should get the first ever record with the same efficiency. I was hoping I could do the same in MySQL without doing any kind of sorting or ordering. For research purposes. I mean, why not? Why can't MySQL? Is it impossible by design?
This is possible in typical array structures in programming:
array = [1,2,3,4,5]
The first element is in array[0], the second in array[1] and so on. There is no sorting necessary. The last element is array[array_count(array)-1].
I can offer the following two queries to find the most recent record:
SELECT * FROM posts ORDER BY created_at DESC LIMIT 1
and
SELECT *
FROM posts
WHERE created_at = (SELECT MAX(created_at) FROM posts
Both queries would suffer from performance degredation as the table gets larger, because the sorting operation needed to find the most recent created date would take more time.
But in both cases, adding the following index should improve the performance of the query:
ALTER TABLE posts ADD INDEX created_idx (created_at)
MySQL can use an index both in the ORDER BY clause and when finding the max. See the documentation for more information.
I have a DB with a lot of records (of articles) and currently I keep track of how many times each record has been viewed by counting the views so I can sort on somehting like "see the top 5 most viewed articles"
This is done with a column of integers, and whenever the record is retrieved, the integer count increases by 1.
This works fine but since the counting system is very simple, I can only see views of "all time".
I would like to have something like "see the top 5 most viewed articles this week".
The only way I can think of is to have a separate table which makes a record with the article Id and Date whenever an article is viewed, and then make a SELECT statement for a limited time period.
This could easily work, but at the same time the table would be very large in no time.
Is there any better way of acomplishing the same thing? I've seen the sorting criteria on many websites, but I dont know how this is achieved.
Any thoughts or comments?
Thanks in advance :)
Instead of a row for each view of each article, you could have a row per day. When an article is viewed, you would do:
INSERT INTO article_views (article_id, date, views)
VALUES (#article, CURRENT_DATE(), 1)
ON DUPLICATE KEY UPDATE views = views + 1;
Then to get the top 5 articles viewed in the past week:
SELECT article_id, SUM(views) total_views
FROM article_views
WHERE date > NOW() - INTERVAL 7 day
GROUP BY article_id
ORDER BY total_views DESC
LIMIT 5
To keep the table from growing too large, you can delete old records periodically.
I am trying to extract a random article who has a picture from a database.
SELECT FLOOR(MAX(id) * RAND()) FROM `table` WHERE `picture` IS NOT NULL
My table is 33 MB big and has 1,006,394 articles but just 816 with pictures.
My problem is this query takes 0.4640 sek
I need this to be much much more faster.
Any idea is welcome.
P.S.
1. of course I have a index on id.
2. there is no index on the picture field. should I add one?
3. the product name is unique, also the product number, but thats out of question.
RESULT OF TESTING SESSION.
#cHao's Solution is faster when I use it to select one of the random entries with a picture.(les then 0.1 sec.
But its slower if I try to do the opposite, to select a random article without picture. 2..3 sec.
#Kickstart's Solution is a bit slower when trying to find a entry with picture, but is almost same speed when trying to find a entry without picture. average 0,149 sec.
#bob-kruithof's Solution don't work for me.
when trying to find a entry with picture, it selects a entry without picture.
and #ganesh-bora, yes you are right, in my case the speed difference is about 5..15 times.
I want to thank you all for your help, and I decided for #Kickstart.
You need to get a range of values with matching records and then find a matching record within that range.
Something like this:-
SELECT r1.id
FROM `table` AS r1
INNER JOIN (
SELECT RAND( ) * ( MAX( id ) - MIN( id ) ) + MIN( id ) AS id
FROM `table`
WHERE `picture` IS NOT NULL
) AS r2
ON r1.id >= r2.id
WHERE `picture` IS NOT NULL
ORDER BY r1.id ASC
LIMIT 1
However for any hope of efficiency you need an index on the field it is checking (ie, picture in your example)
Just an explanation of how this works.
The sub select finds a random id from the table which is between the min and max ids for records for a picture. This random id may or may not be for a picture.
The resulting id from this sub select is joined back against the main table, but using >= and with a WHERE clause specifying that the record is a picture record. Hence it joins against all picture records where the id is greater than or equal to the random id. The highest random id will be the one for the picture record with the highest id, so it will always find a record (if there are any picture records). The ORDER BY / LIMIT is then used to bring back that single id.
Note that there is an obvious flaw to this, but most of the time it will be irrelevant. The record retrieved may not be entirely random. The picture with the lowest id is unlikely to be returned (will only be returned if the RAND() returns exactly 0), but if this is important this is easy enough to fix by rounding the resulting random id. The other flaw is that if the ids are not vaguely equally distributed in the full range of ids then some will be returned more often than others. For example, take the situation where the first 1000 ids were pictures, then no more until the last (33 millionth) record. The random id could be any of those 33 million, but unless it is less than or equal to 1000 then it will be the 33 millionth record that will be returned.
You might try attaching a random number to each row, then sorting by that. The row with the lowest number will be at the top.
SELECT `table`.`id`, RAND() as `order`
FROM `table`
WHERE `picture` IS NOT NULL
ORDER BY `order`
LIMIT 1;
This is of course slower than just magicking up an ID with RAND(), but (1) it'll always give you a valid ID (as long as there's a record with a non-null picture field in the table, anyway), and (2) the WTF ratio is pretty low; most people can tell what's going on here. :) Its performance rivals Kickstart's solution with a decently indexed table, when the number of items to select from is relatively small (around 1%). Definitely don't try to select from a whole huge table like this; limit it first with a WHERE clause on some indexed field(s).
Performancewise, if you have a long-running app (ie: not PHP; i'm talking about Java, .net, etc where the app is alive even between requests), you might try to keep a list of all the IDs of items with pictures, select a random ID from that list, and load the article. You could do that in PHP too, if you wanted. It might not work as well when you have to query all the IDs each time, but it could be very useful if you can cache the list of IDs in APC or something.
for performance you can first add index on picture column so 814 records get sorted out at the top while executing the query and then you can fire your query.
How has someone else solved the problem?
I would suggest looking at the this article about different possible ways of selecting random rows in mysql.
Modified example from the article
SELECT name
FROM random JOIN
( SELECT CEIL( RAND() * (
SELECT MAX( id ) FROM random WHERE picture IS NOT NULL
) ) AS id ) AS r2 USING ( id );
This might work in your case.
Efficiency
As user Kickstart mentioned: Do you have an index on the column picture? This might help getting you the results a bit faster.
Are your tables optimized?
I'm working on a political application for a client, and I've got two database tables. One has all the registered voters in a certain city, and the other has all the votes for each registered voter. Combined, these two tables number well over 7 million records. This site is coded on CakePHP.
I'm already narrowing the results by a number of criteria, but I need to filter it also based on the percentage of elections that a given voter has voted in since they registered. I have all the votes, the year they registered, and that there are 3 elections every 4 years. I've tried doing a subquery to filter the results, but it took far too long. It took 10 minutes to return 10 records. I have to do this in a join some way, but I'm not at all versed in joins.
This is basically what I need to do:
SELECT * FROM voters
WHERE (number of votes voter has) >= (((year(now())-(registration_year) ) * 3/4)
* (percentage needed))
All of that is pretty straight-forward. The trick is counting the votes the voter has from the votes database. Any ideas?
Either create another table, or extend your first table (the one containing voter information, but not their votes) with two columns -- #votes and registrationAge. Then you can update this table by scanning the 'votes' table once -- everytime you encounter a vote, just increase the count.
I wouldn't try to calculate this as part of your query
In a case where this info will only change 3 times in four years, I'd add the voted % field to the voter table and calculate it once after each election. Then you can simply filter by the field.
you can add a vote_count field to voters table and do a update count on that. You might want to do it in straight sql query: Aggregate function in an SQL update query?
Also, I'm not sure if mysql smart enough to optimize this, but don't use year(now()): you can either get that value in PHP, or just hard code it each time you run (you probably don't need to run it too often).
How about this:
SELECT voters.* FROM voters
LEFT JOIN (SELECT COUNT(voterid) AS votes,voterid AS id FROM votes) AS a
ON voters.id = a.id
WHERE a.votes >= (((year(now())-(voters.registration_year) ) * 3/4) * percentage
I would recomend to create a view, then model your vie to fetch the data