Database design for hourly, weekly, ranking? - mysql

I've been trying to figure out a good way to handle a ranking system of this sort. As a rough example, I would like to query a facebook page and grab the likes and comments of each post. Then, there would be three rankings based on a time interval. To give a simplified example:
Hourly
- I pull all the posts updated within the last hour, and compare the # of likes/comments compared to my previous entry (the last pull being an hour prior).
Daily
- I pull down all posts within a 24 hours date range. I compare the # of likes/comments compared to the previous entry. "Post X had 12 more likes and 40 more comments today compared to yesterday"
Weekly
- I pull down all posts within a week's range and do the same as above. "Post X had no new likes, but 10 more comments added this week compared to last week"
In terms of the DB tables, what would be a good way to handle this? Would it make sense to have one giant table with the posts (title, comments_previous, comments_current, likes_previous, likes_current, etc)?
Thank you!

Columns: (PK)timestamp, (index)pageid, count. Set a new timestamp every hour on the hour for pages that are liked. Timestamp is the PK so that you don't get horrible fragmentation from your clustered index / page layout in the database.
If you feel for performance reasons that you need to de-normalize, you can make additional daily and monthly tables that are rolled-up summations. Likely, you'll be able to efficiently generate what you need without the rollup tables by using where clauses on the time / pageid combination, thereby giving you what you need with just one table.
Purge old data as you see fit, or keep it.
Clarification
When a comment receives a like, do the following:
insert into likeRanking (concat(select left(now(),13), '00:00'), commentid, 1)
on duplicate key update score = score + 1;

I would do this as follows:
Create a table that gets the time now, comments now, and likes now.
Then after an hour of that time, create another table that gets the time now, comments now and likes now, then subtract it to the previously created table. Then drop the other table and insert the new values of the new table. Then after an hour, create another table.
Same with monthly and yearly.
Let me know if you need anything else.

Related

Should a counter column with frequent update be stored in a separate table?

I have a MySQL/MariaDB database where posts are stored. Each post has some statistical counters such as the number of times the post has been viewed for the current day, the total number of views, number of likes and dislikes.
For now, I plan to have all of the counter columns updated in real-time every time an action happens - a post gets a view, a like or a dislike. That means that the post_stats table will get updated all the time while the posts table will rarely be updated and will only be read most of the time.
The table schema is as follows:
posts(post_id, author_id, title, slug, content, created_at, updated_at)
post_stats(post_id, total_views, total_views_today, total_likes, total_dislikes)
The two tables are connected with a post_id foreign key. Currently, both tables use InnoDB. The data from both tables will be always queried together to be able to show a post with its counters, so this means there will be an INNER JOIN used all the time. The stats are updated right after reading them (every page view).
My questions are:
For best performance when the tables grow, should I combine the two tables into one since the columns in post_status are directly related to the post entries, or should I keep the counter/summary table separate from the main posts table?
For best performance when the tables grow, should I use MyISAM for the posts table as I can imagine that MyISAM can be more efficient at reads while InnoDB at inserts?
This problem is general for this database and also applies to other tables in the same database such as users (counters such as the total number views of their posts, the total number of comments written by them, the total number of posts written by them, etc.) and categories (the number of posts in that category, etc.).
Edit 1: The views per day counters are reset once daily at midnight with a cron job.
Edit 2: One reason for having posts and post_stats as two tables is concerns about caching.
For low traffic, KISS -- Keep the counters in the main post table. (I assume you have ruled this out.)
For high traffic, keep the counters in a separate table. But let's do the "today's" counters differently. (This is what you want to discuss.)
For very high traffic, gather up counts so that you can do less than 1 Update per click/view/like. ("Summary Tables" is beyond the scope of this question.)
Let's study total_views_today. Do you have to do a big "reset" every midnight? That is (or will become) too costly, so let's try to avoid it.
Have only total_views in the table.
At midnight copy the table into another table. (SELECT is faster and less-invasive than the UPDATE needed to reset the values.) Do this copy by building a new table, then RENAME TABLE to move it into place.
Compute total_views_today by subtracting the corresponding values in the two tables.
That left you with
post_stats(post_id, total_views, total_likes, total_dislikes)
For "high traffic, it is fine to do
UPDATE post_stats SET ... = ... + 1 WHERE post_id = ...;
at the moment needed (for each counter).
But there is a potential problem. You can't increment a counter if the row does not exist. That would be best solved by creating a row with zeros at the same time the post is created. (Otherwise, see IODKU.)
(I may come back if I think of more.)

How should I setup the structure of my MySQL database to work for my needs?

I am working on an application that awards the top person of each category for being first. The way you become first in a category is by having the most number of votes in the past 30 (or so) days. So even if you had a total of 2,000 votes but got only 2 votes within the past 30 days, someone with 10 votes but got all 10 within the past 30 days would be ranked above you. I am just trying to seek advise on the best way to create this type of system with a MySQL database and how to structure the database.
I am pretty unsure of the best way to go about this, any advice would be greatly appreciated!
The first desicion you have to make is, whether you want to keep a record for every vote cast: This has the potential for a huge table, but it lets you keep a lot of information, so you trade storage and performance against information. This must be answered by business logic, not implementation.
Assuming you DO want to keep every vote, keep it with a timestamp and the only thing you have to do is to join the user person table with the vote table, use a WHERE clause to select only the last N days and a COUNT() aggregate to count your votes.
If you do NOT want to keep every vote, you should have an accumulation table with person, day and votecount - an analogous query with SUM() instead of COUNT() will do what you want.

MYSQL Database Schema Question

I need opinions on the best way to go about creating a table or collection of tables to handle this unique problem. Basically, I'm designing this site with business profiles. The profile table contains all your usual things such as name, uniqueID, address, ect. Now, the whole idea of the site is that it's going to be collecting a small string of informative text. I want to allow the clients to be able to store one per date, with as many as 30 days in advance. The program is only going to show the information from the current date on forward, with expired dates not being shown.
The only way I can really see this being done is a table consisting of the uniqueID, date, and the informative block of text, but this creates pretty extensive queries. Eventually this table is going to be at least 20 times larger than the table of businesses in the first place as these businesses are going to be able to post up to 30 items in this table using their uniqueID.
Now, imagine the search page brings up a list of businesses in the area, it's then got to query the new table for all of those ids to get that block of information I want to show based on the date. I'm pretty sure it would be a rather intensive couple of queries just to show a rather simple block of text, but I imagine this is how status updates work for social networking sites in general? Does facebook store updates in a table of updates tied to a users ID number or have they come up with a better way?
I'm just trying to gain a little more insight into DB design, so throw out any ideas you might have.
The only way I can really see this being done is a table consisting of the uniqueID, date, and the informative block of text...
Assuming you mean the profile uniqueID, and not a unique ID for the text table, you're correct.
As pascal said in his comment, you'd need a primary index on uniqueID and date. A person could only enter one row of text for a given date.
If you want to retrieve the next text row for a person, your SQL query would have the following clauses:
WHERE UNIQUE_ID = PROFILE.UNIQUE_ID
AND DATE >= CURRENT_DATE
LIMIT 1
Since you have an index on uniqueID and date, this should be a fast query.
If you want to retrieve the next 5 texts for a particular person, you'd just have to make one change:
WHERE UNIQUE_ID = PROFILE.UNIQUE_ID
AND DATE >= CURRENT_DATE
LIMIT 5

Need help with a database design for Top 10

I am trying to come up with a database design to hold the "Top 10" results for some calculations that are being done. Basically, when all is said in done, there will be 3 "Top 10" categories, which I am fine with all being separate tables, however I need to be able to go back and later pull historical data about what was in the Top 10 at certain times, hence the need for a database, although a flat-file would work, this has the potential to hold years worth of data.
Now, it's been awhile since I have done anything serious with a database, other than something that had a couple of simple tables, so I am having some issues thinking through this design. If someone could help me with the design of it, I know enough MySQL to get the rest done.
So, in essence, I need to store: A group of 10 names, a % of the total points each name had, the rank they held in the Top 10 and a time associated with that Top 10 (So I can later query for that time)
I would think I need a table for for the Top 10 with 11 columns, one for the ID and 10 for the Foreign Key of the 'Names' table, that holds every name ever used with a PK, Name, %, and Rank. This seems clunky to me, anyone else have a suggestion?
edit:The 'Top 10' is associated with a specific set of data for 5-minute intervals, and each interval is completely independent from the previous or future intervals.
I don't recommend your solution, because then if you want to ask the database "How often has Joe been in the top 10," you have to write 10 queries of the form
SELECT Date FROM Top10 WHERE FirstPlace = 'joe'
SELECT Date FROM Top10 WHERE SecondPlace = 'joe'
...
Instead, how about a Rankings table, with fields:
id
Date
Person
Rank
Then if you want the Top 10 list for a certain date, the query is
SELECT * FROM Rankings WHERE Date = ...
and if you want to know someone's historical ranking, the query is
SELECT * FROM Rankings WHERE Person = ...
and if you want to know all the historical leaders, the query is
SELECT * FROM Rankings WHERE Rank = 1
The downside to this is that you might accidentally make two different people 8th place, and your database would allow the anomaly. But I have good news for you -- people might actually tie for 8th place, so you might actually want that to be possible!
I assume that your "Top 10" is a snapshot data in certain time. And your business logic is that "every 5 minutes" so that the time is the parent entity for table design
top_10_history
th_id - the primary key
th_time - the time point when taking the snapshot data of "Top 10"
top_10_detail
td_th_id - the FK to top_10_history
td_name_id - the FK to name
td_percentage - the "%"
td_rank - the rank
If the sequence of "Top 10" could be calculated from columns in "top_10_detail", you don't need a column to keep the sequence of it. Otherwise, you need a column to persist the sequence for it.
If you need more complicated query such as "The top 10 at 12:00 AM in last 30 days", using individual columns for "day", "hour", and "minute" would be a better idea for performance(with suitable indexes).

Where to store users visited pages?

I have a project, where I have posts for example.
The task is next: I must show to user his last posts visit.
This is my solution: every time user visits new (for him) topic, I create a new record in table visits.
Table visits has next structure: id, user_id, post_id, last_visit.
Now my tables visits has ~14,000,000 records and its still growing every day..
May be my solution isnt optimal and exists another way how to store users visits?
Its important to save every visit as standalone record, because I also have feature to select and use users visits. And I cant purge this table, because data could be needed later month, year. How I could optimize this situation?
Nope, you don't really have much choice other than to store your visit data in a table with columns for (at a bare minimum) user id, post id, and timestamp if you need to track the last time that each user visited each post.
I question whether you need an id field in that table, rather than using a composite key on (user_id, post_id), but I'd expect that to have a minor effect, provided that you already have a unique index on (user_id, post_id). (If you don't have an index on that pair of fields, adding one should improve query performance considerably and making it a unique index or composite key will protect against accidentally inserting duplicate records.)
If performance is still an issue despite proper indexing, you should be able to improve it a bit by segmenting the table into a collection of smaller tables, but segment it by user_id or post_id (rather than by date as previous answers have suggested). If you break it up by user or post id, then you will still be able to determine whether a given user has previously viewed a given post and, if so, on what date with only a single query. If you segment it by date, then that information will be spread across all tables and, in the worst-case scenario of a user who has never previously viewed a post (which I expect to be fairly common), you'll need to separately query each and every table before having a definitive answer.
As for whether to segment it by user id or by post id, that depends on whether you will more often be looking for all posts viewed by a user (segment by user_id to get them all in one query) or all users who have viewed a post (segment by post_id).
If it doesn't need to be long lasting, you could store it in session instead. If it does, you could either break the records apart by table, like say 1 per month, or you could only store the last 5-10 pages visited, and delete old ones as new ones come in. You could also change it to pages visited today, this week, etc.
If you do need all 14 million records, I would create another historical table to archive the visits that are not the most relevant for the day-to-day site operation.
At the end of the month (or week, or quarter, etc...) have some scheduled logic to archive records beyond a certain cutoff point to the historical table and reduce the number of records in the "live" table. This should help increase the query speed on the "live" table since you would have less records in it.
If you do need to query all of the data, you can use both tables and have all of the data available to you.
you could delete the ones you don't need - if you only want to show the last 10 visited posts then
DELETE FROM visits WHERE user_id = ? AND id NOT IN (SELECT id from visits where user_id = ? ORDER BY last_visit DESC LIMIT 0, 10);
(i think that's the best way to do that query, any mysql guru can tell me otherwise? you can ORDER BY in DELETE but the LIMIT only takes 1 parameter, so you can't do LIMIT 10, 100 there)
after inserting/updating each new row, or every few days if you like
Having a structure like (id, user_id, post_id, last_visit) for your vists table, makes it appear as though you are saving all posts, not just last post per Topic. Don't you need a topic ID in there somewhere so that you can determine what there last post PER TOPIC was, and so you know which row to replace when they post in the same topic more than once?
Store post_ids to $_SESSION and then using MYSQL IN with one SELECT query you will be able to show his visited posts. But all those ids will be destroyed after member close his browser, but anyways, this is much more faster and optimal than using database.
edit: sorry, I didn't notice you that you must store that records in database and use it after months. Then I have no idea how to optimize it, but with 14 mln. records you should definitely use indexes.