Where to store users visited pages? - mysql

I have a project, where I have posts for example.
The task is next: I must show to user his last posts visit.
This is my solution: every time user visits new (for him) topic, I create a new record in table visits.
Table visits has next structure: id, user_id, post_id, last_visit.
Now my tables visits has ~14,000,000 records and its still growing every day..
May be my solution isnt optimal and exists another way how to store users visits?
Its important to save every visit as standalone record, because I also have feature to select and use users visits. And I cant purge this table, because data could be needed later month, year. How I could optimize this situation?

Nope, you don't really have much choice other than to store your visit data in a table with columns for (at a bare minimum) user id, post id, and timestamp if you need to track the last time that each user visited each post.
I question whether you need an id field in that table, rather than using a composite key on (user_id, post_id), but I'd expect that to have a minor effect, provided that you already have a unique index on (user_id, post_id). (If you don't have an index on that pair of fields, adding one should improve query performance considerably and making it a unique index or composite key will protect against accidentally inserting duplicate records.)
If performance is still an issue despite proper indexing, you should be able to improve it a bit by segmenting the table into a collection of smaller tables, but segment it by user_id or post_id (rather than by date as previous answers have suggested). If you break it up by user or post id, then you will still be able to determine whether a given user has previously viewed a given post and, if so, on what date with only a single query. If you segment it by date, then that information will be spread across all tables and, in the worst-case scenario of a user who has never previously viewed a post (which I expect to be fairly common), you'll need to separately query each and every table before having a definitive answer.
As for whether to segment it by user id or by post id, that depends on whether you will more often be looking for all posts viewed by a user (segment by user_id to get them all in one query) or all users who have viewed a post (segment by post_id).

If it doesn't need to be long lasting, you could store it in session instead. If it does, you could either break the records apart by table, like say 1 per month, or you could only store the last 5-10 pages visited, and delete old ones as new ones come in. You could also change it to pages visited today, this week, etc.

If you do need all 14 million records, I would create another historical table to archive the visits that are not the most relevant for the day-to-day site operation.
At the end of the month (or week, or quarter, etc...) have some scheduled logic to archive records beyond a certain cutoff point to the historical table and reduce the number of records in the "live" table. This should help increase the query speed on the "live" table since you would have less records in it.
If you do need to query all of the data, you can use both tables and have all of the data available to you.

you could delete the ones you don't need - if you only want to show the last 10 visited posts then
DELETE FROM visits WHERE user_id = ? AND id NOT IN (SELECT id from visits where user_id = ? ORDER BY last_visit DESC LIMIT 0, 10);
(i think that's the best way to do that query, any mysql guru can tell me otherwise? you can ORDER BY in DELETE but the LIMIT only takes 1 parameter, so you can't do LIMIT 10, 100 there)
after inserting/updating each new row, or every few days if you like

Having a structure like (id, user_id, post_id, last_visit) for your vists table, makes it appear as though you are saving all posts, not just last post per Topic. Don't you need a topic ID in there somewhere so that you can determine what there last post PER TOPIC was, and so you know which row to replace when they post in the same topic more than once?

Store post_ids to $_SESSION and then using MYSQL IN with one SELECT query you will be able to show his visited posts. But all those ids will be destroyed after member close his browser, but anyways, this is much more faster and optimal than using database.
edit: sorry, I didn't notice you that you must store that records in database and use it after months. Then I have no idea how to optimize it, but with 14 mln. records you should definitely use indexes.

Related

Should a counter column with frequent update be stored in a separate table?

I have a MySQL/MariaDB database where posts are stored. Each post has some statistical counters such as the number of times the post has been viewed for the current day, the total number of views, number of likes and dislikes.
For now, I plan to have all of the counter columns updated in real-time every time an action happens - a post gets a view, a like or a dislike. That means that the post_stats table will get updated all the time while the posts table will rarely be updated and will only be read most of the time.
The table schema is as follows:
posts(post_id, author_id, title, slug, content, created_at, updated_at)
post_stats(post_id, total_views, total_views_today, total_likes, total_dislikes)
The two tables are connected with a post_id foreign key. Currently, both tables use InnoDB. The data from both tables will be always queried together to be able to show a post with its counters, so this means there will be an INNER JOIN used all the time. The stats are updated right after reading them (every page view).
My questions are:
For best performance when the tables grow, should I combine the two tables into one since the columns in post_status are directly related to the post entries, or should I keep the counter/summary table separate from the main posts table?
For best performance when the tables grow, should I use MyISAM for the posts table as I can imagine that MyISAM can be more efficient at reads while InnoDB at inserts?
This problem is general for this database and also applies to other tables in the same database such as users (counters such as the total number views of their posts, the total number of comments written by them, the total number of posts written by them, etc.) and categories (the number of posts in that category, etc.).
Edit 1: The views per day counters are reset once daily at midnight with a cron job.
Edit 2: One reason for having posts and post_stats as two tables is concerns about caching.
For low traffic, KISS -- Keep the counters in the main post table. (I assume you have ruled this out.)
For high traffic, keep the counters in a separate table. But let's do the "today's" counters differently. (This is what you want to discuss.)
For very high traffic, gather up counts so that you can do less than 1 Update per click/view/like. ("Summary Tables" is beyond the scope of this question.)
Let's study total_views_today. Do you have to do a big "reset" every midnight? That is (or will become) too costly, so let's try to avoid it.
Have only total_views in the table.
At midnight copy the table into another table. (SELECT is faster and less-invasive than the UPDATE needed to reset the values.) Do this copy by building a new table, then RENAME TABLE to move it into place.
Compute total_views_today by subtracting the corresponding values in the two tables.
That left you with
post_stats(post_id, total_views, total_likes, total_dislikes)
For "high traffic, it is fine to do
UPDATE post_stats SET ... = ... + 1 WHERE post_id = ...;
at the moment needed (for each counter).
But there is a potential problem. You can't increment a counter if the row does not exist. That would be best solved by creating a row with zeros at the same time the post is created. (Otherwise, see IODKU.)
(I may come back if I think of more.)

Storing count of records in SQL table

Lets say i have a table with posts, and each post has index of topic it belongs to. And i have a table with topics, with integer field, representing number of posts in this topic. When i create new post, i increase this value by 1, and then i delete post, i decrease value by 1.
I do it to not query database each time i need to count number of posts in certain topics.
But i heared that this approach may not be safe to use and actual number of posts in table may not match stored value.
Is there any ceratin info about how safe is it?
Without transactions, the primary issue is timing. Consider a delete and two users:
Time User 1 User 2
1 count = count - 1
2 update finishes How many posts?
3 delete post Count returned
4 delete finishes
Remember that actions such as updates and deletes take a finite amount of time -- even if they take effect all at once. Because of this, User 2 will get the wrong number of posts. This is a race condition; and it may or may not be an issue in your application.
Transactions fix this particular problem, by ensuring that resetting the count and deleting the post both take effect "at the same time".
A secondary issue is data quality. Your data consistency checks are outside the database. Someone can come directly into the database and say "Oh, these posts from user X should be removed". That user might then delete those posts en masse -- but "forget" or not know to change the associated values.
This can be a big issue. Triggers should solve this problem.

Storing like count for a post in MySQL

Is it a good idea to store like count in the following format?
like table:
u_id | post_id | user_id
And count(u_id) of a post?
What if there were thousands of likes for each post? The like table is going to be filled with billions of rows after a few months.
What are other efficient ways to do so?
In two words answer is : yes , it is OK. (to store data about each like any user did for any post).
But I want just to separate or transform it to several questions:
Q. Is there other way to count(u_id)? or even better:
SELECT COUNT(u_id) FROM likes WHERE post_id = ?
A. Why not? you can save count in your post table and increase/decrease it every time when user like/dislike the post. You can set trigger (stored procedure) to automate this action. And then to get counter you need just:
SELECT counter FROM posts WHERE post_id = ?
If you like previous Q/A and think that it is good idea I have next question:
Q. Why do we need likes table then?
A. That depends of your application design and requirements. According to the columns set you posted : u_id, post_id, user_id (I would even add another column timestamp). Your requirements is to store info about user as well as about post when it liked. That means you can recognize if user already liked this post and refuse multilikes. If you don't care about multilikes or historical timeline and stats you can delete your likes table.
Last question I see here:
Q. The like table is going to be filled with billions of rows after a few months. isn't it?
A. I wish you that success but IMHO you are 99% wrong. to get just 1M records you need 1000 active users (which is very very good number for personal startup (you are building whole app with no architect or designer involved?)) and EVERY of those users should like EVERY of 1000 posts if you have any.
My point here is: fortunately you have enough time till your database become really big and that would hurt your application. Till your table get 10-20M of records you can do not worry about size and performance.

Database design for hourly, weekly, ranking?

I've been trying to figure out a good way to handle a ranking system of this sort. As a rough example, I would like to query a facebook page and grab the likes and comments of each post. Then, there would be three rankings based on a time interval. To give a simplified example:
Hourly
- I pull all the posts updated within the last hour, and compare the # of likes/comments compared to my previous entry (the last pull being an hour prior).
Daily
- I pull down all posts within a 24 hours date range. I compare the # of likes/comments compared to the previous entry. "Post X had 12 more likes and 40 more comments today compared to yesterday"
Weekly
- I pull down all posts within a week's range and do the same as above. "Post X had no new likes, but 10 more comments added this week compared to last week"
In terms of the DB tables, what would be a good way to handle this? Would it make sense to have one giant table with the posts (title, comments_previous, comments_current, likes_previous, likes_current, etc)?
Thank you!
Columns: (PK)timestamp, (index)pageid, count. Set a new timestamp every hour on the hour for pages that are liked. Timestamp is the PK so that you don't get horrible fragmentation from your clustered index / page layout in the database.
If you feel for performance reasons that you need to de-normalize, you can make additional daily and monthly tables that are rolled-up summations. Likely, you'll be able to efficiently generate what you need without the rollup tables by using where clauses on the time / pageid combination, thereby giving you what you need with just one table.
Purge old data as you see fit, or keep it.
Clarification
When a comment receives a like, do the following:
insert into likeRanking (concat(select left(now(),13), '00:00'), commentid, 1)
on duplicate key update score = score + 1;
I would do this as follows:
Create a table that gets the time now, comments now, and likes now.
Then after an hour of that time, create another table that gets the time now, comments now and likes now, then subtract it to the previously created table. Then drop the other table and insert the new values of the new table. Then after an hour, create another table.
Same with monthly and yearly.
Let me know if you need anything else.

MYSQL Database Schema Question

I need opinions on the best way to go about creating a table or collection of tables to handle this unique problem. Basically, I'm designing this site with business profiles. The profile table contains all your usual things such as name, uniqueID, address, ect. Now, the whole idea of the site is that it's going to be collecting a small string of informative text. I want to allow the clients to be able to store one per date, with as many as 30 days in advance. The program is only going to show the information from the current date on forward, with expired dates not being shown.
The only way I can really see this being done is a table consisting of the uniqueID, date, and the informative block of text, but this creates pretty extensive queries. Eventually this table is going to be at least 20 times larger than the table of businesses in the first place as these businesses are going to be able to post up to 30 items in this table using their uniqueID.
Now, imagine the search page brings up a list of businesses in the area, it's then got to query the new table for all of those ids to get that block of information I want to show based on the date. I'm pretty sure it would be a rather intensive couple of queries just to show a rather simple block of text, but I imagine this is how status updates work for social networking sites in general? Does facebook store updates in a table of updates tied to a users ID number or have they come up with a better way?
I'm just trying to gain a little more insight into DB design, so throw out any ideas you might have.
The only way I can really see this being done is a table consisting of the uniqueID, date, and the informative block of text...
Assuming you mean the profile uniqueID, and not a unique ID for the text table, you're correct.
As pascal said in his comment, you'd need a primary index on uniqueID and date. A person could only enter one row of text for a given date.
If you want to retrieve the next text row for a person, your SQL query would have the following clauses:
WHERE UNIQUE_ID = PROFILE.UNIQUE_ID
AND DATE >= CURRENT_DATE
LIMIT 1
Since you have an index on uniqueID and date, this should be a fast query.
If you want to retrieve the next 5 texts for a particular person, you'd just have to make one change:
WHERE UNIQUE_ID = PROFILE.UNIQUE_ID
AND DATE >= CURRENT_DATE
LIMIT 5