Using Redis as a Key/Value store for activity stream - mysql

I am in the process of creating a simple activity stream for my app.
The current technology layer and logic is as follows:
** All data relating to an activity is stored in MYSQL and an array of all activity id's are kept in Redis for every user.**
User performs action and activity is directly stored in an 'activities' table in MYSQL and a unique 'activity_id' is returned.
An array of this user's 'followers' is retrieved from the database and for each follower I push this new activity_id into their list in Redis.
When a user views their stream I retrieve the array of activity id's from redis based on their userid. I then perform a simple MYSQL WHERE IN($ids) query to get the actual activity data for all these activity id's.
This kind of setup should I believe be quite scaleable as the queries will always be very simple IN queries. However it presents several problems.
Removing a Follower - If a user stops following someone we need to remove all activity_id's that correspond with that user from their Redis list. This requires looping through all ID's in the Redis list and removing the ones that correspond to the removed user. This strikes me as quite unelegant, is there a better way of managing this?
'archiving' - I would like to keep the Redis lists to a length of
say 1000 activity_id's as a maximum as well as frequently prune old data from the MYSQL activities table to prevent it from growing to an unmanageable size. Obviously this can be achieved
by removing old id's from the users stream list when we add a new
one. However, I am unsure how to go about archiving this data so
that users can view very old activity data should they choose to.
What would be the best way to do this? Or am I simply better off
enforcing this limit completely and preventing users from viewing very old activity data?
To summarise: what I would really like to know is if my current setup/logic is a good/bad idea. Do I need a total rethink? If so what are your recommended models? If you feel all is okay, how should I go about addressing the two issues above? I realise this question is quite broad and all answers will be opinion based, but that is exactly what I am looking for. Well formed opinions.
Many thanks in advance.

1 doesn't seem so difficult to perform (no looping):
delete Redis from Redis
join activities on Redis.activity_id = activities.id
and activities.user_id = 2
and Redis.user_id = 1
;
2 I'm not really sure about archiving. You could create archive tables every period and move old activities from the main table to an archive table periodically. Seems like a single properly normalized activity table ought to be able to get pretty big though. (make sure any "large" activity stores the activity data in a separate table, the main activity table should be "narrow" since it's expected to have a lot of entries)

Related

How to scale mysql table to keep historic data without increasing table size

I have a questionnaire in my app, using which I am creating data corresponding to the user who has submitted it and at what time(I have to apply further processing on the last object/questionnaire per user). This data is saved inside my server's MySQL DB. As this questionnaire is open for all my users and as it will be submitted multiple times, I do not want new entries to be created every time for the same user because this will increase the size of the table(users count could be anything around 10M), But I also want to keep the old data as a history for later processing.
Now I have this option in mind:
Create two tables. One main table to keep new objects and one history table to keep history objects. Whenever a questionnaire is submitted it will create a new entry in the history table, but update the existing entry in the main table.
So, is there any better approach to this and how do other companies tackle such situations?
I think you should go through the SCD (Slowly Changing Dimension) Concepts and decide which one is better approach to you.
Please read this and i think you will find the best way for yourself :
Here

Database Design: Difference between using boolean fields and duplicate tables

I have to design a database schema for an application I'm building. I will be using MySQL. In this application, users enter data and it gets saved in the database obviously. However, this data is not accessible to the public until the user publishes the data. Currently, I have one column for storing all the data. I was wondering if a boolean field in this table that indicates whether the data has been published is a good idea. Or, is it much better design to create one table for saved data and one table for published data and move the saved data to the published data table when the user presses Publish.
What are the advantages and disadvantages of using each one and is one of them considered better design than the other?
Case: Binary
They are about equal. Use this as a learning exercise -- Implement it one way; watch it for a while, then switch to the other way.
(same) Space: Since a row exists exactly once, neither option is 'better'.
(favor 1 table) When "publishing" it takes a transaction to atomically delete from one table and insert into the other.
(favor 2 tables) Certain SELECTs will spend time filtering out records with the other value for published. (This applies to deleted, embargoed, approved, and a host of other possible boolean flags.)
Case: Revision history
If there are many revisions of a record, then two tables, Current data and History, is better. That is because the 'important' queries involve fetching the only Current data.
(PARTITIONs are unlikely to help in either case.)

I came up with this SQL structure to allow rolling back and auditing user information, will this be adequate?

So, I came up with an idea to store my user information and the updates they make to their own profiles in a way that it is always possible to rollback (as an option to give to the user, for auditing and support purposes, etc.) while at the same time improving (?) the security and prevent malicious activity.
My idea is to store the user's info in rows but never allow the API backend to delete or update those rows, only to insert new ones that should be marked as the "current" data row. I created a graphical explanation:
Schema image
The potential issues that I come up with this model is the fact that users may update the information too frequently, bloating up the database (1 million users and an average of 5 updates per user are 5 million entries). However, for this I came up with the idea of putting apart the rows with "false" in the "current" column through partitioning, where they should not harm the performance and will await to be cleaned up every certain time.
Am I right to choose this model? Is there any other way to do such a thing?
I'd also use a second table user_settings_history.
When a setting is created, INSERT it in the user_settings_history table, along with a timestamp of when it was created. Then also UPDATE the same settings in the user_settings table. There will be one row per user in user_settings, and it will always be the current settings.
So the user_settings would always have the current settings, and the history table would have all prior sets of settings, associated with the date they were created.
This simplifies your queries against the user_settings table. You don't have to modify your queries to filter for the current flag column you described. You just know that the way your app works, the values in user_settings are defined as current.
If you're concerned about the user_settings_history table getting too large, the timestamp column makes it fairly easy to periodically DELETE rows over 180 days old, or whatever number of days seems appropriate to you.
By the way, 5 million rows isn't so large for a MySQL database. You'd want your queries to use an index where appropriate, but the size alone isn't disadvantage.

sql query to check many interests are matched

So I am building a swingers site. The users can search other users by their interests. This is only part of a number of parameters used to search a user. The thing is there are like 100 different interests. When searching another user they can select all the interests the user must share. While I can think of ways to do this, I know it is important the search be as efficient as possible.
The backend uses jdbc to connect to a mysql database. Java is the backend programming language.
I have debated using multiple columns for interests but generating the thing is the sql query need not check them all if those columns are not addressed in the json object send to the server telling it the search criteria. Also I worry i may have to make painful modifications to the table at a later point if i add new columns.
Another thing I thought about was having some kind of long byte array, or number (used like a byte array) stored in a single column. I could & this with another number corresponding to the interests the user is searching for but I read somewhere this is actually quite inefficient despite it making good sense to my mind :/
And all of this has to be part of one big sql query with multiple tables joined into it.
One of the issues with me using multiple columns would be the compiting power used to run statement.setBoolean on what could be 40 columns.
I thought about generating an xml string in the client then processing that in the sql query.
Any suggestions?
I think the correct term is a Bitmask. I could maybe have one table for the bitmask that maps the users id to the bitmask for querying users interests, and another with multiple entries for each interest per user id for looking up which user has which interests efficiently if I later require this?
Basically, it would be great to have a separate table with all the interests, 2 columns: id and interest.
Then, have a table that links the user to the interests: user_interests which would have the following columns: id,user_id,interest_id. Here some knowledge about many-to-many relations would help a lot.
Hope it helps!

member action table data model suggestion

I'm trying to add an action table, but i'm currently at odds as to how to approach the problem.
Before i go into more detail.
We have members who can do different actions on our website
add an image
update an image
rate an image
post a comment on image
add a blog post
update a blog post
comment on a blog post
etc, etc
the action table allows our users to "Watch" other member's activities if they want to add them to their watch list.
I currently created a table called member_actions with the following columns
[UserID] [actionDate] [actionType] [refID]
[refID] can be a reference either to the image ID in the DB or blogpost ID, or an id column of another actionable table (eg. event)
[actionType] is an Enum column with action names such as (imgAdd,imgUpdate,blogAdd,blogUpdate, etc...)
[actionDate] will decide which records get deleted every 90 days... so we won't be keeping the actions forever
the current mysql query i cam up with is
SELECT act.*,
img.Title, img.FileName, img.Rating, img.isSafe, img.allowComment AS allowimgComment,
blog.postTitle, blog.firstImageSRC AS blogImg, blog.allowComments AS allowBlogComment,
event.Subject, event.image AS eventImg, event.stimgs, event.ends,
imgrate.Rating
FROM member_action act
LEFT JOIN member_img img ON (act.actionType="imgAdd" OR act.actionType="imgUpdate")
AND img.imgID=act.refID AND img.isActive AND img.isReady
LEFT JOIN member_blogpost blog ON (act.actionType="blogAdd" OR act.actionType="blogUpdate")
AND blog.id=act.refID AND blog.isPublished AND blog.isPublic
LEFT JOIN member_event event ON (act.actionType="eventAdd" OR act.actionType="eventUpdate")
AND event.id=act.refID AND event.isPublished
LEFT JOIN img_rating imgrate ON act.actionType="imgRate" AND imgrate.UserID=act.UserID AND imgrate.imgID=act.refID
LEFT JOIN member_favorite imgfav ON act.actionType="imgFavorite" AND imgfav.UserID=act.UserID AND imgfav.imgID=act.refID
LEFT JOIN img_comment imgcomm ON (act.actionType="imgComment" OR act.actionType="imgCommentReply") AND imgcomm.imgID=act.refID
LEFT JOIN blogpost_comment blogcomm ON (act.actionType="blogComment" OR act.actionType="blogCommentReply") AND blogcomm.blogPostID=act.refID
ORDER BY act.actionDate DESC
LIMIT XXXXX,20
Ok so basically, given that i'll be deleting actions older than 90 days every week or so... would it make sense to go with this query for displaying the member action history?
OR should i add a new text column in member_actions table called [actionData] where i can store a few details in json or xml format for fast querying of the member_action table.
It adds to the table size and reduces query complexity, but the table will be purged from periodically from old entries.
the assumption is that eventually we'll have no more than a few 100k members so would i'm concerned about the table size of the member_action table with it's text [actionData] column that will contain some specific details.
I'm leaning towards the [actionData] model but any recommendations or considerations will be appreciated.
another consideration is that it's possible that the table entries for img or blog could get deleted... so i could have action but no reference record...this sure does add to the problem.
thanks in advance
Because you are dealing with user interface issues, performance is key. All the joins will do take time, even with indexes. And, querying the database is likely to lock records in all the tables (or indexes), which can slow down inserts.
So, I lean towards denormalizing the data, by maintaining the text in the record.
However, a key consideration is whether the text can be updated after the fact. That is, you will load the data when it is created. Can it then change? The problem of maintaining the data in light of changes (which could involve triggers and stored procedures) could introduce a lot of additional complexity.
If the data is static, this is not an issue. As for table size, I don't think you should worry about that too much. Databases are designed to manage memory. It is maintaining the table in a page cache, which should contain pages for currently active members. You can always increase memory size, especially for 100,000 users which is well within the realm of today's servers.
I'd be wary of this approach - as you add kinds of actions that you want to monitor the join is going to keep growing (and the sparse extra columns in the select statement as well).
I don't think it would be that scary to have a couple of extra columns in this table - and this query sounds like it would be running fairly frequently, so making it efficient seems like it would be a good idea.