How to store / retrieve large amounts of data sets within XPages? - mysql

Currently we're working on a solution where we want to track (for analysis) the articles a user clicks on/opens and 'likes' from a given list of articles. Subsequently, the user needs to be able to see and re-click/open the article (searching is not needed) in a section on his/her personal user profile. Somewhere around the 100 new articles are posted every day. The increasing(!) amount of daily visitor (users) lies around the 2000 a day. The articles are currently stored and maintained within a MySQL Db.
We could create a new record in the MySQL Db for every article read / 'liked'. 'Worst case' would this create (2500 * 100 = ) 250000 records a day. That won’t hold long of course… So how would you store (process) this within XPages, given the scenario?
My thoughts after reading “the article” :) about MIME/Bean’s: what about keeping 'read articleObjects' in a scope and (periodically) store/save them as MIME on the user profile document? This only creates 100 articleObjects a day (or 36500 a year). In addition, one could come up with a mechanism where articleObjects are shifted from one field to another as time passes by, so the active scope would only contain the 'read articleObjects' from last month or so.

I would say that this exactly what a relational database is for. My first approach would be to have a managed bean (session scope) to read/access user's data in MySQL (JDBC). If you want, you can build an internal cache inside the bean.
For the presented use case, I would not bother with the JDBC datasources in ExtLib. Perhaps even the #Jdbc functions would suffice.
Also, you did not say how you are doing the analysis? If you store the information in Domino, you'll probably have to write an export tool.

Related

Concurrent inserts mysql - calling same insert stored proc before the first set of inserts is completed

I am working on social networking site, which includes the creation of media content and also records the interaction of users with the created content.
Background of issue - approach used currently
There is a page called news-feed, which displays the content and activity done with the content by the users they are following on site.
Display order of the content changes with more and more user interactions(eg. if there are more number of comments on a post, its likely to be shown on top of the one with lesser number of comments. However, number of comments is just one of the attributes used to rank the post).
I am using mysql(innodb) database to store the data as follows:
activity_master : activities allowed to be part of news feed(post, comment etc)
activity_set : for aggregation of activities on the same object
activity_feed: details of actual activity
Detailed ER Diagram is at the end of question
Scenario
A user(with 1000 followers) posts something, which initiates an async call to the procedure to insert the relevant entries(1000 rows for 1000 followers) in above mentioned tables for all followers.
Some followers started commenting(activity allowed to be part of news feed) before the above call is completed which initiates another call to the same procedure to insert entries(x total number of their own followers) of this activity for their particular set of followers. (e.g User B commented on this post)
All the insert requests(which seems way too many) will have to be processed in queue by innodb engine
Questions
Is there a better and efficient way to do this? (I definitely think there would be one)
How many insert requests can innodb handle in its default configuration?
How to avoid deadlock (or resource congestion at database end) in this case
Or is there any other type of database best suited in this case
Thanks for showing your interest by reading the description, any sort of help in this regard is much appreciated and let me know if any further details are required, thanks in advance!
ER Diagram of tables (not reputed enough to embed the image directly :( )
A rule of thumb: "Don't queue it, just do it".
Inserting 1000 rows is likely to be untenable. Tomorrow, it will be 10000.
Can't you do the processing on the select side instead of the insert side?

MYSQL - best Data Structure

I’m currently developing an Application for Win, Linux Mac. The Purpose of the Application is that multiple users are able create Projects based on a single Article. Every Article has up to 15 different Fields/Options (could also be more in future). The Fields of the Article should be changeable so I should be able to add, edit or remove them.
Fields I want to store:
Numbers
Texts (mostly options [1 Word], sometimes Comments [some sentences])
Path/Links to Files
What I want to do with the dB:
load all projects of a user at login
add, edit, remove, delete single projects
set a lock on projects (because multiple people are operating one user-account at the same time and therefore they may not be allowed to edit a project at the same time so if one starts editing it should be locked until he's saving, channelling or time-out)
What is the best way to manage this kind of Data?
Should I create a Table for each user and only make a ID Column and one where all the Values of the all the fields (who are merged to one big string)?
Should I create Tables for every Project and make Columns for every Field/Option and also one for the user / owner?
Or are there any other possibility’s?
If you don't know what you are going to store, then I doubt whether a relational database is the best option for you. Maybe a document store/noSQL database is a better decision, because you can just store documents (usually in the form of Json objects) that can have all kinds of additional fields.
A couple of such databases to look at are MongoDB, Cassandra, ElasticSearch, but you can find a big list on Wikipedia.

Using Redis as a Key/Value store for activity stream

I am in the process of creating a simple activity stream for my app.
The current technology layer and logic is as follows:
** All data relating to an activity is stored in MYSQL and an array of all activity id's are kept in Redis for every user.**
User performs action and activity is directly stored in an 'activities' table in MYSQL and a unique 'activity_id' is returned.
An array of this user's 'followers' is retrieved from the database and for each follower I push this new activity_id into their list in Redis.
When a user views their stream I retrieve the array of activity id's from redis based on their userid. I then perform a simple MYSQL WHERE IN($ids) query to get the actual activity data for all these activity id's.
This kind of setup should I believe be quite scaleable as the queries will always be very simple IN queries. However it presents several problems.
Removing a Follower - If a user stops following someone we need to remove all activity_id's that correspond with that user from their Redis list. This requires looping through all ID's in the Redis list and removing the ones that correspond to the removed user. This strikes me as quite unelegant, is there a better way of managing this?
'archiving' - I would like to keep the Redis lists to a length of
say 1000 activity_id's as a maximum as well as frequently prune old data from the MYSQL activities table to prevent it from growing to an unmanageable size. Obviously this can be achieved
by removing old id's from the users stream list when we add a new
one. However, I am unsure how to go about archiving this data so
that users can view very old activity data should they choose to.
What would be the best way to do this? Or am I simply better off
enforcing this limit completely and preventing users from viewing very old activity data?
To summarise: what I would really like to know is if my current setup/logic is a good/bad idea. Do I need a total rethink? If so what are your recommended models? If you feel all is okay, how should I go about addressing the two issues above? I realise this question is quite broad and all answers will be opinion based, but that is exactly what I am looking for. Well formed opinions.
Many thanks in advance.
1 doesn't seem so difficult to perform (no looping):
delete Redis from Redis
join activities on Redis.activity_id = activities.id
and activities.user_id = 2
and Redis.user_id = 1
;
2 I'm not really sure about archiving. You could create archive tables every period and move old activities from the main table to an archive table periodically. Seems like a single properly normalized activity table ought to be able to get pretty big though. (make sure any "large" activity stores the activity data in a separate table, the main activity table should be "narrow" since it's expected to have a lot of entries)

MySQL and Scheduled Updates by User Preference?

I'm developing an application that
stores an e-mail address (to a user) in a table.
stores the number of days the user would like to stay in the table.
takes the user off the table when the number of days is up.
I don't really know how to approach this, so here are my questions:
Each second, do I have the application check through every table entry for the time that's currently stored in, let's say, the time_left column?
Wouldn't (1) be inefficient if I'm expecting a significant number (10,000+) users?
If not (2), what's the best algorithm to implement for such a task?
What's the name of what I'm trying to do here? I'd like to do some more research on it before and while I'm writing the script, so I need a good search query to start with.
I plan on writing this script in Perl, although I'm open to suggestions with regards to language choice, frameworks, etc... I'm actually new to web development (both on the back-end and front-end), so I'd appreciate it if you could advise me precisely.
Thank you!
*after posting, Topener asked a valid question:
Why would you store users if they won't get requested?
Assume the user is just sitting in the database.
Let's say I'm using the user's e-mail address every 5 minutes from the time the user was added to the database (so if the user's entry was born at 2:00PM-October 18, the user would be accessed at 2:05, 2:10, etc...).
If the user decides that they want out of the database in 10 days, that means their entry is being accessed normally (every 5 minutes from 2:00PM-October 18) until 2:00PM-October 28.
So to clarify, based on this situation:
The system would have to constantly compare the current time with the user's expiration date, wouldn't it?
you should not store the time_left variable, bt you should store vaildTo. This way, whenever the user is requested from the database, you can check if it is valid.
If not, then do whatever you want with it.
This approach wont let you make any cronjobs, or will cost you extramload.
Hey Mr_spock I like the above answer from Topener. Instead of storing a number of days the user would like to be valid, store the day the user would like to be be removed.
Adding a field like validToDate, which would be a DATETIME field type, you can do a query like
delete from tablename where validToDate <= NOW()
where
the italicized text is a SQL query
tablename is the name of the table in question
NOW() is a valid sql function that returns the current DATETIME
validToDate is a field of type DATETIME
This has what ever efficiency SQL server promises, I think it is fairly good.
You could write a separate program/script which makes the delete query on a set interval. If you are on a Linux machine you can create a cron job to do it. Doing it every second may become very resource intensive for slower machines and larger tables, but I don't believe that will become an issue for a simple delete query.

Data Model for Profile Page Views

Say I have a site with user profiles that have publicly accessible pages (each profile has several pages each). I'd like to show the users page view statistics (e.g. per page, for a certain time period, etc.). What's a good way to store page views?
Here's what I was thinking:
Table Page Views
================
- Id (PK)
- Profile Id (FK)
- Page Id (FK)
- Timestamp
I'm afraid this solution won't scale. Suggestions?
Your intuition is correct, writing to a database doesn't scale particularly well. You want to avoid a database transaction for each page request.
That noted, is scaling really your concern? If so, and assuming a Internet site (as opposed to intra), skip rolling your own and collect the hit data with Google Analytics or something similar. Then take that data and process it to generate totals per profile.
However, if you're really hellbent on doing it yourself, consider log parsing instead. If you can enumerate the URLs per profile, use that information, and your web server logs, to generate hit totals. Tools such as Microsoft's Log Parser, which can process A LOT of different formats, or *nix command line tools like sed and grep are your friends here.
If enumeration's not possible change code to log the information you need and process that log file.
With logs in place, generate results using a batch process and insert those results into a database using MySQL's LOAD DATA.
Final note on the roll your own approach I've recommended - this will scale a lot better if you have a clustered environment than database transaction per request.
It depends on what kind of reports you want to make available.
If you want to be able to say "this is the list of people that viewed your page between these two dates", then you must store all the data you proposed.
If you only need to be able to say "your page was viewed X times between these two dates", then you only need a table with a page ID, date, and counter. Update the counter column on each page view with a single UPDATE query.
I suppose you can have
tblPerson
personid(pk)
activeProfileID(fk) -- the active profile to use.
timestamp
tblPage
pageid(pk)
data
tblPersonProfile
profileID(pk)
timestamp
tblProfilePages
profilePageID(pk)
profileid(pk)
pageid(pk)
isActive