coldfusion/jquery like dislike functionality mysql DB size optimization concern - mysql

We are only a few days away from launching our new website, hence we started putting the finishing touches on the development process. Although everything works great due to our effort to optimize every part of the application, my partner has raised a question about mysql database size and possible issue we could end up with over time. More specifically, we built like dislike functionality (cfc, jquery) which though works flawlessly will significantly increase the DB size if we manage to attract a lot of visitors.
here is our logic:
- every user can vote only once on one article (vote up or vote down) as we store IP in the DB
- say 10 000 users come to the website and vote on 10 posts, that's 100 000 inserts (via stored procedure) and 100 000 rows in our database. Count that by 10,100 or 1000, you get the picture.
votes table has 4 colums
- typeID (voteUp = 1 and voteDown = 2)
- articleID
- IP
- vCount (we use it to count SUM, how many votes each article has)
are we missing the point here? in your experience what's the best approach to handle this type of functionality?

I'd say there's nothing wrong with your approach. Assuming that you have a not too limited data storage capacity, you should not run out of space for a long time.
You could of course use just one record per article, but this might create a bottleneck when the records needs to be locked for updating every time a user votes.
What you might consider is adding a date/time field to your votes table to store when a vote was recorded. By creating an additional table with one row per article to keep track of the over-all votes, you could query all votes that are older than let's say 12 months, update your new table accordingly, and delete the old votes from the votes table. Stuff that functionality into a scheduled task and you're done. That way you will lose the IP information (after 12 months or whatever timespan you chose) but gain back some storage.

Related

COUNT(*) WHERE vs. SELECT(*) WHERE performance

I am building a forum and I am trying to count all of the posts submitted by each user. Should I use COUNT(*) WHERE user_id = $user_id, or would it be faster if I kept a record of how many posts each user has each time he made a post and used a SELECT query to find it?
How much of a performance difference would this make? Would there be any difference between using InnoDB and MyISAM storage engines for this?
If you keep a record of how many post a user made, it will definitely be faster.
If you have an index on user field of posts table, you will get decent query speeds also. But it will hurt your database when your posts table is big enough. If you are planning to scale, then I would definitely recommend keeping record of users posts on a specific field.
Storing precalculated values is a common and simple, but very efficient sort of optimization.
So just add the column with amount of comments user has posted and maintain it with triggers or by your application.
The performance difference is:
With COUNT(*) you always will have index lookup + counting of results
With additional field you'll have index lookup + returning of a number (that already has an answer).
And there will be no significant difference between myisam and innodb in this case
Store the post count. It seems that this is a scalability question, regardless of the storage engine. Would you recalculate the count each time the user submitted a post, or would you run a job to take care of this load somewhere outside of the webserver sphere? What is your post volume? What kind of load can your server(s) handle? I really don't think the storage engine will be the point of failure. I say store the value.
If you have the proper index on user_id, then COUNT(user_id) is trivial.
It's also the correct approach, semantically.
this is really one of those 'trade off' questions.
Realistically, if your 'Posts' table has an index on the 'UserID' column and you are truly only wanting to return the number of posts pers user then using a query based on this column should perform perfectly well.
If you had another table 'UserPosts' for e'g., yes it would be quicker to query that table, but the real question would be 'is your 'Posts' table really so large that you cant just query it for this count. The trade off on both approaches is obviously this:
1) having a separate audit table, then there is an overhead when adding, updating a post
2) not having a separate audit table, then overhead in querying the table directly
My gut instinct is always to design a system to record the data in a sensibly normalised fashion. I NEVER make tables based on the fact that it might be quicker to GET some data for reporting purposes. I would only create them, if the need arised and it was essential to incoroporate them then, i would incorporate it.
At the end of the day, i think unless your 'posts' table is ridiculously large (i.e. more than a few millions of records, then there should be no problem in querying it for a distinct user count, presuming it is indexed correctly, i.e. an index placed on the 'UserID' column.
If you're using this information purely for display purposes (i.e. user jonny has posted 73 times), then it's easy enough to get the info out from the DB once, cache it, and then update it (the cache), when or if a change detection occurs.
Performance on post or performance on performance on count? From a data purist perspective a recorded count is not the same as an actual count. You can watch the front door to an auditorium and add the people that come in and subtract those the leave but what if some sneak in the back door? What if you bulk delete a problem topic? If you record the count then the a post is slowed down to calculate and record the count. For me data integrity is everything and I will count(star) every time. I just did a test on a table with 31 million row for a count(star) on an indexed column where the value had 424,887 rows - 1.4 seconds (on my P4 2 GB development machine as I intentionally under power my development server so I get punished for slow queries - on the production 8 core 16 GB server that count is less than 0.1 second). You can never guard your data from unexpected changes or errors in your program logic. Count(star) is the count and it is fast. If count(star) is slow you are going to have performance issues in other queries. I did star as the symbol caused a format change.
there are a whole pile of trade-offs, so no-one can give you the right answer. but here's an approach no-one else has mentioned:
you could use the "select where" query, but cache the result in a higher layer (memcache for example). so you code would look like:
count = memcache.get('article-count-' + user_id)
if count is None:
count = database.execute('select ..... where user_id = ' + user_id)
memcache.put('article-count-' + user_id, count)
and you would also need, when a user makes a new post
memcache.delete('article-count-' + user_id)
this will work best when the article count is used often, but updated rarely. it combines the advantage of efficient caching with the advantage of a normalized database. but it is not a good solution if the article count is needed only rarely (in which case, is optimisation necessary at all?). another unsuitable case is when someone's article count is needed often, but it is almost always a different person.
a further advantage of an approach like this is that you don't need to add the caching now. you can use the simplest database design and, if it turns out to be important to cache this data, add the caching later (without needing to change your schema).
more generally: you don't need to cache in your database. you could also put a cache "around" your database. something i have done with java is to use caching at the ibatis level, for example.

What is a more efficient way to keep a daily ranking log for each user with MySQL?

I have a database called RankHistory that is populated daily with each user's username and rank for the day (rank as in 1,2,3,...). I keep logs going back 90 days for every user, but my user base has grown to the point that the MySQL database holding these logs is now in excess of 20 million rows.
This data is recorded solely for the use of generating a graph showing how a user's rank has changed for the past 90 days. Is there a better way of doing this than having this massive database that will keep growing forever?
How great is the need for historic data in this case? My first thought would be to truncate data older than a certain threshold, or move it to an archive table that doesn't require as frequent or fast access as your current data.
You also mention keeping 90 days of data per user, but the data is only used to show a graph of changes to rank over the past 30 days. Is the extra 60 days' data used to look at changes over previous periods? If it isn't strictly necessary to keep that data (or at least not keep it in your primary data store, as per my first suggestion), you'd neatly cut the quantity of your data by two-thirds.
Do we have the full picture, though? If you have a daily record per user, and keep 90 days on hand, you must have on the order of a quarter-million users if you've generated over twenty million records. Is that so?
Update:
Based on the comments below, here are my thoughts: If you have hundreds of thousands of users, and must keep a piece of data for each of them, every day for 90 days, then you will eventually have millions of pieces of data - there's no simple way around that. What you can look into is minimizing that data. If all you need to present is a calculated rank per user per day, and assuming that rank is simply a numeric position for the given user among all users (an integer between 1 - 200000, for example), storing twenty million such records should not put unreasonable strain on your database resources.
So, what precisely is your concern? Sheer data size (i.e. hard-disk space consumed) should be relatively manageable under the scenario above. You should be able to handle performance via indexes, to a certain point, beyond which the data truncation and partitioning concepts mentioned can come into play (keep blocks of users in different tables or databases, for example, though that's not an ideal design...)
Another possibility is, though the specifics are somewhat beyond my realm of expertise, you seem to have an ideal candidate for an OLAP cube, here: you have a fact (rank) that you want to view in the context of two dimensions (user and date). There are tools out there for managing this sort of scenario efficiently, even on very large datasets.
Could you run an automated task like a cron job that checks the database every day or week and deletes entries that are more than 90 days old?
Another option, do can you create some "roll-up" aggregate per user based on whatever the criteria is... counts, sales, whatever and it is all stored based on employee + date of activity. Then you could have your pre-aggregated rollups in a much smaller table for however long in history you need. Triggers, or nightly procedures can run a query for the day and append the results to the daily summary. Then your queries and graphs can go against that without dealing with performance issues. This would also help ease moving such records to a historical database archive.
-- uh... oops... that's what it sounded like you WERE doing and STILL had 20 million+ records... is that correct? That would mean you're dealing with about 220,000+ users???
20,000,000 records / 90 days = about 222,222 users
EDIT -- from feedback.
Having 222k+ users, I would seriously consider that importance it is for "Ranking" when you have someone in the 222,222nd place. I would pair the daily ranking down to say the top 1,000. Again, I don't know the importance, but if someone doesn't make the top 1,000 does it really matter???

Strategy to handle large datasets in a heavily inserted into table

I have a web application that has a MySql database with a device_status table that looks something like this...
deviceid | ... various status cols ... | created
This table gets inserted into many times a day (2000+ per device per day (estimated to have 100+ devices by the end of the year))
Basically this table gets a record when just about anything happens on the device.
My question is how should I deal with a table that is going to grow very large very quickly?
Should I just relax and hope the database will be fine in a few months when this table has over 10 million rows? and then in a year when it has 100 million rows? This is the simplest, but seems like a table that large would have terrible performance.
Should I just archive older data after some time period (a month, a week) and then make the web app query the live table for recent reports and query both the live and archive table for reports covering a larger time span.
Should I have an hourly and/or daily aggregate table that sums up the various statuses for a device? If I do this, what's the best way to trigger the aggregation? Cron? DB Trigger? Also I would probably still need to archive.
There must be a more elegant solution to handling this type of data.
I had a similar issue in tracking the number of views seen for advertisers on my site. Initially I was inserting a new row for each view, and as you predict here, that quickly led to the table growing unreasonably large (to the point that it was indeed causing performance issues which ultimately led to my hosting company shutting down the site for a few hours until I had addressed the issue).
The solution I went with is similar to your #3 solution. Instead of inserting a new record when a new view occurs, I update the existing record for the timeframe in question. In my case, I went with daily records for each ad. what timeframe to use for your app would depend entirely on the specifics of your data and your needs.
Unless you need to specifically track each occurrence over the last hour, you might be over-doing it to even store them and aggregate later. Instead of bothering with the cron job to perform regular aggregation, you could simply check for an entry with matching specs. If you find one, then you update a count field of the matching row instead of inserting a new row.

Event feed implementation - will it scale?

Situation:
I am currently designing a feed system for a social website whereby each user has a feed of their friends' activities. I have two possible methods how to generate the feeds and I would like to ask which is best in terms of ability to scale.
Events from all users are collected in one central database table, event_log. Users are paired as friends in the table friends. The RDBMS we are using is MySQL.
Standard method:
When a user requests their feed page, the system generates the feed by inner joining event_log with friends. The result is then cached and set to timeout after 5 minutes. Scaling is achieved by varying this timeout.
Hypothesised method:
A task runs in the background and for each new, unprocessed item in event_log, it creates entries in the database table user_feed pairing that event with all of the users who are friends with the user who initiated the event. One table row pairs one event with one user.
The problems with the standard method are well known – what if a lot of people's caches expire at the same time? The solution also does not scale well – the brief is for feeds to update as close to real-time as possible
The hypothesised solution in my eyes seems much better; all processing is done offline so no user waits for a page to generate and there are no joins so database tables can be sharded across physical machines. However, if a user has 100,000 friends and creates 20 events in one session, then that results in inserting 2,000,000 rows into the database.
Question:
The question boils down to two points:
Is this worst-case scenario mentioned above problematic, i.e. does table size have an impact on MySQL performance and are there any issues with this mass inserting of data for each event?
Is there anything else I have missed?
I think your hypothesised system generates too much data; firstly on the global scale the storage and indexing requirements on user_feed seems to escalate exponentially as your user-base becomes larger and more interconnected (both presumably desirable for a social network); secondly consider if in the course of a minute 1000 users each entered a new message and each had 100 friends - then your background thread has 100 000 inserts to do and might quickly fall behind.
I wonder if a compromise might be made between your two proposed solutions where a background thread updates a table last_user_feed_update which contains a single row for each user and a timestamp for the last time that users feed was changed.
Then although the full join and query would be required to refresh the feed, a quick query to the last_user_feed table will tell if a refresh is required or not. This seems to mitigate the biggest problems with your standard method as well as avoid the storage size difficulties but that background thread still has a lot of work to do.
The Hypothesized method works better when you limit the maximum number of friends.. a lot of sites set a safe upper boundary, including Facebook iirc. It limits 'hiccups' from when your 100K friends user generates activity.
Another problem with the hypothesized model is that some of the friends you are essentially pre-generating cache for may sign up and hardly ever log in. This is a pretty common situation for free sites, and you may want to limit the burden that these inactive users will cost you.
I've thought about this problem many times - it's not a problem MySQL is going to be good at solving. I've thought of ways I could use memcached and each user pushes what their latest few status items are to "their key" (and in a feed reading activity you fetch and aggregate all your friend's keys)... but I haven't tested this. I'm not sure of all the pros/cons yet.

mysql amount of data per table

I'm designing a system, and by going deep into numbers, I realize that it could reach a point where there could be a table with 54,240,211,584 records/year (approximately). WOW!!!!
So, I brook it down & down to 73,271,952 records/year (approximately).
I got the numbers by making some excel running on what would happen if:
a) no success = 87 users,
b) low moderated success = 4300 users,
c) high moderated success = 13199 users,
d) success = 55100 users
e) incredible success = nah
Taking into account that the table is used for SELECT, INSERT, UPDATE & JOIN statements and that these statements would be executed by any user logged into the system hourly/daily/weekly (historical data is not an option):
Question 1: is 2nd quantity suitable/handy for the MySQL engine, such that performance would suffer little impact???
Question 2: I set the table as InnoDB but, given the fact that I handle all of the statements with JOINS & that I'm willing to run into the 4GB limit problem, is InnoDB useful???
Quick overview of the tables:
table #1: user/event purchase. Up to 15 columns, some of them VARCHAR.
table #2: tickets by purchase. Up to 8 columns, only TINYINT. Primary Key INT. From 4 to 15 rows inserted by each table #1 insertion.
table #3: items by ticket. 4 columns, only TINYINT. Primary Key INT. 3 rows inserted by each table #2 insertion. I want to keep it as a separated table, but if someone has to die...
table #3 is the target of the question. The way I reduced to 2nd quantity was by making each table #3's row be a table #2's column.
Something that I dont want to do, but I would if necessary, is to partition the tables by week and add more logic to application.
Every answer helps, but it would be more helpful something like:
i) 33,754,240,211,584: No, so lets drop the last number.
ii) 3,375,424,021,158: No, so lets drop the last number.
iii) 337,542,402,115: No, so lets drop the last number. And so on until we get something like "well, it depends on many factors..."
What would I consider "little performance impact"??? Up to 1,000,000 records, it takes no more than 3 seconds to exec the queries. If 33,754,240,211,584 records will take around 10 seconds, that's excellent to me.
Why don't I just test it by myself??? I think I'm not capable of doing such a test. The stuff I would do is just to insert that quantity of rows and see what happens. I prefer FIRST the point of view of someone who has already known of something similar. Remember, I'm still in design stage
Thanks in advance.
54,240,211,584 is a lot. I only have experience with mysql tables up to 300 million rows, and it handles that with little problem. I'm not sure what you're actually asking, but here's some notes:
Use InnoDB if you need transaction support, or are doing a lot of inserts/updates.
MyISAM tables are bad for transactional data, but ok if you're very read heavy and only do bulk inserts/updates every now and then.
There's no 4Gb limit with mysql if you're using recent releases/recen't operating systems. My biggest table is 211Gb now.
Purging data in large tables is very slow. e.g. deleting all records for a month takes me a few hours. (Deleting single records is fast though).
Don't use int/tinyint if you're expecting many billions of records, they'll wrap around.
Get something working, fix the scaling after the first release. An unrealized idea is pretty much useless, something that works(for now) might be very useful.
Test. There's no real substitute - your app and db usage might be wildely different from someone elses huge database.
Look into partitioned tables, a recent feature in MySQL that can help you scale in may ways.
Start at the level you're at. Build from there.
There are plenty of people out there who will sell you services you don't need right now.
If $10/month shared hosting isn't working anymore, then upgrade, and eventually hire someone to help you get around the record limitations of your DB.
There is no 4Gb limit, but of course there are limits. Don't plan too far ahead. If you're just starting up and you plan to be the next Facebook, that's great but you have no resources.
Get something working so you can show your investors :)