Whats the best way to count views on a high traffic site? - mysql

The way I currently do it in mysql is is
UPDATE table SET hits=hits+1 WHERE id = 1;
This keeps live stats on the site, but as I understand, isnt the best way to go about doing this.
Edit:
let me clarify... this is for counting hits on specific item pages. I have a listing of movies, and I want to count how many views each movie page has gotten. After it +1s, it adds the movie ID to a session var, which stories the ids of all the pages the user viewed. If the ID of the page is in that array, it wont +1 it.

If your traffic is high enough, you shouldn't hit the database on every request. Try keeping the count in memory and sync the database on a schedule (for example, update the database on every 1000 requests or every minute.)

You could take an approach similar to Stack Overflow's view count. This basically increments the counter when the image is loaded. This has two useful aspects:
Robots often don't download images, so these don't increment the view.
Browsers cache images, so when you go back to a page, you're not causing work for the server.
The potentially slower code is run async from the rest of the page. This doesn't slow down the page from being visible.
To optimize the updates:
* Keep the counter in a single, narrow table, with a clustered index on the key.
* Have the table served up by a different database server / host.
* Use memcached and/or a queue to allow the write to either be delayed or run async.
If you don't need to display the view count in real time, then your best bet is to include the movie id in your URL somewhere, and use log scrapping to populate the database at the end of the day.

Not sure which web server you are using.
If your web server logs the requests to the site, say one line per request in a text file. Then you could just count the lines in your log files.
Your solution has a major problem in that it will lock the row in the database, therefore your site can only serve one request at a time.

it depends really on if you want hits or views
1 view from 1 ip = 1 person looking at a page
1 person refreshing the same page = multiple hits but only one view
i always prefer google analytics etc for something like this, you need to make sure that
this db update is only done once, or you could quite easily be flooded.

I'm not sure about what you're using, but you could set a cron job to automatically update the count every x minutes in Google App Engine. I think you'd use memcache to save the counts until your cron job is run. Although... GAE does have some stat reporting but you'd probably want to have your own data also. I think you can use memcache on other systems, and set cron jobs on them tool

Use logging software. Google Analytics is pretty and feature-filled (and generates zero load on your servers), but it'll miss non-JavaScript hits. If every single hit is important, use a server log analyzer like webalizer or awstats.

In general with MySQL:
If you use MyISAM table: there is a lock on the table so you better have to do an INSERT in a separate table. Then with a cron job, you UPDATE the values in your movie table.
If you use InnoDB table: there is a lock on the row so you can UPDATE the value directly.
That say, depending of the "maturity" and success of your project, you may need to implement a different solution, so:
1st advice: benchmark, benchmark, benchmark.
2nd advice: Using the data from the 1st advice, identify the bottleneck and select the solution for the issue you face but not a future issue you think you might have.
Here is a great video on this: http://www.youtube.com/watch?v=ZW5_eEKEC28
Hope this helps. :)

Related

Implementing a quota system to limit requests in a web based app

I want to limit my users to 25k/requests per hour/day/whatever.
My first idea was to simply use mysql and have a column in the users table where i would store the requests and i would increment this counter each time the user makes a request. Problem with this approach is that sometimes you end up writing at the same time in the column and you get a deadlock from mysql, so this isn't really a good way to go about it, is it ?
Another way would be, instead of incrementing the counters of a column, to insert log records in a separate table and then count these records for a given timespan, but this way you can easily end up with million records table and the query can be too slow.
When using a RDBMS another aspect to take into consideration is that at each request you'd have to count the user quota from database and this can take time depending on either of the above mentioned methods.
My second idea, i thought using something like redis/memcached (not sure of alternatives or which one of them is faster) and store the request counters there. This would be fast enough to query and increment the counters, for sure faster than a RDBMS, but i haven't tried it with huge amounts of data, so i am not sure how it will perform just yet.
My third idea, i would keep the quota data in memory in a map, something like map[int]int where the key would be the user_id and the value would be the quota usage and i'd protect the map access with a mutex. This would be the fastest solution of all but then what do you do if for some reason your app crashes, you lose all that data related to the number of requests certain user did. One way would be to catch the app when crashing and loop through the map and update the database. Is this feasible?
Not sure if either of the above is the right approach, but i am open to suggestions.
I'm not sure what you mean by "get a deadlock from mysql" when you try to update a row at the same time. But a simple update rate_limit set count = count + 1 where user_id = ? should do what you want.
Personally I have had great success with Redis for doing rate limiting. There are lots of resources out there to help you understand the appropriate approach for your use case. Here is one I just glanced at that seems to handle things correctly: https://www.binpress.com/tutorial/introduction-to-rate-limiting-with-redis/155. Using pipelines (MULTI) or Lua scripts may make things even nicer.
You can persist your map[int]int in RDBMS or just file system time to time and in defer function. You even can use it as cache instead of redis. Surely it will be anyway faster than connection to third-party service every request. Also you can store counters on user side simply in cookies. Smart user can clear cookies of-douse but is it so dangerous at all and you can also provide some identification info in cookies to make clearing uncomfortable.

Should I fetch mysql data in bulks, or fetch as few as possible?

I'm not so sure about the correct approach when using MySQL. When starting with a big website, I used to load articles with all their info using one function, load_articles. I loaded all articles that were supposed to be somehow displayed.
However often only some article info was used, for example only title, only image icon... So I created object oriented model, where Article uses MySQL to fetch the properties when needed, using __get and ArrayAccess. This results in higher number of queries in general, but reduces the ammount of data fetched from MySQL.
Of course, ideal approach would be to buffer the "data needed" and then send one query. But if this is too complicated for me, where should I aim?
Bulk fetch all data that may be needed and discard the unnecesary data - reducing the ammount of queries
Lazy-load the individual properties as they're needed when generating the page - fetching little data with many queries
If the second is the better, should I go as far as not fetching SELECT * and rather have multiple selects for individual properties, as they are needed?
First of all, Answer totally depends on how your webpage is getting loaded& what are user requirements and what are your SLAs.
suppose your page has 5 elements on it then your solutions will behave like below,
Fetch bulk data and store it locally and load it
This is good approach when your user needs to see all data at once or something very computational is required at user end. In this case also fetch only required attributes. never use select * which is always worst.
Check your network bandwidth while transferring data and if possible use CDN if you have many images or static data.
Fetch only base data first and then according to user requirement fetch more data.
This is good approach when your user generally wants to see only first section of webpage or rather he will be happy to see atleast first section on screen within 1 sec.
and slowly you can load/fetch more data as user scrolls down and performs some activity.
This ways you can save amount of memory needed on app. server and its cpu cycles processing bulk data. This approach also maintains the user by showing something very fast and continues to load.
This all was for page loading SLAs. Both options are suitable for different conditions(nowadays 2nd is more preferably used)
Coming to slow sql queries, you need to normalize the database and use proper indexes wherever required. use optimal sql queries to ensure only required data is fetched and with efficiency.
If you have something which cannot be normalized more and getting complex then you can look at nosql options.'
Applying these techniques efficiently will help you achieve your desired performance.
I hope I have cleared you confusion a bit.

What is the best way (in Rails/AR) to ensure writes to a database table are performed synchronously, one after another, one at a time?

I have noticed that using something like delayed_job without a UNIQUE constraint on a table column would still create double entries in the DB. I have assumed delayed_job would run jobs one after another. The Rails app runs on Apache with Passenger Phusion. I am not sure if that is the reason why this would happen, but I would like to make sure that every item in the queue is persisted to AR/DB one after another, in sequence, and to never have more than one write to this DB table happen at the same time. Is this possible? What would be some of the issues that I would have to deal with?
update
The race conditions arise because an AJAX API is used to send data to the application. The application received a bunch of data, each batch of data is identified as belonging together by a Session ID (SID), in the end, the final state of the database has to include the latest most up-to date AJAX PUT query to the API. Sometimes queries arrive at the exact same time for the same SID -- so I need a way to make sure they don't all try to be persisted at the same time, but one after the other, or simply the last to be sent by AJAX request to the API.
I hope that makes my particular use-case easier to understand...
You can lock a specific table (or tables) with the LOCK TABLES statement.
In general I would say that relying on this is poor design and will likely lead to with scalability problems down the road since you're creating an bottleneck in your application flow.
With your further explanations, I'd be tempted to add some extra columns to the table used by delayed_job, with a unique index on them. If (for example) you only ever wanted 1 job per user you'd add a user_id column and then do
something.delay(:user_id => user_id).some_method
You might need more attributes if the pattern is more sophisticated, e.g. there are lots of different types of jobs and you only wanted one per person, per type, but the principle is the same. You'd also want to be sure to rescue ActiveRecord::RecordNotUnique and deal with it gracefully.
For non delayed_job stuff, optimistic locking is often a good compromise between handling the concurrent cases well without slowing down the non concurrent cases.
If you are worried/troubled about/with multiple processes writing to the 'same' rows - as in more users updating the same order_header row - I'd suggest you set some marker bound to the current_user.id on the row once /order_headers/:id/edit was called, and removing it again, once the current_user releases the row either by updating or canceling the edit.
Your use-case (from your description) seems a bit different to me, so I'd suggest you leave it to the DB (in case of a fairly recent - as in post 5.1 - MySQL, you'd add a trigger/function which would do the actual update, and here - you could implement similar logic to the above suggested; some marker bound to the sequenced job id of sorts)

Event feed implementation - will it scale?

Situation:
I am currently designing a feed system for a social website whereby each user has a feed of their friends' activities. I have two possible methods how to generate the feeds and I would like to ask which is best in terms of ability to scale.
Events from all users are collected in one central database table, event_log. Users are paired as friends in the table friends. The RDBMS we are using is MySQL.
Standard method:
When a user requests their feed page, the system generates the feed by inner joining event_log with friends. The result is then cached and set to timeout after 5 minutes. Scaling is achieved by varying this timeout.
Hypothesised method:
A task runs in the background and for each new, unprocessed item in event_log, it creates entries in the database table user_feed pairing that event with all of the users who are friends with the user who initiated the event. One table row pairs one event with one user.
The problems with the standard method are well known – what if a lot of people's caches expire at the same time? The solution also does not scale well – the brief is for feeds to update as close to real-time as possible
The hypothesised solution in my eyes seems much better; all processing is done offline so no user waits for a page to generate and there are no joins so database tables can be sharded across physical machines. However, if a user has 100,000 friends and creates 20 events in one session, then that results in inserting 2,000,000 rows into the database.
Question:
The question boils down to two points:
Is this worst-case scenario mentioned above problematic, i.e. does table size have an impact on MySQL performance and are there any issues with this mass inserting of data for each event?
Is there anything else I have missed?
I think your hypothesised system generates too much data; firstly on the global scale the storage and indexing requirements on user_feed seems to escalate exponentially as your user-base becomes larger and more interconnected (both presumably desirable for a social network); secondly consider if in the course of a minute 1000 users each entered a new message and each had 100 friends - then your background thread has 100 000 inserts to do and might quickly fall behind.
I wonder if a compromise might be made between your two proposed solutions where a background thread updates a table last_user_feed_update which contains a single row for each user and a timestamp for the last time that users feed was changed.
Then although the full join and query would be required to refresh the feed, a quick query to the last_user_feed table will tell if a refresh is required or not. This seems to mitigate the biggest problems with your standard method as well as avoid the storage size difficulties but that background thread still has a lot of work to do.
The Hypothesized method works better when you limit the maximum number of friends.. a lot of sites set a safe upper boundary, including Facebook iirc. It limits 'hiccups' from when your 100K friends user generates activity.
Another problem with the hypothesized model is that some of the friends you are essentially pre-generating cache for may sign up and hardly ever log in. This is a pretty common situation for free sites, and you may want to limit the burden that these inactive users will cost you.
I've thought about this problem many times - it's not a problem MySQL is going to be good at solving. I've thought of ways I could use memcached and each user pushes what their latest few status items are to "their key" (and in a feed reading activity you fetch and aggregate all your friend's keys)... but I haven't tested this. I'm not sure of all the pros/cons yet.

Gathering pageviews MySQL layout

Hey, does anyone know the proper way to set up a MySQL database to gather pageviews? I want to gather these pageviews to display in a graph later. I have a couple ways mapped out below.
Option A:
Would it be better to count pageviews each time someone visits a site and create a new row for every pageview with a time stamp. So, 50,000 views = 50,000 rows of data.
Option B:
Count the pageviews per day and have one row that counts the pageviews. every time someone visits the site the count goes up. So, 50,000 views = 1 row of data per day. Every day a new row will be created.
Are any of the options above the correct way of doing what I want? or is there a better more efficient way?
Thanks.
Option C would be to parse access logs from the web server. No extra storage needed, all sorts of extra information is stored, and even requests to images and JavaScript files are stored.
..
However, if you just want to track visits to pages where you run your own code, I'd definitely go for Option A, unless you're expecting extreme amounts of traffic on your site.
That way you can create overviews per hour of the day, and store more information than just the timestamp (like the visited page, the user's browser, etc.). You might not need that now, but later on you might thank yourself for not losing that information.
If at some point the table grows too large, you can always think of ways on how to deal with that.
If you care about how your pageviews vary with time in a day, option A keeps that info (though you might still do some bucketing, say per-hour, to reduce overall data size -- but you might do that "later, off-line" while archiving all details). Option B takes much less space because it throws away a lot of info... which you might or might not care about. If you don't know whether you care, I think that, in doubt, you should keep more data rather than less -- it's reasonably easy to "summarize and archive" overabundant data, but it's NOT at all easy to recover data you've aggregated away;-). So, aggregating is riskier...
If you do decide to keep abundant per-day data, one strategy is to use multiple tables, say one per day; this will make it easiest to work with old data (summarize it, archive it, remove it from the live DB) without slowing down current "logging". So, say, pageviews for May 29 would be in PV20090529 -- a different table than the ones for the previous and next days (this does require dynamic generation of the table name, or creative uses of ALTER VIEW e.g. in cron-jobs, etc -- no big deal!). I've often found such "sharding approaches" to have excellent (and sometimes unexpected) returns on investment, as a DB scales up beyond initial assumptions, compared to monolithic ones...