Implementing a quota system to limit requests in a web based app - mysql

I want to limit my users to 25k/requests per hour/day/whatever.
My first idea was to simply use mysql and have a column in the users table where i would store the requests and i would increment this counter each time the user makes a request. Problem with this approach is that sometimes you end up writing at the same time in the column and you get a deadlock from mysql, so this isn't really a good way to go about it, is it ?
Another way would be, instead of incrementing the counters of a column, to insert log records in a separate table and then count these records for a given timespan, but this way you can easily end up with million records table and the query can be too slow.
When using a RDBMS another aspect to take into consideration is that at each request you'd have to count the user quota from database and this can take time depending on either of the above mentioned methods.
My second idea, i thought using something like redis/memcached (not sure of alternatives or which one of them is faster) and store the request counters there. This would be fast enough to query and increment the counters, for sure faster than a RDBMS, but i haven't tried it with huge amounts of data, so i am not sure how it will perform just yet.
My third idea, i would keep the quota data in memory in a map, something like map[int]int where the key would be the user_id and the value would be the quota usage and i'd protect the map access with a mutex. This would be the fastest solution of all but then what do you do if for some reason your app crashes, you lose all that data related to the number of requests certain user did. One way would be to catch the app when crashing and loop through the map and update the database. Is this feasible?
Not sure if either of the above is the right approach, but i am open to suggestions.

I'm not sure what you mean by "get a deadlock from mysql" when you try to update a row at the same time. But a simple update rate_limit set count = count + 1 where user_id = ? should do what you want.
Personally I have had great success with Redis for doing rate limiting. There are lots of resources out there to help you understand the appropriate approach for your use case. Here is one I just glanced at that seems to handle things correctly: https://www.binpress.com/tutorial/introduction-to-rate-limiting-with-redis/155. Using pipelines (MULTI) or Lua scripts may make things even nicer.

You can persist your map[int]int in RDBMS or just file system time to time and in defer function. You even can use it as cache instead of redis. Surely it will be anyway faster than connection to third-party service every request. Also you can store counters on user side simply in cookies. Smart user can clear cookies of-douse but is it so dangerous at all and you can also provide some identification info in cookies to make clearing uncomfortable.

Related

Keeping mysql query results and variables set from the queries for use across pages and sessions

I have never used apc_store() before, and I'm also not sure about whether to free query results or not. So I have these questions...
In a MySQL Query Cache article here, it says "The MySQL query cache is a global one shared among the sessions. It caches the select query along with the result set, which enables the identical selects to execute faster as the data fetches from the in memory."
Does using free_result() after a select query negate the caching spoken of above?
Also, if I want to set variables and arrays obtained from the select query for use across pages, should I save the variables in memory via apc_store() for example? (I know that can save arrays too.) And if I do that, does it matter if I free the result of the query? Right now, I am setting these variables and arrays in an included file on most pages, since they are used often. This doesn't seem very efficient, which is why I'm looking for an alternative.
Thanks for any help/advice on the most efficient way to do the above.
MySQL's "Query cache" is internal to MySQL. You still have to perform the SELECT; the result may come back faster if the QC is enabled and usable in the situation.
I don't think the QC is what you are looking for.
The QC is going away in newer versions. Do not plan to use it.
In PHP, consider $_SESSION. I don't know whether it is better than apc_store for your use.
Note also, anything that is directly available in PHP constrains you to a single webserver. (This is fine for small to medium apps, but is not viable for very active apps.)
For scaling, consider storing a small key in a cookie, then looking up that key in a table in the database. This provides for storing arbitrary amounts of data in the database with only a few milliseconds of overhead. The "key" might be something as simple as a "user id" or "session number" or "cart number", etc.

Best practice reading newest rows from database

I have a table which stores the location of my user very frequently. I want to query this table frequently and return the newest rows I haven't read from.
What would be the best practice way to do this. My ideas are:
Add a boolean read flag, query all results where this is false, return them and then update them ALL. This might slow things down with the extra writes
Save the id of the last read row on the client side, and query for rows greater than this. Only issue here is that my client could lose their place
Some stream of data
there will eventually be multiple users and readers of the locations so this will not to scale somewhat
If what you have is a SQL database storing rows of things. I'd suggest something like option 2.
What I would probably do is keep a timestamp rather than in ID, and an index on that (a clustered index on MSSQL, or similar construct so that new rows are physically sorted by time). Then just query by anything newer than that.
That does have the "losing their place" issue. If the client MUST read every row published, then I'd either delete them after processing, or have a flag in the database to indicate that they have been processed. If the client just needs to restart reading current data, then I would do as above, but initialize the time with the most recent existing row.
If you MUST process every record, aren't limited to a database, what you're really talking about is a message queue. If you need to be able to access the individual data points after processing, then one step of the message handling could be to insert into a database for later querying(in addition to whatever this is doing with the data read).
Edit per comments:
If there's no processing that needs be done when receiving, but you just want to periodically update data then you'd be fine with solution of keeping the last received time or ID and not deleting the data. In that case I would recommend not persisting a last known id/timestamp across restarts/reconnects since you might end up inadvertently loading a bunch of data. Just reset it max when you restart.
On another note, when I did stuff like this I had good success using MQTT to transmit the data, and for the "live" updates. That is a pub/sub messaging protocol. You could have a process subscribing on the back end and forwarding data to the database, while the thing that wants the data frequently can subscribe directly to the stream of data for live updates. There's also a feature to hold onto the last published message and forward that to new subscribers so you don't start out completely empty.

redis as write-back view count cache for mysql

I have a very high throughput site for which I'm trying to store "view counts" for each page in a mySQL database (for legacy reasons they must ultimately end up in mySQL).
The sheer number of views is making it impractical to do SQL "UPDATE ITEM SET VIEW_COUNT=VIEW_COUNT+1" type of statements. There are millions of items but most are only viewed a small number of times, others are viewed many times.
So I'm considering using Redis to gather the view counts, with a background thread that writes the counts to mySQL. What is the recommended method for doing this? There are some issues with the approach:
how often does the background thread run?
how does it determine what to write back to mySQL?
should I store a Redis KEY for every ITEM that gets hit?
what TTL should I use?
is there already some pre-built solution or powerpoint presentation that gets me halfway there, etc.
I have seen very similar questions on StackOverflow but none with a great answer...yet! Hoping there's more Redis knowledge out there at this point.
I think you need to step back and look at some of your questions from a different angle to get to your answers.
"how often does the background thread run?"
To answer this you need to answer these questions: How much data can you lose? What is the reason for the data being in MySQL, and how often is that data accessed? For example, if the DB is only needed to be consulted once per day for a report, you might only need it to be updated once per day. On the other hand, what if the Redis instance dies? How many increments can you lose and still be "ok"? These will provide the answers to the question of how often to update your MySQL instance and aren't something we can answer for you.
I would use a very different strategy for storing this in redis. For the sake of the discussion let us assume you decide you need to "flush to db" every hour.
Store each hit in hashes with a key name structure along these lines:
interval_counter:DD:HH
interval_counter:total
Use the page id (such as MD5 sum of the URI, the URI itself, or whatever ID you currently use) as the hash key and do two increments on a page view; one for each hash. This provides you with a current total for each page and a subset of pages to be updated.
You would then have your cron job run a minute or so after the start of the hour to pull down all pages with updated view counts by grabbing the previous hour's hash. This provides you with a very fast means of getting the data to update the MySQL DB with while avoiding any need to do math or play tricks with timestamps etc.. By pulling data from a key which is no longer bing incremented you avoid race conditions due to clock skew.
You could set an expiration on the daily key, but I'd rather use the cron job to delete it when it has successfully updated the DB. This means your data is still there if the cron job fails or fails to be executed. It also provides the front-end with a full set of known hit counter data via keys that do not change. If you wanted, you could even keep the daily data around to be able to do window views of how popular a page is. For example if you kept the daily hash around for 7 days by setting an expire via the cron job instead of a delete, you could display how much traffic each page has had per day for the last week.
Executing two hincr operations can be done either solo or pipelined still performs quite well and is more efficient than doing calculations and munging data in code.
Now for the question of expiring the low traffic pages vs memory use. First, your data set doesn't sound like one which will require huge amounts of memory. Of course, much of that depends on how you identify each page. If you have a numerical ID the memory requirements will be rather small. If you still wind up with too much memory, you can tune it via the config, and if needs be could even use a 32 bit compile of redis for a significant memory use reduction. For example, the data I describe in this answer I used to manage for one of the ten busiest forums on the Internet and it consumed less than 3GB of data. I also stored the counters in far more "temporal window" keys than I am describing here.
That said, in this use case Redis is the cache. If you are still using too much memory after the above options you could set an expiration on keys and add an expire command to each ht. More specifically, if you follow the above pattern you will be doing the following per hit:
hincr -> total
hincr -> daily
expire -> total
This lets you keep anything that is actively used fresh by extending it's expiration every time it is accessed. Of course, to do this you'd need to wrap your display call to catch the null answer for hget on the totals hash and populate it from the MySQL DB, then increment. You could even do both as an increment. This would preserve the above structure and would likely be the same codebase needed to update the Redis server from the MySQL Db if you the Redis node needed repopulation. For that you'll need to consider and decide which data source will be considered authoritative.
You can tune the cron job's performance by modifying your interval in accordance with the parameters of data integrity you determine from the earlier questions. To get a faster running cron nob you decrease the window. With this method decreasing the window means you should have a smaller collection of pages to update. A big advantage here is you don't need to figure out what keys you need to update and then go fetch them. you can do an hgetall and iterate over the hash's keys to do updates. This also saves many round trips by retrieving all the data at once. In either case if you will likely want to consider a second Redis instance slaved to the first to do your reads from. You would still do deletes against the master but those operations are much quicker and less likely to introduce delays in your write-heavy instance.
If you need disk persistence of the Redis DB, then certainly put that on a slave instance. Otherwise if you do have a lot of data being changed often your RDB dumps will be constantly running.
I hope that helps. There are no "canned" answers because to use Redis properly you need to think first about how you will access the data, and that differs greatly from user to user and project to project. Here I based the route taken on this description: two consumers accessing the data, one to display only and the other to determine updating another datasource.
Consolidation of my other answer:
Define a time-interval in which the transfer from redis to mysql should happen, i.e. minute, hour or day. Define it in a way so that fast and easyly an identifying key can be obtained. This key must be ordered, i.e. a smaller time should give a smaller key.
Let it be hourly and the key be YYYYMMDD_HH for readability.
Define a prefix like "hitcount_".
Then for every time-interval you set a hash hitcount_<timekey> in redis which contains all requested items of that interval in the form ITEM => count.
There exists two parts of the solution:
The actual page that has to count:
a) get the current $timekey, i.e. by date- functions
b) get the value of $ITEM
b) send the redis-command HINCRBY hitcount_$timekey $ITEM 1
A cronjob which runs in that given interval, not too close to the limit of that intervals (in example: not at the full hour). This cronjob:
a) Extracts the current time-key (for now it would be 20130527_08)
b) Requests all matching keys from redis with KEYS hitcount_* (those should be a small number)
c) compares every such hash against the current hitcount_<timekey>
d) if that key is smaller than current key, then process it as $processing_key:
read all pairs ITEM => counter by HGETALL $processing_key as $item, $cnt
update the database with `UPDATE ITEM SET VIEW_COUNT=VIEW_COUNT+$cnt where ITEM=$item"
delete that key from the hash by HDEL $processing_key $item
no need to del the hash itself - there are no empty hashes in redis as far as I tried
If you want to have a TTL involved, say if the cleanup-cronjob may be not reliable (as might not run for many hours), then you could create the future hashes by the cronjob with an appropriate TTL, that means for now we could create a hash 20130527_09 with ttl 10 hours, 20130527_10 with TTL 11 hours, 20130527_11 with TTL 12 hours. Problem is that you would need a pseudokey, because empty hashes seem to be deleted automatically.
See EDIT3 for current state of the A...nswer.
I would write a key for every ITEM. A few tenthousand keys are definitely no problem at all.
Do the pages change very much? I mean do you get a lot of pages that will never be called again? Otherwise I would simply:
add the value for an ITEM on page request.
every minute or 5 minutes call a cronjob that reads the redis-keys, read the value (say 7) and reduce it by decrby ITEM 7. In MySQL you could increment the value for that ITEM by 7.
If you have a lot of pages/ITEMS which will never be called again you could make a cleanup-job once a day to delete keys with value 0. This should be locked against incrementing that key again from the website.
I would set no TTL at all, so the values should live forever. You could check the memory usage, but I see a lot of different possible pages with current GB of memory.
EDIT: incr is very nice for that, because it sets the key if not set before.
EDIT2: Given the large amount of different pages, instead of the slow "keys *" command you could use HASHES with incrby (http://redis.io/commands/hincrby). Still I am not sure if HGETALL is much faster then KEYS *, and a HASH does not allow a TTL for single keys.
EDIT3: Oh well, sometimes the good ideas come late. It is so simple: Just prefix the key with a timeslot (say day-hour) or make a HASH with name "requests_". Then no overlapping of delete and increment may happen! Every hour you take the possible keys with older "day_hour_*" - values, update the MySQL and delete those old keys. The only condition is that your servers are not too different on their clock, so use UTC and synchronized servers, and don't start the cron at x:01 but x:20 or so.
That means: a called page converts a call of ITEM1 at 23:37, May 26 2013 to Hash 20130526_23, ITEM1. HINCRBY count_20130526_23 ITEM1 1
One hour later the list of keys count_* is checked, and all up to count_20130523 are processed (read key-value by hgetall, update mysql), and deleted one by one after processing (hdel). After finishing that you check if hlen is 0 and del count_...
So you only have a small amount of keys (one per unprocessed hour), that makes keys count_* fast, and then process the actions of that hour. You can give a TTL of a few hours, if your cron is delayed or timejumped or down for a while or something like that.

What is the best way (in Rails/AR) to ensure writes to a database table are performed synchronously, one after another, one at a time?

I have noticed that using something like delayed_job without a UNIQUE constraint on a table column would still create double entries in the DB. I have assumed delayed_job would run jobs one after another. The Rails app runs on Apache with Passenger Phusion. I am not sure if that is the reason why this would happen, but I would like to make sure that every item in the queue is persisted to AR/DB one after another, in sequence, and to never have more than one write to this DB table happen at the same time. Is this possible? What would be some of the issues that I would have to deal with?
update
The race conditions arise because an AJAX API is used to send data to the application. The application received a bunch of data, each batch of data is identified as belonging together by a Session ID (SID), in the end, the final state of the database has to include the latest most up-to date AJAX PUT query to the API. Sometimes queries arrive at the exact same time for the same SID -- so I need a way to make sure they don't all try to be persisted at the same time, but one after the other, or simply the last to be sent by AJAX request to the API.
I hope that makes my particular use-case easier to understand...
You can lock a specific table (or tables) with the LOCK TABLES statement.
In general I would say that relying on this is poor design and will likely lead to with scalability problems down the road since you're creating an bottleneck in your application flow.
With your further explanations, I'd be tempted to add some extra columns to the table used by delayed_job, with a unique index on them. If (for example) you only ever wanted 1 job per user you'd add a user_id column and then do
something.delay(:user_id => user_id).some_method
You might need more attributes if the pattern is more sophisticated, e.g. there are lots of different types of jobs and you only wanted one per person, per type, but the principle is the same. You'd also want to be sure to rescue ActiveRecord::RecordNotUnique and deal with it gracefully.
For non delayed_job stuff, optimistic locking is often a good compromise between handling the concurrent cases well without slowing down the non concurrent cases.
If you are worried/troubled about/with multiple processes writing to the 'same' rows - as in more users updating the same order_header row - I'd suggest you set some marker bound to the current_user.id on the row once /order_headers/:id/edit was called, and removing it again, once the current_user releases the row either by updating or canceling the edit.
Your use-case (from your description) seems a bit different to me, so I'd suggest you leave it to the DB (in case of a fairly recent - as in post 5.1 - MySQL, you'd add a trigger/function which would do the actual update, and here - you could implement similar logic to the above suggested; some marker bound to the sequenced job id of sorts)

Whats the best way to count views on a high traffic site?

The way I currently do it in mysql is is
UPDATE table SET hits=hits+1 WHERE id = 1;
This keeps live stats on the site, but as I understand, isnt the best way to go about doing this.
Edit:
let me clarify... this is for counting hits on specific item pages. I have a listing of movies, and I want to count how many views each movie page has gotten. After it +1s, it adds the movie ID to a session var, which stories the ids of all the pages the user viewed. If the ID of the page is in that array, it wont +1 it.
If your traffic is high enough, you shouldn't hit the database on every request. Try keeping the count in memory and sync the database on a schedule (for example, update the database on every 1000 requests or every minute.)
You could take an approach similar to Stack Overflow's view count. This basically increments the counter when the image is loaded. This has two useful aspects:
Robots often don't download images, so these don't increment the view.
Browsers cache images, so when you go back to a page, you're not causing work for the server.
The potentially slower code is run async from the rest of the page. This doesn't slow down the page from being visible.
To optimize the updates:
* Keep the counter in a single, narrow table, with a clustered index on the key.
* Have the table served up by a different database server / host.
* Use memcached and/or a queue to allow the write to either be delayed or run async.
If you don't need to display the view count in real time, then your best bet is to include the movie id in your URL somewhere, and use log scrapping to populate the database at the end of the day.
Not sure which web server you are using.
If your web server logs the requests to the site, say one line per request in a text file. Then you could just count the lines in your log files.
Your solution has a major problem in that it will lock the row in the database, therefore your site can only serve one request at a time.
it depends really on if you want hits or views
1 view from 1 ip = 1 person looking at a page
1 person refreshing the same page = multiple hits but only one view
i always prefer google analytics etc for something like this, you need to make sure that
this db update is only done once, or you could quite easily be flooded.
I'm not sure about what you're using, but you could set a cron job to automatically update the count every x minutes in Google App Engine. I think you'd use memcache to save the counts until your cron job is run. Although... GAE does have some stat reporting but you'd probably want to have your own data also. I think you can use memcache on other systems, and set cron jobs on them tool
Use logging software. Google Analytics is pretty and feature-filled (and generates zero load on your servers), but it'll miss non-JavaScript hits. If every single hit is important, use a server log analyzer like webalizer or awstats.
In general with MySQL:
If you use MyISAM table: there is a lock on the table so you better have to do an INSERT in a separate table. Then with a cron job, you UPDATE the values in your movie table.
If you use InnoDB table: there is a lock on the row so you can UPDATE the value directly.
That say, depending of the "maturity" and success of your project, you may need to implement a different solution, so:
1st advice: benchmark, benchmark, benchmark.
2nd advice: Using the data from the 1st advice, identify the bottleneck and select the solution for the issue you face but not a future issue you think you might have.
Here is a great video on this: http://www.youtube.com/watch?v=ZW5_eEKEC28
Hope this helps. :)