Tracking last image request using nginx and redis - mysql

I use nginx to serve static files. For each file I would like to save the timestamp when this file was retreived by a browser request last. Each file has a "unique ID" consisting of 1. servername, 2. path and 3. filename. The filename itself is not unique.
I would like to use a key value store like redis to store this information and a cron job afterwards which pushes this timestamp information to a mySQL database. I need to put redis in between since the system needs to handle a lot of concurrent requests.
Ultimate goal is to automatically delete all files which have not been requested in the last 6 months or so.
How would you configure/set up nginx/redis to make this happen?
Best
Kilian

There are two components to this: 1) how to structure the data in Redis and 2) How to configure Nginx to update it.
Unless you have an external requirement for MySQL I don't see a reason to use it in this chain.
First: Redis Structure
I am assuming you will be running your cleanup job on a frequent basis such as daily. If you are doing it based in a fixed time such as "every month" you might structure your data differently.
I think your best structure may be to use a sorted set. The key name would be "SERVER:PATH", the membername would be the file name, and the score would be a UNIX timestamp.
With this setup you can pull members without needing to know their filename and do so based on their score. This would allow you to pull "any member with SCORE <= TIMESTAMP" where timestamp is also a UNIX timestamp in your example for "six months ago" using zrangebyscore or zrevrangebyscore.
The job you run to clean unused files would use these commands to pull the list. When they are removed you can use the zremrange command to clean them from Redis. If your writes are frequent enough you could run a read-only slave to do the clean-up pulls from.
If you are expecting to have a large amount of such entries you may see a long period of time lead to a larger database. If so you'd likely need to reduce the amount of time you keep the file cache from six months to something more manageable. Six months is a long time to keep a cache.
Second: Configuring Nginx to update the sorted sets
This depends very heavily on how comfortable you are with mucking about with nginx modules. It doesn't do it natively, but you could use the lua-resty-redis module to add the ability directly into Nginx. I've used it for similar tasks.
Hopefully this will get you started. The key portion is really the data structure in Redis, as the rest is simply configuring and testing the Nginx portion in and for your setup.

Related

Porting from MySql to Redis

This is probably more of a two part question, but the main focus is really how to port data from MySql into Redis. I've read over [the process here][1] but its a bit over my head in terms of how that would work for my multi-colum set of data. The end goal is that I would like to move over my activity feed from SQL to Redis. From what I can tell, the best move is to store this data in a hash and then sort it using sorted sets. The most common sort to start would just be created_at.
So the question is:
1) What is a fast way to port this over to SQL as via command line?
2) How could this be structured to sort via created_at?
Warning: moving to Redis is not necessarily a good idea for you. Most of the time moving an entire MySQL database to Redis is a bad idea because MySQL let's you query much more loosely and makes it much easier to make complicated queries than Redis. I'll tell you how to move things over, but there's a good chance you only want to do this for some of your data (or maybe even not at all) for functionality that requires speed above all else.
So, assuming you need to make the switch while data is coming in live, you'll need to do a few things:
You'll want to start sending data to both MySQL and Redis and mark the rows that have been sent to both in some way (probably your updated_at column works fine for this-- just mark the time when you first started sending data to both).
Move all older data over to Redis. You can do this by writing a quick script in a language like Python that will just grab all data with updated_at <= redis_start_date and insert it into Redis. Now your Redis database should mirror your MySQL database exactly, and you can expect it to continue mirroring it into the future.
Make sure all appropriate APIs that used to rely on MySQL have new versions that work with Redis.
Test everything heavily and make sure the data in Redis matches what's in MySQL and is in the format you want, also make sure all appropriate APIs work with it, etc...
Once you're confident all these are set, just switch your live APIs to be the ones that use Redis and not MySQL. You can stagger this process switching them one at a time and verifying all is as it should be. Since data is being written to both MySQL and Redis, you can rollback at anytime in case something went wrong.
Once you've switched over everything and are confident it all works with Redis the way you want, turn off the code that's writing to MySQL and enjoy your fast new Redis database.
In terms of structuring this, it really depends on how you use it. If you just need to be able to look up individual items based on id and have them sorted based on created_at, you can use a sorted set of (key = id; score=unix_time(created_at)) along with hashes of (id->column) for every other column. Again, this isn't something you necessarily want to do and it depends on your use case.

Convert Legacy Text Databases to SQL

At my office we have a legacy accounting system that stores all of its data in plaintext files (TXT extension) with fixed-width records. Each data file is named e.g., FILESALE.TXT. My goal is to bring this data into our MySQL server for read-only usage by many other programs that can't interface with the legacy software. Each file is essentially one table.
There are about 20 files in total that I need to access, roughly 1gb of total data. Each line might be 350-400 characters wide and have 30-40 columns. After pulling the data in, no MySQL table is much bigger than 100mb.
The legacy accounting system can modify any row in the text file, delete old rows (it has a deleted record marker -- 0x7F), and add new rows at any time.
For several years now I have been running a cron job every 5 minutes that:
Checks each data file for last modification time.
If the file is not modified, skip it. Otherwise:
Parse the data file, clean up any issues (very simple checks only), and spit out a tab-delimited file of the columns I need (some of the columns I just ignore).
TRUNCATE the table and imports the new data into our MySQL server like this:
START TRANSACTION;
TRUNCATE legacy_sales;
LOAD DATA INFILE '/tmp/filesale.data' INTO TABLE legacy_sales;
COMMIT;
The cron script runs each file check and parse in parallel, so the whole updating process doesn't really take very long. The biggest table (changed infrequently) takes ~30 seconds to update, but most of the tables take less than 5 seconds.
This has been working ok, but there are some issues. I guess it messes with database caching, so each time I have to TRUNCATE and LOAD a table, other programs that use the MySQL database are slow at first. Additionally, when I switched to running the updates in parallel, the database can be in a slightly inconsistent state for a few seconds.
This whole process seems horribly inefficient! Is there a better way to approach this problem? Any thoughts on optimizations or procedures that might be worth investigating? Any neat tricks from anyone who faced a similar situation?
Thanks!
Couple of ideas:
If the rows in the text files have a modification timestamp, you could update your script to keep track of when it runs, and then only process the records that have been modified since the last run.
If the rows in the text files have a field that can act as a primary key, you could maintain a fingerprint cache for each row, keyed by that id. Use this to detect when a row changes, and skip unchanged rows. I.e., in the loop that reads the text file, calculate the SHA1 (or whatever) hash of the whole row, and then compare that to the hash from your cache. If they match, the row hasn't changed, so skip it. Otherwise, update/insert the MySQL record and the store the new hash value in the cache. The cache could be a GDBM file, a memcached server, a fingerprint field in your MySQL tables, whatever. This will leave unchanged rows untouched (and thus still cached) on MySQL.
Perform updates inside a transaction to avoid inconsistencies.
Two things come to mind and I won't go into too much detail but feel free to ask questions:
A service that offloads the processing of the file to an application server and then just populates the mySQL table, you can even build in intelligence by checking for duplicate records, rather than truncating the entire table.
Offload the processing to another mysql server and replicate / transfer it over.
I agree with alex's tips. If you can, update only modified fields and mass update with transactions and multiple inserts grouped. an additional benefit of transactions is faster updat
if you are concerned about down time, instead of truncating the table, insert into a new table. then rename it.
for improved performance, make sure you have proper indexing on the fields.
look at database specific performance tips such as
_ delayed_inserts in mysql improve performance
_ caches can be optimized
_ even if you do not have unique rows, you may (or may not) be able to md5 the rows

Set eventual consistency (late commit) in MySQL

Consider the following situation: You want to update the number of page views of each profile in your system. This action is very frequent, as almost all visits to your website result in a page view incremental.
The basic way is update Users set page_views=page_views+1. But this is totally not optimal because we don't really need instant update (1 hour late is ok). Is there any other way in MySQL to postpone a sequence of updates, and make cumulative updates at a later time?
I myself tried another method: storing a counter (# of increments) for each profile. But this also results in handling a few thousands of small files, and I think that the disk IO cost (even if a deep tree-structure for files is applied) would probably exceed the database.
What is your suggestion for this problem (other than MySQL)?
To improve performance you could store your page view data in a MEMORY table - this is super fast but temporary, the table only persists while the server is running - on restart it will be empty...
You could then create an EVENT to update a table that will persist the data on a timed basis. This would help improve performance a little with the risk that, should the server go down, only the number of visits since the last run of the event would be lost.
The link posted by James via the comment to your question, wherein lies an accepted answer with another comment about memcached was my first thought also. Just store the profileIds in memcached then you could set up a cron to run every 15 minutes and grab all the entries then issue the updates to MySQL in a batch, but there are a few things to consider.
When you run the batch script to grab the ids out of memcached, you will have to ensure you remove all entries which have been parsed, otherwise you run the risk of counting the same profile views multiple times.
Being that memcache doesn't support wildcard searching via keys, and that you will have to purge existing keys for the reason stated in #1, you will probably have to setup a separate memcache server pool dedicated for the sole purpose of tracking profile ids, so you don't end up purging cached values which have no relation to profile view tracking. However, you could avoid this by storing the profileId and a timestamp within the value payload, then have your batch script step through each entry and check the timestamp, if it's within the time range you specified, add it to queue to be updated, and once you hit the upper limit of your time range, the script stops.
Another option may be to parse your access logs. If user profiles are in a known location like /myapp/profile/1234, you could parse for this pattern and add profile views this way. I ended up having to go this route for advertiser tracking, as it ended up being the only repeatable way to generate billing numbers. If they had any billing disputes we would offer to send them the access logs and parse for themselves.

How to update database of ~25,000 music files?

Update:
I wrote a working script that finishes this job in a reasonable length of time, and seems to be quite reliable. It's coded entirely in PHP and is built around the array_diff() idea suggested by saccharine (so, thanks saccharine!).
You can access the source code here: http://pastebin.com/ddeiiEET
I have a MySQL database that is an index of mp3 files in a certain directory, together with their attributes (ie. title/artist/album).
New files are often being added to the music directory. At the moment it contains about 25,000 MP3 files, but I need to create a cron job that goes through it each day or so, adding any files that it doesn't find in the database.
The problem is that I don't know what is the best / least taxing way of doing this. I'm assuming a MySQL query would have to be run for each file on each cron run (to check if it's already indexed), so the script would unavoidably take a little while to run (which is okay; it's an automated process). However, because of this, my usual language of choice (PHP) would probably not suffice, as it is not designed to run long-running scripts like this (or is it...?).
It would obviously be nice, but I'm not fussed about deleting index entries for deleted files (if files actually get deleted, it's always manual cleaning up, and I don't mind just going into the database by hand to fix the index).
By the way, it would be recursive; the files are mostly situated in an Artist/Album/Title.mp3 structure, however they aren't religiously ordered like this and the script would certainly have to be able to fetch ID3 tags for new files. In fact, ideally, I would like the script to fetch ID3 tags for each file on every run, and either add a new row to the database or update the existing one if it had changed.
Anyway, I'm starting from the ground up with this, so the most basic advice first I guess (such as which programming language to use - I'm willing to learn a new one if necessary). Thanks a lot!
First a dumb question, would it not be possible to simply order the files by date added and only run the iterations through the files added in the last day? I'm not very familiar working with files, but it seems like it should be possible.
If all you want to do is improve the speed of your current code, I would recommend that you check that your data is properly indexed. It makes queries a lot faster if you search through a table's index. If you're searching through columns that aren't the key, you might want to change your setup. You should also avoid using "SELECT *" and instead use "SELECT COUNT" as mysql will then be returning ints instead of objects.
You can also do everything in a few mysql queries but will increase the complexity of your php code. Call the array with information about all the files $files. Select the data from the db where the files in the db match the a file in $files. Something like this.
"SELECT id FROM MUSIC WHERE id IN ($files)"
Read the returned array and label it $db_files. Then find all files in $files array that don't appear in $db_files array using array_diff(). Label the missing files $missing_files. Then insert the files in $missing_files into the db.
What kind of Engine are you using? If you're using MyISAM, the whole table will be locked while updating your table. But still, 25k rows are not that much, so basically in (max) a few minutes it should be updated. If it is InnoDB just update it since it's row-level locked and you should be still able to use your table while updating it.
By the way, if you're not using any fulltext search on that table, I believe that you should convert it to InnoDB as you can use foreign indexes, and that would help you a lot while joining tables. Also, it scales better AFAIK.

What is the most efficient way to keep track of all user traffic inside a database

Currently I am using mysql to log all traffic from all users coming into a website that I manage. The database has grown to almost 11m rows in a month, and queries are getting quite slow. Is there a more efficient way to log user information? All we are storing is their request, useragent, and their ip, and associating it with a certain website.
Why not try Google Analytics? Even if you might not think it would be sufficient for you, I bet you it can track 99% of what you want to be tracked.
The answer depends completely on what you expect to retrieve in the query side. Are you looking for aggregate information, are you looking for all of history or only a portion? Often, if you need to look at every row to find out what you need, storing in basic text files is quickest.
What are the kind of queries that you want to run on the data? I assume most of your queries are over data in current or recent time window. I would suggest to use time based partitioning of the table. This will make such queries faster as the queries will hit only the partition having the data, so less disk seeks. Also regularly purge old data and put them in summary tables. Some useful links are:
http://forge.mysql.com/w/images/a/a2/FOSDEM_2009-Giuseppe_Maxia-Partitions_Performance.pdf
http://www.slideshare.net/bluesmoon/scaling-mysql-writes-through-partitioning-3397422
the most efficient way is probably to have apache (assuming thats what the site is running on) simply use its built in logging to text logs, and configure something like AWStats. This removes the need to log this information yourself, and should provide you with the information you are looking for - probably all ready configured in existing reports. The benefit of this over something like Google Analytics would be its server side tracking - etc.
Maybe stating the obvious but have you got a good index in relation to the querys that you are making?
1) Look at using Piwik to perform Google Analytic type tracking, while retaining control of the MySQL data.
2) If you must continue to use your own system, look at using InnoDB Plugin in order to support compressed table types. In addition, convert IP to unsigned integer, convert both useragent and request to unsigned int referencing lookup tables that are compressed using either Innodb compression or the archive engine.
3) Skip partitioning and shard the DB by month.
This is what "Data Warehousing" is for. Consider buying a good book on warehousing.
Collect the raw data in some "current activity" schema.
Periodically, move it into a "warehouse" (or "datamart") star schema that's (a) separate from the current activity schema and (b) optimized for count/sum/group-by queries.
Move, BTW, means insert into warehouse schema and delete from current activity schema.
Separate your ongoing transactional processing from your query/analytical processing.