configure Redis to keep the keys only for past day - configuration

I have an application writes the keys into Redis without specifying the expirations for keys. It is not possible changing the application, but I want to configure Redis to keep only the keys for past 24 hours and delete the old keys.
HOW?

AFAIK there is no way to configure Redis to keep only the keys for past 24 hours and delete the old keys, as you said, at least you set up a TTL, but there is a trick you could do.
I'm assuming you cannot change anything in the application you are telling us about... so you will need to create a script/command/application which connects to Redis server every small time interval, let's say, 1 minute. The time interval will depend on how many keys you suppose to have on average in Redis.
The application is simple, you only have to iterate over all keys and use three commands:
KEYS * -> to get the full list of keys
TTL keyName -> to know if the key has already an expiration time assigned. It will return -1 if it has not
EXPIRE keyName 86400 -> if the key has not TTL assigned you set up a TTL of 24 hours
So the first time you run the command all the current keys in Redis will take a TTL of 24 hours, after this time they will be removed. The second execution of the command will assign a 24 hours TTL only to new keys which didn't exist in the first execution of the command and so on.
You have to take into account if the number of keys is huge, in the order of several million, you can have some problems of memory and performance so in this case, I suggest retrieving the keys using wildcards to get the keys by groups, like KEYS a* or keys 1*, depending the patter you use to key names. Here you could create a daemon which never stops and iterates constantly for every group.
Using KEYS command with a huge number of keys is not recommended in production environments but you can use this kind of workarounds I suggest before.

To expire keys after 24 hours, you'll need to explicitly set the TTL for each of them.

Related

Couchbase document expiration performance

I have a 6 nodes couchbase cluster with about 200 million documents in one bucket. I need to delete about 100 million documents within that bucket. I'm planning to have a view that gives me an index of the documents I need to delete and then do a touch operation to set the expiry of those documents to the next day.
I understand that couchbase will run a background expiry pager operation on regular intervals to delete the documents. Will this expiry pager process on 100 million documents have an impact on couchbase cluster performance?
If you set them to expire all around the same time, maybe. It depends on your cluster's sizing if it will effect performance. If it were me, unless you have some compelling reason to get rid of them all right this moment, I would play it safe just set a random TTL for a time between now and a few days from now. Then the server will take care of it for you and you do not have to worry about this.
Document expiration in Couchbase is in seconds or UNIX epoc time. If over 30 days, it has to be UNIX epoc time.

Tracking last image request using nginx and redis

I use nginx to serve static files. For each file I would like to save the timestamp when this file was retreived by a browser request last. Each file has a "unique ID" consisting of 1. servername, 2. path and 3. filename. The filename itself is not unique.
I would like to use a key value store like redis to store this information and a cron job afterwards which pushes this timestamp information to a mySQL database. I need to put redis in between since the system needs to handle a lot of concurrent requests.
Ultimate goal is to automatically delete all files which have not been requested in the last 6 months or so.
How would you configure/set up nginx/redis to make this happen?
Best
Kilian
There are two components to this: 1) how to structure the data in Redis and 2) How to configure Nginx to update it.
Unless you have an external requirement for MySQL I don't see a reason to use it in this chain.
First: Redis Structure
I am assuming you will be running your cleanup job on a frequent basis such as daily. If you are doing it based in a fixed time such as "every month" you might structure your data differently.
I think your best structure may be to use a sorted set. The key name would be "SERVER:PATH", the membername would be the file name, and the score would be a UNIX timestamp.
With this setup you can pull members without needing to know their filename and do so based on their score. This would allow you to pull "any member with SCORE <= TIMESTAMP" where timestamp is also a UNIX timestamp in your example for "six months ago" using zrangebyscore or zrevrangebyscore.
The job you run to clean unused files would use these commands to pull the list. When they are removed you can use the zremrange command to clean them from Redis. If your writes are frequent enough you could run a read-only slave to do the clean-up pulls from.
If you are expecting to have a large amount of such entries you may see a long period of time lead to a larger database. If so you'd likely need to reduce the amount of time you keep the file cache from six months to something more manageable. Six months is a long time to keep a cache.
Second: Configuring Nginx to update the sorted sets
This depends very heavily on how comfortable you are with mucking about with nginx modules. It doesn't do it natively, but you could use the lua-resty-redis module to add the ability directly into Nginx. I've used it for similar tasks.
Hopefully this will get you started. The key portion is really the data structure in Redis, as the rest is simply configuring and testing the Nginx portion in and for your setup.

SQL: insert row with expiration

Is there a way to insert a row into SQL with an expiration (c.f. you can insert a new key that expires in a minute with Memcached)?
The context is that I want an integration test to insert rows into a database, but I'd prefer not deleting them myself, as it's shared by many. Those delete queries must be manual, or they may not be run, or they may have disastrous typos, etc. I'd prefer the system to do it for me if it can (i.e. automatically and efficiently and well-tested).
(I assume this is not part of the SQL standard and the answer is no.)
related: SQL entries that expire after 24 hours
related: What is the best way to delete old rows from MySQL on a rolling basis?
CONTEXT: I can't make any changes to the database schema, or any of the associated infrastructure.
If you were doing unit testing, I would suggest wrapping each unit test in a BEGIN TRAN / ROLLBACK.
Since you are doing integrated testing, you probably need the data to live outside the scope of a single transaction. SQL Agent would work fine here, except that it would not distinguish between test data and real data. However, you could get around this by INSERTing some identifier to the specific records to be deleted upon expiration. That could be done in a single stored proc..
You might be able to accomplish this by using SQL Server Service Broker. I have not worked with the service broker, but maybe there is a way to delay message processing until a specific time has passed.
add an expiration date column to your table(s). create a job that will delete data that is past expiration on some schedule (say nightly).

Blocking query on a given foreign key until all inserts for that foreign key complete

I have two workers that do concurrent insert/select on a relatively large (circa 60 million rows) MySQL table.
Worker 1 inserts a new row and enqueues a message that includes the new row's foreign key. This occurs roughly once every day for a given foreign key value.
Worker 2 dequeues the message, queries the most recent record for the foreign key in the table and processes it further.
It seems that, more often than not, Worker 2 does not get the latest record from the table. Is there a way to block Worker 2's queries for a given foreign key until the insert on that foreign key has completed?
Thanks in advance
I probably lack of some key information in order to answer precisely, but...
... my guess is that the "message" arrive before the changes made to the DB in one transaction are visible of the other one.
Please remember that depending on your transaction isolation level (1) the changes are not propagated to the DB before you successfully committed the transaction. And (2) the new state of the DB is not visible until you start a new transaction.
And even so, as far as I know, MySQL does not guarantee any "propagation delay". And I don't think there is any mechanism available in MySQL to signal other connections than a transaction has successfully committed and that changes are now visible.
I don't know your system, but just as an advice, I don't think that using a message queue to signal a change but using MySQL to store the actual change data is a good solution. If I were you, I would take a look at a proper messaging system such as RabbitMQ, that not only will allow to signal a change, but will also carry the new state -- with almost, if not all, the same guarantees as a RDBMS. Maybe you could refactor your application in order to use a messaging queue to process changes, and only once done, use MySQL to permanently store data.

Set eventual consistency (late commit) in MySQL

Consider the following situation: You want to update the number of page views of each profile in your system. This action is very frequent, as almost all visits to your website result in a page view incremental.
The basic way is update Users set page_views=page_views+1. But this is totally not optimal because we don't really need instant update (1 hour late is ok). Is there any other way in MySQL to postpone a sequence of updates, and make cumulative updates at a later time?
I myself tried another method: storing a counter (# of increments) for each profile. But this also results in handling a few thousands of small files, and I think that the disk IO cost (even if a deep tree-structure for files is applied) would probably exceed the database.
What is your suggestion for this problem (other than MySQL)?
To improve performance you could store your page view data in a MEMORY table - this is super fast but temporary, the table only persists while the server is running - on restart it will be empty...
You could then create an EVENT to update a table that will persist the data on a timed basis. This would help improve performance a little with the risk that, should the server go down, only the number of visits since the last run of the event would be lost.
The link posted by James via the comment to your question, wherein lies an accepted answer with another comment about memcached was my first thought also. Just store the profileIds in memcached then you could set up a cron to run every 15 minutes and grab all the entries then issue the updates to MySQL in a batch, but there are a few things to consider.
When you run the batch script to grab the ids out of memcached, you will have to ensure you remove all entries which have been parsed, otherwise you run the risk of counting the same profile views multiple times.
Being that memcache doesn't support wildcard searching via keys, and that you will have to purge existing keys for the reason stated in #1, you will probably have to setup a separate memcache server pool dedicated for the sole purpose of tracking profile ids, so you don't end up purging cached values which have no relation to profile view tracking. However, you could avoid this by storing the profileId and a timestamp within the value payload, then have your batch script step through each entry and check the timestamp, if it's within the time range you specified, add it to queue to be updated, and once you hit the upper limit of your time range, the script stops.
Another option may be to parse your access logs. If user profiles are in a known location like /myapp/profile/1234, you could parse for this pattern and add profile views this way. I ended up having to go this route for advertiser tracking, as it ended up being the only repeatable way to generate billing numbers. If they had any billing disputes we would offer to send them the access logs and parse for themselves.