Determine TTL of stored documents - Couchbase

Determine TTL of stored documents - Couchbase - couchbase

I have a bunch of documents stored in couchbase bucket. I want to find the TTL for each document. Is there a way to do it?
EDIT:
I created a view to display the metadata. I assume below is the unix time for expiration.
{
"id": "beer-Milk-New",
"rev": "9-145c05b4021400005771779102000000",
"expiration": 1467053969,
"flags": 33554432
}

You didn't specify which version of Couchbase Server, but this answer assumes you are using 4.x and can run a N1QL query.
I think you're on the right track. Check out this blog post on detecting expirations before they happen: http://blog.couchbase.com/2016/april/detecting-expirations-of-documents-with-couchbase-server-and-n1ql
Basically you need two steps. First, create an index on the expiration time.
CREATE INDEX iExpiration ON default(exp_datetime) USING GSI;
This isn't strictly necessary, I don't think, but it will improve the performance.
Second, just run a N1QL query something like below (which will list documents that will expire in the next 30 seconds).
SELECT META(default).id, *
FROM default
WHERE DATE_DIFF_STR(STR_TO_UTC(exp_datetime),MILLIS_TO_UTC(DATE_ADD_MILLIS(NOW_MILLIS(),30,"second")),"second") < 30
AND STR_TO_UTC(exp_datetime) IS NOT MISSING;
So, if you want to find the TTL for each document, you would probably need to subtract exp_datetime from the current time (I believe that's CLOCK_MILLIS()).

Related

Keeping mysql query results and variables set from the queries for use across pages and sessions

I have never used apc_store() before, and I'm also not sure about whether to free query results or not. So I have these questions...
In a MySQL Query Cache article here, it says "The MySQL query cache is a global one shared among the sessions. It caches the select query along with the result set, which enables the identical selects to execute faster as the data fetches from the in memory."
Does using free_result() after a select query negate the caching spoken of above?
Also, if I want to set variables and arrays obtained from the select query for use across pages, should I save the variables in memory via apc_store() for example? (I know that can save arrays too.) And if I do that, does it matter if I free the result of the query? Right now, I am setting these variables and arrays in an included file on most pages, since they are used often. This doesn't seem very efficient, which is why I'm looking for an alternative.
Thanks for any help/advice on the most efficient way to do the above.

MySQL's "Query cache" is internal to MySQL. You still have to perform the SELECT; the result may come back faster if the QC is enabled and usable in the situation.
I don't think the QC is what you are looking for.
The QC is going away in newer versions. Do not plan to use it.
In PHP, consider $_SESSION. I don't know whether it is better than apc_store for your use.
Note also, anything that is directly available in PHP constrains you to a single webserver. (This is fine for small to medium apps, but is not viable for very active apps.)
For scaling, consider storing a small key in a cookie, then looking up that key in a table in the database. This provides for storing arbitrary amounts of data in the database with only a few milliseconds of overhead. The "key" might be something as simple as a "user id" or "session number" or "cart number", etc.

How to paginate more than 10M of records efficiently

I need to paginate more than 30M users which are hosted on MySQL. I'm displaying 15 users per page, but it's quite slow. My goal is to access to any random page and load it in a few ms.
At the beginning, I was using the offset method for MySQL, but as I said, is quite slow (and a bad idea for sure). Then I moved to ElasticSearch, but you still have some window limit, so you are limited. After that, I have been checking different ways like the "cursor" method, but I can not access to any random page. For example, we start at the first page, and we have 100000 pages, I would like to access the 4782th page, and load it in a few ms. With the cursor method, I'm just able to access the next && prev page, and the "scroll" method doesn't fit what I really need.
My users' ID are not sorted just by ID, so I can not use it as a delimiter. Already thought about Late row lookups
I don't mind to move all my data to a new DB (but would be to find different solutions). Here Amazon does it really well (https://www.amazon.com/review/top-reviewers)
Query using offset:
SELECT users.* from users
WHERE users.country = 'DE'
ORDER BY users.posts_count DESC, users.id DESC
LIMIT 15 OFFSET 473
PD: My user list is almost in real-time, so it's changing every hour.
Any ideas? Thanks a lot!

"access the 4782th page" -- What is the use case for this? "Pagination" is useful for a few pages, maybe a few dozen pages, but not thousands.
[Next], [Prev], [First], [Last] are useful. But if you want a random probe, then call it a [Random] probe, not "page 4782".
OFFSET is inefficient. Here is a discussion of an alternative: http://mysql.rjweb.org/doc.php/pagination
Meanwhile add INDEX(country, posts_count, id)

One way to achieve this with Elasticsearch is to add a linearly increasing field (e.g. sort_field) to each of your records (or use your ID field if it's linearly increasing). The first record's field has value 1, the second 2, the third 3, etc...
Then, if you sort by that field in ascending mode, you can use the search_after feature in order to access any record directly.
For instance, if you need to access the 4782th page (i.e. record 71730 and following), you can achieve it like this:
POST my-index/_search
{
"size": 15, <--- the page size
"sort": [
{
"sort_field": "asc" <--- properly ordering the records
}
],
"search_after": [ 71730 ] <--- direct access to the desired record/page
}
Under certain circumstances, it is also possible to make the sorting even faster by leveraging the index sorting capability.
Note: deep pagination is not something Elasticsearch has been built for. The solution above works, but can have some shortcomings (see comments) depending on your context. It might not be the best available technology for what you need to do.

Get an accurate count of items in a bucket

The couchbase admin console (I'm using version 5.0, community) shows a count of items in each bucket. I'm wondering if that count is just a rough estimate and not an exact count of the number of items in the bucket. Here's the behavior I'm seeing that leads me to this reasoning:
When I use XDCR to replicate a bucket to a backup node, the count in the backup bucket after the XDCR has finished will be significantly higher than the count of documents in the source bucket, sometimes by tens of thousands (in a bucket that contains hundreds of millions of documents).
When I use the Java DCP client to clone a bucket to a table in a different database, the other database shows numbers of records that are close, but off by possibly even a few million (again, in a bucket with hundreds of millions of documents).
How can I get an accurate count of the exact number of items in a bucket, so that I can be sure, after my DCP or XDCR processes have completed, that all documents have made it to the new location?

There can be a number of different reasons why the count could be different without more details it would be hard to say. The common cases are:
The couchbase admin console (I'm using version 5.0, community) shows a count of items in each bucket.
The Admin console is accurate but does not auto updated, so a refresh is required.
When I use the Java DCP client to clone a bucket to a table in a different database, the other database shows numbers of records that are close, but off by possibly even a few million (again, in a bucket with hundreds of millions of documents).
DCP will include tombstones (deleted documents) and possibly multiple mutations for the same document. Which could explain why the DCP count is out.
With regards to using N1QL, if the query is a simple SELECT COUNT(*) FROM bucketName then depending on the Couchbase Server version it will use the bucket stats directly.
In other words as mentioned previously the bucket stats via the REST interface or by asking the Data service directly will be accurate.

The most accurate answer would be to go directly to the bucket info
something like
curl http://hostname:8091/pools/default/buckets/beer-sample/ -u user:password | jq '.basicStats | {itemCount: .itemCount }'
the result would be immediate, no need for indexing:
{
"itemCount": 7303
}
or not in Json format
curl http://centos:8091/pools/default/buckets/beer-sample/ -u roi:password | jq '.basicStats.itemCount'

Alright, here I am to answer my own question over a year later :). We did a lot of experimentation today when trying to migrate items out of a bucket containing roughly 2.6 million items into an SQL database. We wanted to make sure the row count matched between Couchbase and the new database before going live.
Unfortunately when we tried the normal select count(*) from <bucket>; the document count we received was over what we expected by just 1, so we broke down the query and did a count over all documents in the bucket while grouping by an attribute, hoping to find what kind of document was missing in the target DB. The total for the counts for each group should have added up to the same total that we got from the count query. Unfortunately, they did not. The total added up to 1 fewer than we expected (so that's off by two from the original count query).
We found the category of document that was off by 1, expecting to have an extra doc in Couchbase that didn't make it to the target DB, but found instead that the totals indicated the reverse, that the target DB had one extra doc. This all seemed very fishy, so we did a query to pull all of the IDs in that group out into a single JSON file, and we counted them. Alas, the actual count of documents in that group matched up with the target DB, meaning that Couchbase's counting was incorrect in both cases.
I'm not sure what implementation details caused this to happen, but it seems like at least the over-counting might have been a caching issue. I was able to finally get a correct document count by using a query like this:
select count(*) from <bucket> where meta(<bucket>).id;
This query ran for much longer than the original count did, indicating that whatever cache is used for counts was being skipped, and it did come up with the correct number.
We were doing these tests on a relatively small number of documents, half a million or so. With the full volume of the bucket, counts had been off by as much as 15 in the past, apparently becoming less accurate as the document count increased.
We just did a re-sync of the full bucket. The bucket total as reported by the dashboard and by the original N1ql query are over the expected count by 7. We ran the modified query, waited for the result, and got the expected count.
In case you're wondering, we did turn off traffic to the bucket, so document counts were not likely to be fluctuating during this process, except when a document reached its expiry date in Couchbase, and was automatically deleted.

To get an accurate count, you can run a N1QL query. That will get you as accurate a number as Couchbase is capable of producing.
SELECT COUNT(*) FROM bucketName
Use REQUEST_PLUS consistency to make sure the indexes have received the very latest updates.
https://developer.couchbase.com/documentation/server/current/indexes/performance-consistency.html
You'll need a query node for this, though.

Whats faster on mysql? Column per column handling or via index?

I'm currently working on a cloud project. Its hosted on Amazon AWS and the data is stored in RDS (MySQL). I have many devices with many small requests, devices are asking the server for new commands to execute. The devices have some parameters like "power"=1 or 0, etc., so the commands are used to give the devices order what to do. Now there are two scenarios:
Every command is a column in the table "commands", the devices are asking and the server searches for commands with device=ID. "the classic style". Gives back the column and deletes it (2 Queries).
There is a table called "parameters", where all the "power", ... status things are placed , every line has a timestamp and the device too. So every request the server says: ok, the timestamp of the device is xxx, so which parameter was updated after this xxx.
The description is a bit complicated. Sorry for that. The point is: In the first idea there are not as many columns as in the second. But in the second the server has to check every cloumn for WHERE device=ID AND timestampx > 'device_time_stamp'. Every device asks every 5 seconds and there will be a lot of devices, so its a question about performance.
Thanks folks

On the limited information available, you might have something like this:
device_command
(device_command_id -- PK establishing a sequence for the commands
,device_id
,command_id
,completed -- either a simple 1/0 flag, or a timestamp
)
I'm not sure I understand how the 'parameters' fit into all this so I'll leave it at that for now.

redis as write-back view count cache for mysql

I have a very high throughput site for which I'm trying to store "view counts" for each page in a mySQL database (for legacy reasons they must ultimately end up in mySQL).
The sheer number of views is making it impractical to do SQL "UPDATE ITEM SET VIEW_COUNT=VIEW_COUNT+1" type of statements. There are millions of items but most are only viewed a small number of times, others are viewed many times.
So I'm considering using Redis to gather the view counts, with a background thread that writes the counts to mySQL. What is the recommended method for doing this? There are some issues with the approach:
how often does the background thread run?
how does it determine what to write back to mySQL?
should I store a Redis KEY for every ITEM that gets hit?
what TTL should I use?
is there already some pre-built solution or powerpoint presentation that gets me halfway there, etc.
I have seen very similar questions on StackOverflow but none with a great answer...yet! Hoping there's more Redis knowledge out there at this point.

I think you need to step back and look at some of your questions from a different angle to get to your answers.
"how often does the background thread run?"
To answer this you need to answer these questions: How much data can you lose? What is the reason for the data being in MySQL, and how often is that data accessed? For example, if the DB is only needed to be consulted once per day for a report, you might only need it to be updated once per day. On the other hand, what if the Redis instance dies? How many increments can you lose and still be "ok"? These will provide the answers to the question of how often to update your MySQL instance and aren't something we can answer for you.
I would use a very different strategy for storing this in redis. For the sake of the discussion let us assume you decide you need to "flush to db" every hour.
Store each hit in hashes with a key name structure along these lines:
interval_counter:DD:HH
interval_counter:total
Use the page id (such as MD5 sum of the URI, the URI itself, or whatever ID you currently use) as the hash key and do two increments on a page view; one for each hash. This provides you with a current total for each page and a subset of pages to be updated.
You would then have your cron job run a minute or so after the start of the hour to pull down all pages with updated view counts by grabbing the previous hour's hash. This provides you with a very fast means of getting the data to update the MySQL DB with while avoiding any need to do math or play tricks with timestamps etc.. By pulling data from a key which is no longer bing incremented you avoid race conditions due to clock skew.
You could set an expiration on the daily key, but I'd rather use the cron job to delete it when it has successfully updated the DB. This means your data is still there if the cron job fails or fails to be executed. It also provides the front-end with a full set of known hit counter data via keys that do not change. If you wanted, you could even keep the daily data around to be able to do window views of how popular a page is. For example if you kept the daily hash around for 7 days by setting an expire via the cron job instead of a delete, you could display how much traffic each page has had per day for the last week.
Executing two hincr operations can be done either solo or pipelined still performs quite well and is more efficient than doing calculations and munging data in code.
Now for the question of expiring the low traffic pages vs memory use. First, your data set doesn't sound like one which will require huge amounts of memory. Of course, much of that depends on how you identify each page. If you have a numerical ID the memory requirements will be rather small. If you still wind up with too much memory, you can tune it via the config, and if needs be could even use a 32 bit compile of redis for a significant memory use reduction. For example, the data I describe in this answer I used to manage for one of the ten busiest forums on the Internet and it consumed less than 3GB of data. I also stored the counters in far more "temporal window" keys than I am describing here.
That said, in this use case Redis is the cache. If you are still using too much memory after the above options you could set an expiration on keys and add an expire command to each ht. More specifically, if you follow the above pattern you will be doing the following per hit:
hincr -> total
hincr -> daily
expire -> total
This lets you keep anything that is actively used fresh by extending it's expiration every time it is accessed. Of course, to do this you'd need to wrap your display call to catch the null answer for hget on the totals hash and populate it from the MySQL DB, then increment. You could even do both as an increment. This would preserve the above structure and would likely be the same codebase needed to update the Redis server from the MySQL Db if you the Redis node needed repopulation. For that you'll need to consider and decide which data source will be considered authoritative.
You can tune the cron job's performance by modifying your interval in accordance with the parameters of data integrity you determine from the earlier questions. To get a faster running cron nob you decrease the window. With this method decreasing the window means you should have a smaller collection of pages to update. A big advantage here is you don't need to figure out what keys you need to update and then go fetch them. you can do an hgetall and iterate over the hash's keys to do updates. This also saves many round trips by retrieving all the data at once. In either case if you will likely want to consider a second Redis instance slaved to the first to do your reads from. You would still do deletes against the master but those operations are much quicker and less likely to introduce delays in your write-heavy instance.
If you need disk persistence of the Redis DB, then certainly put that on a slave instance. Otherwise if you do have a lot of data being changed often your RDB dumps will be constantly running.
I hope that helps. There are no "canned" answers because to use Redis properly you need to think first about how you will access the data, and that differs greatly from user to user and project to project. Here I based the route taken on this description: two consumers accessing the data, one to display only and the other to determine updating another datasource.

Consolidation of my other answer:
Define a time-interval in which the transfer from redis to mysql should happen, i.e. minute, hour or day. Define it in a way so that fast and easyly an identifying key can be obtained. This key must be ordered, i.e. a smaller time should give a smaller key.
Let it be hourly and the key be YYYYMMDD_HH for readability.
Define a prefix like "hitcount_".
Then for every time-interval you set a hash hitcount_<timekey> in redis which contains all requested items of that interval in the form ITEM => count.
There exists two parts of the solution:
The actual page that has to count:
a) get the current $timekey, i.e. by date- functions
b) get the value of $ITEM
b) send the redis-command HINCRBY hitcount_$timekey $ITEM 1
A cronjob which runs in that given interval, not too close to the limit of that intervals (in example: not at the full hour). This cronjob:
a) Extracts the current time-key (for now it would be 20130527_08)
b) Requests all matching keys from redis with KEYS hitcount_* (those should be a small number)
c) compares every such hash against the current hitcount_<timekey>
d) if that key is smaller than current key, then process it as $processing_key:
read all pairs ITEM => counter by HGETALL $processing_key as $item, $cnt
update the database with `UPDATE ITEM SET VIEW_COUNT=VIEW_COUNT+$cnt where ITEM=$item"
delete that key from the hash by HDEL $processing_key $item
no need to del the hash itself - there are no empty hashes in redis as far as I tried
If you want to have a TTL involved, say if the cleanup-cronjob may be not reliable (as might not run for many hours), then you could create the future hashes by the cronjob with an appropriate TTL, that means for now we could create a hash 20130527_09 with ttl 10 hours, 20130527_10 with TTL 11 hours, 20130527_11 with TTL 12 hours. Problem is that you would need a pseudokey, because empty hashes seem to be deleted automatically.

See EDIT3 for current state of the A...nswer.
I would write a key for every ITEM. A few tenthousand keys are definitely no problem at all.
Do the pages change very much? I mean do you get a lot of pages that will never be called again? Otherwise I would simply:
add the value for an ITEM on page request.
every minute or 5 minutes call a cronjob that reads the redis-keys, read the value (say 7) and reduce it by decrby ITEM 7. In MySQL you could increment the value for that ITEM by 7.
If you have a lot of pages/ITEMS which will never be called again you could make a cleanup-job once a day to delete keys with value 0. This should be locked against incrementing that key again from the website.
I would set no TTL at all, so the values should live forever. You could check the memory usage, but I see a lot of different possible pages with current GB of memory.
EDIT: incr is very nice for that, because it sets the key if not set before.
EDIT2: Given the large amount of different pages, instead of the slow "keys *" command you could use HASHES with incrby (http://redis.io/commands/hincrby). Still I am not sure if HGETALL is much faster then KEYS *, and a HASH does not allow a TTL for single keys.
EDIT3: Oh well, sometimes the good ideas come late. It is so simple: Just prefix the key with a timeslot (say day-hour) or make a HASH with name "requests_". Then no overlapping of delete and increment may happen! Every hour you take the possible keys with older "day_hour_*" - values, update the MySQL and delete those old keys. The only condition is that your servers are not too different on their clock, so use UTC and synchronized servers, and don't start the cron at x:01 but x:20 or so.
That means: a called page converts a call of ITEM1 at 23:37, May 26 2013 to Hash 20130526_23, ITEM1. HINCRBY count_20130526_23 ITEM1 1
One hour later the list of keys count_* is checked, and all up to count_20130523 are processed (read key-value by hgetall, update mysql), and deleted one by one after processing (hdel). After finishing that you check if hlen is 0 and del count_...
So you only have a small amount of keys (one per unprocessed hour), that makes keys count_* fast, and then process the actions of that hour. You can give a TTL of a few hours, if your cron is delayed or timejumped or down for a while or something like that.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008