Get an accurate count of items in a bucket - couchbase

The couchbase admin console (I'm using version 5.0, community) shows a count of items in each bucket. I'm wondering if that count is just a rough estimate and not an exact count of the number of items in the bucket. Here's the behavior I'm seeing that leads me to this reasoning:
When I use XDCR to replicate a bucket to a backup node, the count in the backup bucket after the XDCR has finished will be significantly higher than the count of documents in the source bucket, sometimes by tens of thousands (in a bucket that contains hundreds of millions of documents).
When I use the Java DCP client to clone a bucket to a table in a different database, the other database shows numbers of records that are close, but off by possibly even a few million (again, in a bucket with hundreds of millions of documents).
How can I get an accurate count of the exact number of items in a bucket, so that I can be sure, after my DCP or XDCR processes have completed, that all documents have made it to the new location?

There can be a number of different reasons why the count could be different without more details it would be hard to say. The common cases are:
The couchbase admin console (I'm using version 5.0, community) shows a count of items in each bucket.
The Admin console is accurate but does not auto updated, so a refresh is required.
When I use the Java DCP client to clone a bucket to a table in a different database, the other database shows numbers of records that are close, but off by possibly even a few million (again, in a bucket with hundreds of millions of documents).
DCP will include tombstones (deleted documents) and possibly multiple mutations for the same document. Which could explain why the DCP count is out.
With regards to using N1QL, if the query is a simple SELECT COUNT(*) FROM bucketName then depending on the Couchbase Server version it will use the bucket stats directly.
In other words as mentioned previously the bucket stats via the REST interface or by asking the Data service directly will be accurate.

The most accurate answer would be to go directly to the bucket info
something like
curl http://hostname:8091/pools/default/buckets/beer-sample/ -u user:password | jq '.basicStats | {itemCount: .itemCount }'
the result would be immediate, no need for indexing:
{
"itemCount": 7303
}
or not in Json format
curl http://centos:8091/pools/default/buckets/beer-sample/ -u roi:password | jq '.basicStats.itemCount'

Alright, here I am to answer my own question over a year later :). We did a lot of experimentation today when trying to migrate items out of a bucket containing roughly 2.6 million items into an SQL database. We wanted to make sure the row count matched between Couchbase and the new database before going live.
Unfortunately when we tried the normal select count(*) from <bucket>; the document count we received was over what we expected by just 1, so we broke down the query and did a count over all documents in the bucket while grouping by an attribute, hoping to find what kind of document was missing in the target DB. The total for the counts for each group should have added up to the same total that we got from the count query. Unfortunately, they did not. The total added up to 1 fewer than we expected (so that's off by two from the original count query).
We found the category of document that was off by 1, expecting to have an extra doc in Couchbase that didn't make it to the target DB, but found instead that the totals indicated the reverse, that the target DB had one extra doc. This all seemed very fishy, so we did a query to pull all of the IDs in that group out into a single JSON file, and we counted them. Alas, the actual count of documents in that group matched up with the target DB, meaning that Couchbase's counting was incorrect in both cases.
I'm not sure what implementation details caused this to happen, but it seems like at least the over-counting might have been a caching issue. I was able to finally get a correct document count by using a query like this:
select count(*) from <bucket> where meta(<bucket>).id;
This query ran for much longer than the original count did, indicating that whatever cache is used for counts was being skipped, and it did come up with the correct number.
We were doing these tests on a relatively small number of documents, half a million or so. With the full volume of the bucket, counts had been off by as much as 15 in the past, apparently becoming less accurate as the document count increased.
We just did a re-sync of the full bucket. The bucket total as reported by the dashboard and by the original N1ql query are over the expected count by 7. We ran the modified query, waited for the result, and got the expected count.
In case you're wondering, we did turn off traffic to the bucket, so document counts were not likely to be fluctuating during this process, except when a document reached its expiry date in Couchbase, and was automatically deleted.

To get an accurate count, you can run a N1QL query. That will get you as accurate a number as Couchbase is capable of producing.
SELECT COUNT(*) FROM bucketName
Use REQUEST_PLUS consistency to make sure the indexes have received the very latest updates.
https://developer.couchbase.com/documentation/server/current/indexes/performance-consistency.html
You'll need a query node for this, though.

Related

Analyzing multiple Json with Tableau

I'm beginning to use Tableau and I have a project involving multiple website logs stored as JSON. I have one log for each day for about a month, each weighting about 500-600 Mb.
Is it possible to open (and join) multiple JSON files in Tableau? If yes, how ? I can load them in parallel, but not join them.
EDIT : I can load multiple JSON files and define their relationship, so this is OK. I still have the memory issue:
I'm am worried that by joining them all, I will not have enough memory to make it work. Are the loaded files stored in RAM of in an internal DB ?
What would be the best way to do this ? Should I merge all the JSON first, or load them in a database and use a connector to Tableau? If so, what could be a good choice of DB?
I'm aware some of these questions are opinion-based, but I have no clue about this and I really need some guideline to get started.
For this volume of data, you probably want to preprocess, filter, aggregate and index it ahead of time - either using a database, something like Parquet and Spark and/or Tableau extracts.
If you use extracts, you probably want to filter and aggregate them for specific purposes, just be aware if that that you aggregate the data when you make the extract, you need to be careful that any further aggregations you perform in the visualization are well defined. Additive functions like SUM(), MIN() and MAX() are safe. Sums of partial sums are still correct sums. But averages of averages and count distincts of count distincts often are not.
Tableau sends a query to the database and then renders a visualization based on the query result set. The volume of data returned depends on the query which depends on what you specify in Tableau. Tableau caches results, and you can also create an extract which serves as a persistent, potentially filtered and aggregated, cache. See this related stack overflow answer
For text files and extracts, Tableau loads them into memory via its Data Engine process today -- replaced by a new in-memory database called Hyper in the future. The concept is the same though, Tableau sends the data source a query which returns a result set. For data of the size you are talking about, you might want to test using some sort of database if it the volume exceeds what comfortably fits in memory.
The JSON driver is very convenient for exploring JSON data, and I would definitely start there. You can avoid an entire ETL step if that serves your needs. But at high volume of data, you might need to move to some sort of external data source to handle production loads. FYI, the UNION feature with Tableau's JSON driver is not (yet) available as of version 10.1.
I think the answer which nobody gave is that No, you cannot join two JSON files in Tableau. Please correct me if I'm wrong.
I believe we can join 2 JSON tables in Tableau.
First extract the column names from the JSON data as below--
select
get_json_object(JSON_column, '$.Attribute1') as Attribute1,
get_json_object(line, '$.Attribute2') as Attribute2
from table_name;
perform the above for the required tableau and join them.

Effectively fetching large number of tuples from Solr

I am stuck in a rather tricky problem. I am implementing a feature in my website, wherein, a person get all the results matching a particular criteria. The matching criteria can be anything. However, for the sake of simplicity, let's call the matching criteria as 'age'. Which means, the feature will return all the students names, from database (which is in hundreds of thousands) with the student whose age matches 'most' with the parameter supplied, on top.
My approaches:
1- I have a Solr server. Since I need to implement this in a paginated way, I would need to query Solr several times (since my solr page size is 10) to find the 'near-absolute' matching student real-time. This is computationally very intensive. This problem boils down to effectively fetching this large number of tuples from Solr.
2- I tried processing it in a batch (and by increasing the solr page size to 100). This data received is not guaranteed to be real-time, when somebody uses my feature. Also, to make it optimal, I would need to have data learning algos to find out which all users are 'most likely' to use my feature today. Then I'll batch process them on priority. Please do remember that number of users are so high that I cannot run this batch for 'all' the users everyday.
On one hand where I want to show results real-time, I have to compromise on performance (hitting Solr multiple times, thus slightly unfeasible), while on the other, my result set wouldn't be real-time if I do a batch processing, plus I can't do it everyday, for all the users.
Can someone correct my seemingly faulty approaches?
Solr indexing is done on MySQL db contents.
As I understand it, your users are not interested in 100K results. They only want the top-10 (or top-100 or a similar low number) results, where the person's age is closest to a number you supply.
This sounds like a case for Solr function queries: https://cwiki.apache.org/confluence/display/solr/Function+Queries. For the age example, that would be something like sort=abs(sub(37, age)) desc, score desc, which would return the persons with age closest to 37 first and prioritize by score in case of ties.
I think what you need is using solr cursors which will enable you to paginate effectively through large resultsets Solr cursors or deep paging

MYSQL and LabVIEW

I have a table with 27 columns and 300,000 rows of data, out of which 8 columns are filled with data 0 or 1 or null. Using LabVIEW I get the total count of each of these columns using the following query:
select
d_1_result,
d_2_value_1_result,
de_2_value_2_result,
d_3_result,
d_4_value_1_result,
d_4_value_2_result,
d_5_result
from Table_name_vp
where ( insp_time between
"15-02-02 06:00:00" and "15-02-02 23:59:59" or
inspection_time between "15-02-03 00:00:00" and "15-02-03 06:00:00")
and partname = "AbvQuene";
This query runs for the number of days the user input, for example 120 days.
I found that total time taken by the query is 8 secs which not good.
I want to reduce the time to 8 millisecs.
I have also changed the engine to Myisam.
Any suggestions to reduce the time consumed by the query. (LabVIEW Processing is not taking time)
It depends on the data, and how many rows out of the 300,000 are actually selected by your WHERE clause. Obviously if all 300,000 are included, the whole table will need to be read. If it's a smaller number of rows, an index on insp_time or inspection_time (is this just a typo, are these actually the same field?) and/or partname might help. The exact index will depend on your data.
Update 2:
I can't see any reason for you wouldn't be able to load your whole DB into memory because it should be less than 60MB. Do you agree with this?
Please post your answer and the answer the following questions (you can edit a question after you have asked it - that's easier than commenting).
Next steps:
I should have mentioned this before, that before you run a query in LabVIEW I would always test it first using your DB admin tool (e.g. MySql Workbench). Please post whether that worked or not.
Post your LabVIEW code.
You can try running your query with less than 300K rows - say 50K and see how much your memory increases. If there's some limitation on how many rows you can query at one time than you can break your giant query into smaller ones pretty easily and then just add up the results sets. I can post an example if needed.
Update:
It sounds like there's something wrong with your schema.
For example, if you had 27 columns of double's and datetimes ( both are 8 bytes each) your total DB size would only be about 60MB (300K * 27 * 8 / 1048576).
Please post your schema for further help (you can use SHOW CREATE TABLE tablename).
8 millisecs is an extremely low time - I assume that's being driven by some type of hardware timing issue? If not please explain that requirement as a typical user requirement is around 1 second.
To get the response time that low you will need to do the following:
Query the DB at the start of your app and load all 300,000 rows into memory (e.g. a LabVIEW array)
Update the array with new values (e.g. array append)
Run the "query" against he array (e.g. using a for loop with a case select)
On a separate thread (i.e. LabVIEW "loop") insert the new records into to the database or do it write before the app closes
This approach assumes that only one instance of the app is running at a time because synchronizing database changes across multiple instances will be very hard with that timing requirement.

How to make a paged Select query and get aggregated results from many shards

In a sharded environment data will be splited to various machines/shards. I want to know how can I create a query that returns a paged results (ex 2nd page, 10 results or 10th page, 20 results)?
I know that it has to do with the primary key. With a single RDBMS it's easy because you have a auto-increment column so it's easy to get get the last 10 items and return paged data.
I work for ScaleBase, which is a maker of a complete scale-out solution an "automatic sharding machine" if you like, analyzes the data and SQL stream, splits the data across DB nodes, load-balances reads, and aggregates results in runtime – so you won’t have to!
You can see my answer to this thread about auto increment: Sharding and ID generation as instagram
Also, take a look on my post in http://database-scalability.blogspot.com/ about Pinterest, then and now...
Specifically - merging results from several shards to 1 result is HELL. May edge cases, GROUP BY, ORDER BY, JOINs, LIMIT, HAVING. I must say that in SB we support most of combinations, it took us ages. True, we need to do it generically, while you can "bend" to proprietary... but still...

Would using Redis with Rails provide any performance benefit for this specific kind of queries

I don't know if this is the right place to ask question like this, but here it goes:
I have an intranet-like Rails 3 application managing about 20k users which are in nested-set (preordered tree - http://en.wikipedia.org/wiki/Nested_set_model).
Those users enter stats (data, just plain numeric values). Entered stats are assigned to category (we call it Pointer) and a week number.
Those data are further processed and computed to Results.
Some are computed from users activity + result from some other category... etc.
What user enters isn't always the same what he sees in reports.
Those computations can be very tricky, some categories have very specific formulae.
But the rest is just "give me sum of all entered values for this category for this user for this week/month/year".
Problem is that those stats needs also to be summed for a subset of users under selected user (so it will basically return sum of all values for all users under the user, including self).
This app is in production for 2 years and it is doing its job pretty well... but with more and more users it's also pretty slow when it comes to server-expensive reports, like "give me list of all users under myself and their statistics. One line for summed by their sub-group and one line for their personal stats"). Of course, users wants (and needs) their reports to be as actual as possible, 5 mins to reflect newly entered data is too much for them. And this specific report is their favorite :/
To stay realtime, we cannot do the high-intensive sqls directly... That would kill the server. So I'm computing them only once via background process and frontend just reads the results.
Those sqls are hard to optimize and I'm glad I've moved from this approach... (caching is not an option. See below.)
Current app goes like this:
frontend: when user enters new data, it is saved to simple mysql table, like [user_id, pointer_id, date, value] and there is also insert to the queue.
backend: then there is calc_daemon process, which every 5 seconds checks the queue for new "recompute requests". We pop the requests, determine what else needs to be recomputed along with it (pointers have dependencies... simplest case is: when you change week stats, we must recompute month and year stats...). It does this recomputation the easy way.. we select the data by customized per-pointer-different sqls generated by their classes.
those computed results are then written back to mysql, but to partitioned tables (one table per year). One line in this table is like [user_id, pointer_id, month_value, w1_value, w2_value, w3_value, w4_value]. This way, the tables have ~500k records (I've basically reduced 5x # of records).
when frontend needs those results it does simple sums on those partitioned data, with 2 joins (because of the nested set conds).
The problem is that those simple sqls with sums, group by and join-on-the-subtree can take like 200ms each... just for a few records.. and we need to run a lot of these sqls... I think they are optimized the best they can, according to explain... but they are just too hard for it.
So... The QUESTION:
Can I rewrite this to use Redis (or other fast key-value store) and see any benefit from it when I'm using Ruby and Rails? As I see it, if I'll rewrite it to use redis, I'll have to run much more queries against it than I have to with mysql, and then perform the sum in ruby manually... so the performance can be hurt considerably... I'm not really sure if I could write all the possible queries I have now with redis... Loading the users in rails and then doing something like "redis, give me sum for users 1,2,3,4,5..." doesn't seem like right idea... But maybe there is some feature in redis that could make this simpler?)...
Also the tree structure needs to be like nested set, i.e. it cannot have one entry in redis with list of all child-ids for some user (something like children_for_user_10: [1,2,3]) because the tree structure changes frequently... That's also the reason why I can't have those sums in those partitioned tables, because when the tree changes, I would have to recompute everything.. That's why I perform those sums realtime.)
Or would you suggest me to rewrite this app to different language (java?) and to compute the results in memory instead? :) (I've tried to do it SOA-way but it failed on that I end up one way or another with XXX megabytes of data in ruby... especially when generating the reports... and gc just kills it...) (and a side effect is that one generating report blocks the whole rails app :/ )
Suggestions are welcome.
Redis would be faster, it is an in-memory database, but can you fit all of that data in memory? Iterating over redis keys is not recommended, as noted in the comments, so I wouldn't use it to store the raw data. However, Redis is often used for storing the results of sums (e.g. logging counts of events), for example it has a fast INCR command.
I'm guessing that you would get sufficient speed improvement by using a stored procedure or a faster language than ruby (eg C-inline or Go) to do the recalculation. Are you doing group-by in the recalculation? Is it possible to change group-bys to code that orders the result-set and then manually checks when the 'group' changes. For example if you are looping by user and grouping by week inside the loop, change that to ordering by user and week and keep variables for the current and previous values of user and week, as well as variables for the sums.
This is assuming the bottleneck is the recalculation, you don't really mention which part is too slow.