Is there a Socrata API method to get the row count of a dataset? - socrata

Is there any fast way to get the number of rows in a dataset?
The best idea I can come up with is to do binary probing using $limit and $offset, or maybe some hybrid of binary probing and a final fetch of a single column within a $limit/$offset window when the size is known within, say, 100 or so.
(I checked the HTTP headers... no joy.)

One way you can achieve this is by doing a COUNT(*) operation on the dataset. For example, to get the total row count of this Socrata dataset:
https://data.seattle.gov/City-Business/Sold-Fleet-Equipment/y6ef-jf2w
You could issue this SODA query:
https://data.seattle.gov/resource/y6ef-jf2w.json?$select=count(*)

Related

Get an accurate count of items in a bucket

The couchbase admin console (I'm using version 5.0, community) shows a count of items in each bucket. I'm wondering if that count is just a rough estimate and not an exact count of the number of items in the bucket. Here's the behavior I'm seeing that leads me to this reasoning:
When I use XDCR to replicate a bucket to a backup node, the count in the backup bucket after the XDCR has finished will be significantly higher than the count of documents in the source bucket, sometimes by tens of thousands (in a bucket that contains hundreds of millions of documents).
When I use the Java DCP client to clone a bucket to a table in a different database, the other database shows numbers of records that are close, but off by possibly even a few million (again, in a bucket with hundreds of millions of documents).
How can I get an accurate count of the exact number of items in a bucket, so that I can be sure, after my DCP or XDCR processes have completed, that all documents have made it to the new location?
There can be a number of different reasons why the count could be different without more details it would be hard to say. The common cases are:
The couchbase admin console (I'm using version 5.0, community) shows a count of items in each bucket.
The Admin console is accurate but does not auto updated, so a refresh is required.
When I use the Java DCP client to clone a bucket to a table in a different database, the other database shows numbers of records that are close, but off by possibly even a few million (again, in a bucket with hundreds of millions of documents).
DCP will include tombstones (deleted documents) and possibly multiple mutations for the same document. Which could explain why the DCP count is out.
With regards to using N1QL, if the query is a simple SELECT COUNT(*) FROM bucketName then depending on the Couchbase Server version it will use the bucket stats directly.
In other words as mentioned previously the bucket stats via the REST interface or by asking the Data service directly will be accurate.
The most accurate answer would be to go directly to the bucket info
something like
curl http://hostname:8091/pools/default/buckets/beer-sample/ -u user:password | jq '.basicStats | {itemCount: .itemCount }'
the result would be immediate, no need for indexing:
{
"itemCount": 7303
}
or not in Json format
curl http://centos:8091/pools/default/buckets/beer-sample/ -u roi:password | jq '.basicStats.itemCount'
Alright, here I am to answer my own question over a year later :). We did a lot of experimentation today when trying to migrate items out of a bucket containing roughly 2.6 million items into an SQL database. We wanted to make sure the row count matched between Couchbase and the new database before going live.
Unfortunately when we tried the normal select count(*) from <bucket>; the document count we received was over what we expected by just 1, so we broke down the query and did a count over all documents in the bucket while grouping by an attribute, hoping to find what kind of document was missing in the target DB. The total for the counts for each group should have added up to the same total that we got from the count query. Unfortunately, they did not. The total added up to 1 fewer than we expected (so that's off by two from the original count query).
We found the category of document that was off by 1, expecting to have an extra doc in Couchbase that didn't make it to the target DB, but found instead that the totals indicated the reverse, that the target DB had one extra doc. This all seemed very fishy, so we did a query to pull all of the IDs in that group out into a single JSON file, and we counted them. Alas, the actual count of documents in that group matched up with the target DB, meaning that Couchbase's counting was incorrect in both cases.
I'm not sure what implementation details caused this to happen, but it seems like at least the over-counting might have been a caching issue. I was able to finally get a correct document count by using a query like this:
select count(*) from <bucket> where meta(<bucket>).id;
This query ran for much longer than the original count did, indicating that whatever cache is used for counts was being skipped, and it did come up with the correct number.
We were doing these tests on a relatively small number of documents, half a million or so. With the full volume of the bucket, counts had been off by as much as 15 in the past, apparently becoming less accurate as the document count increased.
We just did a re-sync of the full bucket. The bucket total as reported by the dashboard and by the original N1ql query are over the expected count by 7. We ran the modified query, waited for the result, and got the expected count.
In case you're wondering, we did turn off traffic to the bucket, so document counts were not likely to be fluctuating during this process, except when a document reached its expiry date in Couchbase, and was automatically deleted.
To get an accurate count, you can run a N1QL query. That will get you as accurate a number as Couchbase is capable of producing.
SELECT COUNT(*) FROM bucketName
Use REQUEST_PLUS consistency to make sure the indexes have received the very latest updates.
https://developer.couchbase.com/documentation/server/current/indexes/performance-consistency.html
You'll need a query node for this, though.

Web Development - SQL Creating a Recent Pages

I'm creating a web application where one of the features will include getting the recent games played. I want to make it so the 25 most recent games played will be on the first page, with a page selection at the bottom. (Games data will be grabbed from MYSQL)
I understand this is a concept many sites already have, but after extensive googling I'm uncertain of a way to efficiently do this. The only thing I can currently think of is querying every game of theirs then splitting it into pages, but wouldn't that become extremely inefficient when it gets in the thousands?
Any help, or link to outside sources that explain this topic would be greatly appreciated, thank you!
For pagination use limit:
The LIMIT clause can be used to constrain the number of rows returned by the SELECT statement.
... the first argument specifies the [zero origin] offset of the first row to return, and the second specifies the maximum number of rows to return. ...
SELECT * FROM tbl LIMIT 5,10; # Retrieve rows 6-15
So for the first page it's enough to request all games played, order by time_played DESCending, limit 25. Just be sure to put an index on the time_played column.

Effectively fetching large number of tuples from Solr

I am stuck in a rather tricky problem. I am implementing a feature in my website, wherein, a person get all the results matching a particular criteria. The matching criteria can be anything. However, for the sake of simplicity, let's call the matching criteria as 'age'. Which means, the feature will return all the students names, from database (which is in hundreds of thousands) with the student whose age matches 'most' with the parameter supplied, on top.
My approaches:
1- I have a Solr server. Since I need to implement this in a paginated way, I would need to query Solr several times (since my solr page size is 10) to find the 'near-absolute' matching student real-time. This is computationally very intensive. This problem boils down to effectively fetching this large number of tuples from Solr.
2- I tried processing it in a batch (and by increasing the solr page size to 100). This data received is not guaranteed to be real-time, when somebody uses my feature. Also, to make it optimal, I would need to have data learning algos to find out which all users are 'most likely' to use my feature today. Then I'll batch process them on priority. Please do remember that number of users are so high that I cannot run this batch for 'all' the users everyday.
On one hand where I want to show results real-time, I have to compromise on performance (hitting Solr multiple times, thus slightly unfeasible), while on the other, my result set wouldn't be real-time if I do a batch processing, plus I can't do it everyday, for all the users.
Can someone correct my seemingly faulty approaches?
Solr indexing is done on MySQL db contents.
As I understand it, your users are not interested in 100K results. They only want the top-10 (or top-100 or a similar low number) results, where the person's age is closest to a number you supply.
This sounds like a case for Solr function queries: https://cwiki.apache.org/confluence/display/solr/Function+Queries. For the age example, that would be something like sort=abs(sub(37, age)) desc, score desc, which would return the persons with age closest to 37 first and prioritize by score in case of ties.
I think what you need is using solr cursors which will enable you to paginate effectively through large resultsets Solr cursors or deep paging

Partial Data Set in WEBI 4.0

When I run a query in Web Intelligence, I only get a part of the data.
But I want to get all the data.
The resulting data set I am retrieving from database is quite large (10 million rows). However, I do not want to have 10 million rows in my reports, but to summarize it, so that the report has the most 50 rows.
Why am I getting only a partial data set as a result of WEBI query?
(I also noticed that in the bottom right corner there is an exclamation mark, that indicates I am working with partial data set, and when I click on refresh I still get the partial data set.)
BTW, I know I can see the SQL query when I built it using query editor, but can i see the corresponding query when I make a certain report? If yes, how?
UPDATE: I have tried the option by editing the 'Limit size of result set to:' in the Query Options in Business Layer by setting the value to 9 999 999 and the again by unchecking this option. However, I am still getting the partial result.
UPDATE: I have checked the number of rows in the resulting set - it is 9,6 million. Now it's even more confusing why I'm not getting all the rows (the max number of rows was set to 9 999 999)
SELECT
I_ATA_MV_FinanceTreasury.VWD_Segment_Value_A.Description_TXT,
count(I_ATA_MV_FinanceTreasury.VWD_Party_A.Party_KEY)
FROM
I_ATA_MV_FinanceTreasury.VWD_Segment_Value_A RIGHT OUTER JOIN
I_ATA_MV_FinanceTreasury.VWD_Party_A ON
(I_ATA_MV_FinanceTreasury.VWD_Segment_Value_A.Segment_Value_KEY=I_ATA_MV_FinanceTreasury.VWD_Party_A.Segment_Value_KEY)
GROUP BY 1
The "Limit size of result set" setting is a little misleading. You can choose an amount lower than the associated setting in the universe, but not higher. That is, if the universe is set to a limit of 5,000, you can set your report to a limit lower than 5,000, but you can't increase it.
Does your query include any measures? If not, and your query is set to retrieve duplicate rows, you will get an un-aggregated result.
If you're comfortable reading SQL, take a look at the report's generated SQL, and that might give you a clue as to what's going on. It's possible that there is a measure in the query that does not have an aggregate function (as it should).
While this may be a little off-topic, I personally would advise against loading that much data into a Web Intelligence document, especially if you're going to aggregate it to 50 rows in your report.
These are not the kind of data volumes WebI was designed to handle (regardless whether it will or not). Ideally, you should push down as much of the aggregation as possible to your database (which is much better equipped to handle such volumes) and return only the data you really need.
Have a look at this link, which contains some best practices. For example, slide 13 specifies that:
50.000 rows per document is a reasonable number
What you need to do is to add a measure to your query and make sure that this measure uses an aggregate database function (e.g. SUM()). This will cause WebI to create a SQL statement with GROUP BY.
Another alternative is to disable the option Retrieve duplicate rows. You can set this option by opening the data provider's properties.
.

Do ZeosLib DataSets need to perform FetchAll method to return real real total rows?

In Firebird/Interbase databases we have the TIBQuery, TIBTable and TIBDataSet, which have the FetchAll method to count how many rows that data set has. If we don't call that method, these data sets only register as "total" the number of rows that the user already saw by a TDBGrid or TDBNavigator. This "total" can be retrieved by calling the `RecordCount' method of these data sets.
Another (much more efficient) way to get the real total rows it to get a separated data set and perform some SELECT COUNT(*) FROM TABLE_NAME and apply any filters we like. It's Ok by far.
But now that I am working with MySQL through ZeosLib, I was wonder if I need to have that trouble to put a second query on the memory.
We know that ZeosLib makes it's queries and it might return internally the statistics of that query, which includes the number of rows returned.
Does ZeosLib puts that information in the RecordCount or does it works exactly like Interbase Components?
Zeos returns number of already fetched records. It does not take into account any applyed filters and does not do FetchAll before returning RecordCount.
SELECT COUNT(*) ... is not "much more efficient", because it creates additional server workload, which sometimes may be equal to workload to execute an original query.
In general, a data access library may offer 3 modes of record count calculation: the number of fetched rows, the number of visible rows (like first, but after applying filters), and SELECT COUNT(*). FetchAll or not FetchAll will be better to control explicitly. This is how this is done in AnyDAC (more).