Open Payments - How to reconcile or explain differences between API data versus live site visual data - socrata

We are connected to the API for https://openpaymentsdata.cms.gov/. In some cases the data points differ (for example - quantity of general payments) between the raw API data versus what is seen on the live site.
What is the source for these data differences?
Which is the correct data - the API or the live site? Best case scenario is the API is the correct data and the live site wasn't updated yet.
We'd like to reconcile this to pull the correct data, but if it can't be reconciled through the API then how can it best be explained? It seems almost like a dispute on the transaction(s) was initiated which changed one of the data points. If a dispute is initiated by the profile owner, does the API data change to reflect any updates?
This is my first attempt to reach out for an idea how these data differences occur.
Here's a random example. The profile URL: https://openpaymentsdata.cms.gov/physician/209169/
The quantity of general payments shown on the live site is: 139
The quantity of general payments pulled from the API is: 151

I see the issue here. Based on the API call https://openpaymentsdata.cms.gov/resource/bqf5-h6wd.json?physician_profile_id=209169&$$app_token=oXbsFwj7KElCMesuRAZEfTDfB&$select=total_amount_of_payment_usdollars,number_of_payments_included_in_total_amount&$limit=50000&$offset=0 you get a field called "number_of_payments_included_in_total_amount" which has a number associated with it. Most of these are "1", but the 136th entry has "13" assigned to it. You are summing across the field which equals 151.
However, the OpenPayments website appears to only be counting entries/rows. There are 139 entries/rows in the results.
Unfortunately, I don't know which one is accurate. It could be a misconfiguration of the OpenPayments website or it could be a misinterpretation of what number_of_payments_included_in_total_amount means. But, at least, it explains the difference in how you get your counts.

Related

why can I only get 250 rows (/shoe transactions) of stockx data when accessing their API?

I have tried to gather data directly from the API of stockx which seemed possible according to an article from Jan 2019: https://medium.com/#thewillmundy/stockx-sneaker-data-in-three-simple-steps-8977d0016b80 . I am thereby able to get a request url which gives me some transactions in JSON-format.
I have tried changing the parameters within the request url (limit as well as page), which is possible, but only for the latest 250 transactions (due to high volume of sales for some shoes, I can thereby only receive the sales history for the last few days)...
My Goal: getting the whole sales history (often several thousand transactions) - in the article mentioned above, thats possible
Could it be a restriction from stockx?
or is there a way?
Would be so so grateful for help!!!
Best regards, Marvin
I think the API will only give you the 250 most recent sales because that's all the product webpage itself will allow you to load when you click view all sales. Any sales further back in time aren't directly accessible from the product page, and we're essentially requesting the same data that page can request using the link it would use. I guess those are stored and accessed in a different way internally.
I'm guessing StockX changed its API since that article is a little old. I would try to contact StockX about their API via email, but I don't think they're really continuing developer support:
https://twitter.com/stockx/status/1000004306844647424?lang=en
It's pretty disappointing because I was also looking to work with the sales data but what can you do :/

GoogleBetterAds - violatingSites.list - google-apis-explorer

I can get a list of summaries of violating sites, using the following link:
https://developers.google.com/ad-experience-report/[...]/violatingSites/list
My questions:
Is this list exhaustive?
If not, is it possible to get an exhaustive list (or not) and how?
Is it possible to know how these websites are pulled (the share of websites analysed, etc)?
- Is this list exhaustive?
What's size of your actual API return?
If you have an API return statement increasingly longer and longer with new data at each new request, you can think have the exhaustive list (with a possible update
latency).
If the API return statement have always same size with different data, in example old data will not appears and it replaced by new data, it's not exhaustive.
- If not, is it possible to get an exhaustive list (or not) and how?
I have no idea at the moment, the total number of websites can be in billion ...
- Is it possible to know how these websites are pulled (the share of websites analysed, etc)?
I have no idea for the moment too, I think it is either a confidential process or that it is described in the general conditions and subtily in the documentation...

Summing from large public dataset

I'm working on a web app that will let users explore some data from a public API. The idea is to let the user select a U.S. state and some other parameters, and I'll give them a line chart showing, for example, what percentage of home loans in that state were approved or denied over time.
I can make simple queries along these lines work with a small number rows, but these are rather large datasets, so I'm only seeing a sliver of the whole. Asking for all the data produces an error. I think the solution is to aggregate the data. But that's where I start getting 400 bad request responses from the server.
For example, this is an attempt to summarize 2008 California data to give the total number of applications per approval category:
https://api.consumerfinance.gov/data/hmda/slice/hmda_lar.json?$where=as_of_year=2008,state_abbr="CA"&$select=action_taken,SUM(action_taken)&group=action_taken
All summary variations produce a 400 error. I can't find an example that is very similar to what I'm trying to do, so any help is appreciated.
Publisher's information is here:
http://cfpb.github.io/api/hmda/queries.html
Sorry for the delay, but its worth noting that that API is based on Qu, CFPB's open-source query platform, and not Socrata, so you can't use the same SoQL query language on that API.
Its unclear how you can engage them for help, but there is a "Contact Us" email link in the docs page and an issues forum on GitHub.

Counting views of any element on website

I am using such MySQL request for measuring views count
UPDATE content SET views=views+1 WHERE id='$id'
For example if I want to check how many times some single page has been viewed I've just putting it on top of page code. Unfortunately I always receiving about 5-10x bigger amount than results in Google Analytics.
If I am correct one refresh should increase value in my data base about +1. Doesn't "Views" in Google Analytics works in the same way?
If e.g. Google Analytics provides me that single page has been viewed 100x times and my data base says it was e.g. 450x times. How such simple request could generate additional 350 views? And I don't mean visits or unique visits. Just regular views.
Is it possible that Google Analytics interprates such data in a little bit different way and my data base result is correct?
There are quite a few reasons why this could be occurring. The most usual culprit is bots and spiders. As soon as you use a third-party API like Google Analytics, or Facebook's API, you'll get their bots making hits to your page.
You need to examine each request in more detail. The user agent is a good place to start, although I do recommend researching this area further - discriminating between human and bot traffic is quite a deep subject.
In Google Analytics the data is provided by the user, for example:
A user view a page on your domain, now he is on charge to comunicate to Google The PageView, if something fails in the road, the data will no be included in the reports.
In the other case , the SQL sistem that you have is a Log Based Analytic, the data is collected by your system reducing the data collection failures.
If we see this in that way, that means taht some data can be missed with the slow conections and users that dont execute javascriopt (Adbloquers or bots), or the HTML page is not properly printed***.
Now 5x times more it's a huge discrepancy, in my experiences must be near 8-25% of discrepancy. (tested over transaction level, maybe in Pageview can be more)
What i recomend you is:
Save device, browser information, the ip, and some other metadata information that can be useful and dont forget the timesatmp, so in that way yo can isolate the problem, maybe are robots or users with adblock, in the worst case you code is not properly implemented ( located in the Footer as example)
*** i added this because one time i had a huge discrepancy, but it was a server error, the HTML code was not properly printed showing to the user a empty HTTP. The MYSQL was no so fast to save the information and process the HTML code. I notice it when the effort test (via Screaming frog) showed a lot of 500x errors. ( Wordpress Blog with no cache)

Is it worth to exclude null fields from a JSON server response in a web application to reduce traffic?

Lets say that the API is well documented and every possible response field is described.
Should web application's server API exclude null fields in a JSON response to lower the amount of traffic? Is this a good idea at all?
I was trying to calculate the amount of traffic reduced for a large app like Twitter, and the numbers are actually quite convincing.
For example: if you exclude a single response field, "someGenericProperty":null, which is 26 bytes, from every single API response, while Twitter is reportedly having 13 billion API requests per day, the amount of traffic reduction will be >300 Gb.
More than 300 Gb less traffic every day is quite a money saver, isn't it? That's probably the most naive and simplistic calculation ever, but still.
In general, no. The more public the API and and the more potential consumers of the API, the more invariant the API should be.
Developers getting started with the API are confused when a field shows up some times, but not other times. This leads to frustration and ultimately wastes the API owner's time in the form of support requests.
There is no way to know exactly how downstream consumers are using an API. Often, they are not using it just as the API developer imagines. Elements that appear or disappear based on the context can break applications that consume the API. The API developer usually has no way to know when a downstream application has been broken, short of complaints from downstream developers.
When data elements appear or disappear, uncertainty is introduced. Was the data element not sent because the API considered it to be irrelevant? Or has the API itself changed? Or is some bug in the consumer's code not parsing the response correctly? If the consumer expects a fields and it isn't there, how does that get debugged?
On the server side, extra code is needed to strip out those fields from the response. What if the logic to strip out data the wrong? It's a chance to inject defects and it means there is more code that must be maintained.
In many applications, network latency is the dominating factor, not bandwidth. For performance reasons, many API developers will favor a few large request/responses over many small request/responses. At my last company, the sales and billing systems would routinely exchange messages of 100 KB, 200 KB or more. Sometimes only a few KB of the data was needed. But overall system performance was better than fetching some data, discovering more was needed then sending additional request for that data.
For most applications some inconsistency is more dangerous than superfluous data is wasteful.
As always, there are a million exceptions. I once interviewed for a job at a torpedo maintenance facility. They had underwater sensors on their firing range to track torpedoes. All sensor data were relayed via acoustic modems to a central underwater data collector. Acoustic underwater modems? Yes. At 300 baud, every byte counts.
There are battery-powered embedded applications where every bytes counts, as well as low-frequency RF communication systems.
Another exception is sparse data. For example, imagine a matrix with 4,000,000 rows and 10,000 columns where 99.99% of the values of the matrix are zero. The matrix should be represented with a sparse data structure that does not include the zeros.
It's definitely dependent from the service and the amount of data it provides; it should be evaluate the ratio about null / not null data and set a threshold over than it worth to exclude that elements.
Thanks for sharing, it's an interesting point as for me.
The question is on a wrong side - JSON is not the best format to compress or reduce traffic, but something like google protobuffers or bson is.
I am carefully re-evaluating nullables in the API scheme right now. We use swagger (Open API) and json scheme does not really have something like nullable type and I think there is a good reason for this.
If you have a JSON response that maps a DB integer field which is suddenly NULL (or can be according to DB scheme), well it is indeed ok for relational DB but not at all healthy for your API.
I suggest to adopt and follow a much more elegant approach, and that would be to make better use of "required" also for the response.
If the field is optional in the response API scheme and it has null value in the DB do not return this field.
We have enabled strict scheme checks also for the API responses, and this gives us a much better control of our data and force us not to rely on states in the API.
For the API client that of course means doing checks like:
if ("key" in response) {
console.log("Optional key value:" + response[key]);
} else {
console.log("Optional key not found");
}