google analytics core api results sampling level - google-apps-script

I have a script pulling data from the Google Analytics core api. Since I am using the results of the data to successfully populate a sheet in GSheets I know that my data pull is a success.
I'm reading the documentation here.
In particular, this table:
However, I would like to Logger.log() the sampling level of the query:
// check sampling for each report
if(!results.containsSampledData) {
Logger.log('sampling: none');
} else {
Logger.log('sampling: ' + results.query.samplingLevel);
}
When I view the logs I get 'sampling: undefined'.
How do I get the sampling results from the results object?
Here is what generates the results object, though I don;t think it's relavant (but may be wrong):
// get GA data from core api
function getReportDataForProfile(profile, len_results, start_num) {
var startDate = getLastNdays(30); // set date range here
var endDate = getLastNdays(0);
var optArgs = {
'dimensions': 'ga:dimension5, ga:dimension4', // Comma separated list of dimensions.
'start-index': start_num,
'max-results': len_results,
'filters': 'ga:source==cj'
};
// Make a request to the API.
var results = Analytics.Data.Ga.get( // mcf for multi channel api, Ga for core
profile, // Table id (format ga:xxxxxx).
startDate, // Start-date (format yyyy-MM-dd).
endDate, // End-date (format yyyy-MM-dd).
'ga:goalCompletionsAll, ga:users, ga:sessions', // Comma seperated list of metrics.
optArgs);
return results;
}

I think you missed this sentence:
The following table summarizes all the query parameters accepted by
the Core reporting API.
Those are query parameters. In other words, values that YOU supply. So, you should already know what the sampling level is since you determine it.
Here's the doc on sampling level. If not supplied, it sets samplingLevel to DEFAULT.
EDIT: Here's the doc on the response. I see that it indeed includes a samplingLevel field, but if you scroll further down, samplingLevel isn't one of the fields described in the Response Fields table. I suspect it is either included in the response by accident or you cannot rely upon that field given the lack of documentation.

Ah. If I read further down I would have seen this paragraph:
Sampling
Google Analytics calculates certain combinations of dimensions and
metrics on the fly. To return the data in a reasonable time, Google
Analytics may only process a sample of the data.
You can specify the sampling level to use for a request by setting the
samplingLevel parameter.
If a Core Reporting API response contains sampled data, then the
containsSampledData response field will be true. In addition, 2
properties will provide information about the sampling level for the
query: sampleSize and sampleSpace. With these 2 values you can
calculate the percentage of sessions that were used for the query. For
example, if sampleSize is 201,000 and sampleSpace is 220,000 then the
report is based on (201,000 / 220,000) * 100 = 91.36% of sessions.
See Sampling for a general description of sampling and how it is used
in Google Analytics."
So to get sample size as a percentage (what I'm used to seeing) I do this: results.sampleSize/results.sampleSpace

Related

Apply OData function on retrieved data in a query

I just started to work with Odata and I had an impression the OData querying is kind of flexible.
But in some cases I want to retrieve updated/newly calculated data on the fly. In my case this data is SalaryData values. At some point, I want them to be slightly tweaked with additional applied calculation function. And the critical point that this action must occur on the retrieval of the data with the general request query.
But I don't know, is that applicable to use function in this case?
Ideally, I want to have the similar request:
/odata/Employee(1111)?$expand=SalaryData/CalculculationFunction(40)
Here I want to apply CalculculationFunction with parameters on SalaryData.
Is that possible to do it in OData in this way? Or should I create an entity set of salary data and retrieve calculated data directly using the query something like
/odata/SalaryData(1111)/CalculculationFunction(40)
But this way is least preferable for me, because I don't want to use id of SalaryData in request
Current example of the function I created:
[EnableQuery(MaxExpansionDepth = 10, MaxAnyAllExpressionDepth = 10)]
[HttpGet]
[ODataRoute("({key})/FloatingWindow(days={days})")]
public SingleResult<Models.SalaryData> MovingWindow([FromODataUri] Guid key, [FromODataUri] int days)
{
if (days <= 0)
return new SingleResult<Models.SalaryData>(Array.Empty<Models.SalaryData>().AsQueryable());
var cachedSalaryData = GetAllowedSalaryData().FirstOrDefault(x => x.Id.Equals(key));
var mappedSalaryData = mapper.Map<Models.SalaryData>(cachedSalaryData);
mappedSalaryData = Models.SalaryData.FloatingWindowAggregation(days, mappedSalaryData);
var salaryDataResult = new[] { mappedSalaryData };
return new SingleResult<Models.SalaryData>(salaryDataResult.AsQueryable());
}
There is always an overlap between What is OData Compliant Routing vs What can I do with Routes in Web API. It is not always necessary to conform to the OData (V4) specification, but a non-conforming route will need custom logic on the client as well.
The common workaround for this type of request is to create Function endpoint bound to the Employee item that accepts the parameter input that will be used to materialize the data. The URL might look like this instead:
/odata/Employee(1111)/WithCalculatedSalary(40)?$expand=SalaryData
This method could then internally call the existing MovingWindow function from the SalaryDataController to build the results. You could also engineer both functions to call a common set based routine.
The reason that you you should bind this function to the EmployeeController is that the primary identifying resource that correlates the resulting data together is the Employee.
In this way OData v4 compliant clients would still be able to execute this function and importantly would be able to discover it without any need for customisations.
If you didn't need to return the Employee resource as part of the response then you could still serve a collection of SalaryData from the EmployeeController:
/odata/Employee(1111)/CalculatedSalary(days=40)
[EnableQuery(MaxExpansionDepth = 10, MaxAnyAllExpressionDepth = 10)]
[HttpGet]
[ODataRoute("({key})/FloatingWindow(days={days})")]
public IQueryable<Models.SalaryData> CalculatedSalary([FromODataUri] int key, [FromODataUri] int days)
{
...
}
builder.EntitySet<Employee>("Employee")
.EntityType
.Function("CalculatedSalary")
.ReturnsCollectionFromEntitySet<SalaryData>("SalaryData")
.Parameter<int>("days");
$compute and $search in ASP.NET Core OData 8
The OData v4.01 specification does have support for System Query Option $compute which was designed to enable clients to append computed values into the response structure, you could hijack this pipeline and define your own function that can be executed from a $compute clause, but the expectation is that system canonical functions are used with a combination of literal values and field references.
The ASP.Net implementation has only introduced support for this in the OData Lib v8 runtime, as yet I have not yet found a good example of how to implement custom functions, but syntactically it is feasible.
The same concept could be used to augment the $apply execution, if this calculation operates over a collection and effectively performs an aggregate evaluation, then $apply
It might be that your current CalculculationFunction can be translated directly into a $compute statement, otherwise if you promote some of the calculation steps (metadata) as columns in the schema (you might use SQL Computed Columns for this...) then $compute could be a viable option.

Calling Background colors from google sheets using Google sheets api is missing data

From a previous question linked here ( Previous Question ) I learned about Sheets.SpreadSheets.get calling a JSON of sheet data that would allow me to get the backgroundcolors of a sheet within my project. Id previously been doing this with var BackgroundColors = ActiveWeekSheet.getDataRange().getBackgrounds(); but was told that the JSON method would be a faster read/write method. They directed me to do some reading on Javascript objects but after that I'm still confused.
I've got the following code. TestArray = Sheets.Spreadsheets.get("1irmcO8yMxYwkcLaxZd1cN8XsTIhpzI98If_Cxgp1vF8"); which seems to call a JSON with sheet specific data. A logger statement of TestArray returns this: testArrayObject: {"properties":{"gridProperties":{"rowCount":1000,"columnCount":26},"sheetType":"GRID","index":0,"sheetId":0,"title":"Awesome"}}
Community members previously suggested I could then find the background colors at: sheets[].data[].rowData[].values[].cellData.effectiveFormat.backgroundColor
I've highlighted one of the cells yellow but when reviewing the above JSON i can't seem to find anything that references color. There definitely isn't any multileveling of the JSON to refer to sheets->data->rowData->values->celldata.effectiveFormat.backgroundColor.
What am I missing here? Do I need to format things someway? Am I not calling the right JSON to start with?
Thanks!
As written in the documentation,
By default, data within grids will not be returned. You can include grid data one of two ways:
Specify a field mask listing your desired fields using the fields URL parameter in HTTP
Sheets.Spreadsheets.get(spreadsheetId, {
ranges:"Sheet1!A1:A5",
fields:"sheets(data(rowData(values(effectiveFormat.backgroundColor))))"
})
Set the includeGridData URL parameter to true. If a field mask is set, the includeGridData parameter is ignored
Sheets.Spreadsheets.get(spreadsheetId, {
ranges:"Sheet1!A1:A5",
includeGridData: true
})
Field mask documentation:
In a nutshell,
multiple different fields are comma separated, and
subfields are dot-separated.
For convenience, multiple subfields from the same type can be listed within parentheses.
You may test the API here
There are optional parameters in the spreadsheets.get method that will give you that data, but you need to explicitly include them:
ranges – The ranges to retrieve from the spreadsheet.
includeGridData – The cell data within specified range.
This specifies a range of just one cell (A1 in Sheet1), but you can specify a larger range and navigate through the array if you need to.
var TestArray = Sheets.Spreadsheets.get(SS_ID, {ranges: "Sheet1!A1", includeGridData: true});
Really important that you keep in mind this returns a Color object with RGBA
values that range from 0-1, but elsewhere apps script uses hex color or the conventional 0-255 RGB values.

Pagination yields no results in Google Fit

I am using the REST API of Google Fit. I want to list sessions with the fitness.users.sessions.list method. This gives me a few dozen of results.
Now I would like to get more results and for this I set the pageToken to the value I got from the previous response. But the new results does not contain any data points, just yet another pageToken:
{
"session": [
],
"deletedSession": [
],
"nextPageToken": "1541027616563"
}
The same happens when I use the pagination function of the Google Python API Client: I iterate on results but never get any new data.
request = self.service.users().sessions().list(userId='me')
while request is not None:
response = request.execute()
for ds in response['session']:
yield ds
request = self.service.users().sessions().list_next(request, response)
I am sure there is much(!) more session data in Google Fit for my account. Am I missing something regarding pagination?
Thanks
I think that the description of the pageToken parameter is actually rather confusing in the documentation (this answer was written prior to the documentation being updated).
The continuation token, which is used to page through large result sets. To get the next page of results, set this parameter to the value of nextPageToken from the previous response.
This is conflating two concepts: continuation, and paging. There isn't actually any paging in the implementation of Users.sessions.
Sessions are indexed by their modification timestamp. There are two (or three, depending on how you count) ways to interact with the API:
Pass a start and/or end time. Omitted start and end times are taken to be the start and end of time respectively. In this case, you will get back all sessions falling between those times.
Pass neither start nor end times. In this case, you will receive all sessions between some time in the past and now. That time is:
pageToken, if provided
Otherwise, it's 7 days ago (this doesn't actually appear in the documentation, but it is the behavior)
In any of these cases, you receive a nextPageToken back which is just after the most recent session in the results. As such, nextPageToken is really a continuation token, because what it is saying is that you have been told about all sessions modified up to now: pass that token back to be told about anything modified between nextPageToken and "current time" to get updates.
As such, if you issue a request that fetches all sessions for the last 7 days (no start/end time, no page token) and get a nextPageToken, you will only get something back in a request using that nextPageToken if any sessions have been modified in between the first and second requests.
So, if you're making these requests in quick succession, it is expected that you won't see anything in the second response.
In terms of the validity of the startTime you were passing in your comment, that's a bug. RFC3339 defines that fractional seconds should be optional.
I'll see about getting that fixed; but in the interim, just make sure you pass a fractional number of seconds (even if it is just .0, e.g. 2018-10-18T00:00:00.0+00:00).
It may be because the format of the URL you're using is different from the example in the documentation.
You are using:
startTime=2018-10-18T00:00:00+00:00
Wherein the one in the documentation has it as:
startTime=2014-04-01T00:00:00.00Z
The documentation also stated that both startTime and endTime query parameters are required.

Duplicates on Apache Beam / Dataflow inputs even when using withIdAttribute

I am trying to ingest data from a 3rd party API into a Dataflow pipeline. Since the 3rd party doesn't make webhooks available, I wrote a custom script that constantly polls their endpoint for more data.
The data is refreshed every 15 minutes, but since I don't want to miss any datapoints and I want to consume as soon as new data is available, my "crawler" runs every 1 minute. The script then sends the data to a PubSub topic. Easy to see that PubSub will receive about 15 repeated messages for each datapoint in the source.
My first attempt to identify and discard those repeated messages was to add a custom attribute to each PubSub message (eventid), created from a hash of its [ID + updated_time] at source.
const attributes = {
eventid: Buffer.from(`${item.lastupdate}|${item.segmentid}`).toString('base64'),
timestamp: item.timestamp.toString()
};
const dataBuffer = Buffer.from(JSON.stringify(item))
publisher.publish(dataBuffer, attributes)
Then I configured Dataflow with a withIdAttribute() (which is the new idLabel(), based on Record IDs).
PCollection<String> input = p
.apply("ReadFromPubSub", PubsubIO
.readStrings()
.fromTopic(String.format("projects/%s/topics/%s", options.getProject(), options.getIncomingDataTopic()))
.withTimestampAttribute("timestamp")
.withIdAttribute("eventid"))
.apply("OutputToBigQuery", ...)
With that implementation, I was expecting that when the script sends the same datapoint a second time, the repeated eventid would be the same and the message discarded. But for some reason, I still see duplicates on the output dataset.
Some questions:
Is there a clever way to ingest the data to dataflow from that 3rd party API if they don't provide webhooks?
Any ideas on why dataflow is not discarding the messages on this situation?
I know about the 10-minute restriction for deduplication on dataflow, but I see duplicated data even on the 2nd insertion (2 minutes).
Any help will be greatly appreciated!
I think you are on the right track, instead of the hash I recommend to use timestamps. A better way to to this is by using windows. Review this document which filters data that is outside of the window.
Regarding the additional duplicate data, if you are using pull subscriptions and the acknowledgement deadline is reached before having the data processed the message will be resent as per the at-least-once delivery. In this case change the acknowledgement deadline, the defaults is 10 seconds.

Using Other Data Sources for cubism.js

I like the user experience of cubism, and would like to use this on top of a backend we have.
I've read the API doc's and some of the code, most of this seems to be extracted away. How could I begin to use other data sources exactly?
I have a data store of about 6k individual machines with 5 minute precision on around 100 or so stats.
I would like to query some web app with a specific identifier for that machine and then render a dashboard similar to cubism via querying a specific mongo data store.
Writing the webapp or the querying to mongo isn't the issue.
The issue is more in line with the fact that cubism seems to require querying whatever data store you use for each individual data point (say you have 100 stats across a window of a week...expensive).
Is there another way I could leverage this tool to look at data that gets loaded using something similar to the code below?
var data = [];
d3.json("/initial", function(json) { data.concat(json); });
d3.json("/update", function(json) { data.push(json); });
Cubism takes care of initialization and update for you: the initial request is the full visible window (start to stop, typically 1,440 data points), while subsequent requests are only for a few most recent metrics (7 data points).
Take a look at context.metric for how to implement a new data source. The simplest possible implementation is like this:
var foo = context.metric(function(start, stop, step, callback) {
d3.json("/data", function(data) {
if (!data) return callback(new Error("unable to load data"));
callback(null, data);
});
});
You would extend this to change the "/data" URL as appropriate, passing in the start, stop and step times, and whatever else you want to use to identify a metric. For example, both Cube and Graphite use a metric expression as an additional query parameter.