Reference data join on stream analytics input not giving output - json

I'm trying to set a rule in Azure Stream Analytics job with the use of reference data and input stream which is coming from an event hub.
This is my reference data JSON packet in BLOB storage:
{
"ruleId": 1234,
"Tag" : "TAG1",
"metricName": "velocity",
"alertName": "velocity over 500",
"operator" : "AVGGREATEROREQUAL",
"value": 500
}
And here is the transformation query in the stream analytics job:
WITH
transformedInput AS
(
SELECT
metric = GetArrayElement(DeviceInputStream.data,0),
masterTag = rules.Tag,
ruleId = rules.ruleId,
alertName = rules.alertName,
ruleOperator = rules.operator,
ruleValue = rules.value
FROM
DeviceInputStream
timestamp by EventProcessedUtcTime
JOIN
rules
ON DeviceInputStream.masterTag = rules.Tag
)
--rule output--
SELECT
System.Timestamp as time,
transformedInput.Tag as Tag,
transformedInput.ruleId as ruleId,
transformedInput.alertName as alert,
AVG(metric.velocity) as avg
INTO
alertruleblob
FROM
transformedInput
GROUP BY
transformedInput.masterTag,
transformedInput.ruleId,
transformedInput.alertName,
ruleOperator,
ruleValue,
TumblingWindow(second, 6)
HAVING
ruleOperator = 'AVGGREATEROREQUAL' AND avg(metric.velocity) >= ruleValue
This is not yielding any results. However, when I do a test with sample input and reference data I get the expected results. But this doens't seem to be working with the streaming data. My use case is if the average velocity is greater than 500 for a 6 second window, store that result in another blob storage. The value of velocity has been greater than 500 for sometime, but I'm not getting any results.
What am I doing wrong?

This was working all along. I just had to specify the input path of the reference blob in the reference input path of stream analytics including the file name. I was basically referencing only the blob container without the actual file. So when I changed the path pattern to "filename.json", I got the results. It was a stupid mistake.

Related

Cloud Function running multiple times instead of once

I upload 10 files every day at 11 p.m with a Cron Job to a bucket on GCS. Each file is a .csv with a size from 2 to 30 KB. The file name is always YYYY-MM-DD-ID.csv
A Cloud Function is called everytime I am uploading a file into that bucket to send those .csv files to BigQuery. The trigger type is Cloud Storage on finalise/create events.
My issue is the following:
On BigQuery, each value for each row/columns is multiplied by a multiple. Sometimes it's 1 (so the value is the same), often 2 and sometimes 3. I attached one example bellow with the difference between BigQuery (BQ) and Google Cloud Storage (GCS).
It seems that the cloud function is called multiple times. It's not on the code but rather some duplicate message deliveries from the Cloud Function during the trigger. When I am going o the logs tab for today, I can see the Cloud Function upload_to_bigquery has been called multiple times.
I have tried to fix it but I made a mistake. I thought we could write temporary files to Cloud Functions but we can not. My solution was to write the filename I am uploading to BigQuery on a .txt file. And before to upload a new file on BigQuery, read that .txt file and check if the current file exist on that list. If the filename is already present, skip. Else, write the .txt filename to the list and do my stuff.
if file_to_upload not in text:
text.append(file_to_upload)
with open("all_uploaded_files.txt", "w") as text_file:
for item in text:
text_file.write(item + "\n")
bucket = storage_client.bucket('sfr-test-data')
blob = bucket.blob("all_uploaded_files.txt")
blob.upload_from_filename("all_uploaded_files.txt")
## do my things here
else:
print("file already uploaded")
# skip to new file to upload
But even if I could do that, this solution is not viable. The temporary file will become so large after months of years that it would be a mess. Do you know whats the easiest way to fix this issue?
Cloud Function: upload_to_big_query - main.py
BUCKET = "xxx"
GOOGLE_PROJECT = "xxx"
HEADER_MAPPING = {
"Source/Medium": "source_medium",
"Campaign": "campaign",
"Last Non-Direct Click Conversions": "last_non_direct_click_conversions",
"Last Non-Direct Click Conversion Value": "last_non_direct_click_conversion_value",
"Last Click Prio Conversions": "last_click_prio_conversions",
"Last Click Prio Conversion Value": "last_click_prio_conversion_value",
"Data-Driven Conversions": "dda_conversions",
"Data-Driven Conversion Value": "dda_conversion_value",
"% Change in Conversions from Last Non-Direct Click to Last Click Prio": "last_click_prio_vs_last_click",
"% Change in Conversions from Last Non-Direct Click to Data-Driven": "dda_vs_last_click"
}
SPEND_HEADER_MAPPING = {
"Source/Medium": "source_medium",
"Campaign": "campaign",
"Spend": "spend"
}
tables_schema = {
"google-analytics": [
bigquery.SchemaField("date", bigquery.enums.SqlTypeNames.DATE, mode='REQUIRED'),
bigquery.SchemaField("week", bigquery.enums.SqlTypeNames.INT64, mode='REQUIRED'),
bigquery.SchemaField("goal", bigquery.enums.SqlTypeNames.STRING, mode='REQUIRED'),
bigquery.SchemaField("source", bigquery.enums.SqlTypeNames.STRING, mode='NULLABLE'),
bigquery.SchemaField("medium", bigquery.enums.SqlTypeNames.STRING, mode='NULLABLE'),
bigquery.SchemaField("campaign", bigquery.enums.SqlTypeNames.STRING, mode='NULLABLE'),
bigquery.SchemaField("last_non_direct_click_conversions", bigquery.enums.SqlTypeNames.INT64, mode='NULLABLE'),
bigquery.SchemaField("last_non_direct_click_conversion_value", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE'),
bigquery.SchemaField("last_click_prio_conversions", bigquery.enums.SqlTypeNames.INT64, mode='NULLABLE'),
bigquery.SchemaField("last_click_prio_conversion_value", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE'),
bigquery.SchemaField("dda_conversions", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE'),
bigquery.SchemaField("dda_conversion_value", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE'),
bigquery.SchemaField("last_click_prio_vs_last_click", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE'),
bigquery.SchemaField("dda_vs_last_click", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE')
],
"google-analytics-spend": [
bigquery.SchemaField("date", bigquery.enums.SqlTypeNames.DATE, mode='REQUIRED'),
bigquery.SchemaField("week", bigquery.enums.SqlTypeNames.INT64, mode='REQUIRED'),
bigquery.SchemaField("source", bigquery.enums.SqlTypeNames.STRING, mode='NULLABLE'),
bigquery.SchemaField("medium", bigquery.enums.SqlTypeNames.STRING, mode='NULLABLE'),
bigquery.SchemaField("campaign", bigquery.enums.SqlTypeNames.STRING, mode='NULLABLE'),
bigquery.SchemaField("spend", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE'),
]
}
def download_from_gcs(file):
client = storage.Client()
bucket = client.get_bucket(BUCKET)
blob = bucket.get_blob(file['name'])
file_name = os.path.basename(os.path.normpath(file['name']))
blob.download_to_filename(f"/tmp/{file_name}")
return file_name
def load_in_bigquery(file_object, dataset: str, table: str):
client = bigquery.Client()
table_id = f"{GOOGLE_PROJECT}.{dataset}.{table}"
job_config = bigquery.LoadJobConfig(
source_format=bigquery.SourceFormat.CSV,
skip_leading_rows=1,
autodetect=True,
schema=tables_schema[table]
)
job = client.load_table_from_file(file_object, table_id, job_config=job_config)
job.result() # Wait for the job to complete.
def __order_columns(df: pd.DataFrame, spend=False) ->pd.DataFrame:
# We want to have source and medium columns at the third position
# for a spend data frame and at the fourth postion for others df
# because spend data frame don't have goal column.
pos = 2 if spend else 3
cols = df.columns.tolist()
cols[pos:2] = cols[-2:]
cols = cols[:-2]
return df[cols]
def __common_transformation(df: pd.DataFrame, date: str, goal: str) -> pd.DataFrame:
# for any kind of dataframe, we add date and week columns
# based on the file name and we split Source/Medium from the csv
# into two different columns
week_of_the_year = datetime.strptime(date, '%Y-%m-%d').isocalendar()[1]
df.insert(0, 'date', date)
df.insert(1, 'week', week_of_the_year)
mapping = SPEND_HEADER_MAPPING if goal == "spend" else HEADER_MAPPING
print(df.columns.tolist())
df = df.rename(columns=mapping)
print(df.columns.tolist())
print(df)
df["source_medium"] = df["source_medium"].str.replace(' ', '')
df[["source", "medium"]] = df["source_medium"].str.split('/', expand=True)
df = df.drop(["source_medium"], axis=1)
df["week"] = df["week"].astype(int, copy=False)
return df
def __transform_spend(df: pd.DataFrame) -> pd.DataFrame:
df["spend"] = df["spend"].astype(float, copy=False)
df = __order_columns(df, spend=True)
return df[df.columns[:6]]
def __transform_attribution(df: pd.DataFrame, goal: str) -> pd.DataFrame:
df.insert(2, 'goal', goal)
df["last_non_direct_click_conversions"] = df["last_non_direct_click_conversions"].astype(int, copy=False)
df["last_click_prio_conversions"] = df["last_click_prio_conversions"].astype(int, copy=False)
df["dda_conversions"] = df["dda_conversions"].astype(float, copy=False)
return __order_columns(df)
def transform(df: pd.DataFrame, file_name) -> pd.DataFrame:
goal, date, *_ = file_name.split('_')
df = __common_transformation(df, date, goal)
# we only add goal in attribution df (google-analytics table).
return __transform_spend(df) if "spend" in file_name else __transform_attribution(df, goal)
def main(event, context):
"""Triggered by a change to a Cloud Storage bucket.
Args:
event (dict): Event payload.
context (google.cloud.functions.Context): Metadata for the event.
"""
file = event
file_name = download_from_gcs(file)
df = pd.read_csv(f"/tmp/{file_name}")
transformed_df = transform(df, file_name)
with open(f"/tmp/bq_{file_name}", "w") as file_object:
file_object.write(transformed_df.to_csv(index=False))
with open(f"/tmp/bq_{file_name}", "rb") as file_object:
table = "google-analytics-spend" if "spend" in file_name else "google-analytics"
load_in_bigquery(file_object, dataset='attribution', table=table)
You might would prefer to check this thread:
BigQuery displaying wrong results - Duplicating data from Cloud Function?
Very shortly - the function is to be idempotent, and the state of the process (if the data/file was uploaded into BQ or not) should be kept outside of the cloud function. A text file (in some GCS bucket, not inside the cloud function memory, which can be erased as soon as the cloud function execution is finished) is an option, but GCS has plenty of drawbacks in this particular case. For example, a firestore - is much, much better choice.
You might consider the following algorithm -
When you cloud function starts, it should calculate some hash code based on input data - file/object metadata or file/object data or combination of both. That hash - should be unique for the given set of data.
Your cloud function connects to a predefined firestore collection (the project and the name can be provided in the environment variables) and checks if there a document/record with the given hash as an id - already exists or not.
If that hash already exists (the document exists) in the firestore collection - the cloud function finishes its execution and does not do anything else (can do logging, add some additional details into the firestore document if required, etc.). Thus simply finishes its execution.
If that hash is not found (the document does not exist) - the cloud function creates a new document with the given hash as an id. Some metadata details can be added into that document if needed.
Upon the document is created the cloud function continues the main 'workflow'.
A few things to bear in mind.
1/ IAM permissions - the service account under which the cloud function is running - should have relevant permissions on the firestore. Obviously the firestore API is to be enabled in the given project...
2/ What will happen if the cloud function creates a new firestore document, but then failed to load the data into BigQuery (for any reason). It may be that just a check on the firestore document existence is not enough. Thus, a proper 'state' is to be maintained in the firestore document. For example, when a new document is created (in the firestore), there should be a field __state and a value (for example) IN_PROGRESS is assigned to it. Then, when the data is loaded, the cloud function comes back to the firestore and updates that field with the value DONE (for example). But even that is not enough. As you have a load job - there may be cases, when the load is actually successful, but he cloud function failed (any reason including timeout). You might would like to think what to do in that case as well. In any case, having some 'state' monitoring in the firestore may help to understand/investigate the situation with the loading process. Automation of the monitoring might need developing a separate cloud function, but this is a separate story.
3/ As I mentioned in the thread I pointed above (see reasoning in that answer), loading data from inside the cloud function memory is a risky idea. I would suggest to think about that part of your algorithm again.
4/ It might be a good idea to move the loaded file/object from the "input" bucket to some "processed" (or "archive") bucket in case of success, and to move it into a "failure" bucket, in case the load failed. That is to be done in the cloud function code. Failure outcome can also be reflected in the firestore document (i.e. set the value of the __state field to FAILURE).

AWS Athena and handling json

I have millions of files with the following (poor) JSON format:
{
"3000105002":[
{
"pool_id": "97808",
"pool_name": "WILDCAT (DO NOT USE)",
"status": "Zone Permanently Plugged",
"bhl": "D-12-10N-05E 902 FWL 902 FWL",
"acreage": ""
},
{
"pool_id": "96838",
"pool_name": "DRY & ABANDONED",
"status": "Zone Permanently Plugged",
"bhl": "D-12-10N-05E 902 FWL 902 FWL",
"acreage": ""
}]
}
I've tried to generate an Athena DDL that would accommodate this type (especially the api field) of structure with this:
CREATE EXTERNAL TABLE wp_info (
api:array < struct < pool_id:string,
pool_name:string,
status:string,
bhl:string,
acreage:string>>)
LOCATION 's3://foo/'
After trying to generate a table with this, the following error is thrown:
Your query has the following error(s):
FAILED: ParseException line 2:12 cannot recognize input near ':' 'array' '<' in column type
What is a workable solution to this issue? Note that the api string is different for every one of the million files. The api key is not actually within any of the files, so I hope there is a way that Athena can accommodate just the string-type value for these data.
If you don't have control over the JSON format that you are receiving, and you don't have a streaming service in the middle to transform the JSON format to something simpler, you can use regex functions to retrieve the relevant data that you need.
A simple way to do it is to use Create-Table-As-Select (CTAS) query that will convert the data from its complex JSON format to a simpler table format.
CREATE TABLE new_table
WITH (
external_location = 's3://path/to/ctas_partitioned/',
format = 'Parquet',
parquet_compression = 'SNAPPY')
AS SELECT
regexp_extract(line, '"pool_id": "(\d+)"', 1) as pool_id,
regexp_extract(line, ' "pool_name": "([^"])",', 1) as pool_name,
...
FROM json_lines_table;
You will improve the performance of the queries to the new table, as you are using Parquet format.
Note that you can also update the table when you can new data, by running the CTAS query again with external_location as 's3://path/to/ctas_partitioned/part=01' or any other partition scheme

JSON as input in a mapreduce

I have a JSON file contains fields such as machine_id, category, and ... Category contains states of machines such as "alarm", "failure". I simply like to see how many times each machine_id has been reported using rmr2.
For example, if I have the following:
machine_id, state
48, alarm
39, failure
48, utilization
I like to see this result:
48,2
39,1
What I did:
I wrote a simple mapreduce to read the value of JSON file and used it as an input in the second mapreduce. Code is:
mp = function(k,v){
machine_id=v$machine_id
keyval(machine_id,1) }
rd = function(k,v) keyval(k,length(v))
mapreduce(input = mapreduce(input='\user\cloudera\sample.json', input.format="json" , map=function(k,v) keyval(k,v)) , map=mp, reduce = rd)
Unfortunately, it returns only the last two values of JSON file. It seems that it doesn't read entire of the value of the JSON file. I would appreciate any help.

How to fetch a JSON file to get a row position from a given value or argument

I'm using wget to fetch several dozen JSON files on a daily basis that go like this:
{
"results": [
{
"id": "ABC789",
"title": "Apple",
},
{
"id": "XYZ123",
"title": "Orange",
}]
}
My goal is to find row's position on each JSON file given a value or set of values (i.e. "In which row XYZ123 is located?"). In previous example ABC789 is in row 1, XYZ123 in row 2 and so on.
As for now I use Google Regine to "quickly" visualize (using the Text Filter option) where the XYZ123 is standing (row 2).
But since it takes a while to do this manually for each file I was wondering if there is a quick and efficient way in one go.
What can I do and how can I fetch and do the request? Thanks in advance! FoF0
In python:
import json
#assume json_string = your loaded data
data = json.loads(json_string)
mapped_vals = []
for ent in data:
mapped_vals.append(ent['id'])
The order of items in the list will be indexed according to the json data, since the list is a sequenced collection.
In PHP:
$data = json_decode($json_string);
$output = array();
foreach($data as $values){
$output[] = $values->id;
}
Again, the ordered nature of PHP arrays ensure that the output will be ordered as-is with regard to indexes.
Either example could be modified to use a mapped dictionary (python) or an associative array (php) if needs demand.
You could adapt these to functions that take the id value as an argument, track how far they are into the array, and when found, break out and return the current index.
Wow. I posted the original question 10 months ago when I knew nothing about Python nor computer programming whatsoever!
Answer
But I learned basic Python last December and came up with a solution for not only get the rank order but to insert the results into a MySQL database:
import urllib.request
import json
# Make connection and get the content
response = urllib.request.urlopen(http://whatever.com/search?=ids=1212,125,54,454)
content = response.read()
# Decode Json search results to type dict
json_search = json.loads(content.decode("utf8"))
# Get 'results' key-value pairs to a list
search_data_all = []
for i in json_search['results']:
search_data_all.append(i)
# Prepare MySQL list with ranking order for each id item
ranks_list_to_mysql = []
for i in range(len(search_data_all)):
d = {}
d['id'] = search_data_all[i]['id']
d['rank'] = i + 1
ranks_list_to_mysql.append(d)

Use JSON Input step to process uneven data

I'm trying to process the following with an JSON Input step:
{"address":[
{"AddressId":"1_1","Street":"A Street"},
{"AddressId":"1_101","Street":"Another Street"},
{"AddressId":"1_102","Street":"One more street", "Locality":"Buenos Aires"},
{"AddressId":"1_102","Locality":"New York"}
]}
However this seems not to be possible:
Json Input.0 - ERROR (version 4.2.1-stable, build 15952 from 2011-10-25 15.27.10 by buildguy) :
The data structure is not the same inside the resource!
We found 1 values for json path [$..Locality], which is different that the number retourned for path [$..Street] (3509 values).
We MUST have the same number of values for all paths.
The step provides Ignore Missing Path flag but it only works if all the rows misses the same path. In that case that step acts as as expected an fills the missing values with null.
This limits the power of this step to read uneven data, which was really one of my priorities.
My step Fields are defined as follows:
Am I missing something? Is this the correct behavior?
What I have done is use JSON Input using $.address[*] to read to a jsonRow field the full map of each element p.e:
{"address":[
{"AddressId":"1_1","Street":"A Street"},
{"AddressId":"1_101","Street":"Another Street"},
{"AddressId":"1_102","Street":"One more street", "Locality":"Buenos Aires"},
{"AddressId":"1_102","Locality":"New York"}
]}
This results in 4 jsonRows one for each element, p.e. jsonRow = {"AddressId":"1_101","Street":"Another Street"}. Then using a Javascript step I map my values using this:
var AddressId = getFromMap('AddressId', jsonRow);
var Street = getFromMap('Street', jsonRow);
var Locality = getFromMap('Locality', jsonRow);
In a second script tab I inserted minified JSON parse code from https://github.com/douglascrockford/JSON-js and the getFromMap function:
function getFromMap(key,jsonRow){
try{
var map = JSON.parse(jsonRow);
}
catch(e){
var message = "Unparsable JSON: "+jsonRow+" Desc: "+e.message;
var nr_errors = 1;
var field = "jsonRow";
var errcode = "JSON_PARSE";
_step_.putError(getInputRowMeta(), row, nr_errors, message, field, errcode);
trans_Status = SKIP_TRANSFORMATION;
return null;
}
if(map[key] == undefined){
return null;
}
trans_Status = CONTINUE_TRANSFORMATION;
return map[key]
}
You can solve this by changing the JSONPath and splitting up the steps in two JSON input steps. The following website explains a lot about JSONPath: http://goessner.net/articles/JsonPath/
$..AddressId
Does in fact return all the AddressId's in the address array, BUT since Pentaho is using grid rows for input and output [4 rows x 3 columns], it can't handle a missing value aka null value when you want as results return all the Streets (3 rows) and return all the Locality (2 rows), simply because there are no null values in the array itself as in you can't drive out of your garage with 3 wheels on your car instead of the usual 4.
I guess your script returns null (where X is zero) values like:
A S X
A S X
A S L
A X L
The scripting step can be avoided same by changing the Fields path of the first JSONinput step into:
$.address[*]
This is to retrieve all the 4 address lines. Create a next JSONinput step based on the new source field which contains the address line(s) to retrieve the address details per line:
$.AddressId
$.Street
$.Locality
This yields the null values on the four address lines when a address details is not available in an address line.