Formatting Biquery query to ML appropriate JSON to Pass through ML Predict - json

Using Python 2.7, I wont to pass a query from BigQuery to ML Predict which has a specific formating request.
First: Is there an easier way to go directly from the BigQuery query to JSON in the correct format so it can be passed to requests.post() instead of going through pandas (from what I understand pandas is still not supported for GCP Standard)?
Second: Is there a way to construct the query to go directly to a JSON format and then modify the JSON to reflect the ML Predict JSON requirments?
Currently my code looks like this:
#I used the bigquery to dataframe option here to view the output.
#I would like to not use pandas in the end code.
logs = log_data.execute(output_options=bq.QueryOutput.dataframe()).result()
data = logs.to_json(orient='index')
print data
'{"0":{"end_time":"2018-04-19","device":"iPad","device_os":"iOS","device_os_version":"5.1.1","latency":0.150959,"megacycles":140.0,"cost":"1.3075e-08","device_brand":"Apple","device_family":"iPad","browser_version":"5.1","app":"567","ua_parse":"0"}}'
#The JSON needs to be in this format according to google documentation.
#data = {
# 'instances': [
# {
# 'key':'',
# 'end_time': '2018-04-19',
# 'device': 'iPad',
# 'device_os': 'iOS',
# 'device_os_version': '5.1.1',
# 'latency': 0.150959,
# 'megacycles':140.0,
# 'cost':'1.3075e-08',
# 'device_brand':'Apple',
# 'device_family':'iPad',
# 'browser_version':'5.1',
# 'app':'567',
# 'ua_parse':'40.9.8'
# }
# ]
#}
So all I would need to change is the leading key '0' to 'instances' and I should be all set to pass into `requests.post().
Is there a way to accomplish this?
Edit-Adding BigQuery query:
%%bq query --n log_data
WITH `my.table` AS (
SELECT ARRAY<STRUCT<end_time STRING, device STRING, device_os STRING, device_os_version STRING, latency FLOAT64, megacycles FLOAT64,
cost STRING, device_brand STRING, device_family STRING, browser_version STRING, app STRING, ua_parse STRING>>[] instances
)
SELECT TO_JSON_STRING(t)
FROM `my.table` AS t
WHERE end_time >='2018-04-19'
LIMIT 1
data = log_data.execute().result()
Thanks to #MikhailBerlyant I have adjust my query and code to look like this:
%%bq query --n log_data
SELECT [TO_JSON_STRING(t)] AS instance
FROM `yourproject.yourdataset.yourtable` AS t
WHERE end_time >='2018-04-19'
LIMIT 1
But when I run the execute logs = log_data.execute().result() I get this
Which results in this error when passing into request.post
TypeError: QueryResultsTable job_zfVEiPdf2W6msBlT6bBLgMusF49E is not JSON serializable
Is there a way within execut() to just return the json?

First: Is there an easier way to go directly from the BigQuery query to JSON in the correct format
See example below
#standardSQL
WITH yourTable AS (
SELECT ARRAY<STRUCT<id INT64, type STRING>>[(1, 'abc'), (2, 'xyz')] instances
)
SELECT TO_JSON_STRING(t)
FROM yourTable t
with result is in the format you asked for:
{"instances":[{"id":1,"type":"abc"},{"id":2,"type":"xyz"}]}
Above demonstrates the query and how it will work
In you real case - you should use something like below
SELECT TO_JSON_STRING(t)
FROM `yourproject.yourdataset.yourtable` AS t
WHERE end_time >='2018-04-19'
LIMIT 1
hope this helps :o)
Update based on comments
SELECT [TO_JSON_STRING(t)] AS instance
FROM `yourproject.yourdataset.yourtable` t
WHERE end_time >='2018-04-19'
LIMIT 1

I wanted to add this in case someone has the same problem I had or at least are stuck on were to go once you have the query.
I was able to write a function that formatted the query in the way Google ML Predict wants it to be passed into requests.post(). This is most likely a horrible way to accomplish this but I could not find a direct way to go from BigQuery to ML Predict in the correct format.
def logs(query):
client = gcb.Client()
query_job = client.query(query)
CSV_COLUMNS ='end_time,device,device_os,device_os_version,latency,megacycles,cost,device_brand,device_family,browser_version,app,ua_parse'.split(',')
for row in query_job.result():
var = list(row)
l1 = dict(zip(CSV_COLUMNS,var))
l1.update({'key':''})
l2 = {'instances':[l1]}
return l2

Related

Unexpected end of JSON input at undefined line XXXX, columns xx-xx while reading in BigQuery

I have a table in Bigquery which has 2 columns - job_id and json_column(string which is in JSON format). When I tried to read the data and identify some objects it gives me error as below:
SyntaxError:Unexpected end of JSON input at undefined line XXXX, columns xx-xx
It Always gives me line 5931 and second time I execute again it gives line 6215.
If it's related to JSON structure issue, how can I know which row/job_id that number 5931 corresponds to? If I subset for a specific job_id, it returns the values but when I tried to execute on the complete table, I got this error. I tried looking at the job_id at the row_numbers mentioned and code works fine for those job_ids.
Do you think its JSON structure issue and how to identify which row/job_id has this Issue?
Table Structure:
Code:
CREATE TEMPORARY FUNCTION CUSTOM_JSON_EXTRACT(json STRING, json_path STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
var result = jsonPath(JSON.parse(json), json_path);
if(result){return result;}
else {return [];}
"""
OPTIONS (
library="gs://json_temp/jsonpath-0.8.0.js"
);
SELECT job_id,dist,gm,sub_gm
FROM lz_fdp_op.fdp_json_file,
UNNEST(CUSTOM_JSON_EXTRACT(trim(conv_column), '$.Project.OpsLocationInfo.iDistrictId')) dist ,
UNNEST(CUSTOM_JSON_EXTRACT(trim(conv_column), '$.Project.GeoMarketInfo.Geo')) gm,
UNNEST(CUSTOM_JSON_EXTRACT(trim(conv_column), '$.Project.GeoMarketInfo.SubGeo')) sub_gm
Would this work for you?
WITH
T AS (
SELECT
'1000149.04.14' AS job_id,
'{"Project":{"OpsLocationInfo":{"iDistrictId":"A"},"GeoMarketInfo":{"Geo":"B","SubGeo":"C"}}}' AS conv_column
)
SELECT
JSON_EXTRACT_SCALAR(conv_column, '$.Project.OpsLocationInfo.iDistrictId') AS dist,
JSON_EXTRACT_SCALAR(conv_column, '$.Project.GeoMarketInfo.Geo') AS gm,
JSON_EXTRACT_SCALAR(conv_column, '$.Project.GeoMarketInfo.SubGeo') AS sub_gm
FROM
T
BigQuery JSON Functions docs:
https://cloud.google.com/bigquery/docs/reference/standard-sql/json_functions
how can I read multiple arrays in an object in JSON without using
unnest?
Can you explain better with an input sample your comment?

Maximo/GIS spatial query

I have a work order in Maximo 7.6.1.1:
The WO has LatitudeY and LongitudeX coordinates in the Service Address tab.
The WO has a custom zone field.
And there is a feature class (polygons) in a separate GIS database.
I want to do spatial query to return an attribute from the polygon record that the WO intersects and use it to populate zone in the WO.
How can I do this?
Related keyword: Maximo Spatial
To do this live in Maximo using an automation script is possible or by writing custom code into Spatial (more challenging). You want to use the /MapServer/identify tool and post the geometry xy, coordinate system, and the layer you want to query. identify window
You will have to format the geometry object correctly and test your post from the window. I usually grab the post from the network section of developer tools once I get it to work and change the output format to json and use it in my code.
You may actually not need to touch your Maximo environment at all. How about just using a trigger on your work orders table ? That trigger can then automatically fill the zone ID from a simple select statement that matches x and y with the zones in the zones table. Here is how that could look like.
This assumes that your work orders are in a table like this:
create table work_orders (
wo_id number primary key,
x number,
y number,
zone_id number
);
and the zones in a table like this
create table zones (
zone_id number primary key,
shape st_geometry
)
Then the trigger would be like this
create or replace trigger work_orders_fill_zone
before insert or update of x,y on work_orders
for each row
begin
select zone_id
into :new.zone_id
from zones
where sde.st_contains (zone_shape, sde.st_point (:new.x, :new.y, 4326) ) = 1;
end;
/
Some assumptions:
The x and y columns contain coordinates in WGS84 longitude/latitude (not in some projection or some other long/lat coordinate system)
Zones don't overlap: a work order point is always therefore in one and only one zone. If not, then the query may return multiple results, which you then need to handle.
Zones fully cover the territory your work orders can take place in. If a work order location can be outside all your zones, then you also need to handle that (the query would return no result).
The x and y columns are always filled. If they are optional, then you also need to handle that case (set zone_id to NULL if either x or y is NULL)
After that, each time a new work order is inserted in the work_orders table, the zone_id column will be automatically updated.
You can initialize zone_id in your existing work orders with a simple update:
update work_orders set x=x, y=y;
This will make the trigger run for each row in the table ... It may take some time to complete if the table is large.
Adapt the code in the Library Scripts section of Maximo 76 Scripting Features (pdf):
#What the script does:
# 1. Takes the X&Y coordinates of a work order in Maximo
# 2. Generates a URL from the coordinates
# 3. Executes the URL via a separate script/library (LIB_HTTPCLIENT)
# 4. Performs a spatial query in an ESRI REST feature service (a separate GIS system)
# 5. Returns JSON text to Maximo with the attributes of the zone that the work
# order intersected
# 6. Parses the zone number from the JSON text
# 7. Inserts the zone number into the work order record
from psdi.mbo import MboConstants
from java.util import HashMap
from com.ibm.json.java import JSONObject
field_to_update = "ZONE"
gis_field_name = "ROADS_ZONE"
def get_coords():
"""
Get the y and x coordinates(UTM projection) from the WOSERVICEADDRESS table
via the SERVICEADDRESS system relationship.
The datatype of the LatitdeY and LongitudeX fields is decimal.
"""
laty = mbo.getDouble("SERVICEADDRESS.LatitudeY")
longx = mbo.getDouble("SERVICEADDRESS.LongitudeX")
#Test values
#laty = 4444444.7001941890
#longx = 666666.0312127020
return laty, longx
def is_latlong_valid(laty, longx):
#Verify if the numbers are legitimate UTM coordinates
return (4000000 <= laty <= 5000000 and
600000 <= longx <= 700000)
def make_url(laty, longx, gis_field_name):
"""
Assembles the URL (including the longx and the laty).
Note: The coordinates are flipped in the url.
"""
url = (
"http://hostname.port"
"/arcgis/rest/services/Example"
"/Zones/MapServer/15/query?"
"geometry={0}%2C{1}&"
"geometryType=esriGeometryPoint&"
"spatialRel=esriSpatialRelIntersects&"
"outFields={2}&"
"returnGeometry=false&"
"f=pjson"
).format(longx, laty, gis_field_name)
return url
def fetch_zone(url):
# Get the JSON text from the feature service (the JSON text contains the zone value).
ctx = HashMap()
ctx.put("url", url)
service.invokeScript("LIBHTTPCLIENT", ctx)
json_text = str(ctx.get("response"))
# Parse the zone value from the JSON text
obj = JSONObject.parse(json_text)
parsed_val = obj.get("features")[0].get("attributes").get(gis_field_name)
return parsed_val
try:
laty, longx = get_coords()
if not is_latlong_valid(laty, longx):
service.log('Invalid coordinates')
else:
url = make_url(laty, longx, gis_field_name)
zone = fetch_zone(url)
#Insert the zone value into the zone field in the work order
mbo.setValue(field_to_update, zone, MboConstants.NOACCESSCHECK)
service.log(zone)
except:
#If the script fails, then set the field value to null.
mbo.setValue(field_to_update, None, MboConstants.NOACCESSCHECK)
service.log("An exception occurred")
LIBHTTPCLIENT: (a reusable Jython library script)
from psdi.iface.router import HTTPHandler
from java.util import HashMap
from java.lang import String
handler = HTTPHandler()
map = HashMap()
map.put("URL", url)
map.put("HTTPMETHOD", "GET")
responseBytes = handler.invoke(map, None)
response = String(responseBytes, "utf-8")

DynamoDB JSON response parsing prints vertically

I have a script that scans a DynamoDB table that stores my instance IDs. Then I try to query another table to see if it also has that same instance and get all of the metadata attributes in a master table. When I iterate through the query using the instance ID from the initial scan of the first table, I am noticing each character of the instance id string is being printed to a new line, instead of the entire string on one line. I am confused how to fix this. Below is my code, sample output, and the expected output.
CODE:
import boto3
import json
from boto3.dynamodb.conditions import Key, Attr
def table_diff():
dynamo = boto3.client('dynamodb')
dynamodb = boto3.resource('dynamodb')
table_missing = dynamodb.Table('RunningInstances')
missing_response = dynamo.scan(TableName='CWPMissingAgent')
for instances in missing_response['Items']:
instance_id = instances['missing_instances']['S']
# This works how I want, prints i-xxxxx
print(instance_id)
for id in instance_id:
# This does not print how I want (vertically)
print(id)
query_response = table_missing.query(KeyConditionExpression=Key('ID').eq(id))
OUTPUT:
i
-
x
x
x
x
x
EXPECTED OUTPUT:
i-xxxxx
etc etc
instance_id is a string. Thus, when you loop over it (for id in instance_id), you are actually looping over each character in the string, and printing them out individually.
Why do you try to loop over it, when you say that just printing it produces the correct result?

Apache Drill: Convert JSON as String to JSON object to retrieve each element

I have the below string in a column in hive table which i am trying to query using apache drill:
{"cdrreasun":"52","cdxscarc":"20150407161405","cdrend":"20150407155201","cdrdnrar.1un":"24321.70","servlnqlp":"54.201.25.50","men":"42403","xa:lnqruup":"3","cemcau":"120","accuuncl":"21","cdrc:
5","volcuca":"1.7"}
Want to retrieve all values for key cdrreasun using apache drill SQL.
Can't use FLATTEN on the column as it says Flatten does not work with inputs of non-list types.
Can't use KVGEN as well as it works only with MAP datatype.
Drill has function convert_fromJSON which allows converting from String to JSON object. For more details about this function and examples of usage please see https://drill.apache.org/docs/data-type-conversion/#convert_to-and-convert_from
For the example you specified, you can run
convert_fromJSON(colWithJsonText)['cdrreasun']
I figured it out, hope it will be helpful for others.
We have to do it in 3 steps if the datatype is of type MAP:
KVGEN() -> FLATTEN() -> convert_from()
If it's of type STRING then KVGEN() function is not needed.
SELECT ratinggrouplist
,t3.cdrlist3.cdrreason AS cdrreason
,t3.cdrlist3.cdrstart AS cdrstart
,t3.cdrlist3.cdrend AS cdrend
,t3.cdrlist3.cdrduration AS cdrduration
FROM (
SELECT ratinggrouplist, convert_from(t2.cdrlist2.`element`, 'JSON') AS cdrlist3
FROM (
SELECT ratinggrouplist ,flatten(t1.cdrlist1.`value`) AS cdrlist2
FROM (
SELECT ratinggrouplist, kvgen(cdrlist) AS cdrlist1
FROM dfs.tmp.SOME_TABLE
) AS t1
) AS t2
) AS t3;

Django / PostgresQL jsonb (JSONField) - convert select and update into one query

Versions: Django 1.10 and Postgres 9.6
I'm trying to modify a nested JSONField's key in place without a roundtrip to Python. Reason is to avoid race conditions and multiple queries overwriting the same field with different update.
I tried to chain the methods in the hope that Django would make a single query but it's being logged as two:
Original field value (demo only, real data is more complex):
from exampleapp.models import AdhocTask
record = AdhocTask.objects.get(id=1)
print(record.log)
> {'demo_key': 'original'}
Query:
from django.db.models import F
from django.db.models.expressions import RawSQL
(AdhocTask.objects.filter(id=25)
.annotate(temp=RawSQL(
# `jsonb_set` gets current json value of `log` field,
# take a the nominated key ("demo key" in this example)
# and replaces the value with the json provided ("new value")
# Raw sql is wrapped in triple quotes to avoid escaping each quote
"""jsonb_set(log, '{"demo_key"}','"new value"', false)""",[]))
# Finally, get the temp field and overwrite the original JSONField
.update(log=F('temp’))
)
Query history (shows this as two separate queries):
from django.db import connection
print(connection.queries)
> {'sql': 'SELECT "exampleapp_adhoctask"."id", "exampleapp_adhoctask"."description", "exampleapp_adhoctask"."log" FROM "exampleapp_adhoctask" WHERE "exampleapp_adhoctask"."id" = 1', 'time': '0.001'},
> {'sql': 'UPDATE "exampleapp_adhoctask" SET "log" = (jsonb_set(log, \'{"demo_key"}\',\'"new value"\', false)) WHERE "exampleapp_adhoctask"."id" = 1', 'time': '0.001'}]
It would be much nicer without RawSQL.
Here's how to do it:
from django.db.models.expressions import Func
class ReplaceValue(Func):
function = 'jsonb_set'
template = "%(function)s(%(expressions)s, '{\"%(keyname)s\"}','\"%(new_value)s\"', %(create_missing)s)"
arity = 1
def __init__(
self, expression: str, keyname: str, new_value: str,
create_missing: bool=False, **extra,
):
super().__init__(
expression,
keyname=keyname,
new_value=new_value,
create_missing='true' if create_missing else 'false',
**extra,
)
AdhocTask.objects.filter(id=25) \
.update(log=ReplaceValue(
'log',
keyname='demo_key',
new_value='another value',
create_missing=False,
)
ReplaceValue.template is the same as your raw SQL statement, just parametrized.
(jsonb_set(log, \'{"demo_key"}\',\'"another value"\', false)) from your query is now jsonb_set("exampleapp.adhoctask"."log", \'{"demo_key"}\',\'"another value"\', false). The parentheses are gone (you can get them back by adding it to the template) and log is referenced in a different way.
Anyone interested in more details regarding jsonb_set should have a look at table 9-45 in postgres' documentation: https://www.postgresql.org/docs/9.6/static/functions-json.html#FUNCTIONS-JSON-PROCESSING-TABLE
Rubber duck debugging at its best - in writing the question, I've realised the solution. Leaving the answer here in hope of helping someone in future:
Looking at the queries, I realised that the RawSQL was actually being deferred until query two, so all I was doing was storing the RawSQL as a subquery for later execution.
Solution:
Skip the annotate step altogether and use the RawSQL expression straight into the .update() call. Allows you to dynamically update PostgresQL jsonb sub-keys on the database server without overwriting the whole field:
(AdhocTask.objects.filter(id=25)
.update(log=RawSQL(
"""jsonb_set(log, '{"demo_key"}','"another value"', false)""",[])
)
)
> 1 # Success
print(connection.queries)
> {'sql': 'UPDATE "exampleapp_adhoctask" SET "log" = (jsonb_set(log, \'{"demo_key"}\',\'"another value"\', false)) WHERE "exampleapp_adhoctask"."id" = 1', 'time': '0.001'}]
print(AdhocTask.objects.get(id=1).log)
> {'demo_key': 'another value'}