Apache Drill: Convert JSON as String to JSON object to retrieve each element - apache-drill

I have the below string in a column in hive table which i am trying to query using apache drill:
{"cdrreasun":"52","cdxscarc":"20150407161405","cdrend":"20150407155201","cdrdnrar.1un":"24321.70","servlnqlp":"54.201.25.50","men":"42403","xa:lnqruup":"3","cemcau":"120","accuuncl":"21","cdrc:
5","volcuca":"1.7"}
Want to retrieve all values for key cdrreasun using apache drill SQL.
Can't use FLATTEN on the column as it says Flatten does not work with inputs of non-list types.
Can't use KVGEN as well as it works only with MAP datatype.

Drill has function convert_fromJSON which allows converting from String to JSON object. For more details about this function and examples of usage please see https://drill.apache.org/docs/data-type-conversion/#convert_to-and-convert_from
For the example you specified, you can run
convert_fromJSON(colWithJsonText)['cdrreasun']

I figured it out, hope it will be helpful for others.
We have to do it in 3 steps if the datatype is of type MAP:
KVGEN() -> FLATTEN() -> convert_from()
If it's of type STRING then KVGEN() function is not needed.
SELECT ratinggrouplist
,t3.cdrlist3.cdrreason AS cdrreason
,t3.cdrlist3.cdrstart AS cdrstart
,t3.cdrlist3.cdrend AS cdrend
,t3.cdrlist3.cdrduration AS cdrduration
FROM (
SELECT ratinggrouplist, convert_from(t2.cdrlist2.`element`, 'JSON') AS cdrlist3
FROM (
SELECT ratinggrouplist ,flatten(t1.cdrlist1.`value`) AS cdrlist2
FROM (
SELECT ratinggrouplist, kvgen(cdrlist) AS cdrlist1
FROM dfs.tmp.SOME_TABLE
) AS t1
) AS t2
) AS t3;

Related

Unexpected end of JSON input at undefined line XXXX, columns xx-xx while reading in BigQuery

I have a table in Bigquery which has 2 columns - job_id and json_column(string which is in JSON format). When I tried to read the data and identify some objects it gives me error as below:
SyntaxError:Unexpected end of JSON input at undefined line XXXX, columns xx-xx
It Always gives me line 5931 and second time I execute again it gives line 6215.
If it's related to JSON structure issue, how can I know which row/job_id that number 5931 corresponds to? If I subset for a specific job_id, it returns the values but when I tried to execute on the complete table, I got this error. I tried looking at the job_id at the row_numbers mentioned and code works fine for those job_ids.
Do you think its JSON structure issue and how to identify which row/job_id has this Issue?
Table Structure:
Code:
CREATE TEMPORARY FUNCTION CUSTOM_JSON_EXTRACT(json STRING, json_path STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
var result = jsonPath(JSON.parse(json), json_path);
if(result){return result;}
else {return [];}
"""
OPTIONS (
library="gs://json_temp/jsonpath-0.8.0.js"
);
SELECT job_id,dist,gm,sub_gm
FROM lz_fdp_op.fdp_json_file,
UNNEST(CUSTOM_JSON_EXTRACT(trim(conv_column), '$.Project.OpsLocationInfo.iDistrictId')) dist ,
UNNEST(CUSTOM_JSON_EXTRACT(trim(conv_column), '$.Project.GeoMarketInfo.Geo')) gm,
UNNEST(CUSTOM_JSON_EXTRACT(trim(conv_column), '$.Project.GeoMarketInfo.SubGeo')) sub_gm
Would this work for you?
WITH
T AS (
SELECT
'1000149.04.14' AS job_id,
'{"Project":{"OpsLocationInfo":{"iDistrictId":"A"},"GeoMarketInfo":{"Geo":"B","SubGeo":"C"}}}' AS conv_column
)
SELECT
JSON_EXTRACT_SCALAR(conv_column, '$.Project.OpsLocationInfo.iDistrictId') AS dist,
JSON_EXTRACT_SCALAR(conv_column, '$.Project.GeoMarketInfo.Geo') AS gm,
JSON_EXTRACT_SCALAR(conv_column, '$.Project.GeoMarketInfo.SubGeo') AS sub_gm
FROM
T
BigQuery JSON Functions docs:
https://cloud.google.com/bigquery/docs/reference/standard-sql/json_functions
how can I read multiple arrays in an object in JSON without using
unnest?
Can you explain better with an input sample your comment?

How to filter a value in json field type on Postgres 9.2?

My json field data is like this:
{"active":true,"id":"xxxxxxxxxxxxxxxxx","settings":{"secret":"xxxxxxxxxxxxxxxxxxxxxxxxxx","token":"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx","expires":"2019-12-16 01:11:23"},"plan":"Sample"}
Then I tried to query the field like this:
select * from integrations.accounts where field -> 'id' = 'xxxxxx';
But it gives me an error of:
SQL Error [42883]: ERROR: operator does not exist: json -> unknown
I found that the arrow operator (->) is not supported in version 9.2:
Unsupported versions: 9.3 / 9.2
Is there any alternative way to do this?
Since there is no true support for JSON in your version (It is really recommended to upgrade!) you have to parse your JSON text manually in any way. For example with help of regexp:
demo:db<>fiddle
SELECT
(regexp_matches(my_json_value::text, '"id":"(.*?)"'))[1]
-> is used for traversing the nested JSON
->> is used for the selection of querying level
In the above example if you wanna query on token which is inside settings you can use the following command
select * from integrations.accounts where field -> 'settings' ->> 'token' = 'xxxxxx';
To query on id which is at the top level use the following command
select * from integrations.accounts where field ->> 'id' = 'xxxxxx';

Formatting Biquery query to ML appropriate JSON to Pass through ML Predict

Using Python 2.7, I wont to pass a query from BigQuery to ML Predict which has a specific formating request.
First: Is there an easier way to go directly from the BigQuery query to JSON in the correct format so it can be passed to requests.post() instead of going through pandas (from what I understand pandas is still not supported for GCP Standard)?
Second: Is there a way to construct the query to go directly to a JSON format and then modify the JSON to reflect the ML Predict JSON requirments?
Currently my code looks like this:
#I used the bigquery to dataframe option here to view the output.
#I would like to not use pandas in the end code.
logs = log_data.execute(output_options=bq.QueryOutput.dataframe()).result()
data = logs.to_json(orient='index')
print data
'{"0":{"end_time":"2018-04-19","device":"iPad","device_os":"iOS","device_os_version":"5.1.1","latency":0.150959,"megacycles":140.0,"cost":"1.3075e-08","device_brand":"Apple","device_family":"iPad","browser_version":"5.1","app":"567","ua_parse":"0"}}'
#The JSON needs to be in this format according to google documentation.
#data = {
# 'instances': [
# {
# 'key':'',
# 'end_time': '2018-04-19',
# 'device': 'iPad',
# 'device_os': 'iOS',
# 'device_os_version': '5.1.1',
# 'latency': 0.150959,
# 'megacycles':140.0,
# 'cost':'1.3075e-08',
# 'device_brand':'Apple',
# 'device_family':'iPad',
# 'browser_version':'5.1',
# 'app':'567',
# 'ua_parse':'40.9.8'
# }
# ]
#}
So all I would need to change is the leading key '0' to 'instances' and I should be all set to pass into `requests.post().
Is there a way to accomplish this?
Edit-Adding BigQuery query:
%%bq query --n log_data
WITH `my.table` AS (
SELECT ARRAY<STRUCT<end_time STRING, device STRING, device_os STRING, device_os_version STRING, latency FLOAT64, megacycles FLOAT64,
cost STRING, device_brand STRING, device_family STRING, browser_version STRING, app STRING, ua_parse STRING>>[] instances
)
SELECT TO_JSON_STRING(t)
FROM `my.table` AS t
WHERE end_time >='2018-04-19'
LIMIT 1
data = log_data.execute().result()
Thanks to #MikhailBerlyant I have adjust my query and code to look like this:
%%bq query --n log_data
SELECT [TO_JSON_STRING(t)] AS instance
FROM `yourproject.yourdataset.yourtable` AS t
WHERE end_time >='2018-04-19'
LIMIT 1
But when I run the execute logs = log_data.execute().result() I get this
Which results in this error when passing into request.post
TypeError: QueryResultsTable job_zfVEiPdf2W6msBlT6bBLgMusF49E is not JSON serializable
Is there a way within execut() to just return the json?
First: Is there an easier way to go directly from the BigQuery query to JSON in the correct format
See example below
#standardSQL
WITH yourTable AS (
SELECT ARRAY<STRUCT<id INT64, type STRING>>[(1, 'abc'), (2, 'xyz')] instances
)
SELECT TO_JSON_STRING(t)
FROM yourTable t
with result is in the format you asked for:
{"instances":[{"id":1,"type":"abc"},{"id":2,"type":"xyz"}]}
Above demonstrates the query and how it will work
In you real case - you should use something like below
SELECT TO_JSON_STRING(t)
FROM `yourproject.yourdataset.yourtable` AS t
WHERE end_time >='2018-04-19'
LIMIT 1
hope this helps :o)
Update based on comments
SELECT [TO_JSON_STRING(t)] AS instance
FROM `yourproject.yourdataset.yourtable` t
WHERE end_time >='2018-04-19'
LIMIT 1
I wanted to add this in case someone has the same problem I had or at least are stuck on were to go once you have the query.
I was able to write a function that formatted the query in the way Google ML Predict wants it to be passed into requests.post(). This is most likely a horrible way to accomplish this but I could not find a direct way to go from BigQuery to ML Predict in the correct format.
def logs(query):
client = gcb.Client()
query_job = client.query(query)
CSV_COLUMNS ='end_time,device,device_os,device_os_version,latency,megacycles,cost,device_brand,device_family,browser_version,app,ua_parse'.split(',')
for row in query_job.result():
var = list(row)
l1 = dict(zip(CSV_COLUMNS,var))
l1.update({'key':''})
l2 = {'instances':[l1]}
return l2

Querying json in postgres

I have to extract data from a json file who contains spatial information. The content of this file is
{"vertices":[{"lat":46.744628268759314,"lon":6.569952920654968},
{"lat":46.74441692818192,"lon":6.570487107359068},
{"lat":46.74426116111054,"lon":6.570355867853787},
{"lat":46.74447250168793,"lon":6.569821681149689}],
"name":"demo-field",
"cropType":"sugarbeet",
"cropPlantDistance":0.18000000715255737,
"rowDistance":0.5,"numberOfRows":[28,12,12],"seedingDate":"2016-08-17T07:39+00:00"}
I've created a table then copied the content of this file into it
create table field(data json);
COPY field(data) FROM '/home/guest-pc5/field.json';
I now I can query my data
SELECT json_array_elements(data->'vertices') from field;
{"lat":46.744628268759314,"lon":6.569952920654968}
{"lat":46.74441692818192,"lon":6.570487107359068}
{"lat":46.74426116111054,"lon":6.570355867853787}
{"lat":46.74447250168793,"lon":6.569821681149689}
(4 rows)
The problem is that I can't use it like that. I would like to catch only values of "lat" and "lon" to put them in the field table
I've tried to use the function json_to_recordset without success
select * from json_to_recordset('[{"lat":46.744628268759314,"lon":6.569952920654968},{"lat":46.74441692818192,"lon":6.570487107359068},{"lat":46.74426116111054,"lon":6.570355867853787},{"lat":46.74447250168793,"lon":6.569821681149689}]') as (lat numeric, lon numeric);
ERROR: function json_to_recordset(unknown) does not exist
LINE 1: select * from json_to_recordset('[{"lat":46.744628268759314,...
^
HINT: No function matches the given name and argument types. You might need to add explicit type casts.
You can use json manipulator operator ->> to get the value you want as text out of json_array_elements output. To make it easier, you can call json_array_elements in FROM clause (which is a lateral call to a set-returning function):
SELECT
f.data AS original_json,
CAST((e.element->>'lat') AS numeric) AS lat,
CAST((e.element->>'lon') AS numeric) AS lon
FROM
field AS f,
json_array_elements(f.data->'vertices') AS e(element);
With that, you can simple create a table (or use INSERT into an existent one):
CREATE TABLE coordinates AS
SELECT
f.data AS original_json,
CAST((e.element->>'lat') AS numeric) AS lat,
CAST((e.element->>'lon') AS numeric) AS lon
FROM
field AS f,
json_array_elements(f.data->'vertices') AS e(element);
OBS: The LATERAL there is implicit, as the LATERAL keyword is optional for set-returning function calls, but you could make it really explicit, as:
FROM
field f
CROSS JOIN LATERAL json_array_elements(f.data->'vertices') AS e(element);
Also, LATERAL is 9.3+ only, although you are certainly above that as you are using json_array_elements (also 9.3+ only).

querying json object from table in postgreSQL

I want to use where condition on a json object in a table, in postgreSql. how i need to do this for example: i have a table 'test' it has three columns name(varchar),url(varchar),more(json). i need to retrive date where css21Colors = Purple.
more is a json type and below is the values of more field.
Please let me know what should be syntax of querying for the same?
more = {"colorTree":{"Purple":[{"Spanish Violet":"#522173"}],
"Brown":[{"Dark Puce":"#4e3347"}],"White":[{"White":"#ffffff"}],
"Black":[{"Eerie Black":"#1d0d27"}],"Gray":[{"Rose Quartz":"#a091a4"}]},
"sizeoutscount":0,"css21Colors":{"Purple":69,"Brown":5,"White":4,"Black":17,"Gray":3},
"sizeins": [],"sizeinscount":0,"sizeouts":[],"allsizes":["8","10","16"],
"css3Colors": {"Rose Quartz":3,"White":4,"Dark Puce":5,"Eerie Black":17,"Spanish
Violet":69},"hexColors":{"#522173":69,"#4e3347":5,"#ffffff":4,"#1d0d27":17,"#a091a4":3}}
SELECT more->'css21Colors'->'Purple' FROM test;
Additionally you can query only the rows containing that key.
SELECT
more->'css21Colors'->'Purple'
FROM
test
WHERE
(more->'css21Colors')::jsonb ? 'Purple';
Mind switching to the jsonb data type.