Unexpected end of JSON input at undefined line XXXX, columns xx-xx while reading in BigQuery - json

I have a table in Bigquery which has 2 columns - job_id and json_column(string which is in JSON format). When I tried to read the data and identify some objects it gives me error as below:
SyntaxError:Unexpected end of JSON input at undefined line XXXX, columns xx-xx
It Always gives me line 5931 and second time I execute again it gives line 6215.
If it's related to JSON structure issue, how can I know which row/job_id that number 5931 corresponds to? If I subset for a specific job_id, it returns the values but when I tried to execute on the complete table, I got this error. I tried looking at the job_id at the row_numbers mentioned and code works fine for those job_ids.
Do you think its JSON structure issue and how to identify which row/job_id has this Issue?
Table Structure:
Code:
CREATE TEMPORARY FUNCTION CUSTOM_JSON_EXTRACT(json STRING, json_path STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
var result = jsonPath(JSON.parse(json), json_path);
if(result){return result;}
else {return [];}
"""
OPTIONS (
library="gs://json_temp/jsonpath-0.8.0.js"
);
SELECT job_id,dist,gm,sub_gm
FROM lz_fdp_op.fdp_json_file,
UNNEST(CUSTOM_JSON_EXTRACT(trim(conv_column), '$.Project.OpsLocationInfo.iDistrictId')) dist ,
UNNEST(CUSTOM_JSON_EXTRACT(trim(conv_column), '$.Project.GeoMarketInfo.Geo')) gm,
UNNEST(CUSTOM_JSON_EXTRACT(trim(conv_column), '$.Project.GeoMarketInfo.SubGeo')) sub_gm

Would this work for you?
WITH
T AS (
SELECT
'1000149.04.14' AS job_id,
'{"Project":{"OpsLocationInfo":{"iDistrictId":"A"},"GeoMarketInfo":{"Geo":"B","SubGeo":"C"}}}' AS conv_column
)
SELECT
JSON_EXTRACT_SCALAR(conv_column, '$.Project.OpsLocationInfo.iDistrictId') AS dist,
JSON_EXTRACT_SCALAR(conv_column, '$.Project.GeoMarketInfo.Geo') AS gm,
JSON_EXTRACT_SCALAR(conv_column, '$.Project.GeoMarketInfo.SubGeo') AS sub_gm
FROM
T
BigQuery JSON Functions docs:
https://cloud.google.com/bigquery/docs/reference/standard-sql/json_functions
how can I read multiple arrays in an object in JSON without using
unnest?
Can you explain better with an input sample your comment?

Related

Error parsing JSON: more than one document in the input (Redshift to Snowflake SQL)

I'm trying to convert a query from Redshift to Snowflake SQL.
The Redshift query looks like this:
SELECT
cr.creatives as creatives
, JSON_ARRAY_LENGTH(cr.creatives) as creatives_length
, JSON_EXTRACT_PATH_TEXT(JSON_EXTRACT_ARRAY_ELEMENT_TEXT (cr.creatives,0),'previewUrl') as preview_url
FROM campaign_revisions cr
The Snowflake query looks like this:
SELECT
cr.creatives as creatives
, ARRAY_SIZE(TO_ARRAY(ARRAY_CONSTRUCT(cr.creatives))) as creatives_length
, PARSE_JSON(PARSE_JSON(cr.creatives)[0]):previewUrl as preview_url
FROM campaign_revisions cr
It seems like JSON_EXTRACT_PATH_TEXT isn't converted correctly, as the Snowflake query results in error:
Error parsing JSON: more than one document in the input
cr.creatives is formatted like this:
"[{""previewUrl"":""https://someurl.com/preview1.png"",""device"":""desktop"",""splitId"":null,""splitType"":null},{""previewUrl"":""https://someurl.com/preview2.png"",""device"":""mobile"",""splitId"":null,""splitType"":null}]"
It seems to me that you are not working with valid JSON data inside Snowflake.
Please review your file format used for the copy into command.
If you open the "JSON" text provided in a text editor , note that the information is not parsed or formatted as JSON because of the quoting you have. Once your issue with double quotes / escaped quotes is handled, you should be able to make good progress
Proper JSON on Left || Original Data on Right
If you are not inclined to reload your data, see if you can create a Javascript User Defined Function to remove the quotes from your string, then you can use Snowflake to process the variant column.
The following code is working POJO that can be used to remove the doublequotes for you.
var textOriginal = '[{""previewUrl"":""https://someurl.com/preview1.png"",""device"":""desktop"",""splitId"":null,""splitType"":null},{""previewUrl"":""https://someurl.com/preview2.png"",""device"":""mobile"",""splitId"":null,""splitType"":null}]';
function parseText(input){
var a = input.replaceAll('""','\"');
a = JSON.parse(a);
return a;
}
x = parseText(textOriginal);
console.log(x);
For anyone else seeing this double double quote issue in JSON fields coming from CSV files in a Snowflake external stage (slightly different issue than the original question posted):
The issue is likely that you need to use the FIELD_OPTIONALLY_ENCLOSED_BY setting. Specifically, FIELD_OPTIONALLY_ENCLOSED_BY = '"' when setting up your fileformat.
(docs)
Example of creating such a file format:
create or replace file format mydb.myschema.my_tsv_file_format
type = CSV
field_delimiter = '\t'
FIELD_OPTIONALLY_ENCLOSED_BY = '"';
And example of querying from a stage using this file format:
select
$1 field_one
$2 field_two
-- ...and so on
from '#my_s3_stage/path/to/file/my_tab_separated_file.csv' (file_format => 'my_tsv_file_format')

How to INSERT JSON objects in an array in MySQL?

I am currently assigned to a task to transform our active non-normalized table to a normalized one. We decided to use database triggers to facilitate the bulk migration and succeeding data changes until we discontinue the old table.
Below are the structure and sample of our old table:
SELECT * FROM TabHmIds;
ID EntitlementID TabId HmId
1 101 201 301
2 102 202 302
The required structure and sample of our new table should look like:
SELECT * FROM tab_integration;
id tab_id integration_id metadata
1 201 1 { "paid_id": {"entitlement_id": 101, "id": 301} }
2 202 1 { "paid_id": {"entitlement_id": 202, "id": 302} }
The following is what I have done in my INSERT trigger so far:
CREATE TRIGGER tab_integration_after_insert AFTER INSERT ON `TabHmIds`
FOR EACH ROW
BEGIN
DECLARE var_metadata JSON;
DECLARE var_new_metadata JSON;
DECLARE var_hm_metadata JSON;
DECLARE var_integration_id INT(11);
SELECT
metadata,
integration_id INTO var_metadata,
var_integration_id
FROM
`go`.`tab_integration` gti
WHERE
gti.`tab_id` = NEW.`TabId`;
SET var_hm_metadata = JSON_OBJECT('entitlement_id', NEW.`EntitlementId`, 'id', NEW.`HmId`);
IF var_integration_id = 1 THEN
IF var_metadata IS NULL THEN
SET var_new_metadata = JSON_OBJECT('paid_id', var_hm_metadata);
ELSE
SET #paid_id = JSON_UNQUOTE(JSON_EXTRACT(var_metadata, '$.paid_id'));
SET var_new_metadata = JSON_ARRAY_APPEND(var_metadata, '$.paid_id', var_hm_metadata);
END IF;
END IF;
UPDATE `tab_integration` gti SET `metadata` = var_new_metadata WHERE `tab_id` = NEW.`TabId`;
END
However, what I get is this:
SELECT * FROM tab_integration;
id tab_id integration_id metadata
1 201 1 { "paid_id": "{\"entitlement_id\": 101, \"id\": 301}" }
2 202 1 { "paid_id": "{\"entitlement_id\": 202, \"id\": 302}" }
From the table above, the JSON object is parsed into STRING. I am aware that the JSON_OBJECT parses the passed value to a string. SO I used the JSON_UNQUOTE(JSON_EXTRACT(…)) to convert the paid_id path value to JSON, but it does not get parsed into JSON. I also tried JSON_MERGE_PRESERVE to put the JSON object under the paid_id path but I end up getting:
{“paid_id”: [], “entitlement_id”: 101, “id”: 301 }
I also tried to put the JSON array into a temporary table using JSON_TABLE and modify the values in the temporary and convert that temporary table to JSON using JSONARRAYAGG. But Workbench keeps saying I have an error in my syntax even though I directly copied examples from the web.
I also tried CASTing a well-formed string to JSON, but Workbench also throws a syntax error.
I have spent a week into resolving this data structure in MySQL.
Database scripting is not my strong suit and I am new to the JSON functions in MySQL. Thank you in advance to those who will reply.
If in case needed, my MySQL Workbench is version 10.4.13-MariaDB. But the script should work in MySQL 5.7.
I found the answer to my problem.
Before inserting the new JSON data, I parsed it to CHAR and REPLACEd the characters making it a string. I did the following and it worked!
# After performing the needed JSON manipulations, cleanse the JSON string to JSON.
SET var_new_metadata = CAST(var_new_metadata AS CHAR);
SELECT REPLACE(REPLACE(REPLACE(var_new_metadata, '\\', ''), '\"\{', '{'), '\}\"', '}') INTO var_new_metadata;
After cleansing the data, the UPDATE call still worked and tried to do some JSON manipulations after and, yes, still works!

azure ADF - Get field list of .csv file from lookup activity

context: azure - ADF brief process description:
Get a list of the fields defined in the first row of a .csv(blobed) file. This is the first step, detect fields
then 2nd step would be a kind of compare with actual columns of an SQL table
3rd one a stored procedure execution to make the alter table task, finishing with a (customized) table containing all fields needed to successfully load the .csv file into the SQl table.
To begin my ADF pipeline, I set up a lookup activity that "querys" the first line of my blobed file, "First row only" flag = ON.As a second pipeline activity, an "Append Variable" task, there I would like to get all .csv fields(first row) retrieved from the lookup activity, as a list.
Here is where a getting the nightmare.
As far I know, with dynamic content I can get an array with all values (w/ format like {"field1_name":"field1_value_1st_row", "field2_name":"field2_value_1st_row", etc })
with something like #activity('Lookup1').output.firstrow.
Or any array element with #activity('Lookup1').output.firstrow.<element_name>,
but I can't figure out how to get a list of all field names (keys?) of the array.
I will appreciate any advice, many thanks!
I would save the part of LookUp Activity because it seems that you are familiar with it.
You could use Azure Function HttpTrigger to get the key list of firstrow JSON object. For example your json object like this as you mentioned in your question:
{"field1_name":"field1_value_1st_row", "field2_name":"field2_value_1st_row"}
Azure Function code:
module.exports = async function (context, req) {
context.log('JavaScript HTTP trigger function processed a request.');
var array = [];
for(var key in req.body){
array.push(key);
}
context.res = {
body: {"keyValue":array}
};
};
Test Output:
Then use Azure Function Activity to get the output:
#activity('<AzureFunctionActivityName>').keyValue
Use Foreach Activity to loop the keyValue array:
#item()
Still based on the above sample input data,please refer to my sample code:
dct = {"field1_name": "field1_value_1st_row", "field2_name": "field2_value_1st_row"}
list = []
for key in dct.keys():
list.append(key)
print(list)
dicOutput = {"keys": list}
print(dicOutput)
Have you considered doing this in ADF data flow? You would map the incoming fields to a SQL dataset without a target schema. Define a new table name in the dataset definition and then map the incoming fields from your CSV to a new target table schema definition. ADF will write the rows to a new table using that file's schema.

Apache Drill: Convert JSON as String to JSON object to retrieve each element

I have the below string in a column in hive table which i am trying to query using apache drill:
{"cdrreasun":"52","cdxscarc":"20150407161405","cdrend":"20150407155201","cdrdnrar.1un":"24321.70","servlnqlp":"54.201.25.50","men":"42403","xa:lnqruup":"3","cemcau":"120","accuuncl":"21","cdrc:
5","volcuca":"1.7"}
Want to retrieve all values for key cdrreasun using apache drill SQL.
Can't use FLATTEN on the column as it says Flatten does not work with inputs of non-list types.
Can't use KVGEN as well as it works only with MAP datatype.
Drill has function convert_fromJSON which allows converting from String to JSON object. For more details about this function and examples of usage please see https://drill.apache.org/docs/data-type-conversion/#convert_to-and-convert_from
For the example you specified, you can run
convert_fromJSON(colWithJsonText)['cdrreasun']
I figured it out, hope it will be helpful for others.
We have to do it in 3 steps if the datatype is of type MAP:
KVGEN() -> FLATTEN() -> convert_from()
If it's of type STRING then KVGEN() function is not needed.
SELECT ratinggrouplist
,t3.cdrlist3.cdrreason AS cdrreason
,t3.cdrlist3.cdrstart AS cdrstart
,t3.cdrlist3.cdrend AS cdrend
,t3.cdrlist3.cdrduration AS cdrduration
FROM (
SELECT ratinggrouplist, convert_from(t2.cdrlist2.`element`, 'JSON') AS cdrlist3
FROM (
SELECT ratinggrouplist ,flatten(t1.cdrlist1.`value`) AS cdrlist2
FROM (
SELECT ratinggrouplist, kvgen(cdrlist) AS cdrlist1
FROM dfs.tmp.SOME_TABLE
) AS t1
) AS t2
) AS t3;

Formatting Biquery query to ML appropriate JSON to Pass through ML Predict

Using Python 2.7, I wont to pass a query from BigQuery to ML Predict which has a specific formating request.
First: Is there an easier way to go directly from the BigQuery query to JSON in the correct format so it can be passed to requests.post() instead of going through pandas (from what I understand pandas is still not supported for GCP Standard)?
Second: Is there a way to construct the query to go directly to a JSON format and then modify the JSON to reflect the ML Predict JSON requirments?
Currently my code looks like this:
#I used the bigquery to dataframe option here to view the output.
#I would like to not use pandas in the end code.
logs = log_data.execute(output_options=bq.QueryOutput.dataframe()).result()
data = logs.to_json(orient='index')
print data
'{"0":{"end_time":"2018-04-19","device":"iPad","device_os":"iOS","device_os_version":"5.1.1","latency":0.150959,"megacycles":140.0,"cost":"1.3075e-08","device_brand":"Apple","device_family":"iPad","browser_version":"5.1","app":"567","ua_parse":"0"}}'
#The JSON needs to be in this format according to google documentation.
#data = {
# 'instances': [
# {
# 'key':'',
# 'end_time': '2018-04-19',
# 'device': 'iPad',
# 'device_os': 'iOS',
# 'device_os_version': '5.1.1',
# 'latency': 0.150959,
# 'megacycles':140.0,
# 'cost':'1.3075e-08',
# 'device_brand':'Apple',
# 'device_family':'iPad',
# 'browser_version':'5.1',
# 'app':'567',
# 'ua_parse':'40.9.8'
# }
# ]
#}
So all I would need to change is the leading key '0' to 'instances' and I should be all set to pass into `requests.post().
Is there a way to accomplish this?
Edit-Adding BigQuery query:
%%bq query --n log_data
WITH `my.table` AS (
SELECT ARRAY<STRUCT<end_time STRING, device STRING, device_os STRING, device_os_version STRING, latency FLOAT64, megacycles FLOAT64,
cost STRING, device_brand STRING, device_family STRING, browser_version STRING, app STRING, ua_parse STRING>>[] instances
)
SELECT TO_JSON_STRING(t)
FROM `my.table` AS t
WHERE end_time >='2018-04-19'
LIMIT 1
data = log_data.execute().result()
Thanks to #MikhailBerlyant I have adjust my query and code to look like this:
%%bq query --n log_data
SELECT [TO_JSON_STRING(t)] AS instance
FROM `yourproject.yourdataset.yourtable` AS t
WHERE end_time >='2018-04-19'
LIMIT 1
But when I run the execute logs = log_data.execute().result() I get this
Which results in this error when passing into request.post
TypeError: QueryResultsTable job_zfVEiPdf2W6msBlT6bBLgMusF49E is not JSON serializable
Is there a way within execut() to just return the json?
First: Is there an easier way to go directly from the BigQuery query to JSON in the correct format
See example below
#standardSQL
WITH yourTable AS (
SELECT ARRAY<STRUCT<id INT64, type STRING>>[(1, 'abc'), (2, 'xyz')] instances
)
SELECT TO_JSON_STRING(t)
FROM yourTable t
with result is in the format you asked for:
{"instances":[{"id":1,"type":"abc"},{"id":2,"type":"xyz"}]}
Above demonstrates the query and how it will work
In you real case - you should use something like below
SELECT TO_JSON_STRING(t)
FROM `yourproject.yourdataset.yourtable` AS t
WHERE end_time >='2018-04-19'
LIMIT 1
hope this helps :o)
Update based on comments
SELECT [TO_JSON_STRING(t)] AS instance
FROM `yourproject.yourdataset.yourtable` t
WHERE end_time >='2018-04-19'
LIMIT 1
I wanted to add this in case someone has the same problem I had or at least are stuck on were to go once you have the query.
I was able to write a function that formatted the query in the way Google ML Predict wants it to be passed into requests.post(). This is most likely a horrible way to accomplish this but I could not find a direct way to go from BigQuery to ML Predict in the correct format.
def logs(query):
client = gcb.Client()
query_job = client.query(query)
CSV_COLUMNS ='end_time,device,device_os,device_os_version,latency,megacycles,cost,device_brand,device_family,browser_version,app,ua_parse'.split(',')
for row in query_job.result():
var = list(row)
l1 = dict(zip(CSV_COLUMNS,var))
l1.update({'key':''})
l2 = {'instances':[l1]}
return l2