How to get output of a bigquery in a specific json format - json

I have a bigquery table in this format :
DataProvider,Id,Name,Time
ABC,f8453e99-516f-4f15-a3bd-8749089b6934,"xyz",43200
ABC,f8453e99-516f-4f15-a3bd-8749089b6934,"123",43200
ABC,00453e99-516f-4f15-a3bd-8749089b6934,"xyz",43200
I want to generate the output in this format (json) :
{"dataProviderId":"ABC","items":[{"Id":"f8453e99-516f-4f15-a3bd-8749089b6934","data":[{"Name":"xyz","Time":43200},{"Name":"xyz","Time":43200}],
{"Id":"00453e99-516f-4f15-a3bd-8749089b6934","data":[{"Name":"xyz","Time":43200}]}

In your CLI, you can use bq command with --format flag, where you can pass prettyjson format (easy-to-read JSON format).
bq query --format=prettyjson --use_legacy_sql=false 'SELECT * FROM `project_id`:dataset.table' > output.json
By using > at the end of the command, it is possible to save the output of a command to a new file. You will be able to see the output of query in output.json file.
I hope it helps.

Below is for BigQuery Standard SQL
#standardSQL
SELECT TO_JSON_STRING(t) json
FROM (
SELECT dataProvider, ARRAY_AGG(STRUCT(id, data)) items
FROM (
SELECT dataProvider, id, ARRAY_AGG(STRUCT(name, time)) data
FROM `project.dataset.table` t
GROUP BY dataProvider, id
)
GROUP BY dataProvider
) t
If to apply to sample data in your question - output is
Row json
1 {"dataProvider":"ABC","items":[{"id":"f8453e99-516f-4f15-a3bd-8749089b6934","data":[{"name":"xyz","time":43200},{"name":"123","time":43200}]},{"id":"00453e99-516f-4f15-a3bd-8749089b6934","data":[{"name":"xyz","time":43200}]}]}

Related

Convert BigQuery rows to array of JSON

I wan to convert all the rows of BigQuery query output to an array of JSON.
For example: I want to convert the following output rows
Col1
Col2
ex1a
ex1b
ex2a
ex2b
Convert this to the following JSON:
{
"Col1":"ex1a",
"Col2":"ex1b"
},
{
"Col1":"ex2a",
"Col2":"ex2b"
}
]```
Use below approach
select format('[%s]', string_agg(to_json_string(t)))
from your_table t
if applied to sample data in your question - output is
Another option (with same output) is
select to_json_string(array_agg(t))
from your_table t

How to extract JSON array stored as string in BigQuery

I have a JSON array that looks similar to this
[{"key":"Email","slug":"customer-email","value":"abc#gmail.com"},{"key":"Phone Number","slug":"mobile-phone-number","value":"123456789"},{"key":"First Name","slug":"first-name","value":"abc"},{"key":"Last Name","slug":"last-name","value":"xyz"},{"key":"Date of birth","slug":"date-of-birth","value":"01/01/1990"}]
But the tricky part is, this array is stored as string. So I am thinking that the first step would be to convert the string into array then unnest it then follow the method in here
I wonder if this method is doable, if so I guess the challenge that I am having is to convert string into array. If not, or if you have more efficient method please help. Thanks
Have you tried json_extract_array
select json_extract_array(
"""[{"key":"Email","slug":"customer-email","value":"abc#gmail.com"},{"key":"Phone Number","slug":"mobile-phone-number","value":"123456789"},{"key":"First Name","slug":"first- name","value":"abc"},{"key":"Last Name","slug":"last-name","value":"xyz"},{"key":"Date of birth","slug":"date-of-birth","value":"01/01/1990"}]""");
Below is for BigQuery Standard SQL
#standardSQL
SELECT
id,
JSON_EXTRACT_ARRAY(json_string) AS json_array
FROM `project.dataset.table`
if to apply to sample data from your question as in below
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, '[{"key":"Email","slug":"customer-email","value":"abc#gmail.com"},{"key":"Phone Number","slug":"mobile-phone-number","value":"123456789"},{"key":"First Name","slug":"first-name","value":"abc"},{"key":"Last Name","slug":"last-name","value":"xyz"},{"key":"Date of birth","slug":"date-of-birth","value":"01/01/1990"}]' json_string
)
SELECT
id,
JSON_EXTRACT_ARRAY(json_string) AS json_array
FROM `project.dataset.table`
output is
Row id json_array
1 1 {"key":"Email","slug":"customer-email","value":"abc#gmail.com"}
{"key":"Phone Number","slug":"mobile-phone-number","value":"123456789"}
{"key":"First Name","slug":"first-name","value":"abc"}
{"key":"Last Name","slug":"last-name","value":"xyz"}
{"key":"Date of birth","slug":"date-of-birth","value":"01/01/1990"}
From this point - you can use solution in How do I parse value from JSON array into columns in BigQuery that you referenced in your question

Loading Json Array File in Hive

I have a file which contains the data as follows
[{"col1":"col1","col2":1}
,{"col1":"col11","col2":11}
,{"col1":"col111","col2":2}
]
I am trying to load the table in Hive.
I am using following Hive serde
CREATE EXTERNAL TABLE my_table (
my_array ARRAY<struct<col1:string,col2:int>>
)ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
WITH SERDEPROPERTIES ( "ignore.malformed.json" = "true")
LOCATION "MY_LOCATION";
I am getting error when I try to run select * after running the create command -
['*org.apache.hive.service.cli.HiveSQLException:java.io.IOException: org.apache.hadoop.hive.serde2.SerDeException: java.io.IOException: Start token not found where expected:25:24', 'org.apache.hive.service.cli.operation.SQLOperation:getNextRowSet:SQLOperation.java:499', 'org.apache.hive.service.cli.operation.OperationManager:getOperationNextRowSet:OperationManager.java:307', 'org.apache.hive.service.cli.session.HiveSessionImpl:fetchResults:HiveSessionImpl.java:878', 'sun.reflect.GeneratedMethodAccessor29:invoke::-1', 'sun.reflect.DelegatingMethodAccessorImpl:invoke:DelegatingMethodAccessorImpl.java:43', 'java.lang.reflect.Method:invoke:Method.java:498', 'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:78', 'org.apache.hive.service.cli.session.HiveSessionProxy:access$000:HiveSessionProxy.java:36', 'org.apache.hive.service.cli.session.HiveSessionProxy$1:run:HiveSessionProxy.java:63', 'java.security.AccessController:doPrivileged:AccessController.java:-2', 'javax.security.auth.Subject:doAs:Subject.java:422', 'org.apache.hadoop.security.UserGroupInformation:doAs:UserGroupInformation.java:1698', 'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:59', 'com.sun.proxy.$Proxy35:fetchResults::-1', 'org.apache.hive.service.cli.CLIService:fetchResults:CLIService.java:559', 'org.apache.hive.service.cli.thrift.ThriftCLIService:FetchResults:ThriftCLIService.java:751', 'org.apache.hive.service.rpc.thrift.TCLIService$Processor$FetchResults:getResult:TCLIService.java:1717', 'org.apache.hive.service.rpc.thrift.TCLIService$Processor$FetchResults:getResult:TCLIService.java:1702', 'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39', 'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39', 'org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:56', 'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:286', 'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1149', 'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:624', 'java.lang.Thread:run:Thread.java:748', '*java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: java.io.IOException: Start token not found where expected:29:4', 'org.apache.hadoop.hive.ql.exec.FetchOperator:getNextRow:FetchOperator.java:521', 'org.apache.hadoop.hive.ql.exec.FetchOperator:pushRow:FetchOperator.java:428', 'org.apache.hadoop.hive.ql.exec.FetchTask:fetch:FetchTask.java:147', 'org.apache.hadoop.hive.ql.Driver:getResults:Driver.java:2207', 'org.apache.hive.service.cli.operation.SQLOperation:getNextRowSet:SQLOperation.java:494', '*org.apache.hadoop.hive.serde2.SerDeException:java.io.IOException: Start token not found where expected:30:1', 'org.apache.hive.hcatalog.data.JsonSerDe:deserialize:JsonSerDe.java:184', 'org.apache.hadoop.hive.ql.exec.FetchOperator:getNextRow:FetchOperator.java:502', '*java.io.IOException:Start token not found where expected:30:0', 'org.apache.hive.hcatalog.data.JsonSerDe:deserialize:JsonSerDe.java:170'], statusCode=3), results=None, hasMoreRows=None)
I tried several things, none of which worked as expected. I can't change the input data format as it is someone else who is providing the data.
This is a malformed JSON issue. A JSON file will always have "curly braces" at the beginning and the end. So change your JSON file to look something like below.
{"my_array":[{"col1":"col1","col2":1},{"col1":"col11","col2":11},{"col1":"col111","col2":2}]}
Create your table in the exact same way as you are doing it already.
CREATE EXTERNAL TABLE my_table
(
my_array ARRAY<struct<col1:string,col2:int>>
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
WITH SERDEPROPERTIES ( "ignore.malformed.json" = "true")
LOCATION "MY_LOCATION";
Now fire a select * on your newly created table to see following results.
[{"col1":"col1","col2":1},{"col1":"col11","col2":11},{"col1":"col111","col2":2}]
Use select my_array.col1 from my_table; to see the values for col1 from your array.
["col1","col11","col111"]
PS - Not the most efficient way to store the data. Consider transforming the data and storing it as ORC/Parquet.
Hope that helps!
Looks like the issue is with your json data. Can you try with below example?
Create employee json with below content and place it in hdfs.
[root#quickstart spark]# hadoop fs -cat /user/cloudera/spark/employeejson/*
{"Name":"Vinayak","age":35}
{"Name":"Nilesh","age":37}
{"Name":"Raju","age":30}
{"Name":"Karthik","age":28}
{"Name":"Shreshta","age":1}
{"Name":"Siddhish","age":2}
Add below jar(execute only if you get any error. )
hive> ADD JAR /usr/lib/hive-hcatalog/lib/hive-hcatalog-core.jar;
hive>
CREATE TABLE employeefromjson(name string, age int)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
LOCATION '/user/cloudera/hive/employeefromjson'
;
hive> LOAD DATA INPATH '/user/cloudera/spark/employeejson' OVERWRITE INTO TABLE employeefromjson;
hive> select * from employeefromjson;
OK
Vinayak 35
Nilesh 37
Raju 30
Karthik 28
Shreshta 1
Siddhish 2
Time taken: 0.174 seconds, Fetched: 6 row(s)
A JSON should always start with '{' and not with '['. That is the problem. As you know, JSON has a structure of {'key':'value'}. What you have given in your file is a value which does not have any key. So, change your JSON to the below formmat
{"my_array":[{"col1":"col1","col2":1},{"col1":"col11","col2":11},{"col1":"col111","col2":2}]}
Your Create table statement should work fine.
If you want to get the data for each column for all the rows, use the below query.
select my_array.col1, my_array.col2 from my_table;
The above command will give you the below result.
OK
["col1","col11","col111"] [1,11,2]
If you want to get the result column wise for each row seperately, use the below query.
select a.* from my_table m lateral view outer inline (m.my_array) a;
The above command will give you the below result.
OK
col1 1
col11 11
col111 2
Hope you this helps!

Parse JSON into U-SQL then convert to csv

I'm trying to convert some telemetry data that is in JSON format into CSV format, then write it out to a file, using U-SQL.
The problem is that some of the JSON key values have periods in them, and so when I'm doing the SELECT operation, U-SQL is not recognizing them. When I check the output file, all that I am seeing is the values for "p1". How can I represent the names of the JSON key names in the script so that they are recognized. Thanks in advance for any help!
Code:
REFERENCE ASSEMBLY MATSDevDB.[Newtonsoft.Json];
REFERENCE ASSEMBLY MATSDevDB.[Microsoft.Analytics.Samples.Formats];
USING Microsoft.Analytics.Samples.Formats.Json;
#jsonDocuments =
EXTRACT jsonString string
FROM #"adl://xxxx.azuredatalakestore.net/xxxx/{*}/{*}/{*}/telemetry_{*}.json"
USING Extractors.Tsv(quoting:false);
#jsonify =
SELECT Microsoft.Analytics.Samples.Formats.Json.JsonFunctions.JsonTuple(jsonString) AS json
FROM #jsonDocuments;
#columnized = SELECT
json["EventInfo.Source"] AS EventInfoSource,
json["EventInfo.InitId"] AS EventInfoInitId,
json["EventInfo.Sequence"] AS EventInfoSequence,
json["EventInfo.Name"] AS EventInfoName,
json["EventInfo.Time"] AS EventInfoTime,
json["EventInfo.SdkVersion"] AS EventInfoSdkVersion,
json["AppInfo.Language"] AS AppInfoLanguage,
json["UserInfo.Language"] AS UserInfoLanguage,
json["DeviceInfo.BrowserName"] AS DeviceInfoBrowswerName,
json["DeviceInfo.BrowserVersion"] AS BrowswerVersion,
json["DeviceInfo.OsName"] AS DeviceInfoOsName,
json["DeviceInfo.OsVersion"] AS DeviceInfoOsVersion,
json["DeviceInfo.Id"] AS DeviceInfoId,
json["p1"] AS p1,
json["PipelineInfo.AccountId"] AS PipelineInfoAccountId,
json["PipelineInfo.IngestionTime"] AS PipelineInfoIngestionTime,
json["PipelineInfo.ClientIp"] AS PipelineInfoClientIp,
json["PipelineInfo.ClientCountry"] AS PipelineInfoClientCountry,
json["PipelineInfo.IngestionPath"] AS PipelineInfoIngestionPath,
json["AppInfo.Id"] AS AppInfoId,
json["EventInfo.Id"] AS EventInfoId,
json["EventInfo.BaseType"] AS EventInfoBaseType,
json["EventINfo.IngestionTime"] AS EventINfoIngestionTime
FROM #jsonify;
OUTPUT #columnized
TO "adl://xxxx.azuredatalakestore.net/poc/TestResult.csv"
USING Outputters.Csv(quoting : false);
JSON:
{"EventInfo.Source":"JS_default_source","EventInfo.Sequence":"1","EventInfo.Name":"daysofweek","EventInfo.Time":"2018-01-25T21:09:36.779Z","EventInfo.SdkVersion":"ACT-Web-JS-2.6.0","AppInfo.Language":"en","UserInfo.Language":"en-US","UserInfo.TimeZone":"-08:00","DeviceInfo.BrowserName":"Chrome","DeviceInfo.BrowserVersion":"63.0.3239.132","DeviceInfo.OsName":"Mac OS X","DeviceInfo.OsVersion":"10","p1":"V1","PipelineInfo.IngestionTime":"2018-01-25T21:09:33.9930000Z","PipelineInfo.ClientCountry":"CA","PipelineInfo.IngestionPath":"FastPath","EventInfo.BaseType":"custom","EventInfo.IngestionTime":"2018-01-25T21:09:33.9930000Z"}
I got this to work with single quotes and single square brackets, eg
#columnized = SELECT
json["['EventInfo.Source']"] AS EventInfoSource,
...
Full code:
#columnized = SELECT
json["['EventInfo.Source']"] AS EventInfoSource,
json["['EventInfo.InitId']"] AS EventInfoInitId,
json["['EventInfo.Sequence']"] AS EventInfoSequence,
json["['EventInfo.Name']"] AS EventInfoName,
json["['EventInfo.Time']"] AS EventInfoTime,
json["['EventInfo.SdkVersion']"] AS EventInfoSdkVersion,
json["['AppInfo.Language']"] AS AppInfoLanguage,
json["['UserInfo.Language']"] AS UserInfoLanguage,
json["['DeviceInfo.BrowserName']"] AS DeviceInfoBrowswerName,
json["['DeviceInfo.BrowserVersion']"] AS BrowswerVersion,
json["['DeviceInfo.OsName']"] AS DeviceInfoOsName,
json["['DeviceInfo.OsVersion']"] AS DeviceInfoOsVersion,
json["['DeviceInfo.Id']"] AS DeviceInfoId,
json["p1"] AS p1,
json["['PipelineInfo.AccountId']"] AS PipelineInfoAccountId,
json["['PipelineInfo.IngestionTime']"] AS PipelineInfoIngestionTime,
json["['PipelineInfo.ClientIp']"] AS PipelineInfoClientIp,
json["['PipelineInfo.ClientCountry']"] AS PipelineInfoClientCountry,
json["['PipelineInfo.IngestionPath']"] AS PipelineInfoIngestionPath,
json["['AppInfo.Id']"] AS AppInfoId,
json["['EventInfo.Id']"] AS EventInfoId,
json["['EventInfo.BaseType']"] AS EventInfoBaseType,
json["['EventINfo.IngestionTime']"] AS EventINfoIngestionTime
FROM #jsonify;
My results:

Hive: parsing JSON

I am trying to get some values out of nested JSON for millions of rows (5 TB+ table). What is the most efficient way to do this?
Here is an example:
{"country":"US","page":227,"data":{"ad":{"impressions":{"s":10,"o":10}}}}
I need these values out of the above JSON:
Country Page impressions_s impressions_o
--------- ----- ------------- --------------
US 2 10 10
This is Hive's json_tuple function, I am not sure if this is the best function.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-getjsonobject
You can use get_json_object:
select get_json_object(fieldname, '$.country'),
get_json_object(fieldname, '$.data.ad.s') from ...
You will get better performance with json_tuple but I found a "how to" to get the values in json inside json;
To formating your table you can use something like this:
from table t lateral view
explode( split(regexp_replace(get_json_object(ln, ''$.data.ad.s'), '\\[|\\]', ''), ',' ) ) tb1 as s
this code above will transform you "Array" in a column.
form more: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
I hope this help ...
Here is what you can quickly try , I would suggest to use Json-Ser-De.
nano /tmp/hive-parsing-json.json
{"country":"US","page":227,"data":{"ad":{"impressions":{"s":10,"o":10}}}}
Create base table :
hive > CREATE TABLE hive_parsing_json_table ( json string );
Load json file to Table :
hive > LOAD DATA LOCAL INPATH '/tmp/hive-parsing-json.json' INTO TABLE hive_parsing_json_table;
Query the table :
hive > select v1.Country, v1.Page, v4.impressions_s, v4.impressions_o
from hive_parsing_json_table hpjp
LATERAL VIEW json_tuple(hpjp.json, 'country', 'page', 'data') v1
as Country, Page, data
LATERAL VIEW json_tuple(v1.data, 'ad') v2
as Ad
LATERAL VIEW json_tuple(v2.Ad, 'impressions') v3
as Impressions
LATERAL VIEW json_tuple(v3.Impressions, 's' , 'o') v4
as impressions_s,impressions_o;
Output :
v1.country v1.page v4.impressions_s v4.impressions_o
US 227 10 10
Using hive native json-serde('org.apache.hive.hcatalog.data.JsonSerDe') you can do this.. here are the steps
ADD JAR /path/to/hive-hcatalog-core.jar;
create a table as below
CREATE TABLE json_serde_nestedjson (
country string,
page int,
data struct < ad: struct < impressions: struct < s:int, o:int > > >
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
then load data(stored in file)
LOAD DATA LOCAL INPATH '/tmp/nested.json' INTO TABLE json_serde_nestedjson;
then get required data using
SELECT country, page, data.ad.impressions.s, data.ad.impressions.o
FROM json_serde_nestedjson;
Implementing a SerDe to parse your data in JSON is a better way for your case.
A tutorial on how to implement SerDe for parsing JSON can be found here
http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/
You can use the following sample SerDe implementation as well
https://github.com/rcongiu/Hive-JSON-Serde