Converting JSON to Columns in BigQuery - json

I'm trying to parse a JSON column and map it into individual columns based on key-value pairs. Here's what my input would look like, and I've added the sample output. I am doing this in GCP Bigquery.
Input: JSON column
{"id":"1","timestamp":"2022-09-05", "data":{"fruits":"apple", "name":"abc"}},
{"id":"2","timestamp":"2022-09-06", "data":{"vegetables":"tomato", "name":"def"}},
{"id":"3","timestamp":"2022-09-07", "data":{"fruits":"banana", "name":"ghi"}}
Sample Output:
id timestamp fruits vegetables name
1 2022-09-05 apple null abc
2 2022-09-06 null tomato def
3 2022-09-07 banana null ghi
P.S. -> I've tried going through a few of the answers on similar use cases, but it didn't quite work for me.
Thanks in advance!

to parse a JSON column and map it into individual columns based on key-value pairs
Consider below
select
json_value(json, '$.id') id,
json_value(json, '$.timestamp') timestamp,
json_value(json, '$.data.fruits') fruits,
json_value(json, '$.data.vegetables') vegetables,
json_value(json, '$.data.name') name
from your_table
if applied to sample data in your question - output is

Related

GROUP_CONCAT using where in is not working as expected

I got a table [country] as follow
id | country_name
-----+---------------
1 India
2 USA
3 Nepal
4 SriLanka
When I try querying as follows, it works as expected
select group_concat(country_name)
from country
where id in (1, 2)
I get the result as I want
India,USA
But when I try the query using this way, I get a different result
select GROUP_CONCAT(country_name)
from country
where id in (CONVERT(REPLACE(REPLACE('[1,2]','[',''),']', ''), CHARACTER));
The result I get is
India
Require help in this regard.
you can try this !
select GROUP_CONCAT(country_name) from country
where id in (CONVERT(varchar,REPLACE(REPLACE('[1,2]','[',''),']','')));
I am not sure why you use the second query
but you may try using json
select GROUP_CONCAT(country_name) from country where JSON_SEARCH(CAST('["1","2"]' AS JSON), "one", id )
Using json_search mysql will look id in your json array.
The problem is, it works with ["1","2"] and not [1,2]
At last i found the answer
select GROUP_CONCAT(country_name) from country where FIND_IN_SET(id,REPLACE(REPLACE('[1,2]','[',''),']',''));

Select query for rows which contains specific values within a json array?

I have this table currently
ID | Seller_Name | Seller_Non_Working_Day
1 | Seller A | [1,7]
2 | Seller B | [1]
3 | Seller C | []
4 | Seller D | [1,7]
I'm trying to extract seller who are not working on Sundays only, aka with [1] in the Seller_Non_Working_Day. This field is a JSON type.
This is my query, and I'm not getting any response :(
select * from table_name
where Seller_Non_Working_Day IN ('[1]')
Able to assist pls?
MySQL JSON_OVERLAPS function compares two doc_path or arrays.It returns 0 if value do not match and 1 if it does.
JSON_OVERLAPS(doc_path1, doc_path2)
For your query, I have created a second array with one value [1] i.e. for Sunday. This will compare all records and return 0 and 1.Query result for sellers working on Sundays will return 1. To eliminate non-required records which are returning 1, I have added a WHERE clause. Here is the query:
SELECT * FROM table_name
WHERE JSON_OVERLAPS(Seller_Non_Working_Day , '[1]') != 1 ;
There is another function JSON_CONTAINS which will list all records containing the specific value e.g. [1] at any position within JSON array. The syntax for JSON_CONTAINS is
JSON_CONTAINS(target, candidate[, path])
target is a specific field. Here it would be Seller_Non_Working_Day
candidate is a specific value to find. Here it would be [1]
Path is any specific array location which is optional.
One can use below query to fetch all sellers which are working on Sundays.
SELECT * FROM table_name
WHERE JSON_CONTAINS(Seller_Non_Working_Day , '[1]');

How to get from nested JSON by int rather then by name in MySQL 8

So I'm currently using MySQL's JSON field to store some data.
So the 'reports' table looks like this:
id | stock_id | type | doc |
1 | 5 | Income_Statement | https://pastebin.com/bj1hdK0S|
The pastebin is the content of the json field
What I want to do is get a number (ebit) from the first object under yearly (2018-12-31) in the JSON and then use that to do a WHERE query on so that it only returns where ebit > 50000000 for example. The issue is that the dates under yearly are not standard (i.e. one might be 2018-12-31, the other might by 2018-12-15). So essentially I want a way to get the data using integer indexes rather than the actual names of the objects, so something like yearly.[0].ebit.
How would I do this in MySQL? Alternatively if it's not possible in MySQL, would it be possible in either PostgeSQL or Mongo? If so, could you give me an example? Most of the data fits well into MySQL only this table has a JSON column which is why I started with MySQL.
so StackOverflow isn't letting my link to pastebin without some code so here's some random code:
if(dog == "poodle") {
print "test"
}
I don't know for MySQL nor MongoDB, but here's a simple version for PostgreSQL JSONB type:
SELECT (doc->'yearly'-> max(years) -> 'ebit')::numeric AS ebit
FROM reports, jsonb_object_keys(doc->'yearly') AS years
GROUP BY reports.doc;
...with simplistic test data:
WITH reports(doc) AS (
SELECT '{"yearly":{"2018-12-31":{"ebit":123},"2017-12-31":{"ebit":1.23}}}'::jsonb
)
SELECT (doc->'yearly'-> max(years) -> 'ebit')::numeric AS ebit
FROM reports, jsonb_object_keys(doc->'yearly') AS years
GROUP BY reports.doc;
...gives:
ebit
------
123
(1 row)
So I've basically selected the latest entry under "yearly" without knowing actual values but assuming that the key date formatting will allow a sort order (in this case it seems to comply with ISO-8601).
Using data type JSON instead of JSONB would preserve object key order but is not as efficient in PostgreSQL further down the road and wouldn't help here either.
IF you want to then select only those reports entries having their latest ebit greater than a certain value, just pack it into a sub-select or a CTE. I usualy prefer CTE's because they are better to read, so here we go:
WITH
reports (id, doc) AS (
VALUES
(1, '{"yearly":{"2018-12-31":{"ebit":123},"2017-12-31":{"ebit":1.23}}}'::jsonb),
(2, '{"yearly":{"2018-12-23":{"ebit":50},"2017-12-22":{"ebit":"1200.00"}}}'::jsonb)
),
r_ebit (id, ebit) AS (
SELECT reports.id, (reports.doc->'yearly'-> max(years) -> 'ebit')::numeric AS ebit
FROM reports, jsonb_object_keys(doc->'yearly') AS years
GROUP BY reports.id, reports.doc
)
SELECT id, ebit
FROM r_ebit
WHERE ebit > 100;
However, as you already see, it is not possible to filter the original rows using this strategy. A pre-processing step would make sense here so that the JSON format actually is filter-friendly.
ADDENDUM
To add the possibility of selecting the values for the n-th completed fiscal year, we need resort to window functions and we also need to reduce the resulting set to only return a single row per actual group (in the demonstration case: reports.id):
WITH reports(id, doc) AS (VALUES
(1, '{"yearly":{"2018-12-31":{"ebit":123},"2017-12-31":{"ebit":1.23},"2016-12-31":{"ebit":"23.42"}}}'::jsonb),
(2, '{"yearly":{"2018-12-23":{"ebit":50},"2017-12-22":{"ebit":"1200.00"}}}'::jsonb)
)
SELECT DISTINCT ON (1) reports.id, (reports.doc->'yearly'-> (lead(years, 0) over (partition by reports.doc order by years desc nulls last)) ->>'ebit')::numeric AS ebit
FROM reports, jsonb_object_keys(doc->'yearly') AS years
GROUP BY 1, reports.doc, years.years ORDER BY 1;
...will behave exactly as using the max aggregate function previously. Increasing the offset parameter within the lead(years, <offset>) function all will select the n-th year backwards (because of descending order of the window partition).
The DISTINCT ON (1) clause is the magic that reduces the result to a single row per distinct column value (first column = reports.id). This is why the NULLS LAST is very important inside the window OVER clause.
Here are results for different offsets (I've added a third historic entry for the first id but not for the second to also show how it deals with absent entries):
N = 0:
id | ebit
----+------
1 | 123
2 | 50
N = 1
id | ebit
----+---------
1 | 1.23
2 | 1200.00
N = 2
id | ebit
----+-------
1 | 23.42
2 |
...which means absent entries will just result in a NULL value.

Athena unable to parse date using OpenCSVSerde

I have a very simple csv file on S3
"i","d","f","s"
"1","2018-01-01","1.001","something great!"
"2","2018-01-02","2.002","something terrible!"
"3","2018-01-03","3.003","I'm an oil man"
I'm trying to create a table across this using the following command
CREATE EXTERNAL TABLE test (i int, d date, f float, s string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
LOCATION 's3://mybucket/test/'
TBLPROPERTIES ("skip.header.line.count"="1");
When I query the table (select * from test) I'm getting an error like this:
HIVE_BAD_DATA:
Error parsing field value '2018-01-01' for field 1: For input string: "2018-01-01"
Some more info:
If I change the d column to a string the query will succeed
I've previously parsed dates in text files using Athena; I believe using LazySimpleSerDe
Definitely seems like a problem with the OpenCSVSerde
The documentation definitely implies that this is supported. Looking for anyone who has encountered this, or any suggestions.
In fact, it is a problem with the documentation that you mentioned. You were probably referring to this excerpt:
[OpenCSVSerDe] recognizes the DATE type if it is specified in the UNIX
format, such as YYYY-MM-DD, as the type LONG.
Understandably, you were formatting your date as YYYY-MM-DD. However, the documentation is deeply misleading in that sentence. When it refers to UNIX format, it actually has UNIX Epoch Time in mind.
Based on the definition of UNIX Epoch, your dates should be integers (hence the reference to the type LONG in the documentation). Your dates should be the number of days that have elapsed since January 1, 1970.
For instance, your sample CSV should look like this:
"i","d","f","s"
"1","17532","1.001","something great!"
"2","17533","2.002","something terrible!"
"3","17534","3.003","I'm an oil man"
Then you can run that exact same command:
CREATE EXTERNAL TABLE test (i int, d date, f float, s string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
LOCATION 's3://mybucket/test/'
TBLPROPERTIES ("skip.header.line.count"="1");
If you query your Athena table with select * from test, you will get:
i d f s
--- ------------ ------- ---------------------
1 2018-01-01 1.001 something great!
2 2018-01-02 2.002 something terrible!
3 2018-01-03 3.003 I'm an oil man
An analogous problem also compromises the explanation on TIMESTAMP in the aforementioned documentation:
[OpenCSVSerDe] recognizes the TIMESTAMP type if it is specified in the
UNIX format, such as yyyy-mm-dd hh:mm:ss[.f...], as the type LONG.
It seems to indicate that we should format TIMESTAMPs as yyyy-mm-dd hh:mm:ss[.f...]. Not really. In fact, we need to use UNIX Epoch Time again, but this time with the number of milliseconds that have elapsed since Midnight 1 January 1970.
For instance, consider the following sample CSV:
"i","d","f","s","t"
"1","17532","1.001","something great!","1564286638027"
"2","17533","2.002","something terrible!","1564486638027"
"3","17534","3.003","I'm an oil man","1563486638012"
And the following CREATE TABLE statement:
CREATE EXTERNAL TABLE test (i int, d date, f float, s string, t timestamp)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
LOCATION 's3://mybucket/test/'
TBLPROPERTIES ("skip.header.line.count"="1");
This will be the result set for select * from test:
i d f s t
--- ------------ ------- --------------------- -------------------------
1 2018-01-01 1.001 something great! 2019-07-28 04:03:58.027
2 2018-01-02 2.002 something terrible! 2019-07-30 11:37:18.027
3 2018-01-03 3.003 I'm an oil man 2019-07-18 21:50:38.012
One way around is declare the d column as string and then in the select query use DATE(d) or date_parse to parse the value as date data type.

Retrieving nested values from PostgreSQL JSON columns

Our database contains a table "log" like this:
id | purchases (json)
1 {"apple":{"price":5,"seller":"frank"}, "bred":{"price":3,"seller":"kathy"}}
2 {"milk":{"price":3,"seller":"anne"}, "banana":{"price":2,"seller":"frank"}}
3 {"bred":{"price":4,"seller":"kathy"}}
We would like to retrieve all records containing "seller":"kathy". We tried simple queries like this:
SELECT id FROM log WHERE purchases ->> 'seller' LIKE 'kathy'
SELECT id FROM log WHERE purchases = '{"seller":"kathy"}'
We researched here and elsewhere for some hours ... it seems a bit more complex because the values are nested? We found e.g. some java or pgplsql implementations, but we are still hoping there is a "pure SQL" way. What would be a proper solution? Or should we re-organize our content like this:
id | purchases (json)
1 [{"product":"apple","price":5,"seller":"frank"},{"product":"bred","price":3,"seller":"kathy"}]
2 [{"product":"milk","price":3,"seller":"anne"},{"product":"banana","price":2,"seller":"frank"}]
3 [{"product":"bred","price":4,"seller":"kathy"}]
But what we found, this would be even more complex, because we have to explode the arrays within the query. Any short hint? Thanks!
Check json_each() and #>> Postgres JSON functions:
WITH log(id,purchases) AS ( VALUES
(1,'{"apple":{"price":5,"seller":"frank"}, "bred":{"price":3,"seller":"kathy"}}'::JSON),
(2,'{"milk":{"price":3,"seller":"anne"}, "banana":{"price":2,"seller":"frank"}}'::JSON),
(3,'{"bred":{"price":4,"seller":"kathy"}}'::JSON)
)
SELECT log.* FROM log,
json_each(log.purchases) as purchase
WHERE
purchase.value#>>'{seller}' = 'kathy';
Result:
id | purchases
----+-----------------------------------------------------------------------------
1 | {"apple":{"price":5,"seller":"frank"}, "bred":{"price":3,"seller":"kathy"}}
3 | {"bred":{"price":4,"seller":"kathy"}}
(2 rows)