Iterate over ARRAY<JSON> in BigQuery - json

I want to fetch the value from JSON object in BigQuery. I have a JSON like below
{"fruit":[{"apples":5,"oranges":10},{"apples":2,"oranges":4}]}
and the condition to fetch json object value is : if apples = 2 then return value of oranges is 4
How do i iterate through ARRAY in BigQuery?

Consider below example
#standardSQL
with `project.dataset.table` as (
select '{"fruit":[{"apples":5,"oranges":10},{"apples":2,"oranges":4}]}' json
)
select
json_extract(x, '$.oranges') oranges
from `project.dataset.table`,
unnest(json_extract_array(json, '$.fruit')) x
where json_extract(x, '$.apples') = '2'
with output

Related

Extract and explode inner nested element as rows from string nested structure

I would like to explode a column to rows in a dataframe on pyspark hive.
There are two columns in the dataframe.
The column "business_id" is a string.
The column "sports_info" is a struct type, each element value is an array of string.
Data:
business_id sports_info
"abc-123" {"sports_type":
["{sport_name:most_recent,
sport_events:[{sport_id:568, val:10.827},{id:171,score:8.61}]}"
]
}
I need to get a dataframe like:
business_id. sport_id
"abc-123" 568
"abc-123" 171
I defined:
schema = StructType([ \
StructField("sports_type",ArrayType(),True)
])
df = spark.createDataFrame(data=data, schema=schema) # I am not sure how to create the df
df.printSchema()
df.show(truncate=False)
def get_ids(val):
sports_type = 'sports_type'
sport_events = 'sport_events'
sport_id = 'sport_id'
sport_ids_vals = eval(val.sports_type[0])['sport_events']
ids = [s['sport_id'] for s in sport_ids_scores]
return ids
df2 = df.withColumn('sport_new', F.udf(lambda x: get_ids(x),
ArrayType(ArrayType(StringType())))('sports_info'))
How could I create the df and extract/explode the inner nested elements?
df2 = df.withColumn('sport_new', expr("transform (sports_type, x -> regexp_extract( x, 'sport_id:([0-9]+)',1))")).show()
Explained:
expr( #use a SQL expression, only way to access transform (pre spark 3)
"transform ( # run a SQL function on an array
sports_type, # declare column to use
x # declare the name of the variable to use for each element in the array
-> # Start writing SQL code to run on each element in the array
regexp_extract( # user SQL regex functions to pull out from the string
x, #string to run regex on
'sport_id:([0-9]+)',1))" # find sport_id and capture the number following it.
)
THis will likely run faster than a UDF as it can be vectorized.

Postgres select value by key from json in a list

Given the following:
create table test (
id int,
status text
);
insert into test values
(1,'[]'),
(2,'[{"A":"d","B":"c"}]'),
(3,'[{"A":"g","B":"f"}]');
Is it possible to return?
id A B
1 null null
2 d c
3 g f
I am attempting something like this:
select id,
status::json ->> 0 #> "A" from test
Try this to address your specific example :
SELECT id, (status :: json)#>>'{0,A}' AS A, (status :: json)#>>'{0,B}' AS B
FROM test
see the result
see the manual :
jsonb #>> text[] → text
Extracts JSON sub-object at the specified path as text.
'{"a": {"b": ["foo","bar"]}}'::json #>> '{a,b,1}' → bar
This does it:
SELECT id,
(status::json->0)->"A" as A,
(status::json->0)->"B" as B
FROM test;

Convert a key value pair in a column as new column in python

I want to parse a column, and get the key-value pair as column
Input:
I have a dataframe (called df) with the following structure:
ID data
A1 {"userMatch": "{"match":{"phone":{"name":{"score":1}},"name":{"score":1}}}"}
A2 {"userMatch": "{"match":{"phone":{"name":{"score":0.934}},"name":{"score":0.952}}}"}
Expected Output:
I wanted to create new column called 'score' and get the value from the key value pair
ID score1 score2
A1 1 1
A2 0.934 0.952
Attempted Solution:
data_json = df['data'].transform(lambda x: json.loads(x))
df['score1'] = data_json.str.get('userMatch').str.get('match').str.get('phone').str.get('name').str.get('score')
df['score2'] = data_json.str.get('userMatch').str.get('match').str.get('phone').str.get('name').str.get('name').str.get('score')
Error:
TypeError: the JSON object must be str, bytes or bytearray, not Series
Notes:
I am not even sure how to get the next score2
Using mu previous though regarding using regex, this is how I would approach your problem:
import re
def getOffset(row, offset):
vals = re.findall(r"[-+]?\d*\.\d+|\d+", row.data['userMatch'])
if len(vals)> offset:
return vals[offset]
return None
df['score1'] = df.apply(lambda row: getOffset(row, 0), axis= 1)
df['score2'] = df.apply(lambda row: getOffset(row, 1), axis = 1)
df.drop(['data'], axis= 1, inplace=True)
This yields a dataframe of the form:
ID score1 score2
0 A1 1 1
1 A2 0.934 0.952
This isn't pretty, but works with split(). Couldn't get a dictionary to be read, kept getting invalid syntax or missing delimiter.
df = pd.read_csv(io.StringIO('''ID data
A1 {"userMatch": "{"match":{"phone":{"name":{"score":1}},"name":{"score":1}}}"}
A2 {"userMatch": "{"match":{"phone":{"name":{"score":0.934}},"name":{"score":0.952}}}"}'''), sep=' ', engine='python')
df['score1'] = df['data'].apply(lambda x: x.split('{"userMatch": "{"match":{"phone":{"name":{"score":')[1].split('}', 1)[0])
df['score2'] = df['data'].apply(lambda x: x.split('{"userMatch": "{"match":{"phone":{"name":{"score":')[1].split(',"name":{"score":')[1].split('}', 1)[0])
Output:
ID data score1 score2
0 A1 {"userMatch": "{"match":{"phone":{"name":{"score":1}},"name":{"score":1}}}"} 1 1
1 A2 {"userMatch": "{"match":{"phone":{"name":{"score":0.934}},"name":{"score":0.952}}}"} 0.934 0.952

Compare JSON values in MariaDB

How can I compare two JSON values in MariaDB? Two values such as {"b": 1, "a": 2} and {"a": 2, "b": 1} should be equal. Does MariaDB contain function to reorder elements of a JSON value?
If you expect to need this (uncommon) kind of comparison, build the JSON in some canonical way before storing it. The obvious way for a simple JSON like yours is to alphabetize the keys. How to do that will depend on the "encode" library you are using for JSON.
Just use JSON_EXTRACT, JSON_EXTRACT doesnt care about the position of a digit within a JSON string.
Query
SELECT
JSON_EXTRACT(#json_string_1, '$.a') AS a1
, JSON_EXTRACT(#json_string_2, '$.a') AS a2
, JSON_EXTRACT(#json_string_1, '$.b') AS b1
, JSON_EXTRACT(#json_string_2, '$.b') AS b2
FROM (
SELECT
#json_string_1 := '{"b":1,"a":2}'
, #json_string_2 := '{"a":2,"b":1}'
)
AS
json_strings
Result
a1 a2 b1 b2
------ ------ ------ --------
2 2 1 1
Now use this result as delivered table so we can check if a1 is equal to a2 and b1 is equal to b2.
Query
SELECT
1 AS json_equal
FROM (
SELECT
JSON_EXTRACT(#json_string_1, '$.a') AS a1
, JSON_EXTRACT(#json_string_2, '$.a') AS a2
, JSON_EXTRACT(#json_string_1, '$.b') AS b1
, JSON_EXTRACT(#json_string_2, '$.b') AS b2
FROM (
SELECT
#json_string_1 := '{"b":1,"a":2}'
, #json_string_2 := '{"a":2,"b":1}'
)
AS
json_strings
)
AS json_data
WHERE
json_data.a1 = json_data.a2
AND
json_data.b1 = json_data.b2
Result
json_equal
------------
1
Disclaimer: I work for MariaDB
See my answer at https://dba.stackexchange.com/a/300235/208895 for an example how to use JSON_EQUALS available as of 10.7.

Scala Spark - For loop in Data Frame and compare date

I have a Data Frame which has 3 columns like this:
---------------------------------------------
| x(string) | date(date) | value(int) |
---------------------------------------------
I want to SELECT all the the rows [i] that satisfy all 4 conditions:
1) row [i] and row [i - 1] have the same value in column 'x'
AND
2) 'date' at row [i] == 'date' at row [i - 1] + 1 (two consecutive days)
AND
3) 'value' at row [i] > 5
AND
4) 'value' at row [i - 1] <= 5
I think maybe I need a For loop, but don't know how exactly! Please help me!
Every help is much appreciated!
It can be very easily done with Window functions, look at lag function:
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import sqlContext.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
// test data
val list = Seq(
("x", "2016-12-13", 1),
("x", "2016-12-14", 7)
);
val df = sc.parallelize(list).toDF("x", "date", "value");
// add lags - so read previous value from dataset
val withPrevs = df
.withColumn ("prevX", lag('x, 1).over(Window.orderBy($"date")))
.withColumn ("prevDate", lag('date, 1).over(Window.orderBy($"date")))
.withColumn ("prevValue", lag('value, 1).over(Window.orderBy($"date")))
// filter values and select only needed fields
withPrevs
.where('x === 'prevX)
.where('value > lit(5))
.where('prevValue < lit(5))
.where('date === date_add('prevDate, 1))
.select('x, 'date, 'value)
.show()
Note that without order, i.e. by date, this cannot be done. Dataset has none meaningful order, you must specify order explicity
If you have a DataFrame created, then all you need to do is to call a filter function on DataFrame will all your conditions.
For example:
df1.filter($"Column1" === 2 || $"Column2" === 3)
You can pass as many conditions as you want. It will return you a new DataFrame with filtered data.