I have a use case to store dynamic JSON objects in a column in Big Query. The schema of the object is dynamically generated by the source and not known beforehand. The number of key value pairs in the object can differ as well, as shown below.
Example JSON objects:
{"Fruit":"Apple","Price":"10","Sale":"No"}
{"Movie":"Avatar","Genre":"Fiction"}
I could achieve the same in Hive by defining the column as map<string, string> object and I could query the data in the column like col_name["Fruit"] or col_name["Movie"] for that corresponding row.
Is there an equivalent way of above usage in Big Query? I came across 'RECORD' data type but the schema needs to be same for all the objects in the column.
Note: Storing the column as string datatype is not an option as the users need to query the data on the keys directly without parsing after retrieving the data.
Storing the data as a JSON string seems to be the only way to implement your requirement, at the moment. As a workaround, you can create a JavaScript UDF that parses the JSON string and extracts the necessary information. Below is a sample UDF.
CREATE TEMP FUNCTION extract_from_json(json STRING, key STRING)
RETURNS STRING
LANGUAGE js AS """
const obj = JSON.parse(json);
return obj[key];
""";
WITH json_table AS (
SELECT '{"Fruit":"Apple","Price":"10","Sale":"No"}' json_data UNION ALL
SELECT '{"Movie":"Avatar","Genre":"Fiction"}' json_data
)
SELECT extract_from_json(json_data, 'Movie') AS photos
FROM json_table
You can also check out the newly introduced JSON data type in BigQuery. The data type offers more flexibility when handling JSON data but please note that the data type is still in preview and is not recommended for production. You will have to enroll in this preview. For more information on working with JSON data, refer to this documentation.
Related
So I have three databases - an Oracle one, SQL Server one, and a Postgres one. I have a table that has two columns: name, and value, both are texts. The value is a stringified JSON object. I need to update the nested value.
This is what I currently have:
name: 'MobilePlatform',
value:
'{
"iosSupported":true,
"androidSupported":false,
}'
I want to add {"enableTwoFactorAuth": false} into it.
In PostgreSQL you should be able to do this:
UPDATE mytable
SET MobilePlatform = jsonb_set(MobilePlatform::jsonb, '{MobilePlatform,enableTwoFactorAuth}', 'false');
In Postgres, the plain concatenation operator || for jsonb could do it:
UPDATE mytable
SET value = value::jsonb || '{"enableTwoFactorAuth":false}'::jsonb
WHERE name = 'MobilePlatform';
If a top-level key "enableTwoFactorAuth" already exists, it is replaced. So it's an "upsert" really.
Or use jsonb_set() for manipulating nested values.
The cast back to text works implicitly as assignment cast. (Results in standard format; any insignificant whitespace is removed effectively.)
If the content is valid JSON, the storage type should be json to begin with. In Postges, jsonb would be preferable as it's easier to manipulate, but that's not directly portable to the other two RDBMS mentioned.
(Or, possibly, a normalized design without JSON altogether.)
For ORACLE 21
update mytable
set json_col = json_transform(
json_col,
INSERT '$.value.enableTwoFactorAuth' = 'false'
)
where json_exists(json_col, '$?(#.name == "MobilePlatform")')
;
With json_col being JSON or VARCHAR2|CLOB column with IS JSON constraint.
(but must be JSON if you want a multivalue index on json_value.name:
create multivalue index ix_json_col_name on mytable t ( t.json_col.name.string() );
)
Two of the databases you are using support JSON data type, so it doesn't make sense to have them as stringified JSON object in a Text column.
Oracle: https://docs.oracle.com/en/database/oracle/oracle-database/21/adjsn/json-in-oracle-database.html
PostgreSQL: https://www.postgresql.org/docs/current/datatype-json.html
Apart from these, MSSQL Server also provides methods to work with JSON data type.
MS SQL Server: https://learn.microsoft.com/en-us/sql/relational-databases/json/json-data-sql-server?view=sql-server-ver16
Using a JSON type column in any of the above databases would enable you to use their JSON functions to perform the tasks that you are looking for.
If you've to use Text only then you can use replace to add the key-value pair at the end of your JSON
update dataTable set value = REPLACE(value, '}',",\"enableTwoFactorAuth\": false}") where name = 'MobilePlatform'
Here dataTable is the name of table.
The cleaner and less riskier way would be connect to db using the application and use JSON methods such as JSON.parse in Javascript and JSON.loads in Python. This would give you the JSON object (dictionary in case of Python) to work on. You can look for similar methods in other languages as well.
But i would suggest, if possible use JSON columns instead of Text to store the JSON value wherever possible.
I apologize in advance if this is very simple and I am just missing it.
Would any of you know how to put custom attributes as column headers? I currently have a simple opt in survey on connect and I would like to have each of the 4 items as column headers and the score in the table results. I pull the data using an ODBC connection to excel so ideally I would like to just add this on the end of my current table if I can figure out how to do it.
This is how it currently looks in the output
{"effortscore":"5","promoterscore":"5","satisfactionscore":"5","survey_opt_in":"True"}
If you have any links or something that I can follow to try improve my knowledge.
Thanks in advance
There are multiple options to query data in JSON format in Athena, and based on your use case (data source, query frequency, query destination, etc.) you can choose what makes more sense.
String Column + JSON functions
This is usually the most straightforward option and a good starting point. You define the survey_output as a string column, and when you need to extract the specific attributes from the JSON string, you can apply the JSON functions in Trino/Athena: https://trino.io/docs/current/functions/json.html. For example:
SELECT
id,
json_query(
survey_output,
'lax $.satisfactionscore'
) AS satisfactionscore
FROM customers
String Column + JSON functions + View
The following way to simplify access to data without json_query functions is to define a VIEW on that table using the json_query syntax in the VIEW creation. You define the view once by a DBA, and when the users query the data, they see the columns they care about. For example:
CREATE VIEW survey_results AS
SELECT
id,
json_query(
survey_output,
'lax $.satisfactionscore'
) AS satisfactionscore
FROM customers;
With such dynamic view creation, you have more flexibility in what data will be easily exposed to the users.
Create a Table with STRUCT
Another option is to create the external table from the data source (files in S3, for example) with the STRUCT definition.
CREATE EXTERNAL TABLE survey (
id string,
survey_results struct<
effortscore:string,
promoterscore:string,
satisfactionscore:string,
survey_opt_in:string
>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://<YOUR BUCKET HERE>/<FILES>'
I have a pyspark dataframe, where there is one column(quite long strings) in json string, which has many keys, where I am only interested in one key. May I know how to extract the value for that key?
here is the example of the string of the column userbehavior:
[{"num":"1234","Projections":"test", "intent":"test", "Mtime":11333.....}]
I wish to extract the value for "Mtime" only, i tried using:
user_hist_df=user_hist_df.select(get_json_object(user_hist_df.userbehavior, '$.Mtime').alias("Time"))
However it does not work.
You are almost right, it isn't working because your JSON is an array of objects. Just change to this:
get_json_object('userbehavior', '$[*].Mtime').alias("Time")
In order to extract from a json column you can use - from_json() and specify the schema
e.g. df = df.withColumn("parsed_col", from_json($"Body",MapType(StringType,StringType)))
Once you parse the json as per the schema - just extract the column as per your need
df = df.withColumn("col_1", col("parsed_col").getItem("col_1"))
I have a table in AWS Glue, and the crawler has defined one field as array.
The content is in S3 files that have a json format.
The table is TableA, and the field is members.
There are a lot of other fields such as strings, booleans, doubles, and even structs.
I am able to query them all using a simpel query such as:
SELECT
content.my_boolean,
content.my_string,
content.my_struct.value
FROM schema.tableA;
The issue is when I add content.members into the query.
The error I get is: [Amazon](500310) Invalid operation: schema "content" does not exist.
Content exists because i am able to select other fiels from the main key in the json (content).
Probably is something related with how to perform the query agains array field in Spectrum.
Any idea?
You have to rename the table to extract the fields from the external schema:
SELECT
a.content.my_boolean,
a.content.my_string,
a.content.my_struct.value
FROM schema.tableA a;
I had the same issue on my data, I really don't know why it needs this cast but it works. If you need to access elements of an array you have to explod it like:
SELECT member.<your-field>,
FROM schema.tableA a, a.content.members as member;
Reference
You need to create a Glue Classifier.
Select JSON as Classifier type
and for the JSON Path input the following:
$[*]
then run your crawler. It will infer your schema and populate your table with the correct fields instead of just one big array. Not sure if this was what you were looking for but figured I'd drop this here just in case others had the same problem I had.
Ok so I'm getting a big Json string from an API call, and I want to save some of that string into Cassandra. I'm trying to parse the Json string into a more table like structure, but with only some fields. The overall schema looks like this:
And I want my table structure using regnum, date and value fields.
With sqlContext.read.json(vals).select(explode('register) as 'reg).select("reg.#attributes.regnum","reg.data.date","reg.data.value").show I can get a table like this:
But as you can see date and value fields are arrays. I would like to have one element per record, and duplicate the corresponding regnum for each record. Any help is very much appreciated.
You can cast your DataFrame to Dataset then flatMap on it.
df.select("reg.#attributes.regnum","reg.data.date","reg.data.value")
.as[(Long, Array[String], Array[String])]
.flatMap(s => s._2.zip(s._3).map(p => (s._1, p._1, p._2)))