SOLVED - Big Query Cannot read repeated field of type STRUCT as optional STRUCT Field - json

I have MongoDB data that is extracted to Big Query using Google Cloud Function. We use partition table that is separated with date. There is one field (location) in collections that change from object (RECORD) to array (unexpected). we change it to object (RECORD) again and there is error when extracting this field. the error message is:
Cannot read repeated field of type STRUCT as optional STRUCT Field: location
I thought it is because there is partition table that still array data types. after we backfill all the partition. the error still same. any idea guys?
UPDATE: Some of partition table is still in ARRAY, so it have to be changed to RECORD

Related

Creating External Table with Redshift Spectrum from nested JSON

I’m creating an external table from json data with input format org.apache.hadoop.mapred.TextInputFormat and output format org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat with SerDe org.openx.data.jsonserde.JsonSerDe.
One of the attributes of the json is a highly nested json called groups. The nested data doesn't follow a strict schema, so not all json within groups have the same attributes. I'm having trouble accessing group's attributes and I suspect that I am not casting groups to the proper datatype.
Here is a sample of the data
{"entity":"1111111","date":"2019-05-29T00:00:00.000Z","dataset":"authorizations","aggregations":{"sellersAuths":1,"sellersDeAuths":0},"groups":{"sellersAuths":{"mws_region":{"USAmazon":1},"created_by":{"SWIPE":1},"last_updated_by":{"SWIPE":1}},"sellersDeAuths":{"mws_region":{"EUAmazon":0},"created_by":{"SellerCent":0},"last_updated_by":{"JPAmazon":0}}}}
{"entity":"22222222","date":"2019-05-29T00:00:00.000Z","dataset":"authorizations","aggregations":{"sellersAuths":1,"sellersDeAuths":0},"groups":{"sellersAuths":{"mws_region":{"EUAmazon":1},"created_by":{"SWIPE":1},"last_updated_by":{"SWIPE":1}},"sellersDeAuths":{"mws_region":{"EUAmazon":0},"created_by":{"SWIPE":0},"last_updated_by":{"SWIPE":0}}}}
{"entity":"3333333","date":"2019-05-29T00:00:00.000Z","dataset":"authorizations","aggregations":{"sellersAuths":1,"sellersDeAuths":0},"groups":{"sellersAuths":{"mws_region":{"EUAmazon":1},"created_by":{"SWIPE":1},"last_updated_by":{"SWIPE":1}},"sellersDeAuths":{"mws_region":{"EUAmazon":0},"created_by":{"SWIPE":0},"last_updated_by":{"SWIPE":0}}}}
I've tried a couple of different ways of casting the data type of groups when creating the external table. I tried using super type and when I select for groups I get the entire json, but when I select for an attribute of groups such as select groups.sellersAuths from ... or select groups."sellersAuths" from ... I get relation groups does not exist.
I've tried casting it as a struct<key:VARCHAR, value:struct<key:VARCHAR, value:struct<key:VARCHAR, value:FLOAT8>>>, whoever when access something like groups.key or groups.value.key, I always get NULL. I'm not sure how to cast the data type of groups when creating the external table. I'm not sure if my use case is what the super type is for.
I've also tried using JSON_PARSE after I cast the data to VARCHAR, or super or struct but that presents issues as well.
Thanks a ton for reading!

How to store dynamically generated JSON object in Big Query Table?

I have a use case to store dynamic JSON objects in a column in Big Query. The schema of the object is dynamically generated by the source and not known beforehand. The number of key value pairs in the object can differ as well, as shown below.
Example JSON objects:
{"Fruit":"Apple","Price":"10","Sale":"No"}
{"Movie":"Avatar","Genre":"Fiction"}
I could achieve the same in Hive by defining the column as map<string, string> object and I could query the data in the column like col_name["Fruit"] or col_name["Movie"] for that corresponding row.
Is there an equivalent way of above usage in Big Query? I came across 'RECORD' data type but the schema needs to be same for all the objects in the column.
Note: Storing the column as string datatype is not an option as the users need to query the data on the keys directly without parsing after retrieving the data.
Storing the data as a JSON string seems to be the only way to implement your requirement, at the moment. As a workaround, you can create a JavaScript UDF that parses the JSON string and extracts the necessary information. Below is a sample UDF.
CREATE TEMP FUNCTION extract_from_json(json STRING, key STRING)
RETURNS STRING
LANGUAGE js AS """
const obj = JSON.parse(json);
return obj[key];
""";
WITH json_table AS (
SELECT '{"Fruit":"Apple","Price":"10","Sale":"No"}' json_data UNION ALL
SELECT '{"Movie":"Avatar","Genre":"Fiction"}' json_data
)
SELECT extract_from_json(json_data, 'Movie') AS photos
FROM json_table
You can also check out the newly introduced JSON data type in BigQuery. The data type offers more flexibility when handling JSON data but please note that the data type is still in preview and is not recommended for production. You will have to enroll in this preview. For more information on working with JSON data, refer to this documentation.

How to query an array field (AWS Glue)?

I have a table in AWS Glue, and the crawler has defined one field as array.
The content is in S3 files that have a json format.
The table is TableA, and the field is members.
There are a lot of other fields such as strings, booleans, doubles, and even structs.
I am able to query them all using a simpel query such as:
SELECT
content.my_boolean,
content.my_string,
content.my_struct.value
FROM schema.tableA;
The issue is when I add content.members into the query.
The error I get is: [Amazon](500310) Invalid operation: schema "content" does not exist.
Content exists because i am able to select other fiels from the main key in the json (content).
Probably is something related with how to perform the query agains array field in Spectrum.
Any idea?
You have to rename the table to extract the fields from the external schema:
SELECT
a.content.my_boolean,
a.content.my_string,
a.content.my_struct.value
FROM schema.tableA a;
I had the same issue on my data, I really don't know why it needs this cast but it works. If you need to access elements of an array you have to explod it like:
SELECT member.<your-field>,
FROM schema.tableA a, a.content.members as member;
Reference
You need to create a Glue Classifier.
Select JSON as Classifier type
and for the JSON Path input the following:
$[*]
then run your crawler. It will infer your schema and populate your table with the correct fields instead of just one big array. Not sure if this was what you were looking for but figured I'd drop this here just in case others had the same problem I had.

how to cast json output field to date type

I used a wrapper between mongodb and postgres.
Now, I got my tables and my data and I can access to it.
now, when I listed my table, I found that the field date that is under an object is not type date or timestamp,
let me explain more with short scree,
When I use postgeSQL, my object is converted to json and I got all the fields, but no more dates. but numbers,like in the screen shot
and I can't alter the type of the field date.
what should I do?

how separate json field in postgres and got the field

I'm working with mongoDB, and I used a wrapper mongo/Postegres.
Now, I can find my tables and data.
I want to do some statistics but I can't reach objects that got json type in Postgres.
My problem is that I got all the object in json but I need to separate the fields.
I used this :
CREATE FOREIGN TABLE rents( _id NAME, status text, "from" json )
SERVER mongo_server
OPTIONS (database 'tr', collection 'rents');
The field "from" is an object.
I found something like this :
enter code here
but nothing happened
The error (why a screenshot??) means that the data are not in valid json format.
As a first step, you could define the column as type text instead of json. Then querying the foreign table will probably work, and you can see what is actually returned and why PostgreSQL thinks that this is not valid JSON.
Maybe you can create a view on top of the foreign table that converts the value to valid JSON for further processing.