AWS Glue Crawler for JSONB column in PostgreSQL RDS - json

I've created a crawler that looks at a PostgreSQL 9.6 RDS table with a JSONB column but the crawler identifies the column type as "string". When I then try to create a job that loads data from a JSON file on S3 into the RDS table I get an error.
How can I map a JSON file source to a JSONB target column?

It's not quite a direct copy, but an approach that has worked for me is to define the column on the target table as TEXT. After the Glue job populates the field, I then convert it to JSONB. For example:
alter table postgres_table
alter column column_with_json set data type jsonb using column_with_json::jsonb;
Note the use of the cast for the existing text data. Without that, the alter column would fail.

Crawler will identify JSONB column type as "string" but you can try to use Unbox Class in Glue to convert this column to json
let's check the following table in PostgreSQL
create table persons (id integer, person_data jsonb, creation_date timestamp )
There is an example of one record from person table
ID = 1
PERSON_DATA = {
"firstName": "Sergii",
"age": 99,
"email":"Test#test.com"
}
CREATION_DATE = 2021-04-15 00:18:06
The following code need to be added in Glue
# 1. create dynamic frame from catalog
df_persons = glueContext.create_dynamic_frame.from_catalog(database = "testdb", table_name = "persons", transformation_ctx = "df_persons ")
# 2.in path you need to add your jsonb column name that need to be converted to json
df_persons_json = Unbox.apply(frame = df_persons , path = "person_data", format="json")
# 3. converting from dynamic frame to data frame
datf_persons_json = df_persons_json.toDF()
# 4. after that you can process this column as a json datatype or create dataframe with all necessary columns , each json data element can be added as a separate column in dataframe :
final_df_person = datf_persons_json.select("id","person_data.age","person_data.firstName","creation_date")
You can also check the following link:
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-Unbox.html

Related

postgres store reference to field in json

It is possible to store json in postgres using the json data type. Check this tutorial for an introduction: http://www.postgresqltutorial.com/postgresql-json/
Consider I am storing the following json in such a field:
{
"address": {
"street1": "123 seasame st"
}
}
I want a to store separately a reference to the street field. For example, I might have another object which is using data from this json structure and wants to store a reference to where it got the data. Maybe something like this:
class Product():
__tablename__ = 'Address'
street_1 = Column(String)
data_source = ?
Now I could make data_source a string and just store namespaces like address.street, but if I did this postgres has no idea what that means. Working with that in queries would mean parsing the string and other inefficient stuff. Does postgres support referring to fields stored inside json data structures?
This question is related to JSON foreign keys in PostgreSQL , but in this case I don't necessarily want a fk relationship. I just want to create a reference, which is not necessarily enforced in the way a fk is.
update:
To be more clear, I want to reference the location of something in the json structure on another attribute and store that reference in a column. In the below code, Address.data_source is a reference to the location of the street data (for example address.street1 in this case)
class Address():
__tablename__ = 'Address'
street_1 = Column(String)
sample_id = Column(Integer, ForeignKey('DataSample.uid'))
data_source = ?
class DataSample():
__tablename__ = 'DataSample'
uid = Column(Integer, primary_key=True)
data = Column(JSONB)
body = {
"address": {
"street1": "123 seasame st"
}
}
datasample = DataSample(data=body)
address = Address(street_1=datasample.data['address']['street_1'],
sample_id=datasample.uid,
data_source=?)
As clarified, the question is seeking a way to flexibly specify a path within a JSON object of a particular record. Keys are being handled in normal columns. Constraints on JSONB fields are not available, and there is no specific support for specifying paths within JSON objects.
I worked with the following in SQL Fiddle using PostgreSQL 9.6:
CREATE TABLE datasample (
id integer PRIMARY KEY,
data jsonb
);
CREATE TABLE address (
id integer PRIMARY KEY,
street_1 text,
sample_id integer REFERENCES datasample (id),
data_source text
);
INSERT INTO datasample(id, data)
VALUES (1, '{"address":{"street_1": "123 seasame st"}}');
INSERT INTO address(id,street_1, sample_id, data_source)
VALUES (1,'123 seasame st',1,'datasample.data->''address''->>''street''');
A typical lookup of the street address (needed to retrieve street_1) would resemble:
SELECT datasample.data->'address'->>'street_1'
FROM datasample
WHERE id=1;
There is no special postgres type for identifying columns. Strings are the closest available and you will need to retrieve the string (or array of strings, or object containing strings, if one of those simplifies parsing) and use it to build the query. In tbe first code block, I stored it as the (escaped) fragment of query - 'datasample.data->''address''->>''street'''. Though longer, it would require only retrieval and unescaping to use in a new custom query. I did not find a way to use the string as a fragment within the same SQL statement, though it might be possible to combine it with other bits of text to form a full statement that could be run through EXECUTE.

Insert JSON to Hadoop using Spark (Java)

I'm very new in Hadoop,
I'm using Spark with Java.
I have dynamic JSON, exmaple:
{
"sourceCode":"1234",
"uuid":"df123-....",
"title":"my title"
}{
"myMetaDataEvent": {
"date":"10/10/2010",
},
"myDataEvent": {
"field1": {
"field1Format":"fieldFormat",
"type":"Text",
"value":"field text"
}
}
}
Sometimes I can see only field1 and sometimes I can see field1...field50
And maybe the user can add fields/remove fields from this JSON.
I want to insert this dynamic JSON to hadoop (to hive table) from Spark Java code,
How can I do it?
I want that the user can after make HIVE query, i.e: select * from MyTable where type="Text
I have around 100B JSON records per day that I need to insert to Hadoop,
So what is the recommanded way to do that?
*I'm looked on the following: SO Question but this is known JSON scheme where it isnt my case.
Thanks
I had encountered kind of similar problem, I was able to resolve my problem using this. ( So this might help if you create the schema before you parse the json ).
For a field having a string data type you could create the schema :-
StructField field = DataTypes.createStructField(<name of the field>, DataTypes.StringType, true);
For a field having a int data type you could create the schema :-
StructField field = DataTypes.createStructField(<name of the field>, DataTypes.IntegerType, true);
After you have added all the fields in a List<StructField>,
Eg:-
List<StructField> innerField = new ArrayList<StructField>();
.... Field adding logic ....
Eg:-
innerField.add(field1);
innerField.add(field2);
// One instance can come, or multiple instance of value comes in an array, then it needs to be put in Array Type.
ArrayType getArrayInnerType = DataTypes.createArrayType(DataTypes.createStructType(innerField));
StructField getArrayField = DataTypes.createStructField(<name of field>, getArrayInnerType,true);
You can then create the schema :-
StructType structuredSchema = DataTypes.createStructType(getArrayField);
Then I read the json using the schema generated using the Dataset API.
Dataset<Row> dataRead = sqlContext.read().schema(structuredSchema).json(fileName);

Deedle - how to use 'ParseExact' within the Frame.ReadCsv schema

I have a CSV file of data in the form
21.06.2016 23:00:00.349, 153.461, 153.427
21.06.2016 23:00:00.400, 153.460, 153.423
etc
The initial step of creating a frame involves the optional inclusion of a 'schema' to specify or rename column heads and specify types:
let df = Frame.ReadCsv(__SOURCE_DIRECTORY__ + "/data/GBPJPY.csv", hasHeaders=true, inferTypes=false, schema="TS (DateTimeOffset), Bid (float(3)), Ask (float(3))")
I would like to specify the first column of string values to be ParseExact'ed to DateTimeOffset of the format
"dd.mm.yyyy HH:mm:ss.fff"
(I'm assuming the use of the setting System.Globalization.CultureInfo.InvariantCulture).
How do I express the schema such that it will parse the datetime string in that first Frame.ReadCsv("file.csv", schema = ........ )? Or is this not possible to accomplish within the schema statement?

Hive json serde - Keys with white spaces not populating

I have an external table that is built off of a json file.
All of the json keys are columns and are populated as expected except for one key that has a space.
Here is the DDL:
CREATE EXTERNAL TABLE foo.bar
( event ARRAY <STRUCT
value:STRING
,info:STRUCT
<id:STRING
,event_source:STRING>>
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES("mapping.event_source"="event source")
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 'foo/bar'
All of the values show up as expected except for event_source, which shows up as NULL. The original form of event_source in the json file is 'event source' without the single quotes.
Is there something I need to do different with the WITH SERDEPROPERTIES setting in order to get the key to work properly?
Thanks
you mean that the json has data like
{ id: "myid", event source: "eventsource" }
If so, there's not much that can be done since it's simply broken JSON.
If not, can you post a sample of the JSON you're trying to read ?
I have encountered a similar problem as above but with a slight variation that the input data is correct json.
I have an external table that is built off of a json file. All of the json keys are populated except one msrp_currency
Here is the DDL:
CREATE EXTERNAL TABLE foo.bar
( id string,
variants array<struct<pid:string, msrp_currency:string>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( "ignore.malformed.json" = "true" ,
'mapping.variants.msrp_currency' = 'variants.msrpcurrency')
LOCATION 'foo/bar'
All of the values show up as expected except for msrp_currency, which shows up as NULL. The reason I need to introduce underscore is because later I need to extract the same field value as msrpCurrecny using brickhouse to_json UDF.
sample values:
{ "pid": "mypid", "msrpCurrency": "USD" }

issue with Hive Serde dealing nested structs

I am trying to load a huge volume json data with nested structure to hive using a Json serde. some of the field names start with $ in nested structure. I am mapping hive filed names Using SerDeproperties, but how ever when i query the table, getting null in the field starting with $, tried with different syntax,but no luck.
Sample JSON:
{
"_id" : "319FFE15FF90",
"SomeThing" :
{
"$SomeField" : 22,
"AnotherField" : 2112,
"YetAnotherField": 1
}
. . . etc . . . .
Using a schema as follows:
create table testSample
(
`_id` string,
something struct
<
$somefield:int,
anotherfield:bigint,
yetanotherfield:int
>
)
row format serde 'org.openx.data.jsonserde.JsonSerDe'
with serdeproperties
(
"mapping.somefield" = "$somefield"
);
This schema builds OK, however, somefield(starting with $) in the above table is always returning null (all the other values exist and are correct).
We've been trying a lot of syntax combinations, but to no avail.
Does anyone know the trick to hap a nested field with a leading $ in its name?
You almost got it right. Try creating the table like this.
The mistake you're making is that when mapping in the serde properties (mapping.somefield ="$somefield") you're saying "when looking for the hive column named 'somefield', look for the json field '$somefield', but in hive you defined the column with the dollar sign, which if not outright illegal it's for sure not the best practice in hive.
create table testSample
(
`_id` string,
something struct
<
somefield:int,
anotherfield:bigint,
yetanotherfield:int
>
)
row format serde 'org.openx.data.jsonserde.JsonSerDe'
with serdeproperties
(
"mapping.somefield" = "$somefield"
);
I tested it with some test data:
{ "_id" : "123", "something": { "$somefield": 12, "anotherfield":13,"yetanotherfield":100}}
hive> select something.somefield from testSample;
OK
12
I am suddenly starting to see this problem as well but for normal column names as well (no special characters such as $)
I am populating an external table (Temp) from another internal table (Table2) and want the output of Temp table in JSON format. I want column names in camel case in the output JSON file and so am also using the Serdepoperties in the Temp table to specify correct names. However, I am seeing that when I do Select * from the Temp table, it gives NULL values for the columns whose names have been used in the mapping.
I am running Hive 0.13. Here are the commands:
Create table command:
CREATE EXTERNAL TABLE Temp (
data STRUCT<
customerId:BIGINT, region:STRING, marketplaceId:INT, asin:ARRAY<STRING>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'mapping.customerid' = 'customerId',
'mapping.marketplaceid' = 'marketplaceId'
)
LOCATION '/output';
INSERT INTO TABLE Temp
SELECT
named_struct ('customerId',customerId, 'region', region, 'marketplaceId', marketplaceId, 'asin', asin)
FROM Table2;
Select * from Temp:
{"customerid":null,"region":"EU","marketplaceid":null,"asin":["B000FC1PZC"]}
{"customerid":null,"region":"EU","marketplaceid":null,"asin":["B000FC1C9G"]}
See how "customerid" and "marketplaceid" are null. Generated JSON file is:
{"data":{"region":"EU","asin":["B000FC1PZC"]}}
{"data":{"region":"EU","asin":["B000FC1C9G"]}}
Now, if I remove the with serdeproperties, the table starts getting all values:
{"customerid":1,"region":"EU","marketplaceid":4,"asin":["B000FC1PZC"]}
{"customerid":2,"region":"EU","marketplaceid":4,"asin":["B000FC1C9G"]}
And then the JSON file so generated is:
{"data":{"region":"EU","marketplaceid":4,"asin":["B000FC1PZC"],"customerid":1}}
{"data":{"region":"EU","marketplaceid":4,"asin":["B000FC1C9G"],"customerid":2}}