AWS Athena and handling json - json

I have millions of files with the following (poor) JSON format:
{
"3000105002":[
{
"pool_id": "97808",
"pool_name": "WILDCAT (DO NOT USE)",
"status": "Zone Permanently Plugged",
"bhl": "D-12-10N-05E 902 FWL 902 FWL",
"acreage": ""
},
{
"pool_id": "96838",
"pool_name": "DRY & ABANDONED",
"status": "Zone Permanently Plugged",
"bhl": "D-12-10N-05E 902 FWL 902 FWL",
"acreage": ""
}]
}
I've tried to generate an Athena DDL that would accommodate this type (especially the api field) of structure with this:
CREATE EXTERNAL TABLE wp_info (
api:array < struct < pool_id:string,
pool_name:string,
status:string,
bhl:string,
acreage:string>>)
LOCATION 's3://foo/'
After trying to generate a table with this, the following error is thrown:
Your query has the following error(s):
FAILED: ParseException line 2:12 cannot recognize input near ':' 'array' '<' in column type
What is a workable solution to this issue? Note that the api string is different for every one of the million files. The api key is not actually within any of the files, so I hope there is a way that Athena can accommodate just the string-type value for these data.

If you don't have control over the JSON format that you are receiving, and you don't have a streaming service in the middle to transform the JSON format to something simpler, you can use regex functions to retrieve the relevant data that you need.
A simple way to do it is to use Create-Table-As-Select (CTAS) query that will convert the data from its complex JSON format to a simpler table format.
CREATE TABLE new_table
WITH (
external_location = 's3://path/to/ctas_partitioned/',
format = 'Parquet',
parquet_compression = 'SNAPPY')
AS SELECT
regexp_extract(line, '"pool_id": "(\d+)"', 1) as pool_id,
regexp_extract(line, ' "pool_name": "([^"])",', 1) as pool_name,
...
FROM json_lines_table;
You will improve the performance of the queries to the new table, as you are using Parquet format.
Note that you can also update the table when you can new data, by running the CTAS query again with external_location as 's3://path/to/ctas_partitioned/part=01' or any other partition scheme

Related

Pyspark delta lake json datatype evolution issue ( merge incompatible exception )

I am working on pyspark (3.x) and delta lake. I am facing some challenges w.r.t to datatypes.
We are receiving data as JSON data type, we are doing some flattening on the JSON datasets and saving it as delta tables with options as "mergeSchema" as true as shown below. We are not imposing any schema on the table.
df.write\
.format("delta")\
.partitionBy("country","city")\
.option("mergeSchema","true")\
.mode("append")\
.save(delta_path)\
The problem we are facing is- the data type of JSON fields gets change very often,for example In delta table "field_1" is getting stored with datatype as StringType but the datatype for 'field_1' for new JSON is coming as LongType. Due to this we are getting merge incompatible exception.
ERROR : Failed to merge fields 'field_1' and 'field_1'. Failed to merge incompatible data types StringType and LongType
How to handle such datatype evolution in delta tables, I dont want to handle datatype changes at a field level because we have more than 300+ fields coming as part of json.
According to the article Diving Into Delta Lake: Schema Enforcement & Evolution the option mergeSchema=true can handle the following scenarios:
Adding new columns (this is the most common scenario)
Changing of data types from NullType -> any other type, or upcasts from ByteType -> ShortType -> IntegerType
The article also gives a hint on what can be done in your case:
"Other changes, which are not eligible for schema evolution, require that the schema and data are overwritten by adding .option("overwriteSchema", "true"). For example, in the case where the column “Foo” was originally an integer data type and the new schema would be a string data type, then all of the Parquet (data) files would need to be re-written. Those changes include:"
Dropping a column
Changing an existing column’s data type (in place)
Renaming column names that differ only by case (e.g. “Foo” and “foo”)
I have also take an approach similar to nilesh1212, that is, manually merge schema.
In my case my script can handle nested types, can be found here:
https://github.com/miguellobato84/spark-delta-schema-evolution
Also, I wrote this article regarding this issue
https://medium.com/#miguellobato84/improving-delta-lake-schema-evolution-2cce8db2f0f5
In order get my issue resolved, I have written a new function that essentially merges schema of the delta table (if delta table exist) and JSON schema.
At a high level, I have created a new schema- this new schema is essentially a combination of common columns from delta lake table and new columns from JSON fields, by creating this new schema I recreate a data frame by applying this new schema.
This has solved my issue.
def get_merged_schema(delta_table_schema, json_data_schema):
print('str(len(delta_table_schema.fields)) -> ' + str(len(delta_table_schema.fields)))
print('str(len(json_data_schema.fields)) -> '+ str(len(json_data_schema.fields)))
no_commom_elements=False
no_new_elements=False
import numpy as np
struct_field_array=[]
if len(set(delta_table_schema.names).intersection(set(json_data_schema.names))) > 0:
common_col=set(delta_table_schema.names).intersection(set(json_data_schema.names))
print('common_col len: -> '+ str(len(common_col)))
for name in common_col:
for f in delta_table_schema.fields:
if(f.name == name):
struct_field_array.append(StructField(f.name, f.dataType, f.nullable))
else:
no_commom_elements=True
print("no common elements")
if len(np.setdiff1d(json_data_schema.names,delta_table_schema.names)) > 0:
diff_list = np.setdiff1d(json_data_schema.names,delta_table_schema.names)
print('diff_list len: -> '+ str(len(diff_list)))
for name in diff_list:
for f in json_data_schema.fields:
if(f.name == name):
struct_field_array.append(StructField(f.name, f.dataType, f.nullable))
else:
no_new_elements=True
print("no new elements")
print('len(StructType(struct_field_array)) -> '+str(len(StructType(struct_field_array))))
df=spark.createDataFrame(spark.sparkContext.emptyRDD(),StructType(struct_field_array))
if no_commom_elements and no_new_elements:
return StructType(None)
else:
return df.select(sorted(df.columns)).schema

Insert JSON to Hadoop using Spark (Java)

I'm very new in Hadoop,
I'm using Spark with Java.
I have dynamic JSON, exmaple:
{
"sourceCode":"1234",
"uuid":"df123-....",
"title":"my title"
}{
"myMetaDataEvent": {
"date":"10/10/2010",
},
"myDataEvent": {
"field1": {
"field1Format":"fieldFormat",
"type":"Text",
"value":"field text"
}
}
}
Sometimes I can see only field1 and sometimes I can see field1...field50
And maybe the user can add fields/remove fields from this JSON.
I want to insert this dynamic JSON to hadoop (to hive table) from Spark Java code,
How can I do it?
I want that the user can after make HIVE query, i.e: select * from MyTable where type="Text
I have around 100B JSON records per day that I need to insert to Hadoop,
So what is the recommanded way to do that?
*I'm looked on the following: SO Question but this is known JSON scheme where it isnt my case.
Thanks
I had encountered kind of similar problem, I was able to resolve my problem using this. ( So this might help if you create the schema before you parse the json ).
For a field having a string data type you could create the schema :-
StructField field = DataTypes.createStructField(<name of the field>, DataTypes.StringType, true);
For a field having a int data type you could create the schema :-
StructField field = DataTypes.createStructField(<name of the field>, DataTypes.IntegerType, true);
After you have added all the fields in a List<StructField>,
Eg:-
List<StructField> innerField = new ArrayList<StructField>();
.... Field adding logic ....
Eg:-
innerField.add(field1);
innerField.add(field2);
// One instance can come, or multiple instance of value comes in an array, then it needs to be put in Array Type.
ArrayType getArrayInnerType = DataTypes.createArrayType(DataTypes.createStructType(innerField));
StructField getArrayField = DataTypes.createStructField(<name of field>, getArrayInnerType,true);
You can then create the schema :-
StructType structuredSchema = DataTypes.createStructType(getArrayField);
Then I read the json using the schema generated using the Dataset API.
Dataset<Row> dataRead = sqlContext.read().schema(structuredSchema).json(fileName);

AWS Glue Crawler for JSONB column in PostgreSQL RDS

I've created a crawler that looks at a PostgreSQL 9.6 RDS table with a JSONB column but the crawler identifies the column type as "string". When I then try to create a job that loads data from a JSON file on S3 into the RDS table I get an error.
How can I map a JSON file source to a JSONB target column?
It's not quite a direct copy, but an approach that has worked for me is to define the column on the target table as TEXT. After the Glue job populates the field, I then convert it to JSONB. For example:
alter table postgres_table
alter column column_with_json set data type jsonb using column_with_json::jsonb;
Note the use of the cast for the existing text data. Without that, the alter column would fail.
Crawler will identify JSONB column type as "string" but you can try to use Unbox Class in Glue to convert this column to json
let's check the following table in PostgreSQL
create table persons (id integer, person_data jsonb, creation_date timestamp )
There is an example of one record from person table
ID = 1
PERSON_DATA = {
"firstName": "Sergii",
"age": 99,
"email":"Test#test.com"
}
CREATION_DATE = 2021-04-15 00:18:06
The following code need to be added in Glue
# 1. create dynamic frame from catalog
df_persons = glueContext.create_dynamic_frame.from_catalog(database = "testdb", table_name = "persons", transformation_ctx = "df_persons ")
# 2.in path you need to add your jsonb column name that need to be converted to json
df_persons_json = Unbox.apply(frame = df_persons , path = "person_data", format="json")
# 3. converting from dynamic frame to data frame
datf_persons_json = df_persons_json.toDF()
# 4. after that you can process this column as a json datatype or create dataframe with all necessary columns , each json data element can be added as a separate column in dataframe :
final_df_person = datf_persons_json.select("id","person_data.age","person_data.firstName","creation_date")
You can also check the following link:
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-Unbox.html

Parse complex Json string contained in Hadoop

I want to parse a string of complex JSON in Pig. Specifically, I want Pig to understand my JSON array as a bag instead of as a single chararray. I found that complex JSON can be parsed by using Twitter's Elephant Bird or Mozilla's Akela library. (I found some additional libraries, but I cannot use 'Loader' based approach since I use HCatalog Loader to load data from Hive.)
But, the problem is the structure of my data; each value of Map structure contains value part of complex JSON. For example,
1. My table looks like (WARNING: type of 'complex_data' is not STRING, a MAP of <STRING, STRING>!)
TABLE temp_table
(
user_id BIGINT COMMENT 'user ID.',
complex_data MAP <STRING, STRING> COMMENT 'complex json data'
)
COMMENT 'temp data.'
PARTITIONED BY(created_date STRING)
STORED AS RCFILE;
2. And 'complex_data' contains (a value that I want to get is marked with two *s, so basically #'d'#'f' from each PARSED_STRING(complex_data#'c') )
{ "a": "[]",
"b": "\"sdf\"",
"**c**":"[{\"**d**\":{\"e\":\"sdfsdf\"
,\"**f**\":\"sdfs\"
,\"g\":\"qweqweqwe\"},
\"c\":[{\"d\":21321,\"e\":\"ewrwer\"},
{\"d\":21321,\"e\":\"ewrwer\"},
{\"d\":21321,\"e\":\"ewrwer\"}]
},
{\"**d**\":{\"e\":\"sdfsdf\"
,\"**f**\":\"sdfs\"
,\"g\":\"qweqweqwe\"},
\"c\":[{\"d\":21321,\"e\":\"ewrwer\"},
{\"d\":21321,\"e\":\"ewrwer\"},
{\"d\":21321,\"e\":\"ewrwer\"}]
},]"
}
3. So, I tried... (same approach for Elephant Bird)
REGISTER '/path/to/akela-0.6-SNAPSHOT.jar';
DEFINE JsonTupleMap com.mozilla.pig.eval.json.JsonTupleMap();
data = LOAD temp_table USING org.apache.hive.hcatalog.pig.HCatLoader();
values_of_map = FOREACH data GENERATE complex_data#'c' AS attr:chararray; -- IT WORKS
-- dump values_of_map shows correct chararray data per each row
-- eg) ([{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... },
{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... },
{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... }])
([{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... },
{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... },
{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... }]) ...
attempt1 = FOREACH data GENERATE JsonTupleMap(complex_data#'c'); -- THIS LINE CAUSE AN ERROR
attempt2 = FOREACH data GENERATE JsonTupleMap(CONCAT(CONCAT('{\\"key\\":', complex_data#'c'), '}'); -- IT ALSO DOSE NOT WORK
I guessed that "attempt1" was failed because the value doesn't contain full JSON. However, when I CONCAT like "attempt2", I generate additional \ mark with. (so each line starts with {\"key\": ) I'm not sure that this additional marks breaks the parsing rule or not. In any case, I want to parse the given JSON string so that Pig can understand. If you have any method or solution, please Feel free to let me know.
I finally solved my problem by using jyson library with jython UDF.
I know that I can solve it by using JAVA or other languages.
But, I think that jython with jyson is the most simplist answer to this issue.

MySQL query to retrieve tabular data in json format

I have a table like below in MySQl database
user-name mail
ganesh g#g.com
gani gani#gani.com
gan gan#gan.com
I need query to retrieve above table in JSON format
Example:
[{
user-name:"ganesh",
mail:"g#g.com"
},
{
user-name:"gani",
mail:"gani#gani.com"
},
{
user-name:"gan",
mail:"gan#gan.com"
}
]
I need help, to do above
It's not recommended to do such things in the DBMS, do it in the script that is loading the data instead, if you're wrapping some legacy code you can't edit then wrap it with more code to format the data.
If all that fails do something like this: http://www.thomasfrank.se/mysql_to_json.html
SELECT
CONCAT("[",
GROUP_CONCAT(
CONCAT("{username:'",username,"'"),
CONCAT(",email:'",email),"'}")
)
,"]")
AS json FROM users;