Create Table in Athena From Nested JSON - json

I have nested JSON of type
[{
"emails": [{
"label": "",
"primary": "",
"relationdef_id": "",
"type": "",
"value": ""
}],
"licenses": [{
"allocated": "",
"parent_type": "",
"parentid": "",
"product_type": "",
"purchased_license_id": "",
"service_type": ""
}, {
"allocated": "",
"parent_type": "",
"parentid": "",
"product_type": "",
"purchased_license_id": "",
"service_type": ""
}]
}, {
"emails": [{
"label": "",
"primary": "",
"relationdef_id": "",
"type": "",
"value": ""
}],
"licenses": [{
"allocated": "2016-04-26 01:46:26",
"parent_type": "",
"parentid": "",
"product_type": "",
"purchased_license_id": "",
"service_type": ""
}]
}]
which is not able to be converted to athena table.
I have tried to update it to list of objects also
{
"emails": [{
"label": "",
"primary": "",
"relationdef_id": "",
"type": "",
"value": ""
}
],
"licenses": [{
"allocated": "",
"parent_type": "",
"parentid": "",
"product_type": "",
"purchased_license_id": "",
"service_type": ""
},{
"allocated": "",
"parent_type": "",
"parentid": "",
"product_type": "",
"purchased_license_id": "",
"service_type": ""
}
]
}
{
"emails": [{
"label": "",
"primary": "",
"relationdef_id": "",
"type": "",
"value": ""
}
],
"licenses": [{
"allocated": "",
"parent_type": "",
"parentid": "",
"product_type": "",
"purchased_license_id": "",
"service_type": ""
}
]
}
{
"emails": [{
"label": "",
"primary": "",
"relationdef_id": "",
"type": "",
"value": ""
}
],
"licenses": [{
"allocated": "",
"parent_type": "",
"parentid": "",
"product_type": "",
"purchased_license_id": "",
"service_type": ""
}
]
}
with Query:
CREATE EXTERNAL TABLE `test_orders1`(
`emails` array<struct<`label`: string, `primary`: string,`relationdef_id`: string,`type`: string, `value`: string>>,
`licenses` array<struct<`allocated`: string, `parent_type`: string, `parentid`: string, `product_type`: string,`purchased_license_id`: string, `service_type`: string>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'ignore.malformed.json' = 'true')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
but only 1 row is formed.
Is there a way where i can use Nested json of type JSONArray into Athena table?
Or how can I change Nested Json that will work for me?

When querying JSON data Athena requires the files to be formatted with one JSON document per line. It's unclear from your question if this is the case or not, the examples you give are multiline, but perhaps that's only to make the question more clear.
The table DDL you include looks like it should work on the second example data, provided that it is formatted as one document per line, e.g.
{"emails": [{"label": "", "primary": "", "relationdef_id": "", "type": "", "value": ""}], "licenses": [{"allocated": "", "parent_type": "", "parentid": "", "product_type": "", "purchased_license_id": "", "service_type": ""}, { "allocated": "", "parent_type": "", "parentid": "", "product_type": "", "purchased_license_id": "", "service_type": ""}]}
{"emails": [{"label": "", "primary": "", "relationdef_id": "", "type": "", "value": ""}], "licenses": [{"allocated": "", "parent_type": "", "parentid": "", "product_type": "", "purchased_license_id": "", "service_type": ""}]}
{"emails": [{"label": "", "primary": "", "relationdef_id": "", "type": "", "value": ""}], "licenses": [{"allocated": "", "parent_type": "", "parentid": "", "product_type": "", "purchased_license_id": "", "service_type": ""}]}

Related

how to give a json with about 1000 fields random values

I have a json with about 1000 fields like below
{
"Order": {
"AdjustmentInvoicePending": "",
"AllAddressesVerified": "",
"AllocationRuleID": "",
"AuthorizationExpirationDate": "",
"BillToKey": "",
"CarrierAccountNo": "",
"ChainType": "",
"ChargeActualFreightFlag": "",
"ComplimentaryGiftBoxQty": "",
"ContactKey": "",
"CustomerAge": "",
"CustomerEMailID": "",
"CustomerFirstName": "",
"CustomerLastName": "",
"CustomerPhoneNo": "",
"CustomerZipCode": "",
"Division": "",
"DocumentType": "",
"DraftOrderFlag": "",
"EnterpriseCode": "",
"EntryType": "",
"FreightTerms": "",
"HasDeliveryLines": "",
"HasDerivedChild": "",
"HasDerivedParent": "",
"HasProductLines": "",
"HoldFlag": "",
"HoldReasonCode": "",
"InvoiceComplete": "",
"isHistory": "",
"MaxOrderStatus": "",
"MaxOrderStatusDesc": "",
"MinOrderStatus": "",
"MinOrderStatusDesc": "",
"NoOfAuthStrikes": "",
"NotifyAfterShipmentFlag": "",
"OrderClosed": "",
"OrderDate": "",
"OrderHeaderKey": "",
"OrderNo": "",
"OrderType": "",
"OriginalTax": "",
"OriginalTotalAmount": "",
"OtherCharges": "",
"PaymentRuleId": "",
"PaymentStatus": "",
"PendingTransferIn": "",
"PriorityNumber": "",
"ReqCancelDate": "",
"ReturnByGiftRecipient": "",
"SaleVoided": "",
"ScacAndService": "",
"ScacAndServiceKey": "",
"SellerOrganizationCode": "",
"ShipToKey": "",
"SourcingClassification": "",
"Status": "",
"Tax": "",
"TotalAdjustmentAmount": "",
"AdditionalAddresses": {
"AdditionalAddress": [
{
"AddressType": "",
"EntityKey": "",
"PersonInfoKey": "",
"PersonInfo": {
"AddressID": "",
"AddressLine1": "",
"AddressLine2": "",
"AddressLine3": "",
"AddressLine4": "",
"AddressLine5": "",
"AddressLine6": "",
"AddressType": "",
"AlternateEmailID": "",
"Beeper": "",
"City": "",
"Company": "",
"Country": "",
"DayFaxNo": "",
"DayPhone": "",
"Department": "",
"EMailID": "",
"EveningFaxNo": "",
"EveningPhone": "",
"FirstName": "",
"IsAddressVerified": "",
"IsCommercialAddress": "",
"JobTitle": "",
"LastName": "",
"Latitude": "",
"Longitude": "",
"MiddleName": "",
"MobilePhone": "",
"OtherPhone": "",
"PersonID": "",
"PersonInfoKey": "",
"State": "",
"Suffix": "",
"TaxGeoCode": "",
"Title": "",
"ZipCode": ""
}
}
]
}
I want an output like "AdjustmentInvoicePending": "WQEMQAMCDQ", With all null spots replace with random unique values which changes each time i run the code
I tried using python and coverted json and assigned it to variable but am stuck from their i am new to programming and first time posting on slack please go easy on me if you are not able to understand and please forgive any typo
I recommend just using nested loops to iterate through and add random values. A snippet is below. You'll need to change it to work precisely for your problem. With that said, you can't guarantee randomness and uniqueness. For most uses, you can get by with just using a pseudorandom generator, but there is always the chance for collisions when dealing with randomness.
Also, this will fail with the list embedded in the dictionary, so you'll need to add some extra control-flow logic to deal with that circumstance, but this should help with getting you started.
import string
import random
N = 10 # length of the string
for i in json_var.keys(): # json_var is your json
for j in i.keys():
json_var[i][j].join(random.choices(string.ascii_uppercase, k=N))

Making blocks optional and conditional entry where there is no header name

I am trying to write a json schema to import a csv file. A simplified version of the file is:
000,OUTCOME,111,20220320,FT
001,A.Name1,,,A,
002,A.Name2,,20220101,C,
999,2
000 is a header block
001 & 002 are the body and can have multiple instances. Structurally they are the same
999 is the trailer
the schema I have is
{
"conf_name": "FileTest",
"has_blocks": true,
"has_headers": false,
"enabled_extensions": [
{
"extension": "csv",
"delimiter": ","
}
],
"file_type": "FileTest",
"columns": [
{
"blockName": "000",
"name": "",
"startPosition": "",
"maxLength": "",
"minLength": "",
"data": {
"mandatory": "M",
"type": "text",
"format": "",
"values": [
"OUTCOME"
],
"conditional": ""
}
},
{
"blockName": "000",
"name": "",
"startPosition": "",
"maxLength": "",
"minLength": "",
"data": {
"mandatory": "M",
"type": "int",
"format": "",
"values": "",
"conditional": ""
}
},
{
"blockName": "000",
"name": "",
"startPosition": "",
"maxLength": "",
"minLength": "",
"data": {
"mandatory": "M",
"type": "date",
"format": "yyyyMMdd",
"values": "",
"conditional": ""
}
},
{
"blockName": "000",
"name": "",
"startPosition": "",
"maxLength": "",
"minLength": "",
"data": {
"mandatory": "M",
"type": "text",
"format": "",
"values": "",
"conditional": ""
}
},
{
"blockName": "001",
"name": "",
"startPosition": "",
"maxLength": "",
"minLength": "",
"data": {
"mandatory": "M",
"type": "text",
"format": "",
"values": "",
"conditional": ""
}
},
{
"blockName": "001",
"name": "",
"startPosition": "",
"maxLength": "",
"minLength": "",
"data": {
"mandatory": "O",
"type": "text",
"format": "",
"values": "",
"conditional": ""
}
},
{
"blockName": "001",
"name": "",
"startPosition": "",
"maxLength": "",
"minLength": "",
"data": {
"mandatory": "C",
"type": "date",
"format": "yyyyMMdd",
"values": "",
"conditional": {
"columnName": "Column5",
"operation": "==C THEN NOTEMPTY",
"value": ""
}
}
},
{
"blockName": "001",
"name": "Column5",
"startPosition": "",
"maxLength": "",
"minLength": "",
"data": {
"mandatory": "M",
"type": "text",
"format": "",
"values": [
"A",
"B",
"C"
],
"conditional": ""
}
},
{
"blockName": "001",
"name": "",
"startPosition": "",
"maxLength": "",
"minLength": "",
"data": {
"mandatory": "O",
"type": "text",
"format": "",
"values": [
"Y",
"N"
],
"conditional": ""
}
},
{
"blockName": "002",
"name": "",
"startPosition": "",
"maxLength": "",
"minLength": "",
"data": {
"mandatory": "M",
"type": "text",
"format": "",
"values": "",
"conditional": ""
}
},
{
"blockName": "002",
"name": "",
"startPosition": "",
"maxLength": "",
"minLength": "",
"data": {
"mandatory": "O",
"type": "text",
"format": "",
"values": "",
"conditional": ""
}
},
{
"blockName": "002",
"name": "",
"startPosition": "",
"maxLength": "",
"minLength": "",
"data": {
"mandatory": "C",
"type": "date",
"format": "yyyyMMdd",
"values": "",
"conditional": {
"columnName": "Column5",
"operation": "==C THEN NOTEMPTY",
"value": ""
}
}
},
{
"blockName": "002",
"name": "",
"startPosition": "",
"maxLength": "",
"minLength": "",
"data": {
"mandatory": "M",
"type": "text",
"format": "",
"values": [
"A",
"B",
"C"
],
"conditional": ""
}
},
{
"blockName": "002",
"name": "",
"startPosition": "",
"maxLength": "",
"minLength": "",
"data": {
"mandatory": "O",
"type": "text",
"format": "",
"values": [
"Y",
"N"
],
"conditional": ""
}
},
{
"blockName": "999",
"name": "",
"startPosition": "",
"maxLength": "",
"minLength": "",
"data": {
"mandatory": "M",
"type": "int",
"format": "",
"values": "",
"conditional": ""
}
}
]
}
I apologise now if that is an affront to JSON users everywhere!
This works if there are both 001 and 002 lines in the file, but I've now been told that there may times when the file is all 001 or 002 but not both. Is there a way to make them optional? "oneOf" seemed likely, but I could not fathom how to use it in this situation.
My other issue was with the conditional entry. The data does not have a header row so I'm not sure how to check it. The intention is that if the 5th column contains a "C" then the 4th should have a date in it.
Any help would be gratefully received. Thanks

Develop bargraphs from json object list directly

I have multiple json scripts with a similar format as:
{
"name": "xyz",
"slug": "xyz",
"supplier": "xyz.limited",
"attributes": [
{
"name": "mass",
"productConfiguration": "base",
"description": "",
"value": "0.24",
"minimumValue": "",
"maximumValue": "",
"measurementUnit": "kg",
"uuid": "",
"attributeClassUuid": "",
"productUuid": ""
},
{
"name": "length",
"productConfiguration": "base",
"description": "assumed to be length",
"value": "58.0",
"minimumValue": "",
"maximumValue": "",
"measurementUnit": "mm",
"uuid": "",
"attributeClassUuid": "",
"productUuid": ""
},
{
"name": "height",
"productConfiguration": "base",
"description": "assumed to be height",
"value": "25.0",
"minimumValue": "",
"maximumValue": "",
"measurementUnit": "mm",
"uuid": "",
"attributeClassUuid": "",
"productUuid": ""
},
{
"name": "width",
"productConfiguration": "base",
"description": "assumed to be width",
"value": "58.0",
"minimumValue": "",
"maximumValue": "",
"measurementUnit": "mm",
"uuid": "",
"attributeClassUuid": "",
"productUuid": ""
},
{
"name": "momentum-storage",
"productConfiguration": "base",
"description": "",
"value": "0.050",
"minimumValue": "",
"maximumValue": "",
"measurementUnit": "N m s",
"uuid": "",
"attributeClassUuid": "",
"productUuid": ""
},
{
"name": "maximum-torque",
"productConfiguration": "base",
"description": "",
"value": "0.007",
"minimumValue": "",
"maximumValue": "",
"measurementUnit": "N m",
"uuid": "",
"attributeClassUuid": "",
"productUuid": ""
},
{
"name": "voltage",
"productConfiguration": "base",
"description": "voltage is given in DC current",
"value": "12.0",
"minimumValue": "",
"maximumValue": "",
"measurementUnit": "V",
"uuid": "",
"attributeClassUuid": "",
"productUuid": ""
},
{
"name": "maximum-power",
"productConfiguration": "base",
"description": "",
"value": "9.0",
"minimumValue": "",
"maximumValue": "",
"measurementUnit": "W",
"uuid": "",
"attributeClassUuid": "",
"productUuid": ""
},
{
"name": "lifetime",
"productConfiguration": "base",
"description": "",
"value": "",
"minimumValue": "5.0",
"maximumValue": "",
"measurementUnit": "yr",
"uuid": "",
"attributeClassUuid": "",
"productUuid": ""
}
]
}
Each of these files have different set of attributes defined. I would like to plot the number of times each attribute is repeated across all files. I try to do this using python3.5 functionalities 'os' and 'panda.DataFrame'. However, got lost somewhere! Can use some help with this. Thanks in advance!

Can I split compressed files into json components in Camel via Spring DSL

In a nutshell, I need to take a gzipped file containing json very similar to this example, unzip it (I know how to do that), get each json object as a string and push it to AMQ from where it will be popped to a webservice. I'm fine with all of this with one object, but I will be receiving a file that represents an array. If this were an array of strings or xml, I see how Camel processes it, but I don't see a way to split json. Also, this will require streaming as these files can be very large. Edited to try to make request clearer, and provide a sample json.
[
{
"rickenbackerRepair": {
"estimateId": 22788411
},
"repairShop": {
"inspectionSite": {
"inspectionDate": ""
},
"repairFacility": {
"companyIdCode": "",
"companyName": "",
"city": "",
"stateProvince": "",
"zipPostalCode": "",
"country": ""
},
"repairInformation": {
"guitarDateInShop": "",
"guitarTimeInShop": "",
"authorizationMemo": "",
"guitarTargetCompletionDate": "",
"guitarTargetCompletionTime": "",
"guitarCompletionDate": "",
"guitarCompletionTime": ""
},
"locationOfguitar": {
"city": "",
"stateProvince": "",
"zipPostalCode": "",
"country": ""
}
},
"instrumentIdentifier": {
"guitar": {
"claimRelated": {
"primaryPointOfImpact": "",
"secondaryPointOfImpact": ""
},
"identification": {
"databaseguitarCode": "",
"manufacturingStateProvince": "",
"serialNumber": "",
"guitarCondition": "",
"productionDate": "",
"year": "",
"model": "",
"guitarType": "",
"bodyStyle": "",
"trimCode": "",
"trimColor": "",
"optionsList": ""
}
}
},
"lin": [{
"internalControl": {
"lineIndicator": ""
},
"part": {
"description": {
"partType": "",
"descriptionJudgmentFlag": "",
"oemPartNumber": "",
"priceIncludedIndicator": "",
"alternatePartIndicator": "",
"taxableFlag": "",
"databasePartPrice": "",
"actualPartPrice": "",
"priceJudgmentFlag": "",
"certifiedFlag": "",
"quantity": ""
},
"nonOemSupplier": {
"companyIdCode": "",
"nonOemPartNumber": "",
"nonOemSupplierUserOverride": "",
"nonOemSupplierMemo": ""
},
"adjustment": {
"percent": "",
"amount": ""
}
},
"labor": {
"description": {
"type": "",
"actualHours": "",
"hoursJudgmentFlag": "",
"typeJudgmentFlag": ""
},
"miscSublet": {
"amount": "",
"subletFlag": ""
}
}
}],
"stl": {
"subtotal": [{
"totalType": "",
"totalTypeCode": "",
"subtotalDetail": {
"taxableAmount": ""
}
}]
}
},
{
"rickenbackerRepair": {
"estimateId": 22788412
},
"repairShop": {
"inspectionSite": {
"inspectionDate": ""
},
"repairFacility": {
"companyIdCode": "",
"companyName": "",
"city": "",
"stateProvince": "",
"zipPostalCode": "",
"country": ""
},
"repairInformation": {
"guitarDateInShop": "",
"guitarTimeInShop": "",
"authorizationMemo": "",
"guitarTargetCompletionDate": "",
"guitarTargetCompletionTime": "",
"guitarCompletionDate": "",
"guitarCompletionTime": ""
},
"locationOfguitar": {
"city": "",
"stateProvince": "",
"zipPostalCode": "",
"country": ""
}
},
"instrumentIdentifier": {
"guitar": {
"claimRelated": {
"primaryPointOfImpact": "",
"secondaryPointOfImpact": ""
},
"identification": {
"databaseguitarCode": "",
"manufacturingStateProvince": "",
"serialNumber": "",
"guitarCondition": "",
"productionDate": "",
"year": "",
"model": "",
"guitarType": "",
"bodyStyle": "",
"trimCode": "",
"trimColor": "",
"optionsList": ""
}
}
},
"lin": [{
"internalControl": {
"lineIndicator": ""
},
"part": {
"description": {
"partType": "",
"descriptionJudgmentFlag": "",
"oemPartNumber": "",
"priceIncludedIndicator": "",
"alternatePartIndicator": "",
"taxableFlag": "",
"databasePartPrice": "",
"actualPartPrice": "",
"priceJudgmentFlag": "",
"certifiedFlag": "",
"quantity": ""
},
"nonOemSupplier": {
"companyIdCode": "",
"nonOemPartNumber": "",
"nonOemSupplierUserOverride": "",
"nonOemSupplierMemo": ""
},
"adjustment": {
"percent": "",
"amount": ""
}
},
"labor": {
"description": {
"type": "",
"actualHours": "",
"hoursJudgmentFlag": "",
"typeJudgmentFlag": ""
},
"miscSublet": {
"amount": "",
"subletFlag": ""
}
}
}],
"stl": {
"subtotal": [{
"totalType": "",
"totalTypeCode": "",
"subtotalDetail": {
"taxableAmount": ""
}
}]
}
}
]
You should be able to use a jsonpath expression to split the incoming message (file) and process each element individually.
<route>
<from uri="file://path" />
<split>
<jsonpath>$.</jsonpath>
<to uri="direct:doSomething">
</split>
</route>

How to Parse this Json using Gson and get the field I want?

{
"ws_result":
[
{
"token": "",
"norm_token": "",
"len": "",
"type": "",
"pos": "",
"prop": "",
"stag": "",
"child":
[
{
"token": "",
"norm_token":"",
"len": "",
"type": "",
"pos": "",
"prop": "",
"stag": "",
"child":
[
{
"token": "",
"norm_token":"",
"len": "",
"type": "",
"pos": "",
"prop": "",
"stag": "",
"child": [ ]
},
{
"token": "",
"norm_token":"",
"len": "",
"type": "",
"pos": "",
"prop": "",
"stag": "",
"child": [ ]
}
]
},
{
"token": "",
"norm_token":"",
"len": "",
"type": "",
"pos": "",
"prop": "",
"stag": "",
"child":
[
{
"token": "",
"norm_token":"",
"len": "",
"type": "",
"pos": "",
"prop": "",
"stag": "",
"child": [ ]
}
]
}
]
},
{
"token": "",
"norm_token":"",
"len": "",
"type": "",
"pos": "",
"prop": "",
"stag": "2",
"child": [ ]
},
{
"token": "",
"norm_token": "",
"len": "",
"type": "",
"pos": "",
"prop": "",
"stag": "",
"child": [ ]
}
]
}
Such that some children are empty some is not, and some children contain more children. How do I actually parse this thing and get what I want. I am totally new with Json, and I am trying to use Gson. What I want is to get a value of a token with specific type in the nested Json. Thanks a lot for any help and directions.
I tried use com.google.gson.stream.JsonReader, but ist not working
JsonReader jsonReader = new JsonReader(new StringReader(result));
jsonReader.beginObject();
while(jsonReader.hasNext()){
String field = jsonReader.nextName();
if (field.equals("type")){
System.out.println(jsonReader.nextString());
} else if (field.equals("token")){
System.out.println(jsonReader.nextString());
} else {
jsonReader.skipValue();
}
}
jsonReader.endObject();
Parse your json recursively like this:
http://snipplr.com/view/71742/java-reflection-and-recursive-json-deserializer-using-gson/
private void parse(JsonObject o, PackagingResponse r){
Iterator<Entry<String, JsonElement>> i = o.entrySet().iterator();
while(i.hasNext()){
Entry<String, JsonElement> e = i.next();
JsonElement el = e.getValue();
if(el.isJsonObject())
parse(el.getAsJsonObject(), r);
//......
}
}