Twilio Event Streams Schema to Hive DDL - json

I'm working with Twilio's Event Streams product. I'm successfully receiving events, parsing to models with Pydantic and all of this works.
Now I need to retain these historically. They're going to s3 and I plan on using Athena to query this data.
Here's the question -- the JSON schema for CallSummary events is nested and complex. Before I tediously create a CREATE TABLE statement for this model, is there an easier approach I should be following?
For reference, I'm building the Python models using datamodel-codegen (link) like this:
curl -s https://events-schemas.twilio.com/VoiceInsights.CallSummary/1 > call_summary_event.json
datamodel-codegen --input call_summary_event.json --input-file-type jsonschema --output call_summary_model.py
Using 1 call summary record and hive-json-schema (from quux00), I can get about half of the fields:
java -cp target/json-hive-schema-1.0.jar net.thornydev.JsonHiveSchema CA0000123.json
CREATE TABLE x (
metrics array<struct<account_sid:string, call_sid:string, carrier_edge:null, client_edge:null, direction:string, edge:string, sdk_edge:null, sip_edge:struct<codec:int, codec_name:string, cumulative:struct<jitter:struct<avg:double, max:double>, packets_lost:int, packets_sent:int>, interval:struct<jitter:struct<avg:double, max:double>, packets_loss_percentage:double, packets_lost:int, packets_sent:int>, metadata:struct<edge_location:string, region:string, twilio_ip:string>>, timestamp:string>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';
Unfortunately, much of the data in my call summary json records are not populated, so I can't easily use a script like this to convert the sample data to CREATE TABLE with nested structs.
Before I go about doing this further, is there something I could use to go from the published JSON schema straight to Hive DDL?

Using Glue Crawler, I was able to get this data created. Note that objects are stored in s3 paths with year/month/date partitions e.g. s3://bucket/com.twilio.voice.insights.call-summary.complete/year=2022/month=10/day=10//CA000123.json
'Create Table' DDL statement from Athena after table was created by Crawler:
CREATE EXTERNAL TABLE `com_twilio_voice_insights_call_summary_complete`(
`specverson` string COMMENT 'from deserializer',
`type` string COMMENT 'from deserializer',
`source` string COMMENT 'from deserializer',
`id` string COMMENT 'from deserializer',
`dataschema` string COMMENT 'from deserializer',
`datacontenttype` string COMMENT 'from deserializer',
`time` string COMMENT 'from deserializer',
`data` struct<call_sid:string,account_sid:string,parent_call_sid:string,parent_account_sid:string,start_time:string,end_time:string,duration:int,connect_duration:int,call_type:string,call_state:string,from_:struct<caller:string,callee:string,carrier:string,connection:string,number_prefix:string,location:struct<lat:double,lon:double>,city:string,country_code:string,country_subdivision:string,ip_address:string,sdk:struct<type:string,version:string,platform:string,region:string,selected_region:string,browser:struct<name:string,major:string,version:string>,os:struct<name:string,version:string>,device:struct<model:string,type:string,vendor:string,arch:string>,engine:struct<name:string,version:string>>>,to:struct<caller:string,callee:string,carrier:string,connection:string,number_prefix:string,location:struct<lat:double,lon:double>,city:string,country_code:string,country_subdivision:string,ip_address:string,sdk:struct<type:string,version:string,platform:string,region:string,selected_region:string,browser:struct<name:string,major:string,version:string>,os:struct<name:string,version:string>,device:struct<model:string,type:string,vendor:string,arch:string>,engine:struct<name:string,version:string>>>,processing_state:string,processing_version:int,sip_edge:struct<properties:struct<q850_cause:int,last_sip_response_num:int,pdd_ms:int,route_id:string,media_region:string,signaling_region:string,twilio_media_ip:string,twilio_signaling_ip:string,external_media_ip:string,external_signaling_ip:string,sip_call_id:string,user_agent:string,selected_region:string,region:string,trunk_sid:string,disconnected_by:string,direction:string,settings:string>,metrics:struct<inbound:struct<codec:int,codec_name:string,packets_received:int,packets_sent:string,packets_lost:int,packets_loss_percentage:double,jitter:struct<min:double,max:double,avg:double,value:string>,rtt:string,mos:string,audio_in:string,audio_out:string,latency:string,bytes_received:string,bytes_sent:string,packet_delay_variation:string>,outbound:struct<codec:int,codec_name:string,packets_received:string,packets_sent:int,packets_lost:int,packets_loss_percentage:double,jitter:struct<min:double,max:double,avg:double,value:string>,rtt:string,mos:string,audio_in:string,audio_out:string,latency:struct<min:double,max:double,avg:double,value:string>,bytes_received:string,bytes_sent:string,packet_delay_variation:struct<d50:int,d70:int,d90:int,d120:int,d150:int,d200:int,d300:int>>>,tags:array<string>,events:string>,carrier_edge:struct<properties:struct<q850_cause:int,last_sip_response_num:int,pdd_ms:int,route_id:string,media_region:string,signaling_region:string,twilio_media_ip:string,twilio_signaling_ip:string,external_media_ip:string,external_signaling_ip:string,sip_call_id:string,user_agent:string,selected_region:string,region:string,trunk_sid:string,disconnected_by:string,direction:string,settings:string>,metrics:struct<inbound:struct<codec:int,codec_name:string,packets_received:int,packets_sent:string,packets_lost:int,packets_loss_percentage:double,jitter:struct<min:double,max:double,avg:double,value:string>,rtt:string,mos:string,audio_in:string,audio_out:string,latency:string,bytes_received:string,bytes_sent:string,packet_delay_variation:struct<d50:int,d70:int,d90:int,d120:int,d150:int,d200:int,d300:int>>,outbound:struct<codec:int,codec_name:string,packets_received:string,packets_sent:int,packets_lost:int,packets_loss_percentage:double,jitter:struct<min:double,max:double,avg:double,value:string>,rtt:string,mos:string,audio_in:string,audio_out:string,latency:struct<min:double,max:double,avg:double,value:string>,bytes_received:string,bytes_sent:string,packet_delay_variation:struct<d50:int,d70:int,d90:int,d120:int,d150:int,d200:int,d300:int>>>,tags:array<string>,events:string>,sdk_edge:struct<properties:struct<q850_cause:int,last_sip_response_num:int,pdd_ms:int,route_id:string,media_region:string,signaling_region:string,twilio_media_ip:string,twilio_signaling_ip:string,external_media_ip:string,external_signaling_ip:string,sip_call_id:string,user_agent:string,selected_region:string,region:string,trunk_sid:string,disconnected_by:string,direction:string,settings:struct<ice_restart_enabled:boolean,dscp:boolean,edge:string,selected_edges:array<string>>>,metrics:struct<inbound:struct<codec:string,codec_name:string,packets_received:int,packets_sent:string,packets_lost:int,packets_loss_percentage:double,jitter:struct<min:double,max:double,avg:double,value:string>,rtt:struct<min:double,max:double,avg:double,value:string>,mos:struct<min:double,max:double,avg:double,value:string>,audio_in:struct<min:double,max:double,avg:double,value:string>,audio_out:struct<min:double,max:double,avg:double,value:string>,latency:string,bytes_received:string,bytes_sent:string,packet_delay_variation:string>,outbound:struct<codec:int,codec_name:string,packets_received:string,packets_sent:int,packets_lost:string,packets_loss_percentage:string,jitter:string,rtt:string,mos:string,audio_in:string,audio_out:string,latency:string,bytes_received:string,bytes_sent:string,packet_delay_variation:string>>,tags:array<string>,events:struct<groups:struct<settings:int,network_information:int,pc_connection_state:int,dtls_transport_state:int,audio_level_warning_cleared:int,audio_level_warning_raised:int,ice_candidate:int,network_quality_warning_raised:int,signaling_state:int,connection:int,get_user_media:int,ice_connection_state:int,ice_gathering_state:int,network_quality_warning_cleared:int,audio:int>,levels:struct<info:int,warning:int,debug:int,error:int>,errors:struct<31201:int,31208:int,53405:int,31000:int,31003:int>,feedback:struct<reason:string,score:int>>>,client_edge:struct<properties:struct<q850_cause:int,last_sip_response_num:int,pdd_ms:int,route_id:string,media_region:string,signaling_region:string,twilio_media_ip:string,twilio_signaling_ip:string,external_media_ip:string,external_signaling_ip:string,sip_call_id:string,user_agent:string,selected_region:string,region:string,trunk_sid:string,disconnected_by:string,direction:string,settings:string>,metrics:struct<inbound:struct<codec:int,codec_name:string,packets_received:int,packets_sent:string,packets_lost:int,packets_loss_percentage:double,jitter:struct<min:double,max:double,avg:double,value:string>,rtt:string,mos:string,audio_in:string,audio_out:string,latency:string,bytes_received:string,bytes_sent:string,packet_delay_variation:struct<d50:int,d70:int,d90:int,d120:int,d150:int,d200:int,d300:int>>,outbound:struct<codec:int,codec_name:string,packets_received:string,packets_sent:int,packets_lost:int,packets_loss_percentage:double,jitter:struct<min:double,max:double,avg:double,value:string>,rtt:string,mos:string,audio_in:string,audio_out:string,latency:struct<min:double,max:double,avg:double,value:string>,bytes_received:string,bytes_sent:string,packet_delay_variation:struct<d50:int,d70:int,d90:int,d120:int,d150:int,d200:int,d300:int>>>,tags:array<string>,events:string>,tags:array<string>,attributes:struct<conference_participant:boolean>,properties:struct<q850_cause:int,last_sip_response_num:int,pdd_ms:int,route_id:string,media_region:string,signaling_region:string,twilio_media_ip:string,twilio_signaling_ip:string,external_media_ip:string,external_signaling_ip:string,sip_call_id:string,user_agent:string,selected_region:string,region:string,trunk_sid:string,disconnected_by:string,direction:string,settings:string>> COMMENT 'from deserializer')
PARTITIONED BY (
`year` string,
`month` string,
`day` string,
`day` string)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'paths'='data,datacontenttype,dataschema,id,source,specverson,time,type')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bucket/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'UPDATED_BY_CRAWLER'='com.twilio.voice.insights.call-summary.complete',
'averageRecordSize'='4048',
'classification'='json',
'compressionType'='none',
'objectCount'='1240',
'recordCount'='1239',
'sizeKey'='5030025',
'typeOfData'='file')

Related

Error in data while creating external tables in Athena

I have my data in CSV format in the below form:
Id -> tinyint
Name -> String
Id Name
1 Alex
2 Sam
When I export the CSV file to S3 and create an Athena table, the data transform into the following format.
Id Name
1 "Alex"
2 "Sam"
How do I get rid of the double quotes while creating the table?
Any help is appreciated.
By default if SerDe is not specified, Athena is using LasySimpleSerDe, it does not support quoted values and reads quotes as a part of value. If your CSV file contains quoted values, use OpenCSVSerde (specify correct separatorChar if it is not comma):
CREATE EXTERNAL TABLE mytable(
id tinyint,
Name string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
LOCATION 's3://my-bucket/mytable/'
;
Read the manuals: https://docs.aws.amazon.com/athena/latest/ug/csv-serde.html
See also this answer about data types in OpenCSVSerDe

Athena DDL statement for different data structures

I have data in XML form which I have converted in JSON format through glue crawler. The problem is in writing the DDL statement for a table in Athena as you can see below there is a contact attribute in JSON data. Somewhere it is a structure (single instance) and somewhere it is in array form (multiple instances). I am sharing the DDL statements below as well for each type.
JSON Data Type 1
"ContactList": {
"Contact": {
}
}
Athena DDL Statement
CREATE EXTERNAL TABLE IF NOT EXISTS table_name (
ContactList: struct<
Contact: struct<
>
>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3_bucket_path'
TBLPROPERTIES ('has_encrypted_data'='false')
JSON Data Type 2
"ContactList": {
"Contact": [
{},
{}
]
}
Athena DDL Statement
CREATE EXTERNAL TABLE IF NOT EXISTS table_name (
ContactList: struct<
Contact: array <
struct<
>
>
>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3_bucket_path'
TBLPROPERTIES ('has_encrypted_data'='false')
I am able to write DDL statement for one case at a time only and it work perfectly for individual type. My question is how we can write DDL statements so it can cater to both types either it is struct or array. Thanks in advance.
The way you solve this in Athena is that you use the string type for the Contact field of the ContactList column, and then JSON functions in your queries.
When you query you can for example do (assuming contacts have a "name" field):
SELECT
COALESCE(
json_extract_scalar(ContactList.Contact, '$.name[0]'),
json_extract_scalar(ContactList.Contact, '$.name')
) AS name
FROM table_name
This uses json_extract_scalar which parses a string as JSON and then extracts a value using a JSONPath expression. COALESCE picks the first non-null value, so if the first JSONPath expression does not yield any value (because the property is not an array), the second is attempted.

hive parsing json records as NULL

I have a simple hive table:
hive> show create table tweets;
OK
CREATE EXTERNAL TABLE `tweets`(
`json_body` string COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'file:/tmp/1'
TBLPROPERTIES (
'bucketing_version'='2',
'transient_lastDdlTime'='1551081429')
Time taken: 0.124 seconds, Fetched: 13 row(s)
in the folder /tmp/1 there is a file test.json and the only
contents in file are {"appname":"app-name"}
select from tweets returns NULL
hive> select * From tweets;
OK
NULL
Time taken: 0.097 seconds, Fetched: 1 row(s)
I know either the fileformat is wrong or something else is going on. can someone please help.
If you want JsonSerDe to parse attributes then create table like this:
CREATE EXTERNAL TABLE tweets (
appname string
)
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION '/tmp/1' --this is HDFS/S3 location
;
Read also docs about JsonSerDe
And if you want to get the whole JSON object as a string json_body then you do not need JSON SerDe, use TEXTFILE instead:
CREATE EXTERNAL TABLE tweets (
json_body string
)
STORED AS TEXTFILE
LOCATION '/tmp/1' --this is HDFS/S3 location
;

insert data into table using csv file in HIVE

CREATE TABLE `rk_test22`(
`index` int,
`country` string,
`description` string,
`designation` string,
`points` int,
`price` int,
`province` string,
`region_1` string,
`region_2` string,
`taster_name` string,
`taster_twitter_handle` string,
`title` string,
`variety` string,
`winery` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'input.regex'=',(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://namever/user/hive/warehouse/robert.db/rk_test22'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='true',
'numFiles'='1',
'skip.header.line.count'='1',
'totalSize'='52796693',
'transient_lastDdlTime'='1516088117');
I created the hive table using above command. Now I want to load the following line (in CSV file) into table using load data command. The load data command shows status OK but i cannot see data into that table.
0,Italy,"Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity.",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,#kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
If you are loading one line CSV file then that line is skipped because of this property: 'skip.header.line.count'='1'
Also Regex should contain one capturing group for each column. Like in this answer: https://stackoverflow.com/a/47944328/2700344
And why do you provide these settings in table DDL:
'COLUMN_STATS_ACCURATE'='true'
'numFiles'='1',
'totalSize'='52796693',
'transient_lastDdlTime'='1516088117'
All these should be set automatically after DDL and ANALYZE.

golang - mysql driver - database functions

I have created a struct to store spatial types and I have created a scan function to help query rows in my database. I am having issues inserting this type.
I can insert data using the following sql;
INSERT INTO 'table' ('spot') VALUES (GeomFromText('POINT(10 10)'));
If I use Value interface in database/sql/driver;
type Value interface{}
Value is a value that drivers must be able to handle. It is either nil or an instance of one of these types:
int64
float64
bool
[]byte
string [*] everywhere except from Rows.Next.
time.Time
And use this code;
func (p Point) Value() (driver.Value, error) {
return "GeomFromText('" + p.ToWKT() + "')", nil
}
I end up with the following sql statement going to the database;
INSERT INTO 'table' ('spot') VALUES ('GeomFromText('POINT(10 10)')');
The issue being that the function GeomFromText is in quotes. Is there a way to avoid this scenario? I am using gorm and trying to keep raw sql queries to a minimum.
The mysql type being used on the database end is a point.
Please see the two urls below where the concept was poached from
Schema
-- http://howto-use-mysql-spatial-ext.blogspot.com/
create table Points
( id int auto_increment primary key,
name VARCHAR(20) not null,
location Point NOT NULL,
description VARCHAR(200) not null,
SPATIAL INDEX(location),
key(name)
)engine=MyISAM; -- for use of spatial indexes and avoiding error 1464
-- insert a row, so we can prove Update later will work
INSERT INTO Points (name, location, description) VALUES
( 'point1' , GeomFromText( ' POINT(31.5 42.2) ' ) , 'some place');
Update statement
-- concept borrowed from http://stackoverflow.com/a/7135890
UPDATE Points
set location = PointFromText(CONCAT('POINT(',13.33,' ',26.48,')'))
where id=1;
Verify
select * from points;
(when you open the Value Editor to see the blob, the point is updated)
So, the takeaway is to play with the concat() inside of the update statement.