I'm working with Twilio's Event Streams product. I'm successfully receiving events, parsing to models with Pydantic and all of this works.
Now I need to retain these historically. They're going to s3 and I plan on using Athena to query this data.
Here's the question -- the JSON schema for CallSummary events is nested and complex. Before I tediously create a CREATE TABLE statement for this model, is there an easier approach I should be following?
For reference, I'm building the Python models using datamodel-codegen (link) like this:
curl -s https://events-schemas.twilio.com/VoiceInsights.CallSummary/1 > call_summary_event.json
datamodel-codegen --input call_summary_event.json --input-file-type jsonschema --output call_summary_model.py
Using 1 call summary record and hive-json-schema (from quux00), I can get about half of the fields:
java -cp target/json-hive-schema-1.0.jar net.thornydev.JsonHiveSchema CA0000123.json
CREATE TABLE x (
metrics array<struct<account_sid:string, call_sid:string, carrier_edge:null, client_edge:null, direction:string, edge:string, sdk_edge:null, sip_edge:struct<codec:int, codec_name:string, cumulative:struct<jitter:struct<avg:double, max:double>, packets_lost:int, packets_sent:int>, interval:struct<jitter:struct<avg:double, max:double>, packets_loss_percentage:double, packets_lost:int, packets_sent:int>, metadata:struct<edge_location:string, region:string, twilio_ip:string>>, timestamp:string>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';
Unfortunately, much of the data in my call summary json records are not populated, so I can't easily use a script like this to convert the sample data to CREATE TABLE with nested structs.
Before I go about doing this further, is there something I could use to go from the published JSON schema straight to Hive DDL?
Using Glue Crawler, I was able to get this data created. Note that objects are stored in s3 paths with year/month/date partitions e.g. s3://bucket/com.twilio.voice.insights.call-summary.complete/year=2022/month=10/day=10//CA000123.json
'Create Table' DDL statement from Athena after table was created by Crawler:
CREATE EXTERNAL TABLE `com_twilio_voice_insights_call_summary_complete`(
`specverson` string COMMENT 'from deserializer',
`type` string COMMENT 'from deserializer',
`source` string COMMENT 'from deserializer',
`id` string COMMENT 'from deserializer',
`dataschema` string COMMENT 'from deserializer',
`datacontenttype` string COMMENT 'from deserializer',
`time` string COMMENT 'from deserializer',
`data` struct<call_sid:string,account_sid:string,parent_call_sid:string,parent_account_sid:string,start_time:string,end_time:string,duration:int,connect_duration:int,call_type:string,call_state:string,from_:struct<caller:string,callee:string,carrier:string,connection:string,number_prefix:string,location:struct<lat:double,lon:double>,city:string,country_code:string,country_subdivision:string,ip_address:string,sdk:struct<type:string,version:string,platform:string,region:string,selected_region:string,browser:struct<name:string,major:string,version:string>,os:struct<name:string,version:string>,device:struct<model:string,type:string,vendor:string,arch:string>,engine:struct<name:string,version:string>>>,to:struct<caller:string,callee:string,carrier:string,connection:string,number_prefix:string,location:struct<lat:double,lon:double>,city:string,country_code:string,country_subdivision:string,ip_address:string,sdk:struct<type:string,version:string,platform:string,region:string,selected_region:string,browser:struct<name:string,major:string,version:string>,os:struct<name:string,version:string>,device:struct<model:string,type:string,vendor:string,arch:string>,engine:struct<name:string,version:string>>>,processing_state:string,processing_version:int,sip_edge:struct<properties:struct<q850_cause:int,last_sip_response_num:int,pdd_ms:int,route_id:string,media_region:string,signaling_region:string,twilio_media_ip:string,twilio_signaling_ip:string,external_media_ip:string,external_signaling_ip:string,sip_call_id:string,user_agent:string,selected_region:string,region:string,trunk_sid:string,disconnected_by:string,direction:string,settings:string>,metrics:struct<inbound:struct<codec:int,codec_name:string,packets_received:int,packets_sent:string,packets_lost:int,packets_loss_percentage:double,jitter:struct<min:double,max:double,avg:double,value:string>,rtt:string,mos:string,audio_in:string,audio_out:string,latency:string,bytes_received:string,bytes_sent:string,packet_delay_variation:string>,outbound:struct<codec:int,codec_name:string,packets_received:string,packets_sent:int,packets_lost:int,packets_loss_percentage:double,jitter:struct<min:double,max:double,avg:double,value:string>,rtt:string,mos:string,audio_in:string,audio_out:string,latency:struct<min:double,max:double,avg:double,value:string>,bytes_received:string,bytes_sent:string,packet_delay_variation:struct<d50:int,d70:int,d90:int,d120:int,d150:int,d200:int,d300:int>>>,tags:array<string>,events:string>,carrier_edge:struct<properties:struct<q850_cause:int,last_sip_response_num:int,pdd_ms:int,route_id:string,media_region:string,signaling_region:string,twilio_media_ip:string,twilio_signaling_ip:string,external_media_ip:string,external_signaling_ip:string,sip_call_id:string,user_agent:string,selected_region:string,region:string,trunk_sid:string,disconnected_by:string,direction:string,settings:string>,metrics:struct<inbound:struct<codec:int,codec_name:string,packets_received:int,packets_sent:string,packets_lost:int,packets_loss_percentage:double,jitter:struct<min:double,max:double,avg:double,value:string>,rtt:string,mos:string,audio_in:string,audio_out:string,latency:string,bytes_received:string,bytes_sent:string,packet_delay_variation:struct<d50:int,d70:int,d90:int,d120:int,d150:int,d200:int,d300:int>>,outbound:struct<codec:int,codec_name:string,packets_received:string,packets_sent:int,packets_lost:int,packets_loss_percentage:double,jitter:struct<min:double,max:double,avg:double,value:string>,rtt:string,mos:string,audio_in:string,audio_out:string,latency:struct<min:double,max:double,avg:double,value:string>,bytes_received:string,bytes_sent:string,packet_delay_variation:struct<d50:int,d70:int,d90:int,d120:int,d150:int,d200:int,d300:int>>>,tags:array<string>,events:string>,sdk_edge:struct<properties:struct<q850_cause:int,last_sip_response_num:int,pdd_ms:int,route_id:string,media_region:string,signaling_region:string,twilio_media_ip:string,twilio_signaling_ip:string,external_media_ip:string,external_signaling_ip:string,sip_call_id:string,user_agent:string,selected_region:string,region:string,trunk_sid:string,disconnected_by:string,direction:string,settings:struct<ice_restart_enabled:boolean,dscp:boolean,edge:string,selected_edges:array<string>>>,metrics:struct<inbound:struct<codec:string,codec_name:string,packets_received:int,packets_sent:string,packets_lost:int,packets_loss_percentage:double,jitter:struct<min:double,max:double,avg:double,value:string>,rtt:struct<min:double,max:double,avg:double,value:string>,mos:struct<min:double,max:double,avg:double,value:string>,audio_in:struct<min:double,max:double,avg:double,value:string>,audio_out:struct<min:double,max:double,avg:double,value:string>,latency:string,bytes_received:string,bytes_sent:string,packet_delay_variation:string>,outbound:struct<codec:int,codec_name:string,packets_received:string,packets_sent:int,packets_lost:string,packets_loss_percentage:string,jitter:string,rtt:string,mos:string,audio_in:string,audio_out:string,latency:string,bytes_received:string,bytes_sent:string,packet_delay_variation:string>>,tags:array<string>,events:struct<groups:struct<settings:int,network_information:int,pc_connection_state:int,dtls_transport_state:int,audio_level_warning_cleared:int,audio_level_warning_raised:int,ice_candidate:int,network_quality_warning_raised:int,signaling_state:int,connection:int,get_user_media:int,ice_connection_state:int,ice_gathering_state:int,network_quality_warning_cleared:int,audio:int>,levels:struct<info:int,warning:int,debug:int,error:int>,errors:struct<31201:int,31208:int,53405:int,31000:int,31003:int>,feedback:struct<reason:string,score:int>>>,client_edge:struct<properties:struct<q850_cause:int,last_sip_response_num:int,pdd_ms:int,route_id:string,media_region:string,signaling_region:string,twilio_media_ip:string,twilio_signaling_ip:string,external_media_ip:string,external_signaling_ip:string,sip_call_id:string,user_agent:string,selected_region:string,region:string,trunk_sid:string,disconnected_by:string,direction:string,settings:string>,metrics:struct<inbound:struct<codec:int,codec_name:string,packets_received:int,packets_sent:string,packets_lost:int,packets_loss_percentage:double,jitter:struct<min:double,max:double,avg:double,value:string>,rtt:string,mos:string,audio_in:string,audio_out:string,latency:string,bytes_received:string,bytes_sent:string,packet_delay_variation:struct<d50:int,d70:int,d90:int,d120:int,d150:int,d200:int,d300:int>>,outbound:struct<codec:int,codec_name:string,packets_received:string,packets_sent:int,packets_lost:int,packets_loss_percentage:double,jitter:struct<min:double,max:double,avg:double,value:string>,rtt:string,mos:string,audio_in:string,audio_out:string,latency:struct<min:double,max:double,avg:double,value:string>,bytes_received:string,bytes_sent:string,packet_delay_variation:struct<d50:int,d70:int,d90:int,d120:int,d150:int,d200:int,d300:int>>>,tags:array<string>,events:string>,tags:array<string>,attributes:struct<conference_participant:boolean>,properties:struct<q850_cause:int,last_sip_response_num:int,pdd_ms:int,route_id:string,media_region:string,signaling_region:string,twilio_media_ip:string,twilio_signaling_ip:string,external_media_ip:string,external_signaling_ip:string,sip_call_id:string,user_agent:string,selected_region:string,region:string,trunk_sid:string,disconnected_by:string,direction:string,settings:string>> COMMENT 'from deserializer')
PARTITIONED BY (
`year` string,
`month` string,
`day` string,
`day` string)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'paths'='data,datacontenttype,dataschema,id,source,specverson,time,type')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bucket/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'UPDATED_BY_CRAWLER'='com.twilio.voice.insights.call-summary.complete',
'averageRecordSize'='4048',
'classification'='json',
'compressionType'='none',
'objectCount'='1240',
'recordCount'='1239',
'sizeKey'='5030025',
'typeOfData'='file')
Does anyone how to import a geonames.org data into my database? The one i'm trying to import is http://download.geonames.org/export/dump/DO.zip, and my DB its a MySQL db.
I found the following by looking in the readme file included in the zip file you linked to in the section called "The main 'GeoName' table has the following fields:"
First create the database and table on your MySQL instance. The type of fields are given in each row of the section I just quoted the title of above.
CREATE DATABASE DO_test;
CREATE TABLE `DO_test`.`DO_table` (
`geonameid` INT,
`name` varchar(200),
`asciiname` varchar(200),
`alternatenames` varchar(5000),
`latitude` DECIMAL(10,7),
`longitude` DECIMAL(10,7),
`feature class` char(1),
`feature code` varchar(10),
`country code` char(2),
`cc2` char(60),
`admin1 code` varchar(20),
`admin2 code` varchar(80),
`admin3 code` varchar(20),
`admin4 code` varchar(20),
`population` bigint,
`elevation` INT,
`gtopo30` INT,
`timezone` varchar(100),
`modification date` date
)
CHARACTER SET utf8;
After the table is created you can import the data from the file. The fields are delimited by tabs, rows as newlines:
LOAD DATA INFILE '/path/to/your/file/DO.txt' INTO TABLE `DO_test`.`DO_table`;
I have made recently a shell script that downloads the latest data from geonames site and imports them into a MySQL database. It is based on the knowledge at GeoNames Forum and saved me a lot of time.
It is in its first version but is fully functional. Maybe it can help.
You can access it at http://codigofuerte.github.com/GeoNames-MySQL-DataImport/
For every one in the future :
On geonames.org forum in the year 2008, this is "import all geonames dump into MySQL"
http://forum.geonames.org/gforum/posts/list/732.page
Also google this : import dump into [postgresql OR SQL server OR MySQL] site:forum.geonames.org
To find more answers even from the year 2006
Edited to provide a synopsis:
In the geoname official read me : http://download.geonames.org/export/dump/. We will find a good description about the dump files and contents of them.
Dump files will be imported to the MySQL datatable directly. for example :
SET character_set_database=utf8;
LOAD DATA INFILE '/home/data/countryInfo.txt' INTO TABLE _geo_countries IGNORE 51 LINES(ISO2,ISO3,ISO_Numeric,FIPSCode,AsciiName,Capital,Area_SqKm,Population,ContinentCode,TLD,CurrencyCode,CurrencyName,PhoneCodes,PostalCodeFormats,PostalCodeRegex,Languages,GeonameID,Neighbours,EquivalentFIPSCodes);
SET character_set_database=default;
be careful about the characterset because if we use the CSV LOAD DATA ready importer of an old phpmyadmin of 2012 we may lose the utf characters even if the collation of columns was set to utf8_general_ci
Currently there are 4 essential datatables : continents, countries(countryInfo.txt), divisions(admin1), cities or locations(geonames)
admin1, 2, 3, 4 dump files are the different levels of internal divisions of countries such as admin 1 which is the states of US or provinces of other countries. admin 2 is more detailed and is the internal divisions of the state or the province. and so on for the 3 and 4
The countries dump files have been listed there contain not only cities but all the locatoins in that country even including a store center. Also there is a huge file as "allCountries.txt" will be more than 1GB after extracting from zip file. If we want only the cities we should choose one of the dump files : cities1000.txt , cities5000.txt , cities15000.txt which the numbers represent the min population of the listed cities. We store cities in the geonames datatable(you may call it geo locations or geo cities).
Before importing *.txt dump files take a few research about the LOAD DATA syntax in the MySQL documentation.
The read me text file(also in the footer of dump page) provides enough description for example :
The main 'geoname' table has the following fields :
---------------------------------------------------
geonameid : integer id of record in geonames database
name : name of geographical point (utf8) varchar(200)
asciiname : name of geographical point in plain ascii characters, varchar(200)
alternatenames : alternatenames, comma separated varchar(5000)
latitude : latitude in decimal degrees (wgs84)
longitude : longitude in decimal degrees (wgs84)
feature class : see http://www.geonames.org/export/codes.html, char(1)
feature code : see http://www.geonames.org/export/codes.html, varchar(10)
country code : ISO-3166 2-letter country code, 2 characters
cc2 : alternate country codes, comma separated, ISO-3166 2-letter country code, 60 characters
admin1 code : fipscode (subject to change to iso code), see exceptions below, see file admin1Codes.txt for display names of this code; varchar(20)
admin2 code : code for the second administrative division, a county in the US, see file admin2Codes.txt; varchar(80)
admin3 code : code for third level administrative division, varchar(20)
admin4 code : code for fourth level administrative division, varchar(20)
population : bigint (8 byte int)
elevation : in meters, integer
dem : digital elevation model, srtm3 or gtopo30, average elevation of 3''x3'' (ca 90mx90m) or 30''x30'' (ca 900mx900m) area in meters, integer. srtm processed by cgiar/ciat.
timezone : the timezone id (see file timeZone.txt) varchar(40)
modification date : date of last modification in yyyy-MM-dd format
Also about the varchar(5000) we should know about the 64kb size of each row in MySQL 5.0 or later:
Is a VARCHAR(20000) valid in MySQL?
This is my note after I imported successfully.
As of writing I was testing with MySQL 5.7.16 on Windows 7. Follow these steps to import:
Download desired data file from the official download page. In my case I chose cities1000.zip because it's much smaller in size (21MB) than the all-inclusive allcountries.zip (1.4GB).
Create the following schema and table according to readme.txt on the download page, where the fields are specified below the text "the main 'geoname' table has the following fields".
CREATE SCHEMA geonames DEFAULT CHARSET utf8 COLLATE utf8_general_ci;
CREATE TABLE geonames.cities1000 (
id INT,
name VARCHAR(200),
ascii_name VARCHAR(200),
alternate_names VARCHAR(10000) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci,
latitude DECIMAL(10, 7),
longitude DECIMAL(10, 7),
feature_class CHAR(1),
feature_code VARCHAR(10),
country_code CHAR(2),
cc2 CHAR(60),
admin1_code VARCHAR(20),
admin2_code VARCHAR(80),
admin3_code VARCHAR(20),
admin4_code VARCHAR(20),
population BIGINT,
elevation INT,
dem INT,
timezone VARCHAR(100),
modification_date DATE
)
CHARACTER SET utf8;
Field names are arbitrary as long as the column size and field types are the same as specified. alternate_names are specially defined with the character set utf8mb4 because the values for this column in the file contain 4-byte unicode characters which are not supported by the character set utf8 of MySQL.
Check the values of these parameters: character_set_client, character_set_results, character_set_connection. 7
SHOW VARIABLES LIKE '%char%';
If they are not utf8mb4, then change them:
SET character_set_client = utf8mb4;
SET character_set_results = utf8mb4;
SET character_set_connection = utf8mb4;
Import data from file using LOAD DATA INFILE ...
USE geonames;
LOAD DATA INFILE 'C:\\ProgramData\\MySQL\\MySQL Server 5.7\\Uploads\\cities1000.txt' INTO TABLE cities1000
CHARACTER SET utf8mb4 (id, name, ascii_name, alternate_names, latitude, longitude, feature_class, feature_code,
country_code, cc2, admin1_code, admin2_code, admin3_code, admin4_code, population, #val1,
#val2, timezone, modification_date)
SET elevation = if(#val1 = '', NULL, #val1), dem = if(#val2 = '', NULL, #val2);
Explanation for the statement:
The file should be placed under a designated location by MySQL for importing data from files. You can check the location with SHOW VARIABLES LIKE 'secure_file_priv';. In my case it's C:\ProgramData\MySQL\MySQL Server 5.7\Uploads. In Windows you need to use double slashes to represent one slash in the path. This error would be shown when the path is not given correctly: [HY000][1290] The MySQL server is running with the --secure-file-priv option so it cannot execute this statement.
With CHARACTER SET utf8mb4 you're telling MySQL what encoding to expect from the file. When this is not given explicitly, or the column encoding is not utf8mb4, an error prompt like this will be seen: [HY000][1300] Invalid utf8 character string: 'Gorad Safija,SOF,Serdica,Sofi,Sofia,Sofiae,Sofie,Sofii,Sofij,Sof'. 5 In my case I found it's due to the existence of Gothic letters in the alternate names, such as ππππΉπ° (id 727011), πΊπΏππΉππΉπ±π° (id 3464975), and πΊππ½πΈπ΄ππΈπΉππ½ (id 3893894). These letters need to be stored as 4-byte characters (utf8mb4) while my then encoding was utf8 which only supports up to 3-byte characters. 6 You can change column encoding after the table is created:
ALTER TABLE cities1000 MODIFY alternate_names VARCHAR(10000) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
To check the encoding of a column:
SELECT character_set_name, COLLATION_NAME FROM information_schema.COLUMNS WHERE table_schema = 'geonames' AND table_name = 'cities1000' AND column_name = 'alternate_names';
To test if the characters can be stored:
UPDATE cities1000 SET alternate_names = 'ππππΉπ°' WHERE id = 1;
Values for some columns need to be "improved" before they are inserted, such as elevation and dem. They are of type INT and values for them from the file could be empty strings, which can't be stored by an INT type column. So you need to convert those empty strings to null for those columns. The latter part of the statement just serves this purpose. This error would be shown when the values are not property converted first: [HY000][1366] Incorrect integer value: '' for column 'elevation' at row 1. 3, 4
References
http://www.geonames.org/
http://download.geonames.org/export/dump/
https://dev.mysql.com/doc/refman/8.0/en/load-data.html
https://dba.stackexchange.com/a/111044/94778
https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-conversion.html
https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html
https://stackoverflow.com/a/35156926/4357087
https://dev.mysql.com/doc/refman/5.7/en/charset-connection.html