I have millions of files with the following (poor) JSON format:
{
"3000105002":[
{
"pool_id": "97808",
"pool_name": "WILDCAT (DO NOT USE)",
"status": "Zone Permanently Plugged",
"bhl": "D-12-10N-05E 902 FWL 902 FWL",
"acreage": ""
},
{
"pool_id": "96838",
"pool_name": "DRY & ABANDONED",
"status": "Zone Permanently Plugged",
"bhl": "D-12-10N-05E 902 FWL 902 FWL",
"acreage": ""
}]
}
I've tried to generate an Athena DDL that would accommodate this type (especially the api field) of structure with this:
CREATE EXTERNAL TABLE wp_info (
api:array < struct < pool_id:string,
pool_name:string,
status:string,
bhl:string,
acreage:string>>)
LOCATION 's3://foo/'
After trying to generate a table with this, the following error is thrown:
Your query has the following error(s):
FAILED: ParseException line 2:12 cannot recognize input near ':' 'array' '<' in column type
What is a workable solution to this issue? Note that the api string is different for every one of the million files. The api key is not actually within any of the files, so I hope there is a way that Athena can accommodate just the string-type value for these data.
If you don't have control over the JSON format that you are receiving, and you don't have a streaming service in the middle to transform the JSON format to something simpler, you can use regex functions to retrieve the relevant data that you need.
A simple way to do it is to use Create-Table-As-Select (CTAS) query that will convert the data from its complex JSON format to a simpler table format.
CREATE TABLE new_table
WITH (
external_location = 's3://path/to/ctas_partitioned/',
format = 'Parquet',
parquet_compression = 'SNAPPY')
AS SELECT
regexp_extract(line, '"pool_id": "(\d+)"', 1) as pool_id,
regexp_extract(line, ' "pool_name": "([^"])",', 1) as pool_name,
...
FROM json_lines_table;
You will improve the performance of the queries to the new table, as you are using Parquet format.
Note that you can also update the table when you can new data, by running the CTAS query again with external_location as 's3://path/to/ctas_partitioned/part=01' or any other partition scheme
I have a complex JSON file that looks like this: http://pastebin.com/4UfadbqS
I would like to load only several values from these JSON objects using Pig Latin. I tried doing that like this:
mydata = LOAD 'data.json'
USING JsonLoader('id:chararray, created_at:chararray,
user: {(language:chararray)}’);
STORE mydata
INTO 'output';
But it seems that Pig Latin is just taking the first 3 values from the JSON and saving them (it does not recognize the column name as a key). Is there a way to achieve this? OR should I just list ALL the values from JSON in a Pig and filter them after that?
There are few problems in the above approach
1. JsonLoader will always expect the full schema of your input but you gave only three fields.
2. JsonLoader will always expect the entire input as a single line but your input is multiline.
3. JsonLoader will not support nested schema but your input contains nested schema.
To solve all the above problems you have use the thirdparty library elephant-bird jar.
Download the (elephant-bird-pig-4.1.jar and elephant-bird-hadoop-compat-4.1.jar) jar file from this link
http://www.java2s.com/Code/Jar/e/elephant.htm and try the below approach
I copied your entire input and formatted as a single line as below.
input.json
{"filter_level":"medium","retweeted":false,"in_reply_to_screen_name":null,"possibly_sensitive":false,"truncated":false,"lang":"en","in_reply_to_status_id_str":null,"id":488927960280211456,"in_reply_to_user_id_str":null,"in_reply_to_status_id":null,"created_at":"Tue Jul 15 06:08:04 +0000 2014","favorite_count":0,"place":null,"coordinates":null,"text":"RT #BulleyBufton: #MinaANDMaya PLEASE RT /VOTE BULLEY. Last day to help me win my old rescue #HilbraesDogs £5k https://t.co/Y8g47fLYY1 http\u2026","contributors":null,"retweeted_stt
atus":{"filter_level":"low","contributors":null,"text":"#MinaANDMaya PLEASE RT /VOTE BULLEY. Last day to help me win my old rescue #HilbraesDogs £5k https://t.co/Y8g47fLYY1 httpp
://t.co/DDco9wVXtP","geo":null,"retweeted":false,"in_reply_to_screen_name":"MinaANDMaya","possibly_sensitive":false,"truncated":false,"lang":"en","entities":{"trends":[],"symbols":[],"urls":[{"expanded_url":"https://www.animalfriendsquote.co.uk/fb-worldcup/","indices":[93,116],"display_url":"animalfriendsquote.co.uk/fb-worldcup/","url":"https://t.co/Y8g47fLYY1"}],"hashtags":[],"media":[{"sizes":{"thumb":{"w":150,"resize":"crop","h":150},"small":{"w":340,"resize":"fit","h":455},"large":{"w":706,"resize":"fit","h":946},"medium":{"w":600,"resize":"fit","h":803}},"id":488926730481332224,"media_url_https":"https://pbs.twimg.com/media/BskERVuIcAAJZGu.jpg","media_url":"http://pbs.twimg.com/media/BskERVuIcAAJZGu.jpg","expanded_url":"http://twitter.com/BulleyBufton/status/488926827394904064/photo/1","indices":[117,139],"id_str":"488926730481332224","type":"photo","display_url":"pic.twitter.com/DDco9wVXtP","url":"http://t.co/DDco9wVXtP"}],"user_mentions":[{"id":132204038,"name":"Mina*Bad Yoga Kitty*","indices":[0,12],"screen_name":"MinaANDMaya","id_str":"132204038"},{"id":2308374684,"name":"Julianna Kaminski","indices":[75,88],"screen_name":"HilbraesDogs","id_str":"2308374684"}]},"in_reply_to_status_id_str":null,"id":488926827394904064,"source":"<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android<\/a>","in_reply_to_user_id_str":"132204038","favorited":false,"in_reply_to_status_id":null,"retweet_count":6,"created_at":"Tue Jul 15 06:03:34 +0000 2014","in_reply_to_user_id":132204038,"favorite_count":3,"id_str":"488926827394904064","place":null,"user":{"location":"CHICAGO , USA","default_profile":false,"statuses_count":8868,"profile_background_tile":true,"lang":"en","profile_link_color":"AD54E8","profile_banner_url":"https://pbs.twimg.com/profile_banners/225136520/1403608773","id":225136520,"following":null,"favourites_count":5082,"protected":false,"profile_text_color":"3D1957","verified":false,"description":"I'm Bulley, I'm proof that there is always hope.\r\nI was in rescue kennels in UK for 9yrs. #ada_bscakes took me in.\r\nWe've moved to America to start a new life.","contributors_enabled":false,"profile_sidebar_border_color":"000000","name":"BULLEY","profile_background_color":"0A0A0A","created_at":"Fri Dec 10 19:55:17 +0000 2010","default_profile_image":false,"followers_count":3421,"profile_image_url_https":"https://pbs.twimg.com/profile_images/486614595457789952/gtcLac9w_normal.jpeg","geo_enabled":true,"profile_background_image_url":"http://pbs.twimg.com/profile_background_images/378800000166829702/isbjd7O4.jpeg","profile_background_image_url_https":"https://pbs.twimg.com/profile_background_images/378800000166829702/isbjd7O4.jpeg","follow_request_sent":null,"url":null,"utc_offset":-39600,"time_zone":"International Date Line West","notifications":null,"profile_use_background_image":true,"friends_count":3702,"profile_sidebar_fill_color":"7AC3EE","screen_name":"BulleyBufton","id_str":"225136520","profile_image_url":"http://pbs.twimg.com/profile_images/486614595457789952/gtcLac9w_normal.jpeg","listed_count":29,"is_translator":false},"coordinates":null},"geo":null,"entities":{"trends":[],"symbols":[],"urls":[{"expanded_url":"https://www.animalfriendsquote.co.uk/fb-worldcup/","indices":[111,134],"display_url":"animalfriendsquote.co.uk/fb-worldcup/","url":"https://t.co/Y8g47fLYY1"}],"hashtags":[],"media":[{"sizes":{"thumb":{"w":150,"resize":"crop","h":150},"small":{"w":340,"resize":"fit","h":455},"large":{"w":706,"resize":"fit","h":946},"medium":{"w":600,"resize":"fit","h":803}},"id":488926730481332224,"media_url_https":"https://pbs.twimg.com/media/BskERVuIcAAJZGu.jpg","media_url":"http://pbs.twimg.com/media/BskERVuIcAAJZGu.jpg","expanded_url":"http://twitter.com/BulleyBufton/status/488926827394904064/photo/1","source_status_id_str":"488926827394904064","indices":[139,140],"source_status_id":488926827394904064,"id_str":"488926730481332224","type":"photo","display_url":"pic.twitter.com/DDco9wVXtP","url":"http://t.co/DDco9wVXtP"}],"user_mentions":[{"id":225136520,"name":"BULLEY","indices":[3,16],"screen_name":"BulleyBufton","id_str":"225136520"},{"id":132204038,"name":"Mina*Bad Yoga Kitty*","indices":[18,30],"screen_name":"MinaANDMaya","id_str":"132204038"},{"id":2308374684,"name":"Julianna Kaminski","indices":[93,106],"screen_name":"HilbraesDogs","id_str":"2308374684"}]},"source":"<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android<\/a>","favorited":false,"in_reply_to_user_id":null,"retweet_count":0,"id_str":"488927960280211456","user":{"location":"","default_profile":false,"statuses_count":1370,"profile_background_tile":true,"lang":"zh-tw","profile_link_color":"038544","profile_banner_url":"https://pbs.twimg.com/profile_banners/2272804116/1404662156","id":2272804116,"following":null,"favourites_count":2000,"protected":false,"profile_text_color":"333333","verified":false,"description":"No More Sorrow","contributors_enabled":false,"profile_sidebar_border_color":"000000","name":"Winnie","profile_background_color":"14DBBA","created_at":"Thu Jan 02 10:13:01 +0000 2014","default_profile_image":false,"followers_count":311,"profile_image_url_https":"https://pbs.twimg.com/profile_images/478106512083017728/4ao_8JjE_normal.jpeg","geo_enabled":false,"profile_background_image_url":"http://pbs.twimg.com/profile_background_images/431815421189029888/YrRNpUfd.jpeg","profile_background_image_url_https":"https://pbs.twimg.com/profile_background_images/431815421189029888/YrRNpUfd.jpeg","follow_request_sent":null,"url":null,"utc_offset":null,"time_zone":null,"notifications":null,"profile_use_background_image":true,"friends_count":455,"profile_sidebar_fill_color":"DDEEF6","screen_name":"winnie341881","id_str":"2272804116","profile_image_url":"http://pbs.twimg.com/profile_images/478106512083017728/4ao_8JjE_normal.jpeg","listed_count":0,"is_translator":false}}
PigScript:
REGISTER '/tmp/elephant-bird-hadoop-compat-4.1.jar';
REGISTER '/tmp/elephant-bird-pig-4.1.jar';
A = LOAD 'input.json ' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS myMap;
B = FOREACH A GENERATE myMap#'id' AS ID,myMap#'created_at' AS createdAT,myMap#'user' AS User;
DUMP B;
Output:
(488927960280211456,Tue Jul 15 06:08:04 +0000 2014,[location#,default_profile#false,profile_background_tile#true,statuses_count#1370,lang#zh-tw,profile_link_color#038544,profile_banner_url#https://pbs.twimg.com/profile_banners/2272804116/1404662156,id#2272804116,following#,protected#false,favourites_count#2000,profile_text_color#333333,contributors_enabled#false,description#No More Sorrow,verified#false,name#Winnie,profile_sidebar_border_color#000000,profile_background_color#14DBBA,created_at#Thu Jan 02 10:13:01 +0000 2014,default_profile_image#false,followers_count#311,geo_enabled#false,profile_image_url_https#https://pbs.twimg.com/profile_images/478106512083017728/4ao_8JjE_normal.jpeg,profile_background_image_url#http://pbs.twimg.com/profile_background_images/431815421189029888/YrRNpUfd.jpeg,profile_background_image_url_https#https://pbs.twimg.com/profile_background_images/431815421189029888/YrRNpUfd.jpeg,follow_request_sent#,url#,utc_offset#,time_zone#,notifications#,friends_count#455,profile_use_background_image#true,profile_sidebar_fill_color#DDEEF6,screen_name#winnie341881,id_str#2272804116,profile_image_url#http://pbs.twimg.com/profile_images/478106512083017728/4ao_8JjE_normal.jpeg,is_translator#false,listed_count#0])
In elephantbird library all the values will be stored as key/value pair(ie MAP datatype), so it will be easy to extract the required fields from the loaded data.
In the above pigscript i have extracted the value of 'id','created_at' and 'user' as per your need.
Suppose you want to extract some fields from 'user' data( ex: 'friends_count' and 'followers_count'), in that case you need to project the 'user' field and extract the required data. sample code below.
PigScript:
REGISTER '/tmp/elephant-bird-hadoop-compat-4.1.jar';
REGISTER '/tmp/elephant-bird-pig-4.1.jar';
A = LOAD 'input.json ' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS myMap;
B = FOREACH A GENERATE 'user' AS User;
C = FOREACH B GENERATE User#'friends_count', User#'followers_count';
DUMP C;
Output:
(455,311)
I have the following MySQL Query using RMySQL. All the Database parameters are set before and Query is working well. Is there a possibility to put it in a loop to get multiple zoo objects from more than one dpname? Thanks!
dpname.df<-"%Name%"
paste.query.df<-paste("select handle from db.connect where dpname like '",dpname.df,"'",sep='')
handle.df<-dbGetQuery(dbLT,paste.query.df)
paste.query2.df<-paste("select time,value from db.data where handle='",handle.df,"' and time between '",x,"' and '",y,"'",sep='')
df <- dbGetQuery(dbLT,paste.query2.df)
df$time<-as.POSIXct(df$time,format="%Y-%m-%d %H:%M:%S")
df.zoo<-zoo(df[,-1],df[,1])
I tried to set a function with mapply:
query<-function(x,y,dpname.df)
{
paste.query.df<-paste("select handle from db.connect where dpname like '",dpname.df,"'",sep='')
handle.df<-dbGetQuery(dbLT,paste.query.df)
paste.query2.df<-paste("select time,value from db.data where handle='",handle.df,"' and time between '",x,"' and '",y,"'",sep='')
dbGetQuery(dbLT,paste.query2.df)
}
which I can run: with mapply(query,x,y,dpname.df)
But I can't get multiple outputs for each! query. Is it possible to set another List with the output names? So I could also put the zoo and POSIXct stuff in my function.Thanks!