Load only a few values from complex JSON object in Pig Latin - json
I have a complex JSON file that looks like this: http://pastebin.com/4UfadbqS
I would like to load only several values from these JSON objects using Pig Latin. I tried doing that like this:
mydata = LOAD 'data.json'
USING JsonLoader('id:chararray, created_at:chararray,
user: {(language:chararray)}’);
STORE mydata
INTO 'output';
But it seems that Pig Latin is just taking the first 3 values from the JSON and saving them (it does not recognize the column name as a key). Is there a way to achieve this? OR should I just list ALL the values from JSON in a Pig and filter them after that?
There are few problems in the above approach
1. JsonLoader will always expect the full schema of your input but you gave only three fields.
2. JsonLoader will always expect the entire input as a single line but your input is multiline.
3. JsonLoader will not support nested schema but your input contains nested schema.
To solve all the above problems you have use the thirdparty library elephant-bird jar.
Download the (elephant-bird-pig-4.1.jar and elephant-bird-hadoop-compat-4.1.jar) jar file from this link
http://www.java2s.com/Code/Jar/e/elephant.htm and try the below approach
I copied your entire input and formatted as a single line as below.
input.json
{"filter_level":"medium","retweeted":false,"in_reply_to_screen_name":null,"possibly_sensitive":false,"truncated":false,"lang":"en","in_reply_to_status_id_str":null,"id":488927960280211456,"in_reply_to_user_id_str":null,"in_reply_to_status_id":null,"created_at":"Tue Jul 15 06:08:04 +0000 2014","favorite_count":0,"place":null,"coordinates":null,"text":"RT #BulleyBufton: #MinaANDMaya PLEASE RT /VOTE BULLEY. Last day to help me win my old rescue #HilbraesDogs £5k https://t.co/Y8g47fLYY1 http\u2026","contributors":null,"retweeted_stt
atus":{"filter_level":"low","contributors":null,"text":"#MinaANDMaya PLEASE RT /VOTE BULLEY. Last day to help me win my old rescue #HilbraesDogs £5k https://t.co/Y8g47fLYY1 httpp
://t.co/DDco9wVXtP","geo":null,"retweeted":false,"in_reply_to_screen_name":"MinaANDMaya","possibly_sensitive":false,"truncated":false,"lang":"en","entities":{"trends":[],"symbols":[],"urls":[{"expanded_url":"https://www.animalfriendsquote.co.uk/fb-worldcup/","indices":[93,116],"display_url":"animalfriendsquote.co.uk/fb-worldcup/","url":"https://t.co/Y8g47fLYY1"}],"hashtags":[],"media":[{"sizes":{"thumb":{"w":150,"resize":"crop","h":150},"small":{"w":340,"resize":"fit","h":455},"large":{"w":706,"resize":"fit","h":946},"medium":{"w":600,"resize":"fit","h":803}},"id":488926730481332224,"media_url_https":"https://pbs.twimg.com/media/BskERVuIcAAJZGu.jpg","media_url":"http://pbs.twimg.com/media/BskERVuIcAAJZGu.jpg","expanded_url":"http://twitter.com/BulleyBufton/status/488926827394904064/photo/1","indices":[117,139],"id_str":"488926730481332224","type":"photo","display_url":"pic.twitter.com/DDco9wVXtP","url":"http://t.co/DDco9wVXtP"}],"user_mentions":[{"id":132204038,"name":"Mina*Bad Yoga Kitty*","indices":[0,12],"screen_name":"MinaANDMaya","id_str":"132204038"},{"id":2308374684,"name":"Julianna Kaminski","indices":[75,88],"screen_name":"HilbraesDogs","id_str":"2308374684"}]},"in_reply_to_status_id_str":null,"id":488926827394904064,"source":"<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android<\/a>","in_reply_to_user_id_str":"132204038","favorited":false,"in_reply_to_status_id":null,"retweet_count":6,"created_at":"Tue Jul 15 06:03:34 +0000 2014","in_reply_to_user_id":132204038,"favorite_count":3,"id_str":"488926827394904064","place":null,"user":{"location":"CHICAGO , USA","default_profile":false,"statuses_count":8868,"profile_background_tile":true,"lang":"en","profile_link_color":"AD54E8","profile_banner_url":"https://pbs.twimg.com/profile_banners/225136520/1403608773","id":225136520,"following":null,"favourites_count":5082,"protected":false,"profile_text_color":"3D1957","verified":false,"description":"I'm Bulley, I'm proof that there is always hope.\r\nI was in rescue kennels in UK for 9yrs. #ada_bscakes took me in.\r\nWe've moved to America to start a new life.","contributors_enabled":false,"profile_sidebar_border_color":"000000","name":"BULLEY","profile_background_color":"0A0A0A","created_at":"Fri Dec 10 19:55:17 +0000 2010","default_profile_image":false,"followers_count":3421,"profile_image_url_https":"https://pbs.twimg.com/profile_images/486614595457789952/gtcLac9w_normal.jpeg","geo_enabled":true,"profile_background_image_url":"http://pbs.twimg.com/profile_background_images/378800000166829702/isbjd7O4.jpeg","profile_background_image_url_https":"https://pbs.twimg.com/profile_background_images/378800000166829702/isbjd7O4.jpeg","follow_request_sent":null,"url":null,"utc_offset":-39600,"time_zone":"International Date Line West","notifications":null,"profile_use_background_image":true,"friends_count":3702,"profile_sidebar_fill_color":"7AC3EE","screen_name":"BulleyBufton","id_str":"225136520","profile_image_url":"http://pbs.twimg.com/profile_images/486614595457789952/gtcLac9w_normal.jpeg","listed_count":29,"is_translator":false},"coordinates":null},"geo":null,"entities":{"trends":[],"symbols":[],"urls":[{"expanded_url":"https://www.animalfriendsquote.co.uk/fb-worldcup/","indices":[111,134],"display_url":"animalfriendsquote.co.uk/fb-worldcup/","url":"https://t.co/Y8g47fLYY1"}],"hashtags":[],"media":[{"sizes":{"thumb":{"w":150,"resize":"crop","h":150},"small":{"w":340,"resize":"fit","h":455},"large":{"w":706,"resize":"fit","h":946},"medium":{"w":600,"resize":"fit","h":803}},"id":488926730481332224,"media_url_https":"https://pbs.twimg.com/media/BskERVuIcAAJZGu.jpg","media_url":"http://pbs.twimg.com/media/BskERVuIcAAJZGu.jpg","expanded_url":"http://twitter.com/BulleyBufton/status/488926827394904064/photo/1","source_status_id_str":"488926827394904064","indices":[139,140],"source_status_id":488926827394904064,"id_str":"488926730481332224","type":"photo","display_url":"pic.twitter.com/DDco9wVXtP","url":"http://t.co/DDco9wVXtP"}],"user_mentions":[{"id":225136520,"name":"BULLEY","indices":[3,16],"screen_name":"BulleyBufton","id_str":"225136520"},{"id":132204038,"name":"Mina*Bad Yoga Kitty*","indices":[18,30],"screen_name":"MinaANDMaya","id_str":"132204038"},{"id":2308374684,"name":"Julianna Kaminski","indices":[93,106],"screen_name":"HilbraesDogs","id_str":"2308374684"}]},"source":"<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android<\/a>","favorited":false,"in_reply_to_user_id":null,"retweet_count":0,"id_str":"488927960280211456","user":{"location":"","default_profile":false,"statuses_count":1370,"profile_background_tile":true,"lang":"zh-tw","profile_link_color":"038544","profile_banner_url":"https://pbs.twimg.com/profile_banners/2272804116/1404662156","id":2272804116,"following":null,"favourites_count":2000,"protected":false,"profile_text_color":"333333","verified":false,"description":"No More Sorrow","contributors_enabled":false,"profile_sidebar_border_color":"000000","name":"Winnie","profile_background_color":"14DBBA","created_at":"Thu Jan 02 10:13:01 +0000 2014","default_profile_image":false,"followers_count":311,"profile_image_url_https":"https://pbs.twimg.com/profile_images/478106512083017728/4ao_8JjE_normal.jpeg","geo_enabled":false,"profile_background_image_url":"http://pbs.twimg.com/profile_background_images/431815421189029888/YrRNpUfd.jpeg","profile_background_image_url_https":"https://pbs.twimg.com/profile_background_images/431815421189029888/YrRNpUfd.jpeg","follow_request_sent":null,"url":null,"utc_offset":null,"time_zone":null,"notifications":null,"profile_use_background_image":true,"friends_count":455,"profile_sidebar_fill_color":"DDEEF6","screen_name":"winnie341881","id_str":"2272804116","profile_image_url":"http://pbs.twimg.com/profile_images/478106512083017728/4ao_8JjE_normal.jpeg","listed_count":0,"is_translator":false}}
PigScript:
REGISTER '/tmp/elephant-bird-hadoop-compat-4.1.jar';
REGISTER '/tmp/elephant-bird-pig-4.1.jar';
A = LOAD 'input.json ' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS myMap;
B = FOREACH A GENERATE myMap#'id' AS ID,myMap#'created_at' AS createdAT,myMap#'user' AS User;
DUMP B;
Output:
(488927960280211456,Tue Jul 15 06:08:04 +0000 2014,[location#,default_profile#false,profile_background_tile#true,statuses_count#1370,lang#zh-tw,profile_link_color#038544,profile_banner_url#https://pbs.twimg.com/profile_banners/2272804116/1404662156,id#2272804116,following#,protected#false,favourites_count#2000,profile_text_color#333333,contributors_enabled#false,description#No More Sorrow,verified#false,name#Winnie,profile_sidebar_border_color#000000,profile_background_color#14DBBA,created_at#Thu Jan 02 10:13:01 +0000 2014,default_profile_image#false,followers_count#311,geo_enabled#false,profile_image_url_https#https://pbs.twimg.com/profile_images/478106512083017728/4ao_8JjE_normal.jpeg,profile_background_image_url#http://pbs.twimg.com/profile_background_images/431815421189029888/YrRNpUfd.jpeg,profile_background_image_url_https#https://pbs.twimg.com/profile_background_images/431815421189029888/YrRNpUfd.jpeg,follow_request_sent#,url#,utc_offset#,time_zone#,notifications#,friends_count#455,profile_use_background_image#true,profile_sidebar_fill_color#DDEEF6,screen_name#winnie341881,id_str#2272804116,profile_image_url#http://pbs.twimg.com/profile_images/478106512083017728/4ao_8JjE_normal.jpeg,is_translator#false,listed_count#0])
In elephantbird library all the values will be stored as key/value pair(ie MAP datatype), so it will be easy to extract the required fields from the loaded data.
In the above pigscript i have extracted the value of 'id','created_at' and 'user' as per your need.
Suppose you want to extract some fields from 'user' data( ex: 'friends_count' and 'followers_count'), in that case you need to project the 'user' field and extract the required data. sample code below.
PigScript:
REGISTER '/tmp/elephant-bird-hadoop-compat-4.1.jar';
REGISTER '/tmp/elephant-bird-pig-4.1.jar';
A = LOAD 'input.json ' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS myMap;
B = FOREACH A GENERATE 'user' AS User;
C = FOREACH B GENERATE User#'friends_count', User#'followers_count';
DUMP C;
Output:
(455,311)
Related
How to convert csv to Json using expression transformation in informatica?
I have a csv file, I am converting this to Json array format. Below are the row wise operations in expression transformation for the two fields. region country Json(output port): '{'||'"region": '||'"'||Region||'",'||'"Country": '||'"'||Country||'"'||'},' output: {"region": "Australia and Oceania", "Country": "Tuvalu"}, This output is saved in a text file with session file properties as fixed width. enter code here second mapping expression: JSON(input) V_JSON_start(variable port):INSTR(JSON,'{',1,1) V_JSON_end(variable port):instr(JSON,'}',1,10) O_Json(output port):'['||substr(JSON,V_JSON_start,V_JSON_end)||']' output: [{"region": "Australia and Oceania","Country": "Tuvalu"}, {"region": "Central America and the Caribbean","Country": "Grenada"}] when I try to fetch next 10 records as json, it pulls 20 records instead of ten below is the expression: JSON(input) V_JSON_start(variable port):INSTR(JSON,'{',1,11) V_JSON_end(variable port):instr(JSON,'}',1,20) O_Json(output port):'['||substr(JSON,V_JSON_start,V_JSON_end)||']' Kindly look into this and correct where i am missing. input: flat file(csv with two fields region and country)) expected output:(5 sessions each session 10 records in json format) eg., [{"region";"value","country":"value"}, {"region":"value","country":"value"}] session1(csv to json) -->session 2--session3--session4--session5--session6(all parallel sessions using the file of 1st session 5 records in json format)
Do not show JSON data in columns
The software I'm using saves a copy of the data that I think is json in an extra-different table when I do records in the database. What I want to do is to be able to query the json data contained in the DATASETS column separately. I'm using SQL 2012 as my server This is the query I tried so far: SELECT TOP 1 IND, SNAPSHOTDATE, DATASETS, USERNAME, OWNERFORM FROM TBLSNAPSHOTS CODE RESULT: 105 2018-09-14 02:59:34.000 { "Datasets": [{"Name": "TBLSTOKLAR","Lines": [{"IND": "102","STOKNO": "","MALINCINSI": "TITIZ PLASTIK BUYUK KASIK 10 ADET","STOKKODU": "8691262708050","ANABIRIM": "102","BIRIMEX": "102","ALTSEVIYE": "","KRITIKSEVIYE": "","USTSEVIYE": "","DEPOSEVIYESI": "True","URETICI": "","AYLIKVADE": "0","SERINO": "","DEPO": "1","STOKGRUBU": "","GARANTI": "0","PRIM": "0","IPTAL": "False","STOKTIPI": "0","STOKTAKIP": "0","TEMINYERI": "1","RAFOMRU": "0","RESIM": "","KALAN": "0","REZERV": "0","KOD1": "","KOD2": "","KOD3": "","KOD4": "","KOD5": "","KOD6": "","KOD7": "","KOD8": "","KOD9": "","KOD10": "","TAKSITSAYISI": "0","ISTIHBARAT": "","FIYATYOK": "","DELETED": "","ALISFIYATI": "0","ESKIALISFIYATI": "0","SONALISTARIHI": "","SONSATISTARIHI": "","KARTINACILMATARIHI": "14.09.18 ı. 02:57:58","DEVIRIND": "","MALIYET": "1","KDVGRUBU": "1","AKTIF": "False","ISCILIKIND": "0","ISCILIKBIRIMIND": "0","ISCILIKACIKLAMA": "","ISCILIKSTOKKODU": "","ALISFIYATIDEGISMETARIHI": "","STATUS": "1","DALISFIYATI": "","APB": "","OIV": "0","KARORANI": "0","OTV": "0","ISK": "0","STOKGRUPTANIMI": "","ISKSATISFIYATI2": "0","ISKSATISFIYATI3": "0","ALISKDVORANI": "18","ALISISKORANI": "","SIPARISALINMASIN": "False","SIPARISVERILMESIN": "False","P1": "","P2": "","P3": "","SATISKOSULU": "","DEFAULTALISFIYATI": "","DEFAULTALISFIYATIDEGISMESTARIHI": "","KDVGRUBUT": "","HEDEFSATISFIAYTI": "","KURUMISKONTOSU": "","TICARIISKONTO": "","ITSBILDIRIMI": "False","MAXISKORANI": "","IMALATCISATISFIYATI": "","DKUR": "1","ACILSEVK": "False","SOGUKSEVK": "False","ICMIKTAR": "","TICARISEKIL": "","MAXISKTUTAR": "","TAXE": "","KOD11": "","DAPB": "","IKINCIEL": "","ETICARET": "","STOKNEVI": "0","OTVORANSAL": "True","POZ": "","YAZARKASA": "False","KOD12": "","KOD13": "","KOD14": "","KOD15": "","KOD16": "","KOD17": "","KOD18": "","KOD19": "","KOD20": "","KOD21": "","UID": "{0DE71D73-E447-45B0-BF6A-1D312DBAFDD2}"}]}]} ADMIN frmEdtStok```
In SQL 2012 - no, you can't directly query the JSON. In SQL 2016 they added functions to let you do this: https://learn.microsoft.com/en-us/sql/t-sql/functions/json-query-transact-sql?view=sql-server-2017 But if you need to stay on 2012 you are limited to String parsing it (don't do this), or writing/finding a CLR function which parses it using .Net code and returns the results If you simply must do it quickly there are some hackey solutions to parse it like so: https://www.red-gate.com/simple-talk/sql/t-sql-programming/consuming-json-strings-in-sql-server/ but don't expect it to work smoothly with complex json
Extract fields from log file where data is stored half json and half plain text
I am new to Spark, and want to read a log file and create a dataframe out of it. My data is half json, and I cannot convert it into a dataframe properly. Here below is first row in the file; [2017-01-06 07:00:01] userid:444444 11.11.111.0 info {"artist":"Tears For Fears","album":"Songs From The Big Chair","song":"Everybody Wants To Rule The World","id":"S4555","service":"pandora"} See first part is plain text and the last part between { } is json, I tried few things, converting it first to RDD then map and split then convert back to DataFrame, but I cannot extract the values from Json part of the row, is there a trick to extract fields in this context? Final output will be like; TimeStamp userid ip artist album song id service 2017-01-06 07:00:01 444444 11.11.111.0 Tears For Fears Songs From The Big Chair Everybody Wants To Rule The World S4555 pandora
You just need to parse out the pieces with a Python UDF into a tuple then tell spark to convert the RDD to a dataframe. The easiest way to do this is probably a regular expression. For example: import re import json def parse(row): pattern = ' '.join([ r'\[(?P<ts>\d{4}-\d\d-\d\d \d\d:\d\d:\d\d)\]', r'userid:(?P<userid>\d+)', r'(?P<ip>\d+\.\d+\.\d+\.\d+)', r'(?P<level>\w+)', r'(?P<json>.+$)' ]) match = re.match(pattern, row) parsed_json = json.loads(match.group('json')) return (match.group('ts'), match.group('userid'), match.group('ip'), match.group('level'), parsed_json['artist'], parsed_json['song'], parsed_json['service']) lines = [ '[2017-01-06 07:00:01] userid:444444 11.11.111.0 info {"artist":"Tears For Fears","album":"Songs From The Big Chair","song":"Everybody Wants To Rule The World","id":"S4555","service":"pandora"}' ] rdd = sc.parallelize(lines) df = rdd.map(parse).toDF(['ts', 'userid', 'ip', 'level', 'artist', 'song', 'service']) df.show() This prints +-------------------+------+-----------+-----+---------------+--------------------+-------+ | ts|userid| ip|level| artist| song|service| +-------------------+------+-----------+-----+---------------+--------------------+-------+ |2017-01-06 07:00:01|444444|11.11.111.0| info|Tears For Fears|Everybody Wants T...|pandora| +-------------------+------+-----------+-----+---------------+--------------------+-------+
I have used the following, just some parsing utilizing pyspark power; parts=r1.map( lambda x: x.value.replace('[','').replace('] ','###') .replace(' userid:','###').replace('null','"null"').replace('""','"NA"') .replace(' music_info {"artist":"','###').replace('","album":"','###') .replace('","song":"','###').replace('","id":"','###') .replace('","service":"','###').replace('"}','###').split('###')) people = parts.map(lambda p: (p[0], p[1],p[2], p[3], p[4], p[5], p[6], p[7])) schemaString = "timestamp mac userid_ip artist album song id service" fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()] With this I got almost what I want, and performance was super fast. +-------------------+-----------------+--------------------+-------------------- +--------------------+--------------------+--------------------+-------+ | timestamp| mac| userid_ip| artist| album| song| id|service| +-------------------+-----------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------+ |2017-01-01 00:00:00|00:00:00:00:00:00|111122 22.235.17...|The United States...| This Is Christmas!|Do You Hear What ...| S1112536|pandora| |2017-01-01 00:00:00|00:11:11:11:11:11|123123 108.252.2...| NA| Dinner Party Radio| NA| null|pandora|
Extract Input parameters from "mxnet" model
I have saved the model using mx.model.save(model = fit_dl, prefix = "model", iteration = 10) and loaded later fit <- mx.model.load(prefix = "model", iteration = 10) Now, using object fit, I want to extract the input features (column names of train data). How to do that
Posting for sake of all open source community As per my email exchange with maintner of mxnet packge, Qiang Kou replies following From: Qiang Kou To: Shiv Onkar Kumar Sent: Wednesday, 14 June 2017 10:33 PM Subject: Re: Extract Input parameters from “mxnet” model Hi, Shiv, I don't this is possible since we never store this information in the model. Best, Qiang Kou
How to parse JSON response in Ruby
The end goal for this is to be part of a chatbot that returns an airport's weather. Using import.io, I built an endpoint to query the weather service I'd which provides this response: {"extractorData"=> {"url"=> "https://www.aviationweather.gov/metar/data?ids=kokb&format=decoded&hours=0&taf=off&layout=on&date=0", "resourceId"=>"66ca907842aabb6b08b8bc12049ad533", "data"=> [{"group"=> [{"Timestamp"=>[{"text"=>"Data at: 2135 UTC 12 Dec 2016"}], "Airport"=>[{"text"=>"KOKB (Oceanside Muni, CA, US)"}], "FullText"=> [{"text"=> "KOKB 122052Z AUTO 24008KT 10SM CLR 18/13 A3006 RMK AO2 SLP179 T01780133 58021"}], "Temperature"=>[{"text"=>"17.8°C ( 64°F)"}], "Dewpoint"=>[{"text"=>"13.3°C ( 56°F) [RH = 75%]"}], "Pressure"=> [{"text"=> "30.06 inches Hg (1018.0 mb) [Sea level pressure: 1017.9 mb]"}], "Winds"=> [{"text"=>"from the WSW (240 degrees) at 9 MPH (8 knots; 4.1 m/s)"}], "Visibility"=>[{"text"=>"10 or more sm (16+ km)"}], "Ceiling"=>[{"text"=>"at least 12,000 feet AGL"}], "Clouds"=>[{"text"=>"sky clear below 12,000 feet AGL"}]}]}]}, "pageData"=> {"resourceId"=>"66ca907842aabb6b08b8bc12049ad533", "statusCode"=>200, "timestamp"=>1481578559306}, "url"=> "https://www.aviationweather.gov/metar/data?ids=kokb&format=decoded&hours=0&taf=off&layout=on&date=0", "runtimeConfigId"=>"2ddb288f-9e57-4b58-a690-1cd409f9edd3", "timestamp"=>1481579246454, "sequenceNumber"=>-1} I seem to be running into two issues. How do I: pull each field and write it into its own variable ignore the "text" modifier in the response.
If you're getting a response object, you might want to do something like parsed_json = JSON.parse(response.body) Then you can do things like parsed_json[:some_field]
The simple answer is: require 'json' foo = JSON['{"a":1}'] foo # => {"a"=>1} JSON is smart enough to look at the parameter and, based on whether it's a string or an Array or Hash, parse it or serialize it. In the above case it parsed it back into a Hash. From that point it takes normal Ruby to dive into the hash you got back and access particular values: foo = JSON['{"a":1, "b":[{"c":3}]}'] foo # => {"a"=>1, "b"=>[{"c"=>3}]} foo['b'][0]['c'] # => 3 How to walk through a hash is covered extensively on the internet and here on Stack Overflow, so search around and see what you can find.