Parse Complex JSON String in Pig - json

I want to parse a string of complex JSON in Pig. Specifically, I want Pig to understand my JSON array as a bag instead of as a single chararray. When using JsonLoader, I can do this easily by specifying the schema, as in this question. Is there any way to either have Pig figure out my schema for me, or to specify it when Pig is parsing a string? I've been using JsonStringToMap, but can't find a way to specify Schema, or to have it properly understand my JSON array is an array and not a single chararray.

I wound up using JsonTupleMap() in Mozilla's Akela library for pig. It accomplishes exactly what I want by parsing all of my JSON even when it's complex, and doing this even when I don't provide a schema. If you run into the same problem as me, use that.
Example usage:
REGISTER '/path/to/akela-0.5-SNAPSHOT.jar';
DEFINE JsonTupleMap com.mozilla.pig.eval.json.JsonTupleMap();
loaded = LOAD '$INPUT' AS (json_string:chararray, ...);
jsonified = FOREACH loaded GENERATE JsonTupleMap(json_string) AS json:map[], ...;
some_generate = FOREACH jsonified GENERATE json#'key'#'sub_key';

Related

Prevent parsing a JSON node with common-lisp YASON library

I am using the Yason library in common-lisp, I want to parse a json string but would like the parser to keep one a its node unparsed.
Typically with an example like that:
{
"metadata1" : "mydata1",
"metadata2" : "mydata2",
"payload" : {...my long payload object},
"otherNodesToParse" : {...}
}
How can I set the yason parser to parse my json but skip the payload node and keep it as a string in the json format.
Use: let's say I just want the envelope data (everything that's not the payload), and to forward the payload as-is (as json string) to another system.
If I parse the whole json (so including payload) and then re-encode the payload to json, it is inefficient. The payload size could also be pretty big.
How do you know where the end of the payload object is in the stream? You do so by parsing the stream: if you don't parse the stream you simply can't know where the end of the object is: that's the nature of JSON's syntax (as it is the nature of CL's default syntax). For instance the only way you can know the difference between where to continue after
{x:1}
and after
{x:1.2}
is by parsing the two things.
So you must necessarily parse the whole thing.
So the answer to your question is: you can't do this.
You could (but not, I think, with YASON) decide that you did not want to build an object as a result of the parse. And perhaps, if the stream you are parsing corresponds to something with random access like a string or a file, you could note the start and end positions in the stream to later extract a string from it corresponding to the unparsed data (or you could perhaps build it up as you go).
It looks as if some or all of this might be possible with CL-JSON, but you'd have to work at it.
Unless the objects you are reading are vast the benefit of this seems questionable-to-none. If you really do want to do something like this efficiently you need a serialisation scheme which tells you how long things are.

Apache Pig: Store only specific fields using JsonStorage()

I loaded a json file using JsonLoad() and it loads correctly. Now I want to store only few fields of the json object into a file using jsonStorage(). My Pig script is:
data_input = LOAD '$DATA_INPUT' USING JsonLoader(<<schema>>);
x = FOREACH data_input GENERATE (user__id_str);
STORE x INTO '$DATA_OUTPUT' USING JsonStorage();
Expected output:
{"user__id_str":12345}
{"user__id_str":12345}
{"user__id_str":123467}
Output I am getting:
{"user__id_str":null}
{"user__id_str":null}
{"user__id_str":null}
What is wrong?
EDIT: The schema is huge: consists of 306 fields:
user__contributors_enabled:chararray,retweeted_status__user__friends_count:int,quoted_status__extended_entities__media:chararray,retweeted_status__user__profile_background_image_url:chararray,quoted_status__user__is_translation_enabled:chararray,user__geo_enabled:chararray,avl_word_tags_all:chararray,quoted_status__user__profile_background_color:chararray,quoted_status__user__id_str:chararray,retweeted_status__place__bounding_box__coordinates:chararray,retweeted_status__quoted_status__metadata__result_type:chararray,retweeted_status__user__utc_offset:int,retweeted_status__user__contributors_enabled:chararray,retweeted_status__in_reply_to_screen_name:chararray,retweeted_status__place__place_type:chararray,retweeted_status__quoted_status__user__profile_background_image_url_https:chararray,user__utc_offset:int,quoted_status__favorited:chararray,user__entities__description__urls:chararray,place__url:chararray,quoted_status__user__profile_sidebar_border_color:chararray,favorited:chararray,retweeted_status__user__profile_banner_url:chararray,quoted_status__entities__user_mentions:chararray,retweet_count:int,retweeted_status__user__entities__description__urls:chararray,retweeted_status__quoted_status__user__is_translation_enabled:chararray,retweeted_status__entities__media:chararray,place__bounding_box__type:chararray,text_to_syntaxnet:chararray,quoted_status__user__chararrayed_count:int,avl_pos_tags:chararray,retweeted_status__user__statuses_count:int,quoted_status__metadata__iso_language_code:chararray,created_at:chararray,avl_lexicon_text:chararray,retweeted_status__lang:chararray,place__country:chararray,quoted_status__user__verified:chararray,retweeted_status__quoted_status__user__profile_background_tile:chararray,quoted_status__user__utc_offset:int,retweeted_status__quoted_status__user__location:chararray,quoted_status__created_at:chararray,retweeted_status__quoted_status__lang:chararray,place__place_type:chararray,user__profile_image_url:chararray,quoted_status__user__profile_use_background_image:chararray,user__name:chararray,user__notifications:chararray,user__id:int,in_reply_to_status_id:int,retweeted_status__metadata__iso_language_code:chararray,id:int,retweeted_status__user__follow_request_sent:chararray,retweeted_status__quoted_status__user__profile_use_background_image:chararray,retweeted_status__quoted_status__user__statuses_count:int,quoted_status__id_str:chararray,retweeted_status__user__profile_image_url:chararray,user__protected:chararray,user__profile_image_url_https:chararray,retweeted_status__source:chararray,quoted_status__source:chararray,retweeted_status__user__profile_link_color:chararray,retweeted_status__quoted_status__id_str:chararray,user__followers_count:int,retweeted_status__quoted_status__user__notifications:chararray,avl_num_sentences:int,retweeted_status__quoted_status__truncated:chararray,retweeted_status__text:chararray,quoted_status__favorite_count:int,quoted_status__metadata__result_type:chararray,truncated:chararray,metadata__iso_language_code:chararray,user__profile_banner_url:chararray,retweeted_status__quoted_status__user__profile_image_url_https:chararray,retweeted_status__quoted_status__user__utc_offset:int,quoted_status__user__profile_link_color:chararray,quoted_status__user__profile_image_url_https:chararray,retweeted_status__user__screen_name:chararray,retweeted_status__favorited:chararray,avl_lang:chararray,retweeted_status__user__location:chararray,retweeted_status__quoted_status__user__has_extended_profile:chararray,retweeted_status__quoted_status__user__verified:chararray,user__description:chararray,retweeted_status__user__profile_use_background_image:chararray,retweeted_status__quoted_status__user__contributors_enabled:chararray,quoted_status__is_quote_status:chararray,avl_sent:chararray,quoted_status__entities__media:chararray,quoted_status__possibly_sensitive:chararray,quoted_status__user__favourites_count:int,retweeted_status__quoted_status__user__default_profile_image:chararray,avl_num_words:int,quoted_status__user__friends_count:int,id_str:chararray,user__default_profile:chararray,user__profile_text_color:chararray,quoted_status__user__description:chararray,retweeted_status__user__favourites_count:int,retweeted_status__quoted_status__user__friends_count:int,quoted_status__user__name:chararray,retweeted_status__quoted_status__created_at:chararray,user__verified:chararray,quoted_status_id_str:chararray,user__profile_sidebar_border_color:chararray,retweeted_status__quoted_status__user__profile_text_color:chararray,retweeted_status__quoted_status__user__following:chararray,favorite_count:int,retweeted_status__quoted_status__entities__symbols:chararray,source:chararray,quoted_status_id:int,user__profile_use_background_image:chararray,retweeted_status__user__following:chararray,quoted_status__user__location:chararray,coordinates__type:chararray,retweeted_status__user__id:int,retweeted_status__quoted_status__text:chararray,quoted_status__entities__urls:chararray,retweeted_status__in_reply_to_status_id_str:chararray,text:chararray,retweeted_status__quoted_status__is_quote_status:chararray,quoted_status__id:int,user__entities__url__urls:chararray,quoted_status__user__contributors_enabled:chararray,retweeted_status__quoted_status__user__favourites_count:int,retweeted_status__quoted_status__id:int,retweeted_status__retweet_count:int,retweeted_status__favorite_count:int,metadata__result_type:chararray,retweeted_status__user__protected:chararray,retweeted_status__quoted_status__user__name:chararray,possibly_sensitive:chararray,retweeted_status__user__profile_sidebar_fill_color:chararray,retweeted_status__user__profile_image_url_https:chararray,retweeted_status__quoted_status_id:int,place__contained_within:chararray,retweeted_status__user__id_str:chararray,retweeted_status__user__entities__url__urls:chararray,retweeted_status__id_str:chararray,retweeted_status__quoted_status__entities__user_mentions:chararray,in_reply_to_status_id_str:chararray,retweeted_status__user__has_extended_profile:chararray,user__default_profile_image:chararray,user__is_translator:chararray,place__bounding_box__coordinates:chararray,retweeted_status__is_quote_status:chararray,quoted_status__user__entities__description__urls:chararray,entities__urls:chararray,retweeted_status__quoted_status__favorite_count:int,quoted_status__truncated:chararray,retweeted_status__user__default_profile_image:chararray,user__statuses_count:int,retweeted_status__quoted_status__user__entities__description__urls:chararray,retweeted_status__quoted_status__entities__hashtags:chararray,retweeted_status__quoted_status__user__description:chararray,retweeted_status__user__verified:chararray,retweeted_status__user__followers_count:int,avl_syn_1:chararray,quoted_status__user__default_profile:chararray,retweeted_status__place__bounding_box__type:chararray,retweeted_status__id:int,retweeted_status__user__lang:chararray,retweeted_status__quoted_status__user__default_profile:chararray,retweeted_status__quoted_status__user__profile_link_color:chararray,retweeted_status__in_reply_to_user_id:int,retweeted_status__user__is_translation_enabled:chararray,retweeted_status__user__chararrayed_count:int,quoted_status__user__default_profile_image:chararray,quoted_status__retweet_count:int,retweeted_status__user__profile_background_tile:chararray,quoted_status__user__id:int,retweeted_status__quoted_status__user__screen_name:chararray,retweeted_status__user__notifications:chararray,coordinates__coordinates:chararray,avl_brand_1:chararray,retweeted_status__quoted_status__metadata__iso_language_code:chararray,retweeted_status__quoted_status__retweeted:chararray,retweeted_status__quoted_status_id_str:chararray,retweeted_status__user__profile_text_color:chararray,quoted_status__retweeted:chararray,retweeted_status__user__is_translator:chararray,retweeted_status__user__default_profile:chararray,retweeted_status__extended_entities__media:chararray,avl_word_tags:chararray,quoted_status__user__follow_request_sent:chararray,retweeted_status__quoted_status__possibly_sensitive:chararray,user__screen_name:chararray,quoted_status__user__profile_banner_url:chararray,extended_entities__media:chararray,retweeted_status__quoted_status__retweet_count:int,quoted_status__user__profile_background_image_url:chararray,place__name:chararray,user__created_at:chararray,lang:chararray,in_reply_to_screen_name:chararray,retweeted_status__in_reply_to_status_id:int,quoted_status__user__profile_text_color:chararray,user__url:chararray,retweeted_status__user__profile_background_image_url_https:chararray,retweeted_status__truncated:chararray,entities__symbols:chararray,retweeted_status__quoted_status__user__profile_sidebar_border_color:chararray,quoted_status__entities__hashtags:chararray,retweeted_status__created_at:chararray,place__country_code:chararray,quoted_status__user__screen_name:chararray,avl_score:int,quoted_status__user__lang:chararray,avl_source:chararray,place__full_name:chararray,retweeted_status__place__url:chararray,retweeted_status__user__profile_background_color:chararray,quoted_status__user__following:chararray,quoted_status__user__profile_image_url:chararray,quoted_status__text:chararray,user__chararrayed_count:int,retweeted_status__quoted_status__user__protected:chararray,avl_words_not_in_lexicon:chararray,retweeted_status__quoted_status__user__id_str:chararray,quoted_status__user__followers_count:int,retweeted_status__quoted_status__extended_entities__media:chararray,retweeted_status__quoted_status__user__is_translator:chararray,user__time_zone:chararray,retweeted_status__metadata__result_type:chararray,in_reply_to_user_id_str:chararray,quoted_status__user__profile_background_image_url_https:chararray,avl_num_paragraphs:int,retweeted_status__quoted_status__user__profile_background_color:chararray,retweeted_status__quoted_status__user__followers_count:int,quoted_status__user__has_extended_profile:chararray,retweeted_status__user__profile_sidebar_border_color:chararray,avl_brand_all:chararray,retweeted_status__place__country_code:chararray,retweeted_status__user__description:chararray,quoted_status__user__profile_background_tile:chararray,retweeted_status__quoted_status__user__geo_enabled:chararray,quoted_status__user__created_at:chararray,entities__hashtags:chararray,retweeted_status__user__time_zone:chararray,quoted_status__user__geo_enabled:chararray,retweeted_status__possibly_sensitive:chararray,retweeted_status__user__name:chararray,retweeted:chararray,quoted_status__user__entities__url__urls:chararray,user__profile_background_tile:chararray,user__follow_request_sent:chararray,retweeted_status__quoted_status__entities__urls:chararray,quoted_status__user__statuses_count:int,retweeted_status__quoted_status__user__profile_background_image_url:chararray,user__is_translation_enabled:chararray,user__profile_background_image_url_https:chararray,user__friends_count:int,retweeted_status__quoted_status__user__id:int,geo__coordinates:chararray,user__following:chararray,user__favourites_count:int,retweeted_status__place__country:chararray,retweeted_status__quoted_status__user__chararrayed_count:int,user__profile_link_color:chararray,retweeted_status__place__full_name:chararray,quoted_status__user__protected:chararray,quoted_status__user__notifications:chararray,user__lang:chararray,retweeted_status__place__contained_within:chararray,retweeted_status__entities__hashtags:chararray,retweeted_status__entities__urls:chararray,user__profile_background_image_url:chararray,retweeted_status__quoted_status__favorited:chararray,retweeted_status__place__name:chararray,user__profile_background_color:chararray,geo__type:chararray,retweeted_status__entities__symbols:chararray,retweeted_status__place__id:chararray,quoted_status__lang:chararray,retweeted_status__retweeted:chararray,avl_sentences:chararray,avl_global_idx:int,retweeted_status__entities__user_mentions:chararray,retweeted_status__quoted_status__user__time_zone:chararray,user__id_str:chararray,quoted_status__user__profile_sidebar_fill_color:chararray,quoted_status__entities__symbols:chararray,retweeted_status__user__url:chararray,retweeted_status__quoted_status__user__profile_sidebar_fill_color:chararray,quoted_status__user__is_translator:chararray,retweeted_status__quoted_status__user__lang:chararray,user__profile_sidebar_fill_color:chararray,retweeted_status__quoted_status__source:chararray,entities__media:chararray,entities__user_mentions:chararray,retweeted_status__user__created_at:chararray,user__has_extended_profile:chararray,quoted_status__user__time_zone:chararray,is_quote_status:chararray,place__id:chararray,retweeted_status__quoted_status__user__created_at:chararray,user__location:chararray,retweeted_status__quoted_status__user__follow_request_sent:chararray,quoted_status__user__url:chararray,retweeted_status__user__geo_enabled:chararray,in_reply_to_user_id:int,retweeted_status__in_reply_to_user_id_str:chararray,retweeted_status__quoted_status__user__profile_banner_url:chararray,retweeted_status__quoted_status__entities__media:chararray,retweeted_status__quoted_status__user__profile_image_url:chararray
I found the answer: All the json objects in the input file donot have the same schema. So i guess it is not able to load according to the defined schema defined for the JsonLoader().
Used Elephant bird which made life easier.

Pig Json Multistorage?

Using PIG (0.14), i'm interested in the following use-case: I wish to process my raw JSON into multiple output directories based upon their key and store the result (aggregated data) as JSON. The JSON has an evolving (dynamic) schema which is read in with elephant-bird, and (so-far) has not caused any problems.
I can either store the output in the correct directories (using MultiStorage) or as JSON (using JsonStorage) but not both. As far as i can tell, there is no publicly available UDF for this purpose.
Have I missed something, or is it just a case of writing my own UDF to perform this? This seems like a simple use-case and I would have thought would have been supported.
For those who are looking for an answer to this; a UDF is required.
It is possible (and relatively straight forward) to combine the piggybank UDFs of JsonStorage and MultiStorage to create a pseudo "JsonMultiStorage" class.

Loading json with varying schema into PIG

I ran into an issue loading a set json documents into PIG.
What I have is a lot of json documents that all vary in the fields they have, the fields that I need are in most documents and in whare missing I would like to get a null value.
I just downloaded and compiled the latest Pig version (0.12 straight from the apache git repository) just to be sure this hasn't been solved yet.
What I have is a json document like this:
{"foo":1,"bar":2,"baz":3}
When I load this into PIG using this
Json1 = LOAD 'test.json' USING JsonLoader('foo:int,bar:int,baz:int');
DESCRIBE Json1;
DUMP Json1;
I get the expected results
Json1: {foo: int,bar: int,baz: int}
(1,2,3)
However when the fields are in a different order in the schema :
Json2 = LOAD 'test.json' USING JsonLoader('baz:int,bar:int,foo:int');
DESCRIBE Json2;
DUMP Json2;
I get an undesired result:
Json2: {baz: int,bar: int,foo: int}
(1,2,3)
That should have been
(3,2,1)
Apparently the field names in the schema definition have nothing to do with the fieldnames in the json.
What I need is to load specific fields from a json file (with embedded documents!) into PIG.
How do I resolve this?
I think this is a known issue with even the latest version of Pig, so there isn't an easy way around this other than to use a more capable JsonLoader.
Use the Elephant Bird JSONLoader instead which will behave the way you expect - in other words respect field ordering.

Convert JSON to CSV or unstructured text

How might you take JSON output (e.g., from http://www.kinggary.com/tools/todoist-export.php) and strip the names to yield just the values from each pair, as CSV or human-friendly text? Want a more readable, human-editable backup of my friend's data on todoist.com
Your example site generates XML for me, not JSON. In either case I'd probably reach for Ruby:
require 'net/http'
require 'rexml/document'
xml = Net::HTTP.get_response(URI.parse("http://www.kinggary.com/tools/todoist-export.php?completed=incomplete&retrieval=view&submit=Submit&process=true&key=MYKEY")).body
data = REXML::Document.new(xml)
data.elements.each('//task/content') do |e|
puts e.text
end
there's a good discussion of how to do this with Python at How can I convert JSON to CSV?
Can you JSON decode it to an array and just iterate the array for values? A sample of the JSON output would be helpful.
What language? PHP has a json_decode() function that turns the JSON into an object or associative array. You could then loop through the array or get the values from the object to turn it into whatever format you like.