Load different json schemas in PIG - json

I would like to know how to read different Json schemes from one File In PIG.
In Hadoop I would use an jsonparser and with if questions I would find out what kind of json Element it is.
The Json Elements inside one Doccument Are:
{"a": "bla", "e": 123, "f": 333}
{ "a": "bla", "c": "aa"}
I Tried to load the first Json Array with the following Command:
A = load '/usr/local/hadoop/stuff.net' USING USING JsonLoader('a:chararray, e:int, f:int');
DUMP A;
It Throws The Error: ERROR 2088: Fetch failed. Couldn't retrieve result
The Second query is working:
B = load '/home/hadoop/Desktop/aaa' USING JsonLoader('a:chararray, c:chararray');
DUMP B;
But it also shows me results from the first statement.
So I wanted to ask how to load different Json schemas from the same file or isn't that possible?

I think that you can use Twitter's elephantbird project. Some examples you can find here.
Usage is quite easy, you just register jar file and than you can use elephant UDF function to load nested json:
REGISTER 'elephant-bird.jar';
json_file_00 = LOAD 'json_file.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');
json_file_01 = FOREACH json_file_00 GENERATE json_file_00#'fieldName' AS field_name;
I am also using akela project from mozilla which is great but outdated.

Related

AWS Glue classifier for extracting JSON array values

I have files in S3 with inline JSON per line of the structure:
{ "resources": [{"resourceType":"A","id":"A",...},{...}] }
If I run glue over it, I get "resource: array" as the top level element. I want however the elements of the array to be inspected and used as the top level table elements. All the elements per resources array will have the same schema. So I expect
resourceType: string
id: string
....
Theoretically, a custom JSON classifier should handle this:
$.resources[*]
However, the path is not picked up. So I still get the resources:array as the top level element.
I could now run some pre-processing to extract the array elements myself and write them line per line. However, I want to understand why my path is not working.
UPDATE 1:
It might be something with the JSON that I do not understand (its valid JSON produced via JAVA Jackson). If I remove the outer object with the resources attribute and change the structure to
[{"resourceType":"A","id":"A",...},{...}]
the classifier $[*] should pick the sub-objects up. But I still get array:array as top level element.
UPDATE 2:
Its indeed a formatting issue. If I change the JSON files to
[
{"resourceType":"A","id":"A",...},{...}
]
$[*] starts to work.
UPDATE 3:
Its however not fixing the issue with $.resources[*] to reformat to
{
"resources": [
{"resourceType":"A","id":"A",...},{...}
]
}
UPDATE 4:
If I take my file and run it through a Intellij re-format, hence produce a JSON object where all nested elements have line breaks, it also starts working with $.resources[*]. Basically, like in UPDATE 3 just applied down the structure.
{
"resources": [
{
"resourceType":"A",
"id":"A"
},
{
...
}
]
}
What bothers me is, that the requirements regarding the structure are still not clear to me, since UPDATE 2 worked, but not UPDATE 3. I also find nowhere in the documentation a formal requirement regarding the JSON structure.
In this sense, I think I got to the conclusion of my own question, but the systematics stay a bit unclear.
To conclude here:
The issue is related to unclear documented JSON formatting requirements of Glue.
A normalisation via json.dumps(my_json, separators=(',',':')) produces compact JSON that works for my use case.
I normalised now the content via a lambda.
Lambda code as reference for whomever it may help:
s3 = boto3.client('s3')
paginator = s3.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=my_bucket)
for page in pages:
try:
contents = page["Contents"]
except KeyError:
break
for obj in contents:
key = obj["Key"]
obj = s3.get_object(Bucket=my_bucket, Key=key)
j = json.loads(obj['Body'].read().decode('utf-8'))
new_json = json.dumps(j, separators=(',',':'))
target = 'nrmlzd/' + key
s3.put_object(
Body=new_json,
Bucket=my_bucket,
Key= target
)

Are My Data Better Suited To A CSV Import, Rather Than a JSON Import?

I am trying to force myself to use mongoDB, using the excuse of the "convenience" of it being able to accept JSON data. Of course, it's not as simple as that (it never is!).
At the moment, for this use case, I think I should revert to a traditional CSV import, and possibly a traditional RDBMS (e.g. MariaDB or MySQL). Am I wrong?
I found a possible solution in CSV DATA import to nestable json data, which seems to be a lot of faffing around.
The problem:
I am pulling some data from an online database, which returns data in blocks like this (actually it's all on one line, but I have broken it up to improve readability):
[
[8,1469734163000,50.84516753,0.00021818,2],
[6,1469734163000,50.80342373,0.00021818,2],
[4,1469734163000,50.33066367,0.00021818,2],
[12,1469734164000,40.31650031,0.00021918,2],
[10,1469734164000,11.36652478,0.00021818,2],
[14,1469734165000,52.03905845,0.00021918,2],
[16,1469734168000,57.32,0.00021918,2]
]
According to the command python -mjson.tool this is valid JSON.
But this command barfs
mongoimport --jsonArray --db=bitfinexLendingHistory --collection=fUSD --file=test.json
with
2019-12-31T12:23:42.934+0100 connected to: localhost
2019-12-31T12:23:42.935+0100 Failed: error unmarshaling bytes on document #3: JSON decoder out of sync - data changing underfoot?
2019-12-31T12:23:42.935+0100 imported 0 documents
The named DB and collection already exist.
$ mongo
> use bitfinexLendingHistory
switched to db bitfinexLendingHistory
> db.getCollectionNames()
[ "fUSD" ]
>
I realise that, at this stage, I have no <whatever the mongoDB equivalent of a column header is called in this case> defined, but I suspect the problem above is independent of that.
By wrapping by data above in the way as shown below, I managed to get the data imported.
{
"arf":
[
[8,1469734163000,50.84516753,0.00021818,2],
[6,1469734163000,50.80342373,0.00021818,2],
[4,1469734163000,50.33066367,0.00021818,2],
[12,1469734164000,40.31650031,0.00021918,2],
[10,1469734164000,11.36652478,0.00021818,2],
[14,1469734165000,52.03905845,0.00021918,2],
[16,1469734168000,57.32,0.00021918,2]
]
}
Next step is to determine if that is what I want, and if so, work out how to query it.

How to query json file located at s3 using presto

I have one json file stored in amazon-s3 location, I want to query this json file using presto. how can I achieve this?
Option 1 - Presto on EMR with json_extract built-in function
I am supposing that you have already launched Presto using EMR.
The easiest way to do this would be to use the json_extract function that comes by default with Presto.
So imagine you have a json file on s3 like this:
{"a": "a_value1", "b": { "bb": "bb_value1" }, "c": "c_value1"}
{"a": "a_value2", "b": { "bb": "bb_value2" }, "c": "c_value2"}
{"a": "a_value3", "b": { "bb": "bb_value3" }, "c": "c_value3"}
...
...
Each row represents a json tree object.
So you can simply define in presto a table with one field which is of string type, and then easily query the table with json_extract.
SELECT json_extract(json_field, '$.b.bb') as extract
FROM my_table
The result would be something like:
| extract |
|-----------------|
| bb_value1 |
| bb_value2 |
| bb_value3 |
This can be a fast and easy way to read a json file using presto, but unfortunately this doesn't scale well on big json files.
Some presto docs on json_extract: https://prestodb.github.io/docs/current/functions/json.html#json_extract
Option 2 - Presto on EMR with a specific Serde for json files
You can also customize your presto in bootstrap phase of your emr cluster, by adding custom plugins or SerDe libraries.
So you just have to choose one of the available JSON SerDe libraries (e.g. org.openx.data.jsonserde.JsonSerDe) and follow their guide to define a table that matches the structure of the Json file.
You will be able to access to the fields of the json file in a similar way of the json_extract (using the dotted notation), and it should be faster and scale well on big files. Unfortunately using this method you have 2 main problems:
1) Defining a table for complex files is like being in hell.
2) You may have internal java cast exception, because the data in the json couldn't be easily casted by the SerDe library.
Option 3 - Athena Built-In JSON Serde
https://docs.aws.amazon.com/athena/latest/ug/json.html
It seems that you have some Json SerDe also built-in Athena, I have personally never tried these but they are managed by AWS so should be easier to set up everything.
Rather than installing and running your own Presto service, there are some other options you can try:
Amazon Athena is a fully-managed Presto service. You can use it to query large datastores in Amazon S3, including compressed and partitioned data.
Amazon S3 Select allows you to run a query on a single object stored in Amazon S3. This is possibly simpler for your particular use-case.

Apache Pig: Store only specific fields using JsonStorage()

I loaded a json file using JsonLoad() and it loads correctly. Now I want to store only few fields of the json object into a file using jsonStorage(). My Pig script is:
data_input = LOAD '$DATA_INPUT' USING JsonLoader(<<schema>>);
x = FOREACH data_input GENERATE (user__id_str);
STORE x INTO '$DATA_OUTPUT' USING JsonStorage();
Expected output:
{"user__id_str":12345}
{"user__id_str":12345}
{"user__id_str":123467}
Output I am getting:
{"user__id_str":null}
{"user__id_str":null}
{"user__id_str":null}
What is wrong?
EDIT: The schema is huge: consists of 306 fields:
user__contributors_enabled:chararray,retweeted_status__user__friends_count:int,quoted_status__extended_entities__media:chararray,retweeted_status__user__profile_background_image_url:chararray,quoted_status__user__is_translation_enabled:chararray,user__geo_enabled:chararray,avl_word_tags_all:chararray,quoted_status__user__profile_background_color:chararray,quoted_status__user__id_str:chararray,retweeted_status__place__bounding_box__coordinates:chararray,retweeted_status__quoted_status__metadata__result_type:chararray,retweeted_status__user__utc_offset:int,retweeted_status__user__contributors_enabled:chararray,retweeted_status__in_reply_to_screen_name:chararray,retweeted_status__place__place_type:chararray,retweeted_status__quoted_status__user__profile_background_image_url_https:chararray,user__utc_offset:int,quoted_status__favorited:chararray,user__entities__description__urls:chararray,place__url:chararray,quoted_status__user__profile_sidebar_border_color:chararray,favorited:chararray,retweeted_status__user__profile_banner_url:chararray,quoted_status__entities__user_mentions:chararray,retweet_count:int,retweeted_status__user__entities__description__urls:chararray,retweeted_status__quoted_status__user__is_translation_enabled:chararray,retweeted_status__entities__media:chararray,place__bounding_box__type:chararray,text_to_syntaxnet:chararray,quoted_status__user__chararrayed_count:int,avl_pos_tags:chararray,retweeted_status__user__statuses_count:int,quoted_status__metadata__iso_language_code:chararray,created_at:chararray,avl_lexicon_text:chararray,retweeted_status__lang:chararray,place__country:chararray,quoted_status__user__verified:chararray,retweeted_status__quoted_status__user__profile_background_tile:chararray,quoted_status__user__utc_offset:int,retweeted_status__quoted_status__user__location:chararray,quoted_status__created_at:chararray,retweeted_status__quoted_status__lang:chararray,place__place_type:chararray,user__profile_image_url:chararray,quoted_status__user__profile_use_background_image:chararray,user__name:chararray,user__notifications:chararray,user__id:int,in_reply_to_status_id:int,retweeted_status__metadata__iso_language_code:chararray,id:int,retweeted_status__user__follow_request_sent:chararray,retweeted_status__quoted_status__user__profile_use_background_image:chararray,retweeted_status__quoted_status__user__statuses_count:int,quoted_status__id_str:chararray,retweeted_status__user__profile_image_url:chararray,user__protected:chararray,user__profile_image_url_https:chararray,retweeted_status__source:chararray,quoted_status__source:chararray,retweeted_status__user__profile_link_color:chararray,retweeted_status__quoted_status__id_str:chararray,user__followers_count:int,retweeted_status__quoted_status__user__notifications:chararray,avl_num_sentences:int,retweeted_status__quoted_status__truncated:chararray,retweeted_status__text:chararray,quoted_status__favorite_count:int,quoted_status__metadata__result_type:chararray,truncated:chararray,metadata__iso_language_code:chararray,user__profile_banner_url:chararray,retweeted_status__quoted_status__user__profile_image_url_https:chararray,retweeted_status__quoted_status__user__utc_offset:int,quoted_status__user__profile_link_color:chararray,quoted_status__user__profile_image_url_https:chararray,retweeted_status__user__screen_name:chararray,retweeted_status__favorited:chararray,avl_lang:chararray,retweeted_status__user__location:chararray,retweeted_status__quoted_status__user__has_extended_profile:chararray,retweeted_status__quoted_status__user__verified:chararray,user__description:chararray,retweeted_status__user__profile_use_background_image:chararray,retweeted_status__quoted_status__user__contributors_enabled:chararray,quoted_status__is_quote_status:chararray,avl_sent:chararray,quoted_status__entities__media:chararray,quoted_status__possibly_sensitive:chararray,quoted_status__user__favourites_count:int,retweeted_status__quoted_status__user__default_profile_image:chararray,avl_num_words:int,quoted_status__user__friends_count:int,id_str:chararray,user__default_profile:chararray,user__profile_text_color:chararray,quoted_status__user__description:chararray,retweeted_status__user__favourites_count:int,retweeted_status__quoted_status__user__friends_count:int,quoted_status__user__name:chararray,retweeted_status__quoted_status__created_at:chararray,user__verified:chararray,quoted_status_id_str:chararray,user__profile_sidebar_border_color:chararray,retweeted_status__quoted_status__user__profile_text_color:chararray,retweeted_status__quoted_status__user__following:chararray,favorite_count:int,retweeted_status__quoted_status__entities__symbols:chararray,source:chararray,quoted_status_id:int,user__profile_use_background_image:chararray,retweeted_status__user__following:chararray,quoted_status__user__location:chararray,coordinates__type:chararray,retweeted_status__user__id:int,retweeted_status__quoted_status__text:chararray,quoted_status__entities__urls:chararray,retweeted_status__in_reply_to_status_id_str:chararray,text:chararray,retweeted_status__quoted_status__is_quote_status:chararray,quoted_status__id:int,user__entities__url__urls:chararray,quoted_status__user__contributors_enabled:chararray,retweeted_status__quoted_status__user__favourites_count:int,retweeted_status__quoted_status__id:int,retweeted_status__retweet_count:int,retweeted_status__favorite_count:int,metadata__result_type:chararray,retweeted_status__user__protected:chararray,retweeted_status__quoted_status__user__name:chararray,possibly_sensitive:chararray,retweeted_status__user__profile_sidebar_fill_color:chararray,retweeted_status__user__profile_image_url_https:chararray,retweeted_status__quoted_status_id:int,place__contained_within:chararray,retweeted_status__user__id_str:chararray,retweeted_status__user__entities__url__urls:chararray,retweeted_status__id_str:chararray,retweeted_status__quoted_status__entities__user_mentions:chararray,in_reply_to_status_id_str:chararray,retweeted_status__user__has_extended_profile:chararray,user__default_profile_image:chararray,user__is_translator:chararray,place__bounding_box__coordinates:chararray,retweeted_status__is_quote_status:chararray,quoted_status__user__entities__description__urls:chararray,entities__urls:chararray,retweeted_status__quoted_status__favorite_count:int,quoted_status__truncated:chararray,retweeted_status__user__default_profile_image:chararray,user__statuses_count:int,retweeted_status__quoted_status__user__entities__description__urls:chararray,retweeted_status__quoted_status__entities__hashtags:chararray,retweeted_status__quoted_status__user__description:chararray,retweeted_status__user__verified:chararray,retweeted_status__user__followers_count:int,avl_syn_1:chararray,quoted_status__user__default_profile:chararray,retweeted_status__place__bounding_box__type:chararray,retweeted_status__id:int,retweeted_status__user__lang:chararray,retweeted_status__quoted_status__user__default_profile:chararray,retweeted_status__quoted_status__user__profile_link_color:chararray,retweeted_status__in_reply_to_user_id:int,retweeted_status__user__is_translation_enabled:chararray,retweeted_status__user__chararrayed_count:int,quoted_status__user__default_profile_image:chararray,quoted_status__retweet_count:int,retweeted_status__user__profile_background_tile:chararray,quoted_status__user__id:int,retweeted_status__quoted_status__user__screen_name:chararray,retweeted_status__user__notifications:chararray,coordinates__coordinates:chararray,avl_brand_1:chararray,retweeted_status__quoted_status__metadata__iso_language_code:chararray,retweeted_status__quoted_status__retweeted:chararray,retweeted_status__quoted_status_id_str:chararray,retweeted_status__user__profile_text_color:chararray,quoted_status__retweeted:chararray,retweeted_status__user__is_translator:chararray,retweeted_status__user__default_profile:chararray,retweeted_status__extended_entities__media:chararray,avl_word_tags:chararray,quoted_status__user__follow_request_sent:chararray,retweeted_status__quoted_status__possibly_sensitive:chararray,user__screen_name:chararray,quoted_status__user__profile_banner_url:chararray,extended_entities__media:chararray,retweeted_status__quoted_status__retweet_count:int,quoted_status__user__profile_background_image_url:chararray,place__name:chararray,user__created_at:chararray,lang:chararray,in_reply_to_screen_name:chararray,retweeted_status__in_reply_to_status_id:int,quoted_status__user__profile_text_color:chararray,user__url:chararray,retweeted_status__user__profile_background_image_url_https:chararray,retweeted_status__truncated:chararray,entities__symbols:chararray,retweeted_status__quoted_status__user__profile_sidebar_border_color:chararray,quoted_status__entities__hashtags:chararray,retweeted_status__created_at:chararray,place__country_code:chararray,quoted_status__user__screen_name:chararray,avl_score:int,quoted_status__user__lang:chararray,avl_source:chararray,place__full_name:chararray,retweeted_status__place__url:chararray,retweeted_status__user__profile_background_color:chararray,quoted_status__user__following:chararray,quoted_status__user__profile_image_url:chararray,quoted_status__text:chararray,user__chararrayed_count:int,retweeted_status__quoted_status__user__protected:chararray,avl_words_not_in_lexicon:chararray,retweeted_status__quoted_status__user__id_str:chararray,quoted_status__user__followers_count:int,retweeted_status__quoted_status__extended_entities__media:chararray,retweeted_status__quoted_status__user__is_translator:chararray,user__time_zone:chararray,retweeted_status__metadata__result_type:chararray,in_reply_to_user_id_str:chararray,quoted_status__user__profile_background_image_url_https:chararray,avl_num_paragraphs:int,retweeted_status__quoted_status__user__profile_background_color:chararray,retweeted_status__quoted_status__user__followers_count:int,quoted_status__user__has_extended_profile:chararray,retweeted_status__user__profile_sidebar_border_color:chararray,avl_brand_all:chararray,retweeted_status__place__country_code:chararray,retweeted_status__user__description:chararray,quoted_status__user__profile_background_tile:chararray,retweeted_status__quoted_status__user__geo_enabled:chararray,quoted_status__user__created_at:chararray,entities__hashtags:chararray,retweeted_status__user__time_zone:chararray,quoted_status__user__geo_enabled:chararray,retweeted_status__possibly_sensitive:chararray,retweeted_status__user__name:chararray,retweeted:chararray,quoted_status__user__entities__url__urls:chararray,user__profile_background_tile:chararray,user__follow_request_sent:chararray,retweeted_status__quoted_status__entities__urls:chararray,quoted_status__user__statuses_count:int,retweeted_status__quoted_status__user__profile_background_image_url:chararray,user__is_translation_enabled:chararray,user__profile_background_image_url_https:chararray,user__friends_count:int,retweeted_status__quoted_status__user__id:int,geo__coordinates:chararray,user__following:chararray,user__favourites_count:int,retweeted_status__place__country:chararray,retweeted_status__quoted_status__user__chararrayed_count:int,user__profile_link_color:chararray,retweeted_status__place__full_name:chararray,quoted_status__user__protected:chararray,quoted_status__user__notifications:chararray,user__lang:chararray,retweeted_status__place__contained_within:chararray,retweeted_status__entities__hashtags:chararray,retweeted_status__entities__urls:chararray,user__profile_background_image_url:chararray,retweeted_status__quoted_status__favorited:chararray,retweeted_status__place__name:chararray,user__profile_background_color:chararray,geo__type:chararray,retweeted_status__entities__symbols:chararray,retweeted_status__place__id:chararray,quoted_status__lang:chararray,retweeted_status__retweeted:chararray,avl_sentences:chararray,avl_global_idx:int,retweeted_status__entities__user_mentions:chararray,retweeted_status__quoted_status__user__time_zone:chararray,user__id_str:chararray,quoted_status__user__profile_sidebar_fill_color:chararray,quoted_status__entities__symbols:chararray,retweeted_status__user__url:chararray,retweeted_status__quoted_status__user__profile_sidebar_fill_color:chararray,quoted_status__user__is_translator:chararray,retweeted_status__quoted_status__user__lang:chararray,user__profile_sidebar_fill_color:chararray,retweeted_status__quoted_status__source:chararray,entities__media:chararray,entities__user_mentions:chararray,retweeted_status__user__created_at:chararray,user__has_extended_profile:chararray,quoted_status__user__time_zone:chararray,is_quote_status:chararray,place__id:chararray,retweeted_status__quoted_status__user__created_at:chararray,user__location:chararray,retweeted_status__quoted_status__user__follow_request_sent:chararray,quoted_status__user__url:chararray,retweeted_status__user__geo_enabled:chararray,in_reply_to_user_id:int,retweeted_status__in_reply_to_user_id_str:chararray,retweeted_status__quoted_status__user__profile_banner_url:chararray,retweeted_status__quoted_status__entities__media:chararray,retweeted_status__quoted_status__user__profile_image_url:chararray
I found the answer: All the json objects in the input file donot have the same schema. So i guess it is not able to load according to the defined schema defined for the JsonLoader().
Used Elephant bird which made life easier.

Loading json with varying schema into PIG

I ran into an issue loading a set json documents into PIG.
What I have is a lot of json documents that all vary in the fields they have, the fields that I need are in most documents and in whare missing I would like to get a null value.
I just downloaded and compiled the latest Pig version (0.12 straight from the apache git repository) just to be sure this hasn't been solved yet.
What I have is a json document like this:
{"foo":1,"bar":2,"baz":3}
When I load this into PIG using this
Json1 = LOAD 'test.json' USING JsonLoader('foo:int,bar:int,baz:int');
DESCRIBE Json1;
DUMP Json1;
I get the expected results
Json1: {foo: int,bar: int,baz: int}
(1,2,3)
However when the fields are in a different order in the schema :
Json2 = LOAD 'test.json' USING JsonLoader('baz:int,bar:int,foo:int');
DESCRIBE Json2;
DUMP Json2;
I get an undesired result:
Json2: {baz: int,bar: int,foo: int}
(1,2,3)
That should have been
(3,2,1)
Apparently the field names in the schema definition have nothing to do with the fieldnames in the json.
What I need is to load specific fields from a json file (with embedded documents!) into PIG.
How do I resolve this?
I think this is a known issue with even the latest version of Pig, so there isn't an easy way around this other than to use a more capable JsonLoader.
Use the Elephant Bird JSONLoader instead which will behave the way you expect - in other words respect field ordering.