Loading json with varying schema into PIG - json

I ran into an issue loading a set json documents into PIG.
What I have is a lot of json documents that all vary in the fields they have, the fields that I need are in most documents and in whare missing I would like to get a null value.
I just downloaded and compiled the latest Pig version (0.12 straight from the apache git repository) just to be sure this hasn't been solved yet.
What I have is a json document like this:
{"foo":1,"bar":2,"baz":3}
When I load this into PIG using this
Json1 = LOAD 'test.json' USING JsonLoader('foo:int,bar:int,baz:int');
DESCRIBE Json1;
DUMP Json1;
I get the expected results
Json1: {foo: int,bar: int,baz: int}
(1,2,3)
However when the fields are in a different order in the schema :
Json2 = LOAD 'test.json' USING JsonLoader('baz:int,bar:int,foo:int');
DESCRIBE Json2;
DUMP Json2;
I get an undesired result:
Json2: {baz: int,bar: int,foo: int}
(1,2,3)
That should have been
(3,2,1)
Apparently the field names in the schema definition have nothing to do with the fieldnames in the json.
What I need is to load specific fields from a json file (with embedded documents!) into PIG.
How do I resolve this?

I think this is a known issue with even the latest version of Pig, so there isn't an easy way around this other than to use a more capable JsonLoader.
Use the Elephant Bird JSONLoader instead which will behave the way you expect - in other words respect field ordering.

Related

Merging and/or Reading 88 JSON Files into Dataframe - different datatypes

I basically have a procedure where I make multiple calls to an API and using a token within the JSON return pass that pack to a function top call the API again to get a "paginated" file.
In total I have to call and download 88 JSON files that total 758mb. The JSON files are all formatted the same way and have the same "schema" or at least should do. I have tried reading each JSON file after it has been downloaded into a data frame, and then attempted to union that dataframe to a master dataframe so essentially I'll have one big data frame with all 88 JSON files read into.
However the problem I encounter is roughly on file 66 the system (Python/Databricks/Spark) decides to change the file type of a field. It is always a string and then I'm guessing when a value actually appears in that field it changes to a boolean. The problem is then that the unionbyName fails because of different datatypes.
What is the best way for me to resolve this? I thought about reading using "extend" to merge all the JSON files into one big file however a 758mb JSON file would be a huge read and undertaking.
Could the other solution be to explicitly set the schema that the JSON file is read into so that it is always the same type?
If you know the attributes of those files, you can define the schema before reading them and create an empty df with that schema so you can to a unionByName with the allowMissingColumns=True:
something like:
from pyspark.sql.types import *
my_schema = StructType([
StructField('file_name',StringType(),True),
StructField('id',LongType(),True),
StructField('dataset_name',StringType(),True),
StructField('snapshotdate',TimestampType(),True)
])
output = sqlContext.createDataFrame(sc.emptyRDD(), my_schema)
df_json = spark.read.[...your JSON file...]
output.unionByName(df_json, allowMissingColumns=True)
I'm not sure this is what you are looking for. I hope it helps

How do I read a Large JSON Array File in PySpark

Issue
I recently encountered a challenge in Azure Data Lake Analytics when I attempted to read in a Large UTF-8 JSON Array file and switched to HDInsight PySpark (v2.x, not 3) to process the file. The file is ~110G and has ~150m JSON Objects.
HDInsight PySpark does not appear to support Array of JSON file format for input, so I'm stuck. Also, I have "many" such files with different schemas in each containing hundred of columns each, so creating the schemas for those is not an option at this point.
Question
How do I use out-of-the-box functionality in PySpark 2 on HDInsight to enable these files to be read as JSON?
Thanks,
J
Things I tried
I used the approach at the bottom of this page:
from Databricks that supplied the below code snippet:
import json
df = sc.wholeTextFiles('/tmp/*.json').flatMap(lambda x: json.loads(x[1])).toDF()
display(df)
I tried the above, not understanding how "wholeTextFiles" works, and of course ran into OutOfMemory errors that killed my executors quickly.
I attempted loading to an RDD and other open methods, but PySpark appears to support only the JSONLines JSON file format, and I have the Array of JSON Objects due to ADLA's requirement for that file format.
I tried reading in as a text file, stripping Array characters, splitting on the JSON object boundaries and converting to JSON like the above, but that kept giving errors about being unable to convert unicode and/or str (ings).
I found a way through the above, and converted to a dataframe containing one column with Rows of strings that were the JSON Objects. However, I did not find a way to output only the JSON Strings from the data frame rows to an output file by themselves. The always came out as
{'dfColumnName':'{...json_string_as_value}'}
I also tried a map function that accepted the above rows, parsed as JSON, extracted the values (JSON I wanted), then parsed the values as JSON. This appeared to work, but when I would try to save, the RDD was type PipelineRDD and had no saveAsTextFile() method. I then tried the toJSON method, but kept getting errors about "found no valid JSON Object", which I did not understand admittedly, and of course other conversion errors.
I finally found a way forward. I learned that I could read json directly from an RDD, including a PipelineRDD. I found a way to remove the unicode byte order header, wrapping array square brackets, split the JSON Objects based on a fortunate delimiter, and have a distributed dataset for more efficient processing. The output dataframe now had columns named after the JSON elements, inferred the schema, and dynamically adapts for other file formats.
Here is the code - hope it helps!:
#...Spark considers arrays of Json objects to be an invalid format
# and unicode files are prefixed with a byteorder marker
#
thanksMoiraRDD = sc.textFile( '/a/valid/file/path', partitions ).map(
lambda x: x.encode('utf-8','ignore').strip(u",\r\n[]\ufeff")
)
df = sqlContext.read.json(thanksMoiraRDD)

Apache Pig: Store only specific fields using JsonStorage()

I loaded a json file using JsonLoad() and it loads correctly. Now I want to store only few fields of the json object into a file using jsonStorage(). My Pig script is:
data_input = LOAD '$DATA_INPUT' USING JsonLoader(<<schema>>);
x = FOREACH data_input GENERATE (user__id_str);
STORE x INTO '$DATA_OUTPUT' USING JsonStorage();
Expected output:
{"user__id_str":12345}
{"user__id_str":12345}
{"user__id_str":123467}
Output I am getting:
{"user__id_str":null}
{"user__id_str":null}
{"user__id_str":null}
What is wrong?
EDIT: The schema is huge: consists of 306 fields:
user__contributors_enabled:chararray,retweeted_status__user__friends_count:int,quoted_status__extended_entities__media:chararray,retweeted_status__user__profile_background_image_url:chararray,quoted_status__user__is_translation_enabled:chararray,user__geo_enabled:chararray,avl_word_tags_all:chararray,quoted_status__user__profile_background_color:chararray,quoted_status__user__id_str:chararray,retweeted_status__place__bounding_box__coordinates:chararray,retweeted_status__quoted_status__metadata__result_type:chararray,retweeted_status__user__utc_offset:int,retweeted_status__user__contributors_enabled:chararray,retweeted_status__in_reply_to_screen_name:chararray,retweeted_status__place__place_type:chararray,retweeted_status__quoted_status__user__profile_background_image_url_https:chararray,user__utc_offset:int,quoted_status__favorited:chararray,user__entities__description__urls:chararray,place__url:chararray,quoted_status__user__profile_sidebar_border_color:chararray,favorited:chararray,retweeted_status__user__profile_banner_url:chararray,quoted_status__entities__user_mentions:chararray,retweet_count:int,retweeted_status__user__entities__description__urls:chararray,retweeted_status__quoted_status__user__is_translation_enabled:chararray,retweeted_status__entities__media:chararray,place__bounding_box__type:chararray,text_to_syntaxnet:chararray,quoted_status__user__chararrayed_count:int,avl_pos_tags:chararray,retweeted_status__user__statuses_count:int,quoted_status__metadata__iso_language_code:chararray,created_at:chararray,avl_lexicon_text:chararray,retweeted_status__lang:chararray,place__country:chararray,quoted_status__user__verified:chararray,retweeted_status__quoted_status__user__profile_background_tile:chararray,quoted_status__user__utc_offset:int,retweeted_status__quoted_status__user__location:chararray,quoted_status__created_at:chararray,retweeted_status__quoted_status__lang:chararray,place__place_type:chararray,user__profile_image_url:chararray,quoted_status__user__profile_use_background_image:chararray,user__name:chararray,user__notifications:chararray,user__id:int,in_reply_to_status_id:int,retweeted_status__metadata__iso_language_code:chararray,id:int,retweeted_status__user__follow_request_sent:chararray,retweeted_status__quoted_status__user__profile_use_background_image:chararray,retweeted_status__quoted_status__user__statuses_count:int,quoted_status__id_str:chararray,retweeted_status__user__profile_image_url:chararray,user__protected:chararray,user__profile_image_url_https:chararray,retweeted_status__source:chararray,quoted_status__source:chararray,retweeted_status__user__profile_link_color:chararray,retweeted_status__quoted_status__id_str:chararray,user__followers_count:int,retweeted_status__quoted_status__user__notifications:chararray,avl_num_sentences:int,retweeted_status__quoted_status__truncated:chararray,retweeted_status__text:chararray,quoted_status__favorite_count:int,quoted_status__metadata__result_type:chararray,truncated:chararray,metadata__iso_language_code:chararray,user__profile_banner_url:chararray,retweeted_status__quoted_status__user__profile_image_url_https:chararray,retweeted_status__quoted_status__user__utc_offset:int,quoted_status__user__profile_link_color:chararray,quoted_status__user__profile_image_url_https:chararray,retweeted_status__user__screen_name:chararray,retweeted_status__favorited:chararray,avl_lang:chararray,retweeted_status__user__location:chararray,retweeted_status__quoted_status__user__has_extended_profile:chararray,retweeted_status__quoted_status__user__verified:chararray,user__description:chararray,retweeted_status__user__profile_use_background_image:chararray,retweeted_status__quoted_status__user__contributors_enabled:chararray,quoted_status__is_quote_status:chararray,avl_sent:chararray,quoted_status__entities__media:chararray,quoted_status__possibly_sensitive:chararray,quoted_status__user__favourites_count:int,retweeted_status__quoted_status__user__default_profile_image:chararray,avl_num_words:int,quoted_status__user__friends_count:int,id_str:chararray,user__default_profile:chararray,user__profile_text_color:chararray,quoted_status__user__description:chararray,retweeted_status__user__favourites_count:int,retweeted_status__quoted_status__user__friends_count:int,quoted_status__user__name:chararray,retweeted_status__quoted_status__created_at:chararray,user__verified:chararray,quoted_status_id_str:chararray,user__profile_sidebar_border_color:chararray,retweeted_status__quoted_status__user__profile_text_color:chararray,retweeted_status__quoted_status__user__following:chararray,favorite_count:int,retweeted_status__quoted_status__entities__symbols:chararray,source:chararray,quoted_status_id:int,user__profile_use_background_image:chararray,retweeted_status__user__following:chararray,quoted_status__user__location:chararray,coordinates__type:chararray,retweeted_status__user__id:int,retweeted_status__quoted_status__text:chararray,quoted_status__entities__urls:chararray,retweeted_status__in_reply_to_status_id_str:chararray,text:chararray,retweeted_status__quoted_status__is_quote_status:chararray,quoted_status__id:int,user__entities__url__urls:chararray,quoted_status__user__contributors_enabled:chararray,retweeted_status__quoted_status__user__favourites_count:int,retweeted_status__quoted_status__id:int,retweeted_status__retweet_count:int,retweeted_status__favorite_count:int,metadata__result_type:chararray,retweeted_status__user__protected:chararray,retweeted_status__quoted_status__user__name:chararray,possibly_sensitive:chararray,retweeted_status__user__profile_sidebar_fill_color:chararray,retweeted_status__user__profile_image_url_https:chararray,retweeted_status__quoted_status_id:int,place__contained_within:chararray,retweeted_status__user__id_str:chararray,retweeted_status__user__entities__url__urls:chararray,retweeted_status__id_str:chararray,retweeted_status__quoted_status__entities__user_mentions:chararray,in_reply_to_status_id_str:chararray,retweeted_status__user__has_extended_profile:chararray,user__default_profile_image:chararray,user__is_translator:chararray,place__bounding_box__coordinates:chararray,retweeted_status__is_quote_status:chararray,quoted_status__user__entities__description__urls:chararray,entities__urls:chararray,retweeted_status__quoted_status__favorite_count:int,quoted_status__truncated:chararray,retweeted_status__user__default_profile_image:chararray,user__statuses_count:int,retweeted_status__quoted_status__user__entities__description__urls:chararray,retweeted_status__quoted_status__entities__hashtags:chararray,retweeted_status__quoted_status__user__description:chararray,retweeted_status__user__verified:chararray,retweeted_status__user__followers_count:int,avl_syn_1:chararray,quoted_status__user__default_profile:chararray,retweeted_status__place__bounding_box__type:chararray,retweeted_status__id:int,retweeted_status__user__lang:chararray,retweeted_status__quoted_status__user__default_profile:chararray,retweeted_status__quoted_status__user__profile_link_color:chararray,retweeted_status__in_reply_to_user_id:int,retweeted_status__user__is_translation_enabled:chararray,retweeted_status__user__chararrayed_count:int,quoted_status__user__default_profile_image:chararray,quoted_status__retweet_count:int,retweeted_status__user__profile_background_tile:chararray,quoted_status__user__id:int,retweeted_status__quoted_status__user__screen_name:chararray,retweeted_status__user__notifications:chararray,coordinates__coordinates:chararray,avl_brand_1:chararray,retweeted_status__quoted_status__metadata__iso_language_code:chararray,retweeted_status__quoted_status__retweeted:chararray,retweeted_status__quoted_status_id_str:chararray,retweeted_status__user__profile_text_color:chararray,quoted_status__retweeted:chararray,retweeted_status__user__is_translator:chararray,retweeted_status__user__default_profile:chararray,retweeted_status__extended_entities__media:chararray,avl_word_tags:chararray,quoted_status__user__follow_request_sent:chararray,retweeted_status__quoted_status__possibly_sensitive:chararray,user__screen_name:chararray,quoted_status__user__profile_banner_url:chararray,extended_entities__media:chararray,retweeted_status__quoted_status__retweet_count:int,quoted_status__user__profile_background_image_url:chararray,place__name:chararray,user__created_at:chararray,lang:chararray,in_reply_to_screen_name:chararray,retweeted_status__in_reply_to_status_id:int,quoted_status__user__profile_text_color:chararray,user__url:chararray,retweeted_status__user__profile_background_image_url_https:chararray,retweeted_status__truncated:chararray,entities__symbols:chararray,retweeted_status__quoted_status__user__profile_sidebar_border_color:chararray,quoted_status__entities__hashtags:chararray,retweeted_status__created_at:chararray,place__country_code:chararray,quoted_status__user__screen_name:chararray,avl_score:int,quoted_status__user__lang:chararray,avl_source:chararray,place__full_name:chararray,retweeted_status__place__url:chararray,retweeted_status__user__profile_background_color:chararray,quoted_status__user__following:chararray,quoted_status__user__profile_image_url:chararray,quoted_status__text:chararray,user__chararrayed_count:int,retweeted_status__quoted_status__user__protected:chararray,avl_words_not_in_lexicon:chararray,retweeted_status__quoted_status__user__id_str:chararray,quoted_status__user__followers_count:int,retweeted_status__quoted_status__extended_entities__media:chararray,retweeted_status__quoted_status__user__is_translator:chararray,user__time_zone:chararray,retweeted_status__metadata__result_type:chararray,in_reply_to_user_id_str:chararray,quoted_status__user__profile_background_image_url_https:chararray,avl_num_paragraphs:int,retweeted_status__quoted_status__user__profile_background_color:chararray,retweeted_status__quoted_status__user__followers_count:int,quoted_status__user__has_extended_profile:chararray,retweeted_status__user__profile_sidebar_border_color:chararray,avl_brand_all:chararray,retweeted_status__place__country_code:chararray,retweeted_status__user__description:chararray,quoted_status__user__profile_background_tile:chararray,retweeted_status__quoted_status__user__geo_enabled:chararray,quoted_status__user__created_at:chararray,entities__hashtags:chararray,retweeted_status__user__time_zone:chararray,quoted_status__user__geo_enabled:chararray,retweeted_status__possibly_sensitive:chararray,retweeted_status__user__name:chararray,retweeted:chararray,quoted_status__user__entities__url__urls:chararray,user__profile_background_tile:chararray,user__follow_request_sent:chararray,retweeted_status__quoted_status__entities__urls:chararray,quoted_status__user__statuses_count:int,retweeted_status__quoted_status__user__profile_background_image_url:chararray,user__is_translation_enabled:chararray,user__profile_background_image_url_https:chararray,user__friends_count:int,retweeted_status__quoted_status__user__id:int,geo__coordinates:chararray,user__following:chararray,user__favourites_count:int,retweeted_status__place__country:chararray,retweeted_status__quoted_status__user__chararrayed_count:int,user__profile_link_color:chararray,retweeted_status__place__full_name:chararray,quoted_status__user__protected:chararray,quoted_status__user__notifications:chararray,user__lang:chararray,retweeted_status__place__contained_within:chararray,retweeted_status__entities__hashtags:chararray,retweeted_status__entities__urls:chararray,user__profile_background_image_url:chararray,retweeted_status__quoted_status__favorited:chararray,retweeted_status__place__name:chararray,user__profile_background_color:chararray,geo__type:chararray,retweeted_status__entities__symbols:chararray,retweeted_status__place__id:chararray,quoted_status__lang:chararray,retweeted_status__retweeted:chararray,avl_sentences:chararray,avl_global_idx:int,retweeted_status__entities__user_mentions:chararray,retweeted_status__quoted_status__user__time_zone:chararray,user__id_str:chararray,quoted_status__user__profile_sidebar_fill_color:chararray,quoted_status__entities__symbols:chararray,retweeted_status__user__url:chararray,retweeted_status__quoted_status__user__profile_sidebar_fill_color:chararray,quoted_status__user__is_translator:chararray,retweeted_status__quoted_status__user__lang:chararray,user__profile_sidebar_fill_color:chararray,retweeted_status__quoted_status__source:chararray,entities__media:chararray,entities__user_mentions:chararray,retweeted_status__user__created_at:chararray,user__has_extended_profile:chararray,quoted_status__user__time_zone:chararray,is_quote_status:chararray,place__id:chararray,retweeted_status__quoted_status__user__created_at:chararray,user__location:chararray,retweeted_status__quoted_status__user__follow_request_sent:chararray,quoted_status__user__url:chararray,retweeted_status__user__geo_enabled:chararray,in_reply_to_user_id:int,retweeted_status__in_reply_to_user_id_str:chararray,retweeted_status__quoted_status__user__profile_banner_url:chararray,retweeted_status__quoted_status__entities__media:chararray,retweeted_status__quoted_status__user__profile_image_url:chararray
I found the answer: All the json objects in the input file donot have the same schema. So i guess it is not able to load according to the defined schema defined for the JsonLoader().
Used Elephant bird which made life easier.

Load different json schemas in PIG

I would like to know how to read different Json schemes from one File In PIG.
In Hadoop I would use an jsonparser and with if questions I would find out what kind of json Element it is.
The Json Elements inside one Doccument Are:
{"a": "bla", "e": 123, "f": 333}
{ "a": "bla", "c": "aa"}
I Tried to load the first Json Array with the following Command:
A = load '/usr/local/hadoop/stuff.net' USING USING JsonLoader('a:chararray, e:int, f:int');
DUMP A;
It Throws The Error: ERROR 2088: Fetch failed. Couldn't retrieve result
The Second query is working:
B = load '/home/hadoop/Desktop/aaa' USING JsonLoader('a:chararray, c:chararray');
DUMP B;
But it also shows me results from the first statement.
So I wanted to ask how to load different Json schemas from the same file or isn't that possible?
I think that you can use Twitter's elephantbird project. Some examples you can find here.
Usage is quite easy, you just register jar file and than you can use elephant UDF function to load nested json:
REGISTER 'elephant-bird.jar';
json_file_00 = LOAD 'json_file.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');
json_file_01 = FOREACH json_file_00 GENERATE json_file_00#'fieldName' AS field_name;
I am also using akela project from mozilla which is great but outdated.

Parse Complex JSON String in Pig

I want to parse a string of complex JSON in Pig. Specifically, I want Pig to understand my JSON array as a bag instead of as a single chararray. When using JsonLoader, I can do this easily by specifying the schema, as in this question. Is there any way to either have Pig figure out my schema for me, or to specify it when Pig is parsing a string? I've been using JsonStringToMap, but can't find a way to specify Schema, or to have it properly understand my JSON array is an array and not a single chararray.
I wound up using JsonTupleMap() in Mozilla's Akela library for pig. It accomplishes exactly what I want by parsing all of my JSON even when it's complex, and doing this even when I don't provide a schema. If you run into the same problem as me, use that.
Example usage:
REGISTER '/path/to/akela-0.5-SNAPSHOT.jar';
DEFINE JsonTupleMap com.mozilla.pig.eval.json.JsonTupleMap();
loaded = LOAD '$INPUT' AS (json_string:chararray, ...);
jsonified = FOREACH loaded GENERATE JsonTupleMap(json_string) AS json:map[], ...;
some_generate = FOREACH jsonified GENERATE json#'key'#'sub_key';