How to load json and value out of json in Pig? - json

I have a json and value out of json
000000,{"000":{"phoneNumber":null,"firstName":"xyz","lastName":"pqr","email":"email#xyz.com","alternatePickup":true,"sendTextNotification":false,"isSendTextNotification":false,"isAlternatePickup":true}}
I'm trying to load this json in pig using elephant bird json loader but unable to do that.
I'm able to load the following json
{"000":{"phoneNumber":null,"firstName":"xyz","lastName":"pqr","email":"email#xyz.com","alternatePickup":true,"sendTextNotification":false,"isSendTextNotification":false,"isAlternatePickup":true}}
Using following script -
REGISTER json-simple-1.1.1.jar;
REGISTER elephant-bird-pig-4.3.jar;
REGISTER elephant-bird-hadoop-compat-4.3.jar;
json_data = load 'ek.json' using com.twitter.elephantbird.pig.load.JsonLoader() AS (json_key: [(phoneNumber:chararray,firstName:chararray,lastName:chararray,email:chararray,alternatePickup:boolean,sendTextNotification:boolean,isSendTextNotification:boolean,isAlternatePickup:boolean)]);
dump json_data;
But when I include value out of json
json_data = load 'ek.json' using com.twitter.elephantbird.pig.load.JsonLoader() AS (id:int,json_key: [(phoneNumber:chararray,firstName:chararray,lastName:chararray,email:chararray,alternatePickup:boolean,sendTextNotification:boolean,isSendTextNotification:boolean,isAlternatePickup:boolean)]);
it is not working!! Appreciate the help in advance.

JsonLoader allows loading only of correct json, while your format is actually CSV. There are three options for you ordered by incresing complexity:
Adjust your input format and make id part of it
Load data as CSV (as 2 fields: id and json, then use custom UDF to parse json field into a tuple)
Write custom loader that will allow you your original format.

You can use builtin JsonStorage and JsonLoader()
a = load 'a.json' using JsonLoader('a0:int,a1:{(a10:int,a11:chararray)},a2:(a20:double,a21:bytearray),a3:[chararray]');
In this example data is loaded without a schema; it assumes there is a .pig_schema (produced by JsonStorage) in the input directory.
a = load 'a.json' using JsonLoader();

Related

Merging and/or Reading 88 JSON Files into Dataframe - different datatypes

I basically have a procedure where I make multiple calls to an API and using a token within the JSON return pass that pack to a function top call the API again to get a "paginated" file.
In total I have to call and download 88 JSON files that total 758mb. The JSON files are all formatted the same way and have the same "schema" or at least should do. I have tried reading each JSON file after it has been downloaded into a data frame, and then attempted to union that dataframe to a master dataframe so essentially I'll have one big data frame with all 88 JSON files read into.
However the problem I encounter is roughly on file 66 the system (Python/Databricks/Spark) decides to change the file type of a field. It is always a string and then I'm guessing when a value actually appears in that field it changes to a boolean. The problem is then that the unionbyName fails because of different datatypes.
What is the best way for me to resolve this? I thought about reading using "extend" to merge all the JSON files into one big file however a 758mb JSON file would be a huge read and undertaking.
Could the other solution be to explicitly set the schema that the JSON file is read into so that it is always the same type?
If you know the attributes of those files, you can define the schema before reading them and create an empty df with that schema so you can to a unionByName with the allowMissingColumns=True:
something like:
from pyspark.sql.types import *
my_schema = StructType([
StructField('file_name',StringType(),True),
StructField('id',LongType(),True),
StructField('dataset_name',StringType(),True),
StructField('snapshotdate',TimestampType(),True)
])
output = sqlContext.createDataFrame(sc.emptyRDD(), my_schema)
df_json = spark.read.[...your JSON file...]
output.unionByName(df_json, allowMissingColumns=True)
I'm not sure this is what you are looking for. I hope it helps

Reading JSON in Azure Synapse

I'm trying to understand the code for reading JSON file in Synapse Analytics. And here's the code provided by Microsoft documentation:
Query JSON files using serverless SQL pool in Azure Synapse Analytics
select top 10 *
from openrowset(
bulk 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.jsonl',
format = 'csv',
fieldterminator ='0x0b',
fieldquote = '0x0b'
) with (doc nvarchar(max)) as rows
go
I wonder why the format = 'csv'. Is it trying to convert JSON to CSV to flatten the file?
Why they didn't just read the file as a SINGLE_CLOB I don't know
When you use SINGLE_CLOB then the entire file is important as one value and the content of the file in the doc is not well formatted as a single JSON. Using SINGLE_CLOB will make us do more work after using the openrowset, before we can use the content as JSON (since it is not valid JSON we will need to parse the value). It can be done but will require more work probably.
The format of the file is multiple JSON's like strings, each in separate line. "line-delimited JSON", as the document call it.
By the way, If you will check the history of the document at GitHub, then you will find that originally this was not the case. As much as I remember, originally the file included a single JSON document with an array of objects (was wrapped with [] after loaded). Someone named "Ronen Ariely" in fact found this issue in the document, which is why you can see my name in the list if the Authors of the document :-)
I wonder why the format = 'csv'. Is it trying to convert json to csv to flatten the hierarchy?
(1) JSON is not a data type in SQL Server. There is no data type name JSON. What we have in SQL Server are tools like functions which work on text and provide support for strings which are JSON's like format. Therefore, we do not CONVERT to JSON or from JSON.
(2) The format parameter has nothing to do with JSON. It specifies that the content of the file is a comma separated values file. You can (and should) use it whenever your file is well formatted as comma separated values file (also commonly known as csv file).
In this specific sample in the document, the values in the csv file are strings, which each one of them has a valid JSON format. Only after you read the file using the openrowset, we start to parse the content of the text as JSON.
Notice that only after the title "Parse JSON documents" in the document, the document starts to speak about parsing the text as JSON.

passing variable to json file while matching response in karate

I'm validating my response from a GET call through a .json file
match response == read('match_response.json')
Now I want to reuse this file for various other features as only one field in the .json varies. Let's say this param in the json file is "varyingField"
I'm trying to pass this field every time I am matching the response but not able to
def varyingField = 'VARIATION1'
match response == read('match_response.json') {'varyingField' : '#(varyingField)'}}
In the json file I have
"varyingField": "#(varyingField)"
You are trying to use an argument to read for a JSON file ? Sorry such a thing is not supported in Karate, please read the docs.
Use this pattern:
create a JSON file that has all your "happy path" values set
use the read() syntax to load the file (which means this is re-usable across multiple tests)
use the set keyword to update only the field for your scenario or negative test
For more details, refer this answer: https://stackoverflow.com/a/51896522/143475

Apache Pig: Store only specific fields using JsonStorage()

I loaded a json file using JsonLoad() and it loads correctly. Now I want to store only few fields of the json object into a file using jsonStorage(). My Pig script is:
data_input = LOAD '$DATA_INPUT' USING JsonLoader(<<schema>>);
x = FOREACH data_input GENERATE (user__id_str);
STORE x INTO '$DATA_OUTPUT' USING JsonStorage();
Expected output:
{"user__id_str":12345}
{"user__id_str":12345}
{"user__id_str":123467}
Output I am getting:
{"user__id_str":null}
{"user__id_str":null}
{"user__id_str":null}
What is wrong?
EDIT: The schema is huge: consists of 306 fields:
user__contributors_enabled:chararray,retweeted_status__user__friends_count:int,quoted_status__extended_entities__media:chararray,retweeted_status__user__profile_background_image_url:chararray,quoted_status__user__is_translation_enabled:chararray,user__geo_enabled:chararray,avl_word_tags_all:chararray,quoted_status__user__profile_background_color:chararray,quoted_status__user__id_str:chararray,retweeted_status__place__bounding_box__coordinates:chararray,retweeted_status__quoted_status__metadata__result_type:chararray,retweeted_status__user__utc_offset:int,retweeted_status__user__contributors_enabled:chararray,retweeted_status__in_reply_to_screen_name:chararray,retweeted_status__place__place_type:chararray,retweeted_status__quoted_status__user__profile_background_image_url_https:chararray,user__utc_offset:int,quoted_status__favorited:chararray,user__entities__description__urls:chararray,place__url:chararray,quoted_status__user__profile_sidebar_border_color:chararray,favorited:chararray,retweeted_status__user__profile_banner_url:chararray,quoted_status__entities__user_mentions:chararray,retweet_count:int,retweeted_status__user__entities__description__urls:chararray,retweeted_status__quoted_status__user__is_translation_enabled:chararray,retweeted_status__entities__media:chararray,place__bounding_box__type:chararray,text_to_syntaxnet:chararray,quoted_status__user__chararrayed_count:int,avl_pos_tags:chararray,retweeted_status__user__statuses_count:int,quoted_status__metadata__iso_language_code:chararray,created_at:chararray,avl_lexicon_text:chararray,retweeted_status__lang:chararray,place__country:chararray,quoted_status__user__verified:chararray,retweeted_status__quoted_status__user__profile_background_tile:chararray,quoted_status__user__utc_offset:int,retweeted_status__quoted_status__user__location:chararray,quoted_status__created_at:chararray,retweeted_status__quoted_status__lang:chararray,place__place_type:chararray,user__profile_image_url:chararray,quoted_status__user__profile_use_background_image:chararray,user__name:chararray,user__notifications:chararray,user__id:int,in_reply_to_status_id:int,retweeted_status__metadata__iso_language_code:chararray,id:int,retweeted_status__user__follow_request_sent:chararray,retweeted_status__quoted_status__user__profile_use_background_image:chararray,retweeted_status__quoted_status__user__statuses_count:int,quoted_status__id_str:chararray,retweeted_status__user__profile_image_url:chararray,user__protected:chararray,user__profile_image_url_https:chararray,retweeted_status__source:chararray,quoted_status__source:chararray,retweeted_status__user__profile_link_color:chararray,retweeted_status__quoted_status__id_str:chararray,user__followers_count:int,retweeted_status__quoted_status__user__notifications:chararray,avl_num_sentences:int,retweeted_status__quoted_status__truncated:chararray,retweeted_status__text:chararray,quoted_status__favorite_count:int,quoted_status__metadata__result_type:chararray,truncated:chararray,metadata__iso_language_code:chararray,user__profile_banner_url:chararray,retweeted_status__quoted_status__user__profile_image_url_https:chararray,retweeted_status__quoted_status__user__utc_offset:int,quoted_status__user__profile_link_color:chararray,quoted_status__user__profile_image_url_https:chararray,retweeted_status__user__screen_name:chararray,retweeted_status__favorited:chararray,avl_lang:chararray,retweeted_status__user__location:chararray,retweeted_status__quoted_status__user__has_extended_profile:chararray,retweeted_status__quoted_status__user__verified:chararray,user__description:chararray,retweeted_status__user__profile_use_background_image:chararray,retweeted_status__quoted_status__user__contributors_enabled:chararray,quoted_status__is_quote_status:chararray,avl_sent:chararray,quoted_status__entities__media:chararray,quoted_status__possibly_sensitive:chararray,quoted_status__user__favourites_count:int,retweeted_status__quoted_status__user__default_profile_image:chararray,avl_num_words:int,quoted_status__user__friends_count:int,id_str:chararray,user__default_profile:chararray,user__profile_text_color:chararray,quoted_status__user__description:chararray,retweeted_status__user__favourites_count:int,retweeted_status__quoted_status__user__friends_count:int,quoted_status__user__name:chararray,retweeted_status__quoted_status__created_at:chararray,user__verified:chararray,quoted_status_id_str:chararray,user__profile_sidebar_border_color:chararray,retweeted_status__quoted_status__user__profile_text_color:chararray,retweeted_status__quoted_status__user__following:chararray,favorite_count:int,retweeted_status__quoted_status__entities__symbols:chararray,source:chararray,quoted_status_id:int,user__profile_use_background_image:chararray,retweeted_status__user__following:chararray,quoted_status__user__location:chararray,coordinates__type:chararray,retweeted_status__user__id:int,retweeted_status__quoted_status__text:chararray,quoted_status__entities__urls:chararray,retweeted_status__in_reply_to_status_id_str:chararray,text:chararray,retweeted_status__quoted_status__is_quote_status:chararray,quoted_status__id:int,user__entities__url__urls:chararray,quoted_status__user__contributors_enabled:chararray,retweeted_status__quoted_status__user__favourites_count:int,retweeted_status__quoted_status__id:int,retweeted_status__retweet_count:int,retweeted_status__favorite_count:int,metadata__result_type:chararray,retweeted_status__user__protected:chararray,retweeted_status__quoted_status__user__name:chararray,possibly_sensitive:chararray,retweeted_status__user__profile_sidebar_fill_color:chararray,retweeted_status__user__profile_image_url_https:chararray,retweeted_status__quoted_status_id:int,place__contained_within:chararray,retweeted_status__user__id_str:chararray,retweeted_status__user__entities__url__urls:chararray,retweeted_status__id_str:chararray,retweeted_status__quoted_status__entities__user_mentions:chararray,in_reply_to_status_id_str:chararray,retweeted_status__user__has_extended_profile:chararray,user__default_profile_image:chararray,user__is_translator:chararray,place__bounding_box__coordinates:chararray,retweeted_status__is_quote_status:chararray,quoted_status__user__entities__description__urls:chararray,entities__urls:chararray,retweeted_status__quoted_status__favorite_count:int,quoted_status__truncated:chararray,retweeted_status__user__default_profile_image:chararray,user__statuses_count:int,retweeted_status__quoted_status__user__entities__description__urls:chararray,retweeted_status__quoted_status__entities__hashtags:chararray,retweeted_status__quoted_status__user__description:chararray,retweeted_status__user__verified:chararray,retweeted_status__user__followers_count:int,avl_syn_1:chararray,quoted_status__user__default_profile:chararray,retweeted_status__place__bounding_box__type:chararray,retweeted_status__id:int,retweeted_status__user__lang:chararray,retweeted_status__quoted_status__user__default_profile:chararray,retweeted_status__quoted_status__user__profile_link_color:chararray,retweeted_status__in_reply_to_user_id:int,retweeted_status__user__is_translation_enabled:chararray,retweeted_status__user__chararrayed_count:int,quoted_status__user__default_profile_image:chararray,quoted_status__retweet_count:int,retweeted_status__user__profile_background_tile:chararray,quoted_status__user__id:int,retweeted_status__quoted_status__user__screen_name:chararray,retweeted_status__user__notifications:chararray,coordinates__coordinates:chararray,avl_brand_1:chararray,retweeted_status__quoted_status__metadata__iso_language_code:chararray,retweeted_status__quoted_status__retweeted:chararray,retweeted_status__quoted_status_id_str:chararray,retweeted_status__user__profile_text_color:chararray,quoted_status__retweeted:chararray,retweeted_status__user__is_translator:chararray,retweeted_status__user__default_profile:chararray,retweeted_status__extended_entities__media:chararray,avl_word_tags:chararray,quoted_status__user__follow_request_sent:chararray,retweeted_status__quoted_status__possibly_sensitive:chararray,user__screen_name:chararray,quoted_status__user__profile_banner_url:chararray,extended_entities__media:chararray,retweeted_status__quoted_status__retweet_count:int,quoted_status__user__profile_background_image_url:chararray,place__name:chararray,user__created_at:chararray,lang:chararray,in_reply_to_screen_name:chararray,retweeted_status__in_reply_to_status_id:int,quoted_status__user__profile_text_color:chararray,user__url:chararray,retweeted_status__user__profile_background_image_url_https:chararray,retweeted_status__truncated:chararray,entities__symbols:chararray,retweeted_status__quoted_status__user__profile_sidebar_border_color:chararray,quoted_status__entities__hashtags:chararray,retweeted_status__created_at:chararray,place__country_code:chararray,quoted_status__user__screen_name:chararray,avl_score:int,quoted_status__user__lang:chararray,avl_source:chararray,place__full_name:chararray,retweeted_status__place__url:chararray,retweeted_status__user__profile_background_color:chararray,quoted_status__user__following:chararray,quoted_status__user__profile_image_url:chararray,quoted_status__text:chararray,user__chararrayed_count:int,retweeted_status__quoted_status__user__protected:chararray,avl_words_not_in_lexicon:chararray,retweeted_status__quoted_status__user__id_str:chararray,quoted_status__user__followers_count:int,retweeted_status__quoted_status__extended_entities__media:chararray,retweeted_status__quoted_status__user__is_translator:chararray,user__time_zone:chararray,retweeted_status__metadata__result_type:chararray,in_reply_to_user_id_str:chararray,quoted_status__user__profile_background_image_url_https:chararray,avl_num_paragraphs:int,retweeted_status__quoted_status__user__profile_background_color:chararray,retweeted_status__quoted_status__user__followers_count:int,quoted_status__user__has_extended_profile:chararray,retweeted_status__user__profile_sidebar_border_color:chararray,avl_brand_all:chararray,retweeted_status__place__country_code:chararray,retweeted_status__user__description:chararray,quoted_status__user__profile_background_tile:chararray,retweeted_status__quoted_status__user__geo_enabled:chararray,quoted_status__user__created_at:chararray,entities__hashtags:chararray,retweeted_status__user__time_zone:chararray,quoted_status__user__geo_enabled:chararray,retweeted_status__possibly_sensitive:chararray,retweeted_status__user__name:chararray,retweeted:chararray,quoted_status__user__entities__url__urls:chararray,user__profile_background_tile:chararray,user__follow_request_sent:chararray,retweeted_status__quoted_status__entities__urls:chararray,quoted_status__user__statuses_count:int,retweeted_status__quoted_status__user__profile_background_image_url:chararray,user__is_translation_enabled:chararray,user__profile_background_image_url_https:chararray,user__friends_count:int,retweeted_status__quoted_status__user__id:int,geo__coordinates:chararray,user__following:chararray,user__favourites_count:int,retweeted_status__place__country:chararray,retweeted_status__quoted_status__user__chararrayed_count:int,user__profile_link_color:chararray,retweeted_status__place__full_name:chararray,quoted_status__user__protected:chararray,quoted_status__user__notifications:chararray,user__lang:chararray,retweeted_status__place__contained_within:chararray,retweeted_status__entities__hashtags:chararray,retweeted_status__entities__urls:chararray,user__profile_background_image_url:chararray,retweeted_status__quoted_status__favorited:chararray,retweeted_status__place__name:chararray,user__profile_background_color:chararray,geo__type:chararray,retweeted_status__entities__symbols:chararray,retweeted_status__place__id:chararray,quoted_status__lang:chararray,retweeted_status__retweeted:chararray,avl_sentences:chararray,avl_global_idx:int,retweeted_status__entities__user_mentions:chararray,retweeted_status__quoted_status__user__time_zone:chararray,user__id_str:chararray,quoted_status__user__profile_sidebar_fill_color:chararray,quoted_status__entities__symbols:chararray,retweeted_status__user__url:chararray,retweeted_status__quoted_status__user__profile_sidebar_fill_color:chararray,quoted_status__user__is_translator:chararray,retweeted_status__quoted_status__user__lang:chararray,user__profile_sidebar_fill_color:chararray,retweeted_status__quoted_status__source:chararray,entities__media:chararray,entities__user_mentions:chararray,retweeted_status__user__created_at:chararray,user__has_extended_profile:chararray,quoted_status__user__time_zone:chararray,is_quote_status:chararray,place__id:chararray,retweeted_status__quoted_status__user__created_at:chararray,user__location:chararray,retweeted_status__quoted_status__user__follow_request_sent:chararray,quoted_status__user__url:chararray,retweeted_status__user__geo_enabled:chararray,in_reply_to_user_id:int,retweeted_status__in_reply_to_user_id_str:chararray,retweeted_status__quoted_status__user__profile_banner_url:chararray,retweeted_status__quoted_status__entities__media:chararray,retweeted_status__quoted_status__user__profile_image_url:chararray
I found the answer: All the json objects in the input file donot have the same schema. So i guess it is not able to load according to the defined schema defined for the JsonLoader().
Used Elephant bird which made life easier.

how to convert nested json file into csv in scala

I want to convert my nested json into csv ,i used
df.write.format("com.databricks.spark.csv").option("header", "true").save("mydata.csv")
But it can use to normal json but not nested json. Anyway that I can convert my nested json to csv?help will be appreciated,Thanks!
When you ask Spark to convert a JSON structure to a CSV, Spark can only map the first level of the JSON.
This happens because of the simplicity of the CSV files. It is just asigning a value to a name. That is why {"name1":"value1", "name2":"value2"...} can be represented as a CSV with this structure:
name1,name2, ...
value1,value2,...
In your case, you are converting a JSON with several levels, so Spark exception is saying that it cannot figure out how to convert such a complex structure into a CSV.
If you try to add only a second level to your JSON, it will work, but be careful. It will remove the names of the second level to include only the values in an array.
You can have a look at this link to see the example for json datasets. It includes an example.
As I have no information about the nature of the data, I can't say much more about it. But if you need to write the information as a CSV you will need to simplify the structure of your data.
Read json file in spark and create dataframe.
val path = "examples/src/main/resources/people.json"
val people = sqlContext.read.json(path)
Save the dataframe using spark-csv
people.write
.format("com.databricks.spark.csv")
.option("header", "true")
.save("newcars.csv")
Source :
read json
save to csv