How do I query a nested json after loading it with elephant bird - json

I'm pretty new to HADOOP and pig .
So . I have a single line json files , all have the same schema :
{"name":"someName","pkg":[{"F1":"abc","F2":"44","F3":"xyz","F4":1024,"info":
[{"timestamp":1372631550000,"value":"122","id":"nnn","name":"ppp"},
{"timestamp":1372649240000,"value":"222","id":"ggg","name":"qqq"}]} ,
{"F1":"abc","f2":"44","F3":"xyz","F4":1024,"new":[{"type":"event1", "time":1372537000000,"more":"
{\"bbad\":\"HELLO\",\"is_done\":0,\"ssss\":-128}"}]}]}
I load all of the json files using elephantbird :
data = LOAD 'browsers/gzip' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as (json:map[]);
So far the only thing that working for me is querying the "name" field which returns bytearray.
b = foreach data generate json#'name' as name
I then tries to convert it to map instead :
c = FOREACH data GENERATE json#'name' as (m:map[]);
DESCRIBE c;
and get
c: {tuple_0: (m:map[])}
and the data looks like :
({([F1#"abc",F2#44...])})
so now I need to filter all the ones that have pkg.F1 = "abc" or all the ones that have pkg.info.value = 122 etc.
how do I do it?
a code example will be very helpful as I already googled it a lot.
Thanks

Try this
c = FOREACH data GENERATE flatten(json#'name') as (m:map[]);

The problem is that you don't know how your data is organized in Pig. Use
DESCRIBE data;
to find out what the structure returned by JsonLoader is, and this should give you enough information about how to extract your data.

Related

issue with connecting data in databricks from data lake and reading JSON into Folium

i'm working on something based of this blogpost:
https://python-visualization.github.io/folium/quickstart.html#Getting-Started
specifically part 13 - using Cloropleth maps:
the piece of code they use is the following:
import pandas as pd
url = (
"https://raw.githubusercontent.com/python-visualization/folium/master/examples/data"
)
state_geo = f"{url}/us-states.json"
state_unemployment = f"{url}/US_Unemployment_Oct2012.csv"
state_data = pd.read_csv(state_unemployment)
m = folium.Map(location=[48, -102], zoom_start=3)
folium.Choropleth(
geo_data=state_geo,
name="choropleth",
data=state_data,
columns=["State", "Unemployment"],
key_on="feature.id",
fill_color="YlGn",
fill_opacity=0.7,
line_opacity=0.2,
legend_name="Unemployment Rate (%)",
).add_to(m)
folium.LayerControl().add_to(m)
m
if I use this I get the requested map.
Now I try to do this with my own data; i work in databricks
so I have a JSON with the GEOJSON data (source_file1) and a CSV file (source_file2) with the data that needs to be "plotted" on the map.
source_file1 = "dbfs:/mnt/sandbox/MAARTEN/TOPO/Belgie_GEOJSON.JSON"
state_geo = spark.read.json(source_file1,multiLine=True)
source_file2 = "dbfs:/mnt/sandbox/MAARTEN/TOPO/DATASVZ.csv"
df_2 = spark.read.format("CSV").option("inferSchema", "true").option("header", "true").option("delimiter",";").load(source_file2)
state_data = df_2.toPandas()
when adjusting the code below:
m = folium.Map(location=[48, -102], zoom_start=3)
folium.Choropleth(
geo_data=state_geo,
name="choropleth",
data=state_data,
columns=["State", "Unemployment"],
key_on="feature.properties.name_nl",
fill_color="YlGn",
fill_opacity=0.7,
line_opacity=0.2,
legend_name="% Marktaandeel CC",
).add_to(m)
folium.LayerControl().add_to(m)
m
So i upload the geo_data parameter as a Sparkdatafram, I get the following error:
ValueError: Cannot render objects with any missing geometries: DataFrame[features: array<struct<geometry:struct<coordinates:array<array<array<string>>>,type:string>,properties:struct<arr_fr:string,arr_nis:bigint,arr_nl:string,fill:string,fill-opacity:double,name_fr:string,name_nl:string,nis:bigint,population:bigint,prov_fr:string,prov_nis:bigint,prov_nl:string,reg_fr:string,reg_nis:string,reg_nl:string,stroke:string,stroke-opacity:bigint,stroke-width:bigint>,type:string>>, type: string]```
I think it is because transforming the data from the "blob format" in the Azure datalake to the sparkdataframe, something goes wrong with the format. I tested this in a jupyter notebook from my desktop, data straight from file to folium and it all works.
If i load it directly from the source, like the example does with their webpage, so i adjust the 'geo_data' parameter for the folium function:
m = folium.Map(location=[48, -102], zoom_start=3)
folium.Choropleth(
geo_data=source_file1, #this gets adjusted directly to data lake
name="choropleth",
data=state_data,
columns=["State", "Unemployment"],
key_on="feature.properties.name_nl",
fill_color="YlGn",
fill_opacity=0.7,
line_opacity=0.2,
legend_name="% Marktaandeel CC",
).add_to(m)
folium.LayerControl().add_to(m)
m
I get the error
Use "/dbfs", not "dbfs:": The function expects a local file path. The error is caused by passing a path prefixed with "dbfs:".
So I started wondering what is the difference between my JSON file and the one of the blogpost. And the only thing i can imagine is that the Azure datalake doesn't store my json as a json but as a block blob file and for some reason i am not converting it properly so that folium can read it.
Azure blob storage (data lake)
So can someone with folium knowledge let me know if
A. it is not possible to load the geo_data directly from a datalake ?
B. in what format I need to upload the data ?
any thoughts on this would be helpfull!!!
thanks in advance!
Solved this issue, just had to replace "dbfs:" with "/dbfs". I tried it a lot of times but used "/dbfs:" and got another error.
can't believe i'm this stupid :-)

Need help separating unorganized JSON/JSON arrays

Ok so I don't necessarily understand this or how to do this at all but I am either looking for something that will show me how to sit there and split this portion of MYSQL json into either separate rows or just a way to export as either csv or json and then split one portion off from the others
Example of the JSON:
[{"id":2, "identifier":"IDENTIFIER:", "license":"LICENSE:", "firstname":"FIRSTNAME", "lastname":"LASTNAME", "accounts":"{"money":9595,"bank":9595}"},
{"id":2, "identifier":"IDENTIFIER", "license":"LICENSE", "firstname":"FIRSTNAME", "lastname":"LASTNAME", "accounts":"{"black_money":9595,"bank":9595,"money":9595}"}]
I want to be able to separate the three things in the JSON array called accounts this is all held in a mysql DB and I want to either be able to run something and have a exportable table that can be imported into google sheets or something of that sort so I can sort them if need be.
I expect that your json is probably like
[{"id":2, "identifier":"IDENTIFIER:", "license":"LICENSE:", "firstname":"FIRSTNAME", "lastname":"LASTNAME", "accounts":"{\"money\":9595,\"bank\":9595}"}, {"id":2, "identifier":"IDENTIFIER", "license":"LICENSE", "firstname":"FIRSTNAME", "lastname":"LASTNAME", "accounts":"{\"black_money\":9595,\"bank\":9595,\"money\":9595}"}]
but when copying / pasting the backslashes disappeared. Your json is obviously special since it includes other json. Try
function myFunction() {
var json = SpreadsheetApp.getActiveSpreadsheet().getActiveSheet().getRange('A1').getValue()
var data = JSON.parse(json)
var result = []
result.push(['account.black_money','account.bank','account.money'])
data.forEach(function(elem){
var account = JSON.parse(elem.accounts)
result.push([account.black_money,account.bank,account.money])
})
return result
}
take a copy of : https://docs.google.com/spreadsheets/d/1NwSUF7hRNjcLRbr2HP_mjj-fPfbKJz7BNLYFr184P4o/copy

Python3 Replacing special character from .csv file after convert the same from JSON

I am trying to develop a program using Python3.6.4 which convert a JSON file into a CSV file and also we need to clean the data in the csv file. as for example:
My JSON File:
{emp:[{"Name":"Bo#b","email":"bob#gmail.com","Des":"Unknown"},
{"Name":"Martin","email":"mar#tin#gmail.com","Des":"D#eveloper"}]}
Problem 1:
After converting that into csv its creating a blank row between every 2 rows. As
**Name email Des**
[<BLANK ROW>]
Bo#b bob#gmail.com Unknown
[<BLANK ROW>]
Martin mar#tin#gmail.com D#eveloper
Problem 2:
In my code I am using emp but I need to use it dynamically.
fobj = open("D:/Users/shamiks/PycharmProjects/jsonSamle.txt")
jsonCont = fobj.read()
print(jsonCont)
fobj.close()
employee_parsed = json.loads(jsonCont)
emp_data = employee_parsed['employee']
As we will not know the structure or content of up-coming JSON file.
Problem 3:
I also need to remove all # characters from the CSV file.
For solving Problem 3, you can use .replace (https://www.tutorialspoint.com/python/string_replace.htm).
For problem 2, you can use the dictionary keys and then get the zeroth item out of it.
fobj = open("D:/Users/shamiks/PycharmProjects/jsonSamle.txt")
jsonCont = fobj.read().replace("#", "")
print(jsonCont)
fobj.close()
employee_parsed = json.loads(jsonCont)
first_key = employee_parsed.keys()[0]
emp_data = employee_parsed[first_key]
I can't solve problem 1 without more code to see how your are exporting the result. It may be that your data has newlines in it. In which case, you could add .replace("\n","") and/or .replace("\r","") after the previous replace so the line would read fobj.read().replace("#", "").replace("\n", "").replace("\r", "").

Convert json to array using Perl

I have a chunk of json that has the following format:
{"page":{"size":7,"number":1,"totalPages":1,"totalElements":7,"resultSetId":null,"duration":0},"content":[{"id":"787edc99-e94f-4132-b596-d04fc56596f9","name":"Verification","attributes":{"ruleExecutionClass":"VerificationRule"},"userTags":[],"links":[{"rel":"self","href":"/endpoint/787edc99-e94f-4132-b596-d04fc56596f9","id":"787edc99-e94f-...
Basically the size attribute (in this case) tells me that there are 7 parts to the content section. How do I convert this chunk of json to an array in Perl, and can I do it using the size attribute? Or is there a simpler way like just using decode_json()?
Here is what I have so far:
my $resources = get_that_json_chunk(); # function returns exactly the json you see, except all 7 resources in the content section
my #decoded_json = #$resources;
foreach my $resource (#decoded_json) {
I've also tried something like this:
my $deserialize = from_json( $resources );
my #decoded_json = (#{$deserialize});
I want to iterate over the array and handle the data. I've tried a few different ways because I read a little about array refs, but I keep getting "Not an ARRAY reference" errors and "Can't use string ("{"page":{"size":7,"number":1,"to"...) as an ARRAY ref while "strict refs" in use"
Thank you to Matt Jacob:
my $deserialized = decode_json($resources);
print "$_->{id}\n" for #{$deserialized->{content}};

Parse complex Json string contained in Hadoop

I want to parse a string of complex JSON in Pig. Specifically, I want Pig to understand my JSON array as a bag instead of as a single chararray. I found that complex JSON can be parsed by using Twitter's Elephant Bird or Mozilla's Akela library. (I found some additional libraries, but I cannot use 'Loader' based approach since I use HCatalog Loader to load data from Hive.)
But, the problem is the structure of my data; each value of Map structure contains value part of complex JSON. For example,
1. My table looks like (WARNING: type of 'complex_data' is not STRING, a MAP of <STRING, STRING>!)
TABLE temp_table
(
user_id BIGINT COMMENT 'user ID.',
complex_data MAP <STRING, STRING> COMMENT 'complex json data'
)
COMMENT 'temp data.'
PARTITIONED BY(created_date STRING)
STORED AS RCFILE;
2. And 'complex_data' contains (a value that I want to get is marked with two *s, so basically #'d'#'f' from each PARSED_STRING(complex_data#'c') )
{ "a": "[]",
"b": "\"sdf\"",
"**c**":"[{\"**d**\":{\"e\":\"sdfsdf\"
,\"**f**\":\"sdfs\"
,\"g\":\"qweqweqwe\"},
\"c\":[{\"d\":21321,\"e\":\"ewrwer\"},
{\"d\":21321,\"e\":\"ewrwer\"},
{\"d\":21321,\"e\":\"ewrwer\"}]
},
{\"**d**\":{\"e\":\"sdfsdf\"
,\"**f**\":\"sdfs\"
,\"g\":\"qweqweqwe\"},
\"c\":[{\"d\":21321,\"e\":\"ewrwer\"},
{\"d\":21321,\"e\":\"ewrwer\"},
{\"d\":21321,\"e\":\"ewrwer\"}]
},]"
}
3. So, I tried... (same approach for Elephant Bird)
REGISTER '/path/to/akela-0.6-SNAPSHOT.jar';
DEFINE JsonTupleMap com.mozilla.pig.eval.json.JsonTupleMap();
data = LOAD temp_table USING org.apache.hive.hcatalog.pig.HCatLoader();
values_of_map = FOREACH data GENERATE complex_data#'c' AS attr:chararray; -- IT WORKS
-- dump values_of_map shows correct chararray data per each row
-- eg) ([{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... },
{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... },
{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... }])
([{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... },
{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... },
{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... }]) ...
attempt1 = FOREACH data GENERATE JsonTupleMap(complex_data#'c'); -- THIS LINE CAUSE AN ERROR
attempt2 = FOREACH data GENERATE JsonTupleMap(CONCAT(CONCAT('{\\"key\\":', complex_data#'c'), '}'); -- IT ALSO DOSE NOT WORK
I guessed that "attempt1" was failed because the value doesn't contain full JSON. However, when I CONCAT like "attempt2", I generate additional \ mark with. (so each line starts with {\"key\": ) I'm not sure that this additional marks breaks the parsing rule or not. In any case, I want to parse the given JSON string so that Pig can understand. If you have any method or solution, please Feel free to let me know.
I finally solved my problem by using jyson library with jython UDF.
I know that I can solve it by using JAVA or other languages.
But, I think that jython with jyson is the most simplist answer to this issue.