Mulitlevel Pandas dataframe to nested json - json

I am trying to turn a data frame into some nested json and have been struggling a bit. Here is an example I created. The use case is to split a document for each guest at a hotel chain, with the guest at the top, the hotel details under the visit data, and the daily charges under the 'measurements' info.
The dataframe:
Here is an example of how I am trying to get the JSON to look
I have tried creating a multilevel index and using to_json:
Is there a way to do this using to_json() or will I need to build some nested loops to create nested dictionaries? This is the best I have been able to get:

I would recommend a programming approach. pandas.DataFrame.groupby can be useful.
def hotel_data_to_json(df):
return [
person_data_to_json(person_df)
for person_id, person_df
in df.groupby('person_id')
]
def person_data_to_json(df):
row = df.iloc[0]
return {
'person_id': row['person_id'],
'personal_name': row['personal_name'],
'family_name': row['family_name'],
'visits': [
visit_data_to_json(visit_df)
for visit_id, visit_df
in df.groupby('visit_id')
]
}
def visit_data_to_json(df):
row = df.iloc[0]
# and so on

Related

Parse complex Json string contained in Hadoop

I want to parse a string of complex JSON in Pig. Specifically, I want Pig to understand my JSON array as a bag instead of as a single chararray. I found that complex JSON can be parsed by using Twitter's Elephant Bird or Mozilla's Akela library. (I found some additional libraries, but I cannot use 'Loader' based approach since I use HCatalog Loader to load data from Hive.)
But, the problem is the structure of my data; each value of Map structure contains value part of complex JSON. For example,
1. My table looks like (WARNING: type of 'complex_data' is not STRING, a MAP of <STRING, STRING>!)
TABLE temp_table
(
user_id BIGINT COMMENT 'user ID.',
complex_data MAP <STRING, STRING> COMMENT 'complex json data'
)
COMMENT 'temp data.'
PARTITIONED BY(created_date STRING)
STORED AS RCFILE;
2. And 'complex_data' contains (a value that I want to get is marked with two *s, so basically #'d'#'f' from each PARSED_STRING(complex_data#'c') )
{ "a": "[]",
"b": "\"sdf\"",
"**c**":"[{\"**d**\":{\"e\":\"sdfsdf\"
,\"**f**\":\"sdfs\"
,\"g\":\"qweqweqwe\"},
\"c\":[{\"d\":21321,\"e\":\"ewrwer\"},
{\"d\":21321,\"e\":\"ewrwer\"},
{\"d\":21321,\"e\":\"ewrwer\"}]
},
{\"**d**\":{\"e\":\"sdfsdf\"
,\"**f**\":\"sdfs\"
,\"g\":\"qweqweqwe\"},
\"c\":[{\"d\":21321,\"e\":\"ewrwer\"},
{\"d\":21321,\"e\":\"ewrwer\"},
{\"d\":21321,\"e\":\"ewrwer\"}]
},]"
}
3. So, I tried... (same approach for Elephant Bird)
REGISTER '/path/to/akela-0.6-SNAPSHOT.jar';
DEFINE JsonTupleMap com.mozilla.pig.eval.json.JsonTupleMap();
data = LOAD temp_table USING org.apache.hive.hcatalog.pig.HCatLoader();
values_of_map = FOREACH data GENERATE complex_data#'c' AS attr:chararray; -- IT WORKS
-- dump values_of_map shows correct chararray data per each row
-- eg) ([{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... },
{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... },
{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... }])
([{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... },
{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... },
{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... }]) ...
attempt1 = FOREACH data GENERATE JsonTupleMap(complex_data#'c'); -- THIS LINE CAUSE AN ERROR
attempt2 = FOREACH data GENERATE JsonTupleMap(CONCAT(CONCAT('{\\"key\\":', complex_data#'c'), '}'); -- IT ALSO DOSE NOT WORK
I guessed that "attempt1" was failed because the value doesn't contain full JSON. However, when I CONCAT like "attempt2", I generate additional \ mark with. (so each line starts with {\"key\": ) I'm not sure that this additional marks breaks the parsing rule or not. In any case, I want to parse the given JSON string so that Pig can understand. If you have any method or solution, please Feel free to let me know.
I finally solved my problem by using jyson library with jython UDF.
I know that I can solve it by using JAVA or other languages.
But, I think that jython with jyson is the most simplist answer to this issue.

Elixir - Creating JSON object from 2 collections

I'm using Postgrex in Elixir, and when it returns query results, it returns them in the following struct format:
%{columns: ["id", "email", "name"], command: :select, num_rows: 2, rows: [{1, "me#me.com", "Bobbly Long"}, {6, "email#tts.me", "Woll Smoth"}]}
It should be noted I am using Postgrex directly WITHOUT Ecto.
The columns (table headers) are returned as a collection, but the results (rows) are returned as a list of tuples. (which seems odd, as they could get very large).
I'm trying to find the best way to programmatically create JSON objects for each result in which the JSON key is the column title and the JSON value the corresponding value from the tuple.
I've tried creating maps from both, merging and then serialising to JSON objects but it seems there should be an easier/better way of doing this.
Has anyone dealt with this before? What is the best way of creating a JSON object from a separate collection and tuple?
Something like this should work:
result = Postgrex.query!(...)
Enum.map(result.rows, fn row ->
Enum.zip(result.columns, Tuple.to_list(row))
|> Enum.into(%{})
|> JSON.encode
end)
This will result in a list of json objects where each row in the resultset is a json object.

Parsing Google Custom Search API for Elasticsearch Documents

After retrieving results from the Google Custom Search API and writing it to JSON, I want to parse that JSON to make valid Elasticsearch documents. You can configure a parent - child relationship for nested results. However, this relationship seems to not be inferred by the data structure itself. I've tried automatically loading, but not results.
Below is some example input that doesn't include things like id or index. I'm trying to focus on creating the correct data structure. I've tried modifying graph algorithms like depth-first-search but am running into problems with the different data structures.
Here's some example input:
# mock data structure
google = {"content": "foo",
"results": {"result_one": {"persona": "phone",
"personb": "phone",
"personc": "phone"
},
"result_two": ["thing1",
"thing2",
"thing3"
],
"result_three": "none"
},
"query": ["Taylor Swift", "Bob Dole", "Rocketman"]
}
# correctly formatted documents for _source of elasticsearch entry
correct_documents = [
{"content":"foo"},
{"results": ["result_one", "result_two", "result_three"]},
{"result_one": ["persona", "personb", "personc"]},
{"persona": "phone"},
{"personb": "phone"},
{"personc": "phone"},
{"result_two":["thing1","thing2","thing3"]},
{"result_three": "none"},
{"query": ["Taylor Swift", "Bob Dole", "Rocketman"]}
]
Here is my current approach this is still a work in progress:
def recursive_dfs(graph, start, path=[]):
'''recursive depth first search from start'''
path=path+[start]
for node in graph[start]:
if not node in path:
path=recursive_dfs(graph, node, path)
return path
def branching(google):
""" Get branches as a starting point for dfs"""
branch = 0
while branch < len(google):
if google[google.keys()[branch]] is dict:
#recursive_dfs(google, google[google.keys()[branch]])
pass
else:
print("branch {}: result {}\n".format(branch, google[google.keys()[branch]]))
branch += 1
branching(google)
You can see that recursive_dfs() still needs to be modified to handle string, and list data structures.
I'll keep going at this but if you have thoughts, suggestions, or solutions then I would very much appreciate it. Thanks for your time.
here is a possible answer to your problem.
def myfunk( inHole, outHole):
for keys in inHole.keys():
is_list = isinstance(inHole[keys],list);
is_dict = isinstance(inHole[keys],dict);
if is_list:
element = inHole[keys];
new_element = {keys:element};
outHole.append(new_element);
if is_dict:
element = inHole[keys].keys();
new_element = {keys:element};
outHole.append(new_element);
myfunk(inHole[keys], outHole);
if not(is_list or is_dict):
new_element = {keys:inHole[keys]};
outHole.append(new_element);
return outHole.sort();

Edit Web2py json output for Google Chart

I have managed to get the google charts plugin (http://www.web2pyslices.com/slice/show/1721/google-charts-plugin) to work with my web2py application. Using the JSON example data (data hard coded into default.py).
I am struggling with using my own data. The chart does not work with my JSON data which returns the information as:
{"data": [["2014-03-28", 1000], ["2014-03-25", 1100]]}
When I hardcode the data with the titles, the chart works:
data = [['Date','Sales'],["2014-03-28",1000],["2014-03-25",1100]]
This returns JSON as:
{"data": [["Date", "Sales"], ["2014-03-28", 1000], ["2014-03-25", 1100]]}
The code for this is:
def return_data():
data = [['Date','Sales'],["2014-03-28",1000],["2014-03-25",1100]]
return dict(data=data)
Below is the code I am using in default.py to return the information from the database, the query works, it's the chart that doesn't!:
def return_data():
sales = db().select(db.sales.quantity, db.sales.date)
data = [[row.date,row.quantity] for row in sales]
return dict(data=data)
Somehow, I think I need to add the 'date' and 'sales' labels to the start of the json data but I have not managed to do this - I think I need to do some sort of encode? - do I need to use the simplejson or can this be done without??
Many thanks

How to fetch a JSON file to get a row position from a given value or argument

I'm using wget to fetch several dozen JSON files on a daily basis that go like this:
{
"results": [
{
"id": "ABC789",
"title": "Apple",
},
{
"id": "XYZ123",
"title": "Orange",
}]
}
My goal is to find row's position on each JSON file given a value or set of values (i.e. "In which row XYZ123 is located?"). In previous example ABC789 is in row 1, XYZ123 in row 2 and so on.
As for now I use Google Regine to "quickly" visualize (using the Text Filter option) where the XYZ123 is standing (row 2).
But since it takes a while to do this manually for each file I was wondering if there is a quick and efficient way in one go.
What can I do and how can I fetch and do the request? Thanks in advance! FoF0
In python:
import json
#assume json_string = your loaded data
data = json.loads(json_string)
mapped_vals = []
for ent in data:
mapped_vals.append(ent['id'])
The order of items in the list will be indexed according to the json data, since the list is a sequenced collection.
In PHP:
$data = json_decode($json_string);
$output = array();
foreach($data as $values){
$output[] = $values->id;
}
Again, the ordered nature of PHP arrays ensure that the output will be ordered as-is with regard to indexes.
Either example could be modified to use a mapped dictionary (python) or an associative array (php) if needs demand.
You could adapt these to functions that take the id value as an argument, track how far they are into the array, and when found, break out and return the current index.
Wow. I posted the original question 10 months ago when I knew nothing about Python nor computer programming whatsoever!
Answer
But I learned basic Python last December and came up with a solution for not only get the rank order but to insert the results into a MySQL database:
import urllib.request
import json
# Make connection and get the content
response = urllib.request.urlopen(http://whatever.com/search?=ids=1212,125,54,454)
content = response.read()
# Decode Json search results to type dict
json_search = json.loads(content.decode("utf8"))
# Get 'results' key-value pairs to a list
search_data_all = []
for i in json_search['results']:
search_data_all.append(i)
# Prepare MySQL list with ranking order for each id item
ranks_list_to_mysql = []
for i in range(len(search_data_all)):
d = {}
d['id'] = search_data_all[i]['id']
d['rank'] = i + 1
ranks_list_to_mysql.append(d)