I am trying to convert results of scipy hierarchical clustering into json for display in d3.js here an example
The following codes produces a dendrogram with 6 branches.
import pandas as pd
import scipy.spatial
import scipy.cluster
d = {'employee' : ['A', 'B', 'C', 'D', 'E', 'F'],
'skillX': [2,8,3,6,8,10],
'skillY': [8,15,6,9,7,10]}
d1 = pd.DataFrame(d)
distMat = xPairWiseDist = scipy.spatial.distance.pdist(np.array(d1[['skillX', 'skillY']]), 'euclidean')
clusters = scipy.cluster.hierarchy.linkage(distMat, method='single')
dendo = scipy.cluster.hierarchy.dendrogram(clusters, labels = list(d1.employee), orientation = 'right')
dendo
my question
How can I represent the data in a json file in a format that d3.js understand
{'name': 'Root1’,
'children':[{'name' : 'B'},
{'name': 'E-D-F-C-A',
'children' : [{'name': 'C-A',
'children' : {'name': 'A'},
{'name' : 'C'}]
}
}
]
}
The embarassing truth is that I do not know if I can extract this information from the dendogram or from the linkage matrix and how
I am thankful for any help I can get.
EDIT TO CLARIFY
So far, I have tried to use the totree method but have difficulties understanding its structure (yes, I read the documentation).
a = scipy.cluster.hierarchy.to_tree(clusters , rd=True)
for x in a[1]:
#print x.get_id()
if x.is_leaf() != True :
print x.get_left().get_id(), x.get_right().get_id(), x.get_count()
You can do this in three steps:
Recursively construct a nested dictionary that represents the tree returned by Scipy's to_tree method.
Iterate through the nested dictionary to label each internal node with the leaves in its subtree.
dump the resulting nested dictionary to JSON and load into d3.
Construct a nested dictionary representing the dendrogram
For the first step, it is important to call to_tree with rd=False so that the root of the dendrogram is returned. From that root, you can construct the nested dictionary as follows:
# Create a nested dictionary from the ClusterNode's returned by SciPy
def add_node(node, parent ):
# First create the new node and append it to its parent's children
newNode = dict( node_id=node.id, children=[] )
parent["children"].append( newNode )
# Recursively add the current node's children
if node.left: add_node( node.left, newNode )
if node.right: add_node( node.right, newNode )
T = scipy.cluster.hierarchy.to_tree( clusters , rd=False )
d3Dendro = dict(children=[], name="Root1")
add_node( T, d3Dendro )
# Output: => {'name': 'Root1', 'children': [{'node_id': 10, 'children': [{'node_id': 1, 'children': []}, {'node_id': 9, 'children': [{'node_id': 6, 'children': [{'node_id': 0, 'children': []}, {'node_id': 2, 'children': []}]}, {'node_id': 8, 'children': [{'node_id': 5, 'children': []}, {'node_id': 7, 'children': [{'node_id': 3, 'children': []}, {'node_id': 4, 'children': []}]}]}]}]}]}
The basic idea is to start with a node not in the dendrogram that will serve as the root of the whole dendrogram. Then we recursively add left- and right-children to this dictionary until we reach the leaves. At this point, we do not have labels for the nodes, so I'm just labeling nodes by their clusterNode ID.
Label the dendrogram
Next, we need to use the node_ids to label the dendrogram. The comments should be enough explanation for how this works.
# Label each node with the names of each leaf in its subtree
def label_tree( n ):
# If the node is a leaf, then we have its name
if len(n["children"]) == 0:
leafNames = [ id2name[n["node_id"]] ]
# If not, flatten all the leaves in the node's subtree
else:
leafNames = reduce(lambda ls, c: ls + label_tree(c), n["children"], [])
# Delete the node id since we don't need it anymore and
# it makes for cleaner JSON
del n["node_id"]
# Labeling convention: "-"-separated leaf names
n["name"] = name = "-".join(sorted(map(str, leafNames)))
return leafNames
label_tree( d3Dendro["children"][0] )
Dump to JSON and load into D3
Finally, after the dendrogram has been labeled, we just need to output it to JSON and load into D3. I'm just pasting the Python code to dump it to JSON here for completeness.
# Output to JSON
json.dump(d3Dendro, open("d3-dendrogram.json", "w"), sort_keys=True, indent=4)
Output
I created Scipy and D3 versions of the dendrogram below. For the D3 version, I simply plugged the JSON file I output ('d3-dendrogram.json') into this Gist.
SciPy dendrogram
D3 dendrogram
Related
I was wondering if there is a way to remove/replace null/empty square brackets in json or pandas dataframe. I have tried to replace them after converting into string via .astype(str) and it is successful and/but it seems it converts all json values into string and I can not process further with the same structure. I would appreciate any solution/recommendation. thanks...
With the following toy dataframe:
import pandas as pd
df = pd.DataFrame({"col1": ["a", [1, 2, 3], [], "d"], "col2": ["e", [], "f", "g"]})
print(df)
# Output
Here is one way to do it:
df = df.applymap(lambda x: pd.NA if isinstance(x, list) and not x else x)
print(df)
# Output
I have a json file that looks like this:
test= {'kpiData': [{'date': '2020-06-03 10:05',
'a': 'MINIMUMINTERVAL',
'b': 0.0,
'c': True},
{'date': '2020-06-03 10:10',
'a': 'MINIMUMINTERVAL',
'b': 0.0,
'c': True},
{'date': '2020-06-03 10:15',
'a': 'MINIMUMINTERVAL',
'b': 0.0,
'c': True},
{'date': '2020-06-03 10:20',
'a': 'MINIMUMINTERVAL',
'b': 0.0,}
]}
I want to transfer it to a dataframe object, like this:
rdd = sc.parallelize([test])
jsonDF = spark.read.json(rdd)
This results in a corrupted record. From my understanding the reason for this is, that True and False can't be entries in Python. So I need to tranform these entries prior to the spark.read.json() (to TRUE, true or "True"). test is a dict and rdd is a pyspark.rdd.RDD object. For a datframe object the transformation is pretty straigth forward, but I didn't find a solution for these objects.
spark.read.json expects an RDD of JSON strings, not an RDD of Python dictionaries. If you convert the dictionary to a JSON string, you should be able to read that into a dataframe:
import json
df = spark.read.json(sc.parallelize([json.dumps(test)]))
Another possible way is to read in the dictionary using spark.createDataFrame:
df = spark.createDataFrame([test])
which will give a different schema with maps instead of structs.
Input:
input_list=['1.exe','2.exe','3.exe','4.exe']
Output format:
out_dict=[{'name':'1.exe',
'children':[{'name':'2.exe',
'children':[{'name':'3.exe
'children':[{'name':'4.exe'}]}}}]
The input is the a list as above mentioned and we have to obtain the output in the format as mentioned in the above lines.
I tried using nested for loops but it isn't working. How can we implement JSON in this?
input_list=['1.exe','2.exe','3.exe','4.exe']
def split(data):
try:
first_value = data[0]
data = [{'name': first_value, 'children': split(data[1:])} if split(data[1:]) != [] else {'name': first_value}]
return data
except:
return data
print (split(input_list))
output:
[{'name': '1.exe', 'children':
[{'name': '2.exe', 'children':
[{'name': '3.exe', 'children':
[{'name': '4.exe'}]}]}]}]
code which is a little bit more easier to understand (with explinations):
input_list=['1.exe','2.exe','3.exe','4.exe']
def split(input_list):
if len(input_list) == 0:
return input_list # if there is no data return empty list
else: # if we have elements
first_value = input_list[0] # first value
if split(input_list[1:]) != []: # data[1:] will return a list with all values except the first value
input_list = [{'name':first_value ,'children': split(input_list[1:])}]
return input_list # return after the last recursion is called
else:
input_list = [{'name': first_value}]
return input_list
print (split(input_list))
output:
[{'name': '1.exe', 'children':
[{'name': '2.exe', 'children':
[{'name': '3.exe', 'children':
[{'name': '4.exe'}]}]}]}]
or:
input_list=['1.exe','2.exe','3.exe','4.exe']
def split(input_list):
if input_list:
head, *tail = input_list # This is a nicer way of doing head, tail = data[0], data[1:]
if split(tail) != []:
return [{'name': head, 'children':split(tail)}]
else:
return [{'name': head}]
else:
return {}
print (split(input_list))
Convert from Python to JSON:
import json
# a Python object (dict):
x = {
"name": "John",
"age": 30,
"city": "New York"
}
# convert into JSON:
y = json.dumps(x)
# the result is a JSON string:
print(y)
JSON is a syntax for storing and exchanging data. Convert from Python
to JSON If you have a Python object, you can convert it into a JSON
string by using the json.dumps() method.
import json
input_list=['1.exe','2.exe','3.exe','4.exe']
def split(input_list):
try:
first_value = input_list[0]
input_list = {'name': first_value, 'children': split(input_list[1:])} if split(input_list[1:]) != [] else {'name': first_value}
return input_list
except:
return input_list
data = split(input_list)
print (json.dumps(data))
I currently have two classes in Python like these ones
class person:
age=""
name=""
ranking = {}
def addRanking():
#Do Whatever treatment and add to the ranking dict
class ranking:
semester = ""
position = ""
gpa = ""
I have my list of person as a dictionary called dictP json.dumps() this dictionary but it seems that it doesn't work. Here is my function to dump to JSON
def toJson():
jsonfile = open('dict.json', 'w')
print(json.dump(listP, jsonfile))
I get the famous: is not JSON serializable.
Would you know what I can do to help this problem. I thought that having two dictionaries (which are serializable) would avoid this kind of issue, but apparently not.
Thanks in advance
Edit:
Here is an example (typed on my phone sorry for typos, I'm not sure it does run but it's so you get the idea):
class person:
age=""
name=""
ranking = {}
def __init__(self, age, name):
self.age = age
self.name = name
self.ranking = {}
def addRanking(self,semester,position,gpa):
#if the semester is not already present in the data for that person
self.ranking[semester] = make_ranking(semester,position,gpa)
class ranking:
semester = ""
position = ""
gpa = ""
def __init__(self, semester, position, gpa):
self.semester = semester
self.position = position
self.gpa = gpa
dictP = {}
def make_person(age, name):
# Some stuff happens there
return person(age,name)
def make_ranking(semester,postion,gpa):
#some computation there
return ranking(semester,position,gpa)
def pretending_to_read_csv():
age = 12
name = "Alice"
p = make_person(age, name)
dictP["1"] = p
age = 13
name = "Alice"
p = make_person(age, name)
dictP["2"] = p
#We read a csv for ranking that gives us an ID
semester = 1
position = 4
gpa = 3.2
id = 1
dictP["1"].addRanking(semester, position, gpa)
semester = 2
position = 4
gpa = 3.2
id = 1
dictP["1"].addRanking(semester, position, gpa)
For a dictionary to be serializable, note that all the keys & values in that dictionary must be serializable as well. You did not show us what listP contains, but I'm guessing it's something like this:
>>> listP
[<__main__.person instance at 0x107b65290>, <__main__.person instance at 0x107b65368>]
Python instances are not serializable.
I think you want a list of dictionaries, which would look like this:
>>> listP
[{'ranking': {}, 'age': 10, 'name': 'fred'}, {'ranking': {}, 'age': 20, 'name': 'mary'}]
This would serialize as you expect:
>>> import json
>>> json.dumps(listP)
'[{"ranking": {}, "age": 10, "name": "fred"}, {"ranking": {}, "age": 20, "name": "mary"}]'
UPDATE
(Thanks for adding example code.)
>>> pretending_to_read_csv()
>>> dictP
{'1': <__main__.person instance at 0x107b65368>, '2': <__main__.person instance at 0x107b863b0>}
Recall that user-defined classes cannot be serialized automatically. It's possible to extend the JSONEncoder directly to handle these cases, but all you really need is a function that can turn your object into a dictionary comprised entirely of primitives.
def convert_ranking(ranking):
return {
"semester": ranking.semester,
"position": ranking.position,
"gpa": ranking.gpa}
def convert_person(person):
return {
"age": person.age,
"name": person.name,
"ranking": {semester: convert_ranking(ranking) for semester, ranking in person.ranking.iteritems()}}
One more dictionary comprehension to actually do the conversion and you're all set:
>>> new_dict = {person_id: convert_person(person) for person_id, person in dictP.iteritems()}
>>> from pprint import pprint
>>> pprint(new_dict)
{'1': {'age': 12,
'name': 'Alice',
'ranking': {1: {'gpa': 3.2, 'position': 4, 'semester': 1},
2: {'gpa': 3.2, 'position': 4, 'semester': 2}}},
'2': {'age': 13, 'name': 'Alice', 'ranking': {}}}
Since no user-defined objects are stuffed in there, this will serialize as you hope:
>>> json.dumps(new_dict)
'{"1": {"ranking": {"1": {"position": 4, "semester": 1, "gpa": 3.2}, "2": {"position": 4, "semester": 2, "gpa": 3.2}}, "age": 12, "name": "Alice"}, "2": {"ranking": {}, "age": 13, "name": "Alice"}}'
You can try calling json.dump on the .__dict__ member of your instance. You say that you have a list of person instances so try doing something like this:
listJSON = []
for p in listP
#append the value of the dictionary containing data about your person instance to a list
listJSON.append(p.__dict__)
json.dump(listJSON, jsonfile)
If you are storing your person instances in a dictionary like so: dictP = {'person1': p1, 'person2': p2} this solution will loop through the keys and change their corresponding values to the __dict__ member of the instance:
for key in dictP:
dictP[key] = dictP[key].__dict__
json.dump(dictP, jsonfile)
I'm working with nested JSON-like data structures in python 2.7 that I exchange with some foreign perl code. I just want to 'work with' these nested structures of lists and dictionaries in amore pythonic way.
So if I have a structure like this...
a = {
'x': 4,
'y': [2, 3, { 'a': 55, 'b': 66 }],
}
...I want to be able to deal with it in a python script as if it was nested python classes/Structs, like this:
>>> aa = j2p(a) # <<- this is what I'm after.
>>> print aa.x
4
>>> aa.z = 99
>>> print a
{
'x': 4,
'y': [2, 3, { 'a': 55, 'b': 66 }],
'z': 99
}
>>> aa.y[2].b = 999
>>> print a
{
'x': 4,
'y': [2, 3, { 'a': 55, 'b': 999 }],
'z': 99
}
Thus aa is a proxy into the original structure. This is what I came up with so far, inspired by the excellent What is a metaclass in Python? question.
def j2p(x):
"""j2p creates a pythonic interface to nested arrays and
dictionaries, as returned by json readers.
>>> a = { 'x':[5,8], 'y':5}
>>> aa = j2p(a)
>>> aa.y=7
>>> print a
{'x': [5, 8], 'y':7}
>>> aa.x[1]=99
>>> print a
{'x': [5, 99], 'y':7}
>>> aa.x[0] = {'g':5, 'h':9}
>>> print a
{'x': [ {'g':5, 'h':9} , 99], 'y':7}
>>> print aa.x[0].g
5
"""
if isinstance(x, list):
return _list_proxy(x)
elif isinstance(x, dict):
return _dict_proxy(x)
else:
return x
class _list_proxy(object):
def __init__(self, proxied_list):
object.__setattr__(self, 'data', proxied_list)
def __getitem__(self, a):
return j2p(object.__getattribute__(self, 'data').__getitem__(a))
def __setitem__(self, a, v):
return object.__getattribute__(self, 'data').__setitem__(a, v)
class _dict_proxy(_list_proxy):
def __init__(self, proxied_dict):
_list_proxy.__init__(self, proxied_dict)
def __getattribute__(self, a):
return j2p(object.__getattribute__(self, 'data').__getitem__(a))
def __setattr__(self, a, v):
return object.__getattribute__(self, 'data').__setitem__(a, v)
def p2j(x):
"""p2j gives back the underlying json-ic json-ic nested
dictionary/list structure of an object or attribute created with
j2p.
"""
if isinstance(x, (_list_proxy, _dict_proxy)):
return object.__getattribute__(x, 'data')
else:
return x
Now I wonder whether there is an elegant way of mapping a whole set of the __*__ special functions, like __iter__, __delitem__? so I don't need to unwrap things using p2j() just to iterate or do other pythonic stuff.
# today:
for i in p2j(aa.y):
print i
# would like to...
for i in aa.y:
print i
I think you're making this more complex than it needs to be. If I understand you correctly, all you should need to do is this:
import json
class Struct(dict):
def __getattr__(self, name):
return self[name]
def __setattr__(self, name, value):
self[name] = value
def __delattr__(self, name):
del self[name]
j = '{"y": [2, 3, {"a": 55, "b": 66}], "x": 4}'
aa = json.loads(j, object_hook=Struct)
for i in aa.y:
print(i)
When you load JSON, the object_hook parameter lets you specify a callable object to process objects that it loads. I've just used it to turn the dict into an object that allows attribute access to its keys. Docs
There is an attrdict library that does exactly that in a very safe manner, but if you want, a quick and dirty (possibly leaking memory) approach was given in this answer:
class AttrDict(dict):
def __init__(self, *args, **kwargs):
super(AttrDict, self).__init__(*args, **kwargs)
self.__dict__ = self
j = '{"y": [2, 3, {"a": 55, "b": 66}], "x": 4}'
aa = json.loads(j, object_hook=AttrDict)
I found the answer: There is intentionally no way to automatically map the special methods in python, using __getattribute__. So to achieve what I want, I need to explicitely define all special methods like __len__ one after the other.