Json to Data Frame and Excel with Python - json

I would like to ask for some help with the conversion of a nested json into pandas df.
I have read the quite brilliant input from a couple of year past, but that is outdated now. :(
Flatten double nested JSON
So here is a sample of my input data (mind that classes might contain up to 10 class name and confidence pairs):
[
{
"classifier_id": "my_classifier_id",
"url": "https://api.eu-de.natural-language-classifier...",
"text": "for sales? aligning obsolete incentive\u00a0 system to what is the standard today: 100% reference salary, 100% of ref sal if you hit 100% of your quota",
"top_class": "conditions",
"classes": [
{
"class_name": "conditions",
"confidence": 0.9074866214536228
},
{
"class_name": "temperature",
"confidence": 0.09251337854637723
}
]
},
{
"classifier_id": "my_classifier_id",
"url": "https://api.eu-de.natural-language-classifier...",
"text": "Complete integration of incentives.\u00a0 People act inline with how they are compensated as the general rule. \u00a0 If we get that right then this model can genuinely change the face of IBM to the client.",
"top_class": "conditions",
"classes": [
{
"class_name": "conditions",
"confidence": 0.9683663322166756
},
{
"class_name": "temperature",
"confidence": 0.0316336677833244
}
]
},
{
"classifier_id": "my_classifier_id",
"url": "https://api.eu-de.natural-language-classifier.watson...",
"text": "Enablement, operational support on the most basic things",
"top_class": "temperature",
"classes": [
{
"class_name": "temperature",
"confidence": 0.8174158442711534
},
{
"class_name": "conditions",
"confidence": 0.1825841557288465
}
]
}
]
What I have tried thus far in python:
data_df = pd.read_json(r'C:\Users\...\Documents\Python NLP\WATSON NLC\OUTPUT JSON\nlc_data_full.json')
When using this the classes still remain in a json like form:
[{'class_name': 'conditions', 'confidence': 0.907486621453622}, {'class_name': 'temperature', 'confidence': 0.092513378546377}]
[{'class_name': 'conditions', 'confidence': 0.9683663322166751}, {'class_name': 'temperature', 'confidence': 0.031633667783324}]
[{'class_name': 'temperature', 'confidence': 0.8174158442711531}, {'class_name': 'conditions', 'confidence': 0.182584155728846}]
I would love to get a format that can be worked on in excel. Thank you for looking into this.

Well I think I managed to figure out what everyone already knew anyways. LOL
So the magic is in the pd.json_normalize function. With the parameters it takes it basically is able to open multinested json files with relative ease.
Also the pandas site has been a good friend as always: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html
I am calling my dataset: nlc_data = [ .......
Here is a super lightweight solution for cases that do not have such intricate nesting: normie_2 = pd.json_normalize(nlc_data, max_level=0)
This one works for multi nested json files:
result = pd.json_normalize(nlc_data, 'classes', ['text', 'top_class'])
Well I guess I got a lot smarter today. Bare with me ... I just might have another awesome questions tomorrow.
Bye, Levi

Related

Fiware Orion NGSI-LD : Support of LanguageMap

I would like to know weither the LanguageMap functionnality is supported (or will be supported) in the Orion-LD implementation.
The languageMap is presented on the folowing video : https://www.youtube.com/watch?v=ll-t8Vi9i50
The idea is to be able to request an attribute in a specific language. Exemple with the following JSON :
"pitch": {
"type": "Property",
"value": [
{
"language ": "fr",
"article": "Mariage"
},
{
"language": "en",
"article": "Wedding"
},
{
"language": "ru",
"article": "Выставка!"
},
{
"language": "zh",
"article": "展览"
},
{
"language": "ja",
"article": "展覧会"
}
]
},
How to query a specific language article?
Many thanks in advance for your feedback.
Rgds
LC
Yes, it will most definitely be supported.
Not the very first item on our Backlog, but yeah, it's quite prioritized. Third item in a list of hundreds of items ... (right now - the list is somewhat volatile).
It will not be exactly as in your example, please check the latest published ETSI NGSI-LD spec to get it right. It's something like this:
"pitch": {
"type": "LanguageProperty",
"languageMap": {
"en": "marriage",
"fr": "mariage",
"es": "matrimonio",
...
}
}
If you create an issue on Orion-LD's github, you will be informed as soon as the work begins on this, and I can even send a hint every now and then, before the work actually begins. Hopefully, work will commence on this interesting feature before summer, but I can't really promise anything.
About queries, if I remember correctly, the query will be done simply with
?q=pitch==matrimonio&lang=es
(please check the API spec to verify what I just said)

Deeply nested JSON documents in Apache Solr

I have a deeply nested document(pseudo structure as shown below):
[{
"id": "1",
"company_id": "1",
"company_name": "company_1",
"departments":[{
"dep1" : [{
"id" : 40,
"name" : xyz
},
{
"id" : 41,
"name" : xyr
}],
"dep2": [{
}]
}]
"employeePrograms" :[{
}]
}]
How can I index these type of documents in Apache Solr?
Documentation gives the idea of immediate child documents alone.
Unfortunatelly i'm don't have huge experience with this technology, but want to help. Here is some official documentation, that might be useful: oficial doc
more specific
If you have some uncommon issue, tell about it, maybe any error, or whatever.. I would try my best to help)
Upd1 :
Solr can only maintain a 'flat' representation of the data. What you weretrying to do is not really possible. There are a number of workarounds, such as using dynamic fields and using a solr join to link multiple data sets.
Speking about a deep nesting ? I've found such an example of work around.
If you had something like that:
"docs": [
{
"name": "Product Name",
"categories": [
{
"name": "Category 1",
"priority": 8
},
{
"name": "Category 2",
"priority": 6
}
...
]
},
You have to modify it like that to make it not deeply nested :
"docs": [
{
name: "Sample Product"
categories: [
{
priority_category: "9_Category 1",
},
{
priority_category: "5_Category 2",
}
...
]
},
So, you've done something similar, check if there are any errors anywhere

Parsing through JSON .. Gives undefined?

I have a very complex JSON and a snippet of it is below:
var designerJSON=
{
"nodes":
[
{
"NodeDefinition": {
"name": "Start",
"thumbnail": "Start.png",
"icon": "Start.png",
"info": "Entry point ",
"help": "Start point in your workflow.",
"workflow ": "Start",
"category": "Basic",
"ui": [
{
"label": "Entry point",
"category": "Help",
"componet": "label",
"type": "label"
}
]
},
"States": [
{
"start": "node1"
}
]
},.......
]
}
I would like to get the value of "start" in States. But I am stuck in the first step of entering into JSON. When I try
console.log(designerJSON["nodes"]);
I am getting Undefined.
I want the value of start. Wich is designerJSON["nodes"]["States"]["start"].
Can you help.
Thanks in advance
designerJSON["nodes"]["States"]["start"] won't do it.
designerJSON["nodes"] is a list, as is States, so you need to access individual items by index (or iteration).
In the example you have given you need to use this:
designerJSON['nodes'][0]['States'][0]['start']
or this (cleaner IMO):
designerJSON.nodes[0].States[0].start
You have an array in JSON.
instead of
designerJSON["nodes"]["States"]["start"]
use
designerJSON["nodes"][0]["States"][0]["start"]
ps. pay attention on how code is formatted in the topic.
pps. using brackets for accessing properties in js is "bad style" (due to js hint recommendations). better access those via dot, e.g:
designerJSON.nodes[0].States[0].start

Handling Incredibly large JSON Document in CouchDB

I'm new to NoSql databases and I'm having a hard time figuring how to handle a very large JSON Document that could amount to over 20MB on my local drive. This structure will definitely increase over time and I worry about the speed of queries and having to search deep though the returned JSON object nest just to get a string out. My JSON is deeply nested like so for example.
{
"exams": {
"exam1": {
"year": {
"math": {
"questions": [
{
"question_text": "first question",
"options": [
"option1",
"option2",
"option3",
"option4",
"option5"
],
"answer": 1,
"explaination": "explain the answer"
},
{
"question_text": "second question",
"options": [
"option1",
"option2",
"option3",
"option4",
"option5"
],
"answer": 1,
"explaination": "explain the answer"
},
{
"question_text": "third question",
"options": [
"option1",
"option2",
"option3",
"option4",
"option5"
],
"answer": 1,
"explaination": "explain the answer"
}
]
},
"english": {same structure as above}
},
"1961": {}
},
"exam2": {},
"exam3": {},
"exam4": {}
}
}
In the main application, question objects are created and appended based on type of exam, year, and subject making the JSON document huge over time. How can I re-model this so as to avoid slow queries in the future?
Dominic is right. You need to start dividing the documents and storing them as separate documents.
The next question is how to recompose the document after it's been split.
Considering you're using Couch, I would recommend doing this at the application layer. A good starting point would be to create exam documents and store them in their own database. Then have a document (exams) in another database that has pointers to the exam documents.
You can retrieve the exams document and get exams one by one as needed. This could be especially useful with paging since most people will only want to see the most recent exams.

JSON format with gzip compression

My current project sends a lot of data to the browser in JSON via ajax requests.
I've been trying to decide which format I should use. The two I have in mind are
[
"colname1" : "content",
"colname2" : "content",
],
[
"colname1" : "content",
"colname2" : "content",
],
...
and
{
"columns": [
"column name 1",
"column name 2",
],
"rows": [
[
"content",
"content"
],
[
"content",
"content"
]
...
]
}
The first method is better because it is easier to work with. I just have to convert to an object once received. The second will need some post processing to convert it into a format more like the first so it is easier to work with in JavaScript.
The second is better because it is less verbose and therefore takes up less bandwidth and downloads more quickly. Before compression it is usually between 0.75% and 0.85% of the size of the first format.
GZip compression complicates things further. Making the difference in file size nearer 0.85% to 0.95%
Which format should I go with and why?
I'd suggest using RJSON:
RJSON (Recursive JSON) converts any JSON data collection into more compact recursive form. Compressed data is still JSON and can be parsed with JSON.parse. RJSON can compress not only homogeneous collections, but any data sets with free structure.
Example:
JSON:
{
"id": 7,
"tags": ["programming", "javascript"],
"users": [
{"first": "Homer", "last": "Simpson"},
{"first": "Hank", "last": "Hill"},
{"first": "Peter", "last": "Griffin"}
],
"books": [
{"title": "JavaScript", "author": "Flanagan", "year": 2006},
{"title": "Cascading Style Sheets", "author": "Meyer", "year": 2004}
]
}
RJSON:
{
"id": 7,
"tags": ["programming", "javascript"],
"users": [
{"first": "Homer", "last": "Simpson"},
[2, "Hank", "Hill", "Peter", "Griffin"]
],
"books": [
{"title": "JavaScript", "author": "Flanagan", "year": 2006},
[3, "Cascading Style Sheets", "Meyer", 2004]
]
}
Shouldn't the second bit of example 1 be "rowname1"..etc.? I don't really get example 2 so I guess I would aim you towards 1. There is much to be said for having data immediately workable without pre-processing it first. Justification: I once spend too long optimizing array system that turned out to work perfectly but its hell to update it now.