How Do I Serialize spaCy Custom Span Extensions as JSON? - json

I am using spaCy 2.1.6 to define a custom extension on a span.
>>> from spacy import load
>>> nlp = load("en_core_web_lg")
>>> from spacy.tokens import Span
>>> Span.set_extension('my_label', default=None)
>>> d = nlp("The fox jumped.")
>>> d[0:2]._.my_label = "ANIMAL"
>>> d[0:2]._.my_label
'ANIMAL'
The custom span extension does not appear when I serialize the document to JSON.
>>> d.to_json()
{'text': 'The fox jumped.',
'ents': [],
'sents': [{'start': 0, 'end': 15}],
'tokens': [{'id': 0,
'start': 0,
'end': 3,
'pos': 'DET',
'tag': 'DT',
'dep': 'det',
'head': 1},
{'id': 1,
'start': 4,
'end': 7,
'pos': 'NOUN',
'tag': 'NN',
'dep': 'nsubj',
'head': 2},
{'id': 2,
'start': 8,
'end': 14,
'pos': 'VERB',
'tag': 'VBD',
'dep': 'ROOT',
'head': 2},
{'id': 3,
'start': 14,
'end': 15,
'pos': 'PUNCT',
'tag': '.',
'dep': 'punct',
'head': 2}]}
(I'm specifically interested in custom annotation of Spans, but the same appears to be true of the JSON serialization of Doc object.)
Pickling and unpickling the document does preserve the custom extension.
How do I get the custom span extensions into the JSON serialization, or is that not supported?

Use this function and add your custom extensions any way you want:
def doc2json(doc: spacy.tokens.Doc, model: str):
json_doc = {
"text": doc.text,
"text_with_ws": doc.text_with_ws,
"cats": doc.cats,
"is_tagged": doc.is_tagged,
"is_parsed": doc.is_parsed,
"is_nered": doc.is_nered,
"is_sentenced": doc.is_sentenced,
}
ents = [
{"start": ent.start, "end": ent.end, "label": ent.label_} for ent in doc.ents
]
if doc.is_sentenced:
sents = [{"start": sent.start, "end": sent.end} for sent in doc.sents]
else:
sents = []
if doc.is_tagged and doc.is_parsed:
noun_chunks = [
{"start": chunk.start, "end": chunk.end} for chunk in doc.noun_chunks
]
else:
noun_chunks = []
tokens = [
{
"text": token.text,
"text_with_ws": token.text_with_ws,
"whitespace": token.whitespace_,
"orth": token.orth,
"i": token.i,
"ent_type": token.ent_type_,
"ent_iob": token.ent_iob_,
"lemma": token.lemma_,
"norm": token.norm_,
"lower": token.lower_,
"shape": token.shape_,
"prefix": token.prefix_,
"suffix": token.suffix_,
"pos": token.pos_,
"tag": token.tag_,
"dep": token.dep_,
"is_alpha": token.is_alpha,
"is_ascii": token.is_ascii,
"is_digit": token.is_digit,
"is_lower": token.is_lower,
"is_upper": token.is_upper,
"is_title": token.is_title,
"is_punct": token.is_punct,
"is_left_punct": token.is_left_punct,
"is_right_punct": token.is_right_punct,
"is_space": token.is_space,
"is_bracket": token.is_bracket,
"is_currency": token.is_currency,
"like_url": token.like_url,
"like_num": token.like_num,
"like_email": token.like_email,
"is_oov": token.is_oov,
"is_stop": token.is_stop,
"is_sent_start": token.is_sent_start,
"head": token.head.i,
}
for token in doc
]
return {
"model": model,
"doc": json_doc,
"ents": ents,
"sents": sents,
"noun_chunks": noun_chunks,
"tokens": tokens,
}

Since I ran into the same issue and the only other answer didnt really help my I thought I mide as well give other persons looking into this some hints.
Since Spacy 2.1 Spacy removed print_tree and added the to_json. to_json does not return custom extensions as "this method will output the same format as the JSON training data expected by spacy train" (https://spacy.io/usage/v2-1).
If you want to output your custom extension you need to write your own to_json function.
To do this I recommend extending the to_json() given by spacy.

Not really a fan of the other two answers here since they seem a bit overkill (extending the Doc object by #Chooklii or the custom but flaky doc2json method solution by #Laksh) so I'll just drop here what I did for one of my projects here and maybe that is useful to someone.
doc = <YOUR_DOC_OBJECT>
extra_fields = [field for field in dir(doc._) if field not in ('get', 'set', 'has')]
doc_json = doc.to_json()
doc_json.update({field: doc._.get(field) for field in extra_fields})
The doc_json should now have all the fields that you set via the Extensions interface provided by spaCy along with the fields set by other spaCy pipelines.

Related

I need to create a spark dataframe from a nested json file in scala

I have a Json file that looks like this
{
"tags": [
{
"1": "NpProgressBarTag",
"2": "userPath",
"3": "screen",
"4": 6,
"12": 9,
"13": "buttonName",
"16": 0,
"17": 10,
"18": 5,
"19": 6,
"20": 1,
"35": 1,
"36": 1,
"37": 4,
"38": 0,
"39": "npChannelGuid",
"40": "npShowGuid",
"41": "npCategoryGuid",
"42": "npEpisodeGuid",
"43": "npAodEpisodeGuid",
"44": "npVodEpisodeGuid",
"45": "npLiveEventGuid",
"46": "npTeamGuid",
"47": "npLeagueGuid",
"48": "npStatus",
"50": 0,
"52": "gupId",
"54": "deviceID",
"55": 1,
"56": 0,
"57": "uiVersion",
"58": 1,
"59": "deviceOS",
"60": 1,
"61": 0,
"62": "channelLineupID",
"63": 2,
"64": "userProfile",
"65": "sessionId",
"66": "hitId",
"67": "actionTime",
"68": "seekTo",
"69": "seekFrom",
"70": "currentPosition"
}
]
}
I tried to create a dataframe using
val path = "some/path/to/jsonFile.json"
val df = sqlContext.read.json(path)
df.show()
when I run this I get
df: org.apache.spark.sql.DataFrame = [_corrupt_record: string]
How do we create a df based on contents of "tags" key? all I need is, pull data out of "tags" and apply case class like this
case class ProgLang (id: String, type: String )
I need to convert this json data into dataframe with Two Column names .toDF(id, Type)
Can anyone shed some light on this error?
You may modify the JSON using Circe.
Given that your values are sometimes Strings and other times Numbers, this was quite complex.
import io.circe._, io.circe.parser._, io.circe.generic.semiauto._
val json = """ ... """ // your JSON here.
val doc = parse(json).right.get
val mappedDoc = doc.hcursor.downField("tags").withFocus { array =>
array.mapArray { jsons =>
jsons.map { json =>
json.mapObject { o =>
o.mapValues { v =>
// Cast numbers to strings.
if (v.isString) v else Json.fromString(v.asNumber.get.toString)
}
}
}
}
}
final case class ProgLang(id: String, `type`: String )
final case class Tags(tags: List[Map[String, String]])
implicit val TagsDecoder: Decoder[Tags] = deriveDecoder
val tags = mappedDoc.top.get.as[Tags]
val data = for {
tag <- res29.tags
(id, _type) <- tag
} yield ProgLang(id, _type)
Now you have a List of ProgLang you may create a DataFrame directly from it, save it as a file with each JSON per line, save it as a CSV file, etc...
If the file is very big, you may use fs2 to stream it while transforming, it integrates nicely with Circe.
DISCLAIMER: I am far from being a "pro" with Circe, this seems over-complicated for doing something which seems like a "simple-task", probably there is a better / cleaner way of doing it (maybe using Optics?), but hey! it works! - anyways, if anyone knows a better way to solve this feel free to edit the question or provide yours.
val path = "some/path/to/jsonFile.json"
spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json(path)
try following code if your json file not very big
val spark = SparkSession.builder().getOrCreate()
val df = spark.read.json(spark.sparkContext.wholeTextFiles("some/path/to/jsonFile.json").values)

GAE python27 return nested json

This seems such a simple task, yet it eludes me...
class ViewAllDogs(webapp2.RequestHandler):
""" Returns an array of json objects representing all dogs. """
def get(self):
query = Dog.query()
results = query.fetch(limit = MAX_DOGS) # 100
aList = []
for match in results:
aList.append({'id': match.id, 'name': match.name,
'owner': match.owner, arrival_date':match.arrival_date})
aList.append({'departure_history':{'departure_date': match.departure_date,
'departed_dog': match.departed_dog}})
self.response.headers['Content-Type'] = 'application/json'
self.response.write(json.dumps(aList))
The above, my best attempt to date, gets me:
[
{
"arrival_date": null,
"id": "a link to self",
"owner": 354773,
"name": "Rover"
},
{
"departure_history": {
"departed_dog": "Jake",
"departure_date": 04/24/2017
}
},
# json array of objects continues...
]
What I'm trying to get is the departure_history nested:
[
{
"id": "a link to self...",
"owner": 354773,
"name": "Rover",
"departure_history": {
"departed_dog": "Jake",
"departure_date": 04/24/2017
},
"arrival_date": 04/25/2017,
},
# json array of objects continues...
]
I've tried a bunch of different combinations, looked at json docs, python27 docs, no joy, and burned about way too many hours with this. I got this far with the many related SO posts on this topic. Thanks in advance.
You can simplify a little:
aList = []
for match in results:
aDog = {'id': match.id,
'name': match.name,
'owner': match.owner,
'arrival_date':match.arrival_date,
'departure_history': {
'departure_date': match.departure_date,
'departed_dog': match.departed_dog}
}
aList.append(aDog)
This seems a bit hackish, but it works. If you know a better way, by all means, let me know. Thanks.
class ViewAllDogs(webapp2.RequestHandler):
""" Returns an array of json objects representing all dogs. """
def get(self):
query = Dog.query()
results = query.fetch(limit = MAX_DOGS) # 100
aList = []
i = 0
for match in results:
aList.append({'id': match.id, 'name': match.name,
'owner': match.owner, arrival_date':match.arrival_date})
aList[i]['departure_history'] = ({'departure_history':{'departure_date': match.departure_date,
'departed_dog': match.departed_dog}})
i += 1
self.response.headers['Content-Type'] = 'application/json'
self.response.write(json.dumps(aList))

How to read from .csv file and translate into .json (with a different data structure) in python?

Trying to write a python script that will allow me to read a .csv file and mix up the values to a specific format/data structure in .json that I can then import into mongoDB. I'm using pedestrian data as my dataset and there are over a million entries with redundant data. I'm stuck on writing the actual script and translating that into my desired .json format.
data.csv - in table format for easier reading and raw
Id,Date_Time,Year,Month,Mdate,Day,Time,Sensor_ID,Sensor_Name,Hourly_Counts
1, 01-JUN-2009 00:00,2009,June,1,Monday,0,4,Town Hall (West),194
2, 01-JUN-2009 00:00,2009,June,1,Monday,0,17,Collins Place (South),21
3, 01-JUN-2009 00:00,2009,June,1,Monday,0,18,Collins Place (North),9
4, 01-JUN-2009 00:00,2009,June,1,Monday,0,16,Australia on Collins,39
5, 01-JUN-2009 00:00,2009,June,1,Monday,0,2,Bourke Street Mall (South),28
6, 01-JUN-2009 00:00,2009,June,1,Monday,0,1,Bourke Street Mall (North),37
7, 01-JUN-2009 00:00,2009,June,1,Monday,0,13,Flagstaff Station,1
8, 01-JUN-2009 00:00,2009,June,1,Monday,0,3,Melbourne Central,155
9, 01-JUN-2009 00:00,2009,June,1,Monday,0,15,State Library,98
10, 01-JUN-2009 00:00,2009,June,1,Monday,0,9,Southern Cross Station,7
11, 01-JUN-2009 00:00,2009,June,1,Monday,0,10,Victoria Point,8
12, 01-JUN-2009 00:00,2009,June,1,Monday,0,12,New Quay,30
Because I'll be uploading to mongoDB, the Id in my context is redundant to me so I need my script to skip that. Sensor_ID is not unique but I'm planning to make it the PK and create a list of objects differentiating the Hourly_Count.
I'm aiming to generate a JSON object like this from the data:
**data.json**
{
{
"Sensor_ID": 4,
"Sensor_Name": "Town Hall(West)",
"countList":
[
{
"Date_Time": "01-JUN-2009 00:00",
"Year":2009,
"Month": "June",
"Mdate": 1,
"Day": "Monday",
"Time": 0,
"Hourly_Counts": 194
},
{
"Date_Time": "01-JUN-2009 00:00",
"Year":2009,
"Month": "June",
"Mdate": 1,
"Day": "Monday",
"Time": 1,
"Hourly_Counts": 82
}
]
},
{
"Sensor_ID": 17,
"Sensor_Name": "Collins Place(North)",
"countList":
[
{
"Date_Time": "01-JUN-2009 00:00",
"Year":2009,
"Month": "June",
"Mdate": 1,
"Day": "Monday",
"Time": 0,
"Hourly_Counts": 21
}
]
}
}
So on so forth. I'm trying to make it so when it reads a Sensor_ID it creates an json object from the fields listed and adds it to the countList. Added in another from station_ID = 4 to the countList.
I am using python 2.7.x and I have looked at every question concerning this on stackoverflow and every other website. Very few rarely seem to want to restructure the .csv data when converting to .json so it's been a bit difficult.
What I have so far, still relatively new to python so thought this would be good to try out.
csvtojson.py
import csv
import json
def csvtojson():
filename = 'data.csv'
fieldnames = ('Id','Date_Time','Year','Month','Mdate','Day',
'Time','Sensor_ID','Sensor_Name', 'Hourly_Counts')
dataTime = ('Date_Time','Year','Month','Mdate','Day',
'Time', 'Hourly_Counts')
all_data = {}
with open(filename, 'rb') as csvfile:
reader = csv.DictReader(csvfile, fieldnames)
#skip header
next(reader)
current_sensorID = None
for row in reader:
sensor_ID = row['Sensor_ID']
sensorName = row['Sensor_Name']
data = all_data[sensor_ID] = {}
data['dataTime'] = dict((k, row[k]) for k in dataTime)
print json.dumps(all_data, indent=4, sort_keys=True)
if __name__ == "__main__":
csvtojson()
As far as I got is that countList is in is own object but it's not creating a list of objects and may mess up the import to mongoDB. It is filtering through Sensor_ID but overwriting if there are duplicates instead of adding to countList. And I can't seem to get it in the format/data structure I want - I'm not even sure if that's the right structure, ultimate goal is to import the millions of tuples into mongoDB like the way I listed. I'm trying a small set now to test it out.
Please check the following.
https://github.com/gurdyals/test-repo/tree/master/MongoDB
Use " MongoDB_py.zip " files .
I did the same to convert csv data to MongoDB dict .
Please let me know if you have any questions.
Thanks
Here is sample code for doing something similar to the above using python pandas. You could also do some aggregation in the dataframe if you wish to summarise the data to get rid of the redundant data.
import pandas as pd
import pprint as pp
import json
from collections import defaultdict
results = defaultdict(lambda: defaultdict(dict))
df = pd.read_csv('data.csv')
df.set_index(['Sensor_ID', 'Sensor_Name'],inplace=True)
df.reset_index(inplace=True)
grouped = df.groupby(['Sensor_ID', 'Sensor_Name']).apply(lambda x: x.drop(['Sensor_ID', 'Sensor_Name'], axis=1).to_json(orient='records'))
grouped.name = 'countList'
js = json.loads(pd.DataFrame(grouped).reset_index().to_json(orient='records'))
print json.dumps(js, indent = 4)
The output:
[
{
"Sensor_ID": 1,
"countList": "[{\"Id\":6,\"Date_Time\":\" 01-JUN-2009 00:00\",\"Year\":2009,\"Month\":\"June\",\"Mdate\":1,\"Day\":\"Monday\",\"Time\":0,\"Hourly_Counts\":37}]",
"Sensor_Name": "Bourke Street Mall (North)"
},
{
"Sensor_ID": 2,
"countList": "[{\"Id\":5,\"Date_Time\":\" 01-JUN-2009 00:00\",\"Year\":2009,\"Month\":\"June\",\"Mdate\":1,\"Day\":\"Monday\",\"Time\":0,\"Hourly_Counts\":28}]",
"Sensor_Name": "Bourke Street Mall (South)"
},
{
"Sensor_ID": 3,
"countList": "[{\"Id\":8,\"Date_Time\":\" 01-JUN-2009 00:00\",\"Year\":2009,\"Month\":\"June\",\"Mdate\":1,\"Day\":\"Monday\",\"Time\":0,\"Hourly_Counts\":155}]",
"Sensor_Name": "Melbourne Central"
},
{
"Sensor_ID": 4,
"countList": "[{\"Id\":1,\"Date_Time\":\" 01-JUN-2009 00:00\",\"Year\":2009,\"Month\":\"June\",\"Mdate\":1,\"Day\":\"Monday\",\"Time\":0,\"Hourly_Counts\":194}]",
"Sensor_Name": "Town Hall (West)"
},

saving the cppheaderparser output as valid json

the python program
http://sourceforge.net/projects/cppheaderparser/
can parse a c++ header file and store the info (about classes etc) in a python dictionary.
Using the included example program readSampleClass.py and
data_string = ( repr(cppHeader) )
with open('data.txt', 'w') as outfile:
json.dumps(data_string,outfile)
it saved the output but it is not valid json as
it uses single, not double quotes and key part is not quoted.
sample of output: (reduced)
{'enums': [], 'variables': [], 'classes':
{'SampleClass':
{'inherits': [], 'line_number': 8, 'declaration_method': 'class', 'typedefs':
{'public': [], 'private': [], 'protected': []
}, 'abstract': False, 'parent': None,'parent': None, 'reference': 0, 'constant': 0, 'aliases': [], 'raw_type': 'void', 'typedef': None, 'mutable': False
}], 'virtual': False, 'rtnType': 'int', 'returns_class': False, 'name': 'anotherFreeFunction', 'constructor': False, 'inline': False, 'returns_pointer': 0, 'defined': False
}]
}
so the question is:
How can I make it use double quotes and not single and how can I also make it quote the value part. Like False in sample.
I assume is possible as the creator of cppheaderparser wrote
about json.dumps(repr(cppHeader))
https://twitter.com/senexcanis/status/559444754166198272
Why use the json lib if its not valid jason?
That said I have never used python before and it might just not work as i think.
-update-
After some json doc reading, i gave up on json.dump as it seems to do nothing to the output in this case.
I ended up doing
data_string = ( repr(cppHeader) )
data_string = string.replace(data_string,'\'', '\"')
data_string = string.replace(data_string,'False', '\"False\"')
data_string = string.replace(data_string,'True', '\"True\"')
data_string = string.replace(data_string,'None', '\"None\"')
data_string = string.replace(data_string,'...', '')
with open('data.txt', 'w') as outfile:
outfile.write (data_string)
which give valid json - at least for my test c++ headers.
-update 2-
The creator of cppheaderparse just released a new 2.6 version where its possible to write CppHeaderParser.CppHeader("yourHeader.h").toJSON() to save as json.

how to use a jsonfiy object in assertions?

I am trying to unit test some code, and want to assert that the jsonify output of the code is correct. here is what I have so far.
def test_get_ticket(self):
with self.app.test_request_context('/?main_id=14522&user_id=82'):
methodOutput = brain_get_ticket.get_ticket({'main_id': {1: 0}, 'status': {'Closed': 0},
'available': {'False': 0}}, "main_id, status, available",
['main_id', 'status', 'available'])
correct_return_output = json.dumps(dict(
to_be_working_on_last_id=0,
to_be_working_on_id=6,
information={'status': {'Closed': 1}, 'available': {'False': 1}, 'main_id': {1: 1}}
))
self.assertEquals(json.loads(methodOutput.data()), correct_return_output, "output was: " + str(methodOutput) + " it should be: " + str(correct_return_output))
the output i'm getting is :
self.assertEquals(json.loads(methodOutput.data()), correct_return_output)
TypeError: 'str' object is not callable
any suggestions????
Solved:
the main problem was that I was using data as if it was a method, not a descriptor, like Martijn said. Also changing the correct_return_output to a dictionary instead of a jsonify object to compare to the actual method output worked. THANKS!
Response.data is a descriptor and does not need to be called; you are trying to call the returned JSON string here.
Your better bet is to decode that JSON response; dictionaries are unordered and you should not count on what order the resulting JSON data is listed in. You already do so, but then you should compare that against a dictionary, not a new JSON string!
def test_get_ticket(self):
with self.app.test_request_context('/?main_id=14522&user_id=82'):
methodOutput = brain_get_ticket.get_ticket(
{'main_id': {1: 0}, 'status': {'Closed': 0},
'available': {'False': 0}},
"main_id, status, available", ['main_id', 'status', 'available'])
correct_return_output = dict(
to_be_working_on_last_id=0,
to_be_working_on_id=6,
information={'status': {'Closed': 1},
'available': {'False': 1},
'main_id': {1: 1}})
self.assertEquals(
json.loads(methodOutput.data),
correct_return_output,
"output was: {!r}, it should be {!r}".format(
methodOutput.data, correct_return_output))