JSON hierarchies encoding and event-capture - json

Consider the following JSON structure:
{'100': {'Time': '02:00:00', 'Group': 'A', 'Similar events': [101, 102, 104, 120],
'101': {'Time': '02:01:00', 'Group': 'B', 'Similar events': [100, 103, 105, 111],
'102': {'Time': '04:00:00', 'Group': 'A', 'Similar events': [104, 100, 107, 121]}
The top-level keys (e.g. '100', '101', etc.) are unique identifiers. I have come to find this is not the ideal way to store JSON (attempting to load this structure - with many more events - crashed my PC).
After some digging, I believe this is the proper way (or, at least, a much more canonical way) of encoding these data in JSON:
{'Time': [{'100': '02:00:00'},
{'101': '02:01:00'},
{'102': '04:00:00'}],
'Group': [{'100': 'A'},
{'101': 'B'},
{'102': 'A'}],
'Similar events': [{'100': [101, 102, 104, 120]},
{'101': [100, 103, 105, 111]},
{'102': [104, 100, 107, 121]}]}
My machine is able to handle much better this last attempt. Why does my former method of using unique events as (what I think are) individual "rows" cause so much trouble? My gut tells me each "column" or key within each record in the former try becomes a new field since it's found under a unique identifier (a unique key).

It's difficult to say without more details such as the total size of your data, the amount of memory on your computer, the software you're using and the specific operations your're trying to do but it may be that the working set of the second representation is smaller for your problem.

Related

How to parse nested JSON file in Pandas

I'm trying to transform a JSON file generated by the Day One Journal to a text file using Python but hit a brick wall.
This is broadly the format:
{'metadata': {'version': '1.0'},
'entries': [{'richText': '{"meta":{"version":1,"small-lines-removed":true,"created":{"platform":"com.bloombuilt.dayone-mac","version":1344}},"contents":[{"attributes":{"line":{"header":1,"identifier":"F78B28DA-488E-489E-9C95-1A0648099792"}},"text":"2022\\n"},{"attributes":{"line":{"header":0,"identifier":"FA8C6594-F43D-4652-B442-DAF72A379799"}},"text":"\\n"},{"attributes":{"line":{"header":0,"identifier":"0923BCC8-B24A-4C0D-963C-73D09561EECD"}},"text":"It’s the beginning of a new year"},{"embeddedObjects":[{"type":"horizontalRuleLine"}]},{"text":"\\n\\n\\n\\n"},{"embeddedObjects":[{"type":"horizontalRuleLine"}]}]}',
'duration': 0,
'creationOSVersion': '12.1',
'weather': {'sunsetDate': '2022-01-12T16:15:28Z',
'temperatureCelsius': 7,
'weatherServiceName': 'HAMweather',
'windBearing': 230,
'sunriseDate': '2022-01-12T08:00:44Z',
'conditionsDescription': 'Mostly Clear',
'pressureMB': 1042,
'visibilityKM': 48.28020095825195,
'relativeHumidity': 81,
'windSpeedKPH': 6,
'weatherCode': 'clear-night',
'windChillCelsius': 6.699999809265137},
'editingTime': 2925.313938140869,
'timeZone': 'Europe/London',
'creationDeviceType': 'Hal 9000',
'uuid': '988D9D9876624FAEB88F9BCC666FD9CD',
'creationDeviceModel': 'MacBookPro15,2',
'starred': False,
'location': {'region': {'center': {'longitude': -0.0095,
'latitude': 51},
'radius': 75},
'localityName': 'London',
'country': 'United Kingdom',
'timeZoneName': 'Europe/London',
'administrativeArea': 'England',
'longitude': -0.0095,
'placeName': 'Somewhere',
'latitude': 51},
'isPinned': False,
'creationDevice': 'somedevice'...,
}
I only want the 'text' (of which there might be a number of 'text' entries and 'creationDate' so I've got a daily record.
My code to pull out the data is straightforward:
import json
# Opening JSON file
f = open('files/2022.json')
# returns JSON object as
# a dictionary
data = json.load(f)
# Closing file
f.close()
I've tried using list comprensions and then concatenating the Series in Pandas, but two don't match in length - because multiple entries on one day mix up the dataframe.
I wanted to use this code, but:
result = []
for i in data['entries']:
entry = i['creationDate'] + i['text']
result.append(entry)
but I get this error:
KeyError: 'text'
What do I need to do?
Update:
{'richText': '{"meta":{"version":1,"small-lines-removed":true,"created":{"platform":"com.bloombuilt.dayone-mac","version":1344}},"contents":[{"text":"Later than I planned\\n"}]}',
'duration': 0,
'creationOSVersion': '12.1',
'weather': {'sunsetDate': '2022-01-12T16:15:28Z',
'temperatureCelsius': 7,
'weatherServiceName': 'HAMweather',
'windBearing': 230,
'sunriseDate': '2022-01-12T08:00:44Z',
'conditionsDescription': 'Mostly Clear',
'pressureMB': 1042,
'visibilityKM': 48.28020095825195,
'relativeHumidity': 81,
'windSpeedKPH': 6,
'weatherCode': 'clear-night',
'windChillCelsius': 6.699999809265137},
'editingTime': 672.3099998235703,
'timeZone': 'Europe/London',
'creationDeviceType': 'Computer',
'uuid': 'F53DCC5E05BB4106A49C76954117DBF4',
'creationDeviceModel': 'xompurwe',
'isPinned': False,
'creationDevice': 'Computer',
'text': 'Later than I planned \\\n',
'modifiedDate': '2022-01-05T01:01:29Z',
'isAllDay': False,
'creationDate': '2022-01-05T00:39:19Z',
'creationOSName': 'macOS'},
Sort of managed to work a solution - thank you to everyone who helped this morning, particularly #Tomer S.
My solution was:
result = []
for i in data['entries']:
print (i['creationDate'] + i['text'])
result.append(entry)
It still won't get what I want

Data preprocessing for Named Entity Recognition?

I'm working on a Named Entity Recognition on resume dataset and we have entities like dates, phone, email etc.,,
And I'm working how to preprocess those entities. I'm currently adding a space after each puncuation like this,
DAVID B-Name
John I-Name
, O
IT O
Washington B-Address
, I-Address
DC I-Address
( B-Phone
107 I-Phone
) I-Phone
155
- I-Phone
4838 I-Phone
david B-Email
. I-Email
John I-Email
# I-Email
gmail I-Email
. I-Email
com I-Email
But I'm starting to question the process on how to handle such text during inference. I'm assuming even at inference we have to preprocess text using same process that is adding a space after each puncuation isn't it?
But it won't be so readable right?
For example at inference I have to provide input text like test # example . com? which is not readable isn't it? It only be able to predict entities in such format.
The problem you're trying to deal with is called tokenization. To deal with the formatting issue that you raise, often frameworks will extract the tokens from the underlying text in a way preserves the original text, such as keeping track of the character starts and ends for each token.
For instance, SpaCy in Python returns an object that stores all of this information:
import spacy
from pprint import pprint
nlp = spacy.load("en_core_web_sm")
doc = nlp("DAVID John, IT\nWashington, DC (107) 155-4838 david.John#gmail.com")
pprint([(token.text, token.idx, token.idx + len(token.text)) for token in doc])
output:
[('DAVID', 0, 5),
('John', 6, 10),
(',', 10, 11),
('IT', 12, 14),
('\n', 14, 15),
('Washington', 15, 25),
(',', 25, 26),
('DC', 27, 29),
('(', 30, 31),
('107', 31, 34),
(')', 34, 35),
('155', 36, 39),
('-', 39, 40),
('4838', 40, 44),
('david.John#gmail.com', 45, 65)]
You could either do the same sort of thing for yourself (e.g. keep a counter as you add spaces) or use an existing tokenizer (such as SpaCy, CoreNLP, tensorflow, etc.)

How to read a csv file into a list of lists in SWI prolog where the inner list represents each line of the CSV?

I have a CSV file that look something like below: i.e. not in Prolog format
james,facebook,intel,samsung
rebecca,intel,samsung,facebook
Ian,samsung,facebook,intel
I am trying to write a Prolog predicate that reads the file and returns a list that looks like
[[james,facebook,intel,samsung],[rebecca,intel,samsung,facebook],[Ian,samsung,facebook,intel]]
to be used further in other predicates.
I am still a beginner and have found some good information from SO and modified them to see if I can get it but I`m stuck because I only generate a list that looks like this
[[(james,facebook,intel,samsung)],[(rebecca,intel,samsung,facebook)],[(Ian,samsung,facebook,intel)]]
which means when I call the head of the inner lists I get (james,facebook,intel,samsung) and not james.
Here is the code being used :- (seen on SO and modified)
stream_representations(Input,Lines) :-
read_line_to_codes(Input,Line),
( Line == end_of_file
-> Lines = []
; atom_codes(FinalLine, Line),
term_to_atom(LineTerm,FinalLine),
Lines = [[LineTerm] | FurtherLines],
stream_representations(Input,FurtherLines)
).
main(Lines) :-
open('file.txt', read, Input),
stream_representations(Input, Lines),
close(Input).
The problem lies with term_to_atom(LineTerm,FinalLine).
First we read a line of the CSV file into a list of character codes in
read_line_to_codes(Input,Line).
Let's simulate input with atom_codes/2:
?- atom_codes('james,facebook,intel,samsung',Line).
Line = [106, 97, 109, 101, 115, 44, 102, 97, 99|...].
Then we recompose the original atom read in into FinalLine (this seems wasteful, there must be a way to hoover up a line into an atom directly)
?- atom_codes('james,facebook,intel,samsung',Line),
atom_codes(FinalLine, Line).
Line = [106, 97, 109, 101, 115, 44, 102, 97, 99|...],
FinalLine = 'james,facebook,intel,samsung'.
The we try to map this atom in FinalLine into a term, LineTerm, using term_to_atom/2
?- atom_codes('james,facebook,intel,samsung',Line),
atom_codes(FinalLine, Line),
term_to_atom(LineTerm,FinalLine).
Line = [106, 97, 109, 101, 115, 44, 102, 97, 99|...],
FinalLine = 'james,facebook,intel,samsung',
LineTerm = (james, facebook, intel, samsung).
You see the problem here: LineTerm is not quite a list, but a nested term using the functor , to separate elements:
?- atom_codes('james,facebook,intel,samsung',Line),
atom_codes(FinalLine, Line),
term_to_atom(LineTerm,FinalLine),
write_canonical(LineTerm).
','(james,','(facebook,','(intel,samsung)))
Line = [106, 97, 109, 101, 115, 44, 102, 97, 99|...],
FinalLine = 'james,facebook,intel,samsung',
LineTerm = (james, facebook, intel, samsung).
This ','(james,','(facebook,','(intel,samsung))) term will thus also be in the final result, just written differently: (james,facebook,intel,samsung) and packed into a list:
[(james,facebook,intel,samsung)]
You do not want this term, you want a list. You could use atomic_list_concat/2 to create a new atom that can be read as a list:
?- atom_codes('james,facebook,intel,samsung',Line),
atom_codes(FinalLine, Line),
atomic_list_concat(['[',FinalLine,']'],ListyAtom),
term_to_atom(LineTerm,ListyAtom),
LineTerm = [V1,V2,V3,V4].
Line = [106, 97, 109, 101, 115, 44, 102, 97, 99|...],
FinalLine = 'james,facebook,intel,samsung',
ListyAtom = '[james,facebook,intel,samsung]',
LineTerm = [james, facebook, intel, samsung],
V1 = james,
V2 = facebook,
V3 = intel,
V4 = samsung.
But that's rather barbaric.
We must do this whole processing in fewer steps:
Read a line of comma-separated strings on input.
Transform this into a list of either atoms or strings directly.
DCGs seem like the correct solution. Maybe someone can add a two-liner.

Yii2 eager loading with subquery instead of array with id's

For a big application I am using the following query to get all projects with relations:
$project_query = Project::find()->With(['category', 'deliveryTickets', 'garbagePerProjects', 'hourLogs',
'materialPerProjects', 'employeePerProjects', 'contact', 'invoices'])
->where(['project.organization_id' => $this->organization_id]);
which generates the following query, for example :
SELECT * FROM `delivery_ticket` WHERE `project_id` IN (124, 137, 147, 148, 149, 219, 222, 241, 1263, 1324, 1325, 1333, 1378, 1423, 1499, 1627, 1687, 1688, 1689, 1690, 1705, 1706, 1962, 2047, 2643, 2774, 2876, 2912, 3005, 3287, 3334, 4251, 4570, 4758, 4963, 5644, 6168, 6605, 6639, 6991, 7000, 7003, 7098, 7530, 7531, 7733, 7734, 7823, 7927, 8452, 8752, 8868, 8903, 8914, 8916, 8917, 8921, 8923, 8931, 8947, 8948, 8949, 8952, 8969, 9042, 9134, 9136, 9137, 9280, 9671, 10262, 10272, 10712, 10730, 11436, 11459, 11520, 11641, 11774, 11776, 12028, 12178, 12323, 12831, 12884, 13050, 13478, 13479, 13595, 13651, 13716, 13946, 14431, 14447, 14523, 15303, 15343, 16269, 16270, 16491, 16513, 17950, 17951)
Mysql explain shows that it is using range instead of eq_ref
therefor my page takes 3 seconds to load.
How can turn this query in a subquery ?
range is ambiguous in this situation. To investigate further, provide EXPLAIN SELECT ... and perform these steps:
FLUSH STATUS;
SELECT ...;
SHOW SESSION STATUS LIKE "Handler%";
and look at the largest number that comes out. Possible cases:
Same as number of rows in the table -- You need INDEX(project_id).
Same as number of rows in output (103?) -- Then it did the optimal, namely leapfrog through the table, not do some big "range" scan like it implied. As for "3 seconds" -- that will take some more head scratching.
Some other number -- What version are you running? (This may take more investigation.) And provide SHOW CREATE TABLE delivery_ticket.

Get JSON's attribute value in Chatterbot and Django integration

statement.text in chatterbot and Django integration returns
{'text': u'How are you doing?', 'created_at': datetime.datetime(2017, 2, 20, 7, 37, 30, 746345, tzinfo=<UTC>), 'extra_data': {}, 'in_response_to': [{'text': u'Hi', 'occurrence': 3}]}
I want a value of text attribute so that it prints How are you doing?
The chatterbot return the json object(dict) so you can use the dictionary operations like following
[1]: data = {'text': u'How are you doing?', 'created_at': datetime.datetime(2017, 2, 20, 7, 37, 30, 746345, tzinfo=<UTC>), 'extra_data': {}, 'in_response_to': [{'text': u'Hi', 'occurrence': 3}]}
[2]: data['text'] or data.get('text')[this approch is good].
What you got is dictionary. Value of dictionary can be obtained by get() function. You can also use dict['text'], but it does not perform error checking. get function returns None if key is not present.