render Displacy Entity Recognition Visualization into Plotly Dash - html

I want to render a piece of Entity Recognition Visualization by Spacy into a Plotly Dash app.
The html of ER Visualization for rendering is as follows:
<div class="entities" style="line-height: 2.5">
<mark class="entities" style="background: ...>
<span>...</span>
</mark>
<mark class="entities" style="background: ...>
<span>...</span>
</mark>
</div>
I have tried parsing the HTML using BeautifulSoup, and converting the HTML to Dash by the following code. But when I run convert_html_to_dash(html_parsed), it is throwing KeyError: 'style'
html_parsed = bs.BeautifulSoup(html, 'html.parser')
def convert_html_to_dash(el, style = None):
if type(el) == bs.element.NavigableString:
return str(el)
else:
name = el.name
style = extract_style(el) if style is None else style
contents = [convert_html_to_dash(x) for x in el.contents]
return getattr(html,name.title())(contents, style=style)
def extract_style(el):
return {k.strip():v.strip() for k,v in [x.split(": ") for x in
el.attrs["style"].split(";")]}

Not every tag has a style attribute. For tags that don't, you are attempting to access a non-existent key in the attrs dictionary. Python's response is a KeyError.
If you use get() instead, it will return a default value instead of raising a KeyError. You can specify a default value as the second argument to get():
return { k.strip() : v.strip() for k, v in
[ x.split(': ') for x in el.attrs.get('style', '').split(';') ]
}
Here I have chosen the empty string as the default value.
With only this change, your code still remains somewhat brittle. What if the input does not exactly match what you expect?
For one thing, there might not be a space after the colon. Changing split(': ') to split(':') will make it work even if there is no space – if there is one it will be removed anyway since you are calling strip() after splitting.
And what if after splitting on ';' you receive something other than a key-value pair in the list? It is best to check if it is a valid pair (contains exactly one colon), and skip it otherwise.
Your code becomes:
return { k.strip() : v.strip() for k, v in
[ x.split(':') for x in el.attrs.get('style', '').split(';')
if x.count(':') == 1 ] }
Note that I have opted for single-quotation marks. Your code uses both, but it is best to pick one and stick with it.

Related

dumping list to JSON file creates list within a list [["x", "y","z"]], why?

I want to append multiple list items to a JSON file, but it creates a list within a list, and therefore I cannot acces the list from python. Since the code is overwriting existing data in the JSON file, there should not be any list there. I also tried it by having just an text in the file without brackets. It just creates a list within a list so [["x", "y","z"]] instead of ["x", "y","z"]
import json
filename = 'vocabulary.json'
print("Reading %s" % filename)
try:
with open(filename, "rt") as fp:
data = json.load(fp)
print("Data: %s" % data)#check
except IOError:
print("Could not read file, starting from scratch")
data = []
# Add some data
TEMPORARY_LIST = []
new_word = input("give new word: ")
TEMPORARY_LIST.append(new_word.split())
print(TEMPORARY_LIST)#check
data = TEMPORARY_LIST
print("Overwriting %s" % filename)
with open(filename, "wt") as fp:
json.dump(data, fp)
example and output with appending list with split words:
Reading vocabulary.json
Data: [['my', 'dads', 'house', 'is', 'nice']]
give new word: but my house is nicer
[['but', 'my', 'house', 'is', 'nicer']]
Overwriting vocabulary.json
So, if I understand what you are trying to accomplish correctly, it looks like you are trying to overwrite a list in a JSON file with a new list created from user input. For easiest data manipulation, set up your JSON file in dictionary form:
{
"words": [
"my",
"dad's",
"house",
"is",
"nice"
]
}
You should then set up functions to separate your functionality to make it more manageable:
def load_json(filename):
with open(filename, "r") as f:
return json.load(f)
Now, we can use those functions to load the JSON, access the words list, and overwrite it with the new word.
data = load_json("vocabulary.json")
new_word = input("Give new word: ").split()
data["words"] = new_word
write_json("vocabulary.json", data)
If the user inputs "but my house is nicer", the JSON file will look like this:
{
"words": [
"but",
"my",
"house",
"is",
"nicer"
]
}
Edit
Okay, I have a few suggestions to make before I get into solving the issue. Firstly, it's great that you have delegated much of the functionality of the program over to respective functions. However, using global variables is generally discouraged because it makes things extremely difficult to debug as any of the functions that use that variable could have mutated it by accident. To fix this, use method parameters and pass around the data accordingly. With small programs like this, you can think of the main() method as the point in which all data comes to and from. This means that the main() function will pass data to other functions and receive new or edited data back. One final recommendation, you should only be using all capital letters for variable names if they are going to be constant. For example, PI = 3.14159 is a constant, so it is conventional to make "pi" all caps.
Without using global, main() will look much cleaner:
def main():
choice = input("Do you want to start or manage the list? (start/manage)")
if choice == "start":
data = load_json()
words = data["words"]
dictee(words)
elif choice == "manage":
manage_list()
You can use the load_json() function from earlier (notice that I deleted write_json(), more on that later) if the user chooses to start the game. If the user chooses to manage the file, we can write something like this:
def manage_list():
choice = input("Do you want to add or clear the list? (add/clear)")
if choice == "add":
words_to_add = get_new_words()
add_words("vocabulary.json", words_to_add)
elif choice == "clear":
clear_words("vocabulary.json")
We get the user input first and then we can call two other functions, add_words() and clear_words():
def add_words(filename, words):
with open(filename, "r+") as f:
data = json.load(f)
data["words"].extend(words)
f.seek(0)
json.dump(data, f, indent=4)
def clear_words(filename):
with open(filename, "w+") as f:
data = {"words":[]}
json.dump(data, f, indent=4)
I did not utilize the load_json() function in the two functions above. My reasoning for this is because it would call for opening the file more times than needed, which would hurt performance. Furthermore, in these two functions, we already need to open the file, so it is okayt to load the JSON data here because it can be done with only one line: data = json.load(f). You may also notice that in add_words(), the file mode is "r+". This is the basic mode for reading and writing. "w+" is used in clear_words(), because "w+" not only opens the file for reading and writing, it overwrites the file if the file exists (that is also why we don't need to load the JSON data in clear_words()). Because we have these two functions for writing and/or overwriting data, we don't need the write_json() function that I had initially suggested.
We can then add to the list like so:
>>> Do you want to start or manage the list? (start/manage)manage
>>> Do you want to add or clear the list? (add/clear)add
>>> Please enter the words you want to add, separated by spaces: these are new words
And the JSON file becomes:
{
"words": [
"but",
"my",
"house",
"is",
"nicer",
"these",
"are",
"new",
"words"
]
}
We can then clear the list like so:
>>> Do you want to start or manage the list? (start/manage)manage
>>> Do you want to add or clear the list? (add/clear)clear
And the JSON file becomes:
{
"words": []
}
Great! Now, we implemented the ability for the user to manage the list. Let's move on to creating the functionality for the game: dictee()
You mentioned that you want to randomly select an item from a list and remove it from that list so it doesn't get asked twice. There are a multitude of ways you can accomplish this. For example, you could use random.shuffle:
def dictee(words):
correct = 0
incorrect = 0
random.shuffle(words)
for word in words:
# ask word
# evaluate response
# increment correct/incorrect
# ask if you want to play again
pass
random.shuffle randomly shuffles the list around. Then, you can iterate throught the list using for word in words: and start the game. You don't necessarily need to use random.choice here because when using random.shuffle and iterating through it, you are essentially selecting random values.
I hope this helped illustrate how powerful functions and function parameters are. They not only help you separate your code, but also make it easier to manage, understand, and write cleaner code.

How can I define a Raku grammar to parse TSV text?

I have some TSV data
ID Name Email
1 test test#email.com
321 stan stan#nowhere.net
I would like to parse this into a list of hashes
#entities[0]<Name> eq "test";
#entities[1]<Email> eq "stan#nowhere.net";
I'm having trouble with using the newline metacharacter to delimit the header row from the value rows. My grammar definition:
use v6;
grammar Parser {
token TOP { <headerRow><valueRow>+ }
token headerRow { [\s*<header>]+\n }
token header { \S+ }
token valueRow { [\s*<value>]+\n? }
token value { \S+ }
}
my $dat = q:to/EOF/;
ID Name Email
1 test test#email.com
321 stan stan#nowhere.net
EOF
say Parser.parse($dat);
But this is returning Nil. I think I'm misunderstanding something fundamental about regexes in raku.
Probably the main thing that's throwing it off is that \s matches horizontal and vertical space. To match just horizontal space, use \h, and to match just vertical space, \v.
One small recommendation I'd make is to avoid including the newlines in the token. You might also want to use the alternation operators % or %%, as they're designed for handling this type work:
grammar Parser {
token TOP {
<headerRow> \n
<valueRow>+ %% \n
}
token headerRow { <.ws>* %% <header> }
token valueRow { <.ws>* %% <value> }
token header { \S+ }
token value { \S+ }
token ws { \h* }
}
The result of Parser.parse($dat) for this is the following:
「ID Name Email
1 test test#email.com
321 stan stan#nowhere.net
」
headerRow => 「ID Name Email」
header => 「ID」
header => 「Name」
header => 「Email」
valueRow => 「 1 test test#email.com」
value => 「1」
value => 「test」
value => 「test#email.com」
valueRow => 「 321 stan stan#nowhere.net」
value => 「321」
value => 「stan」
value => 「stan#nowhere.net」
valueRow => 「」
which shows us that the grammar has successfully parsed everything. However, let's focus on the second part of your question, that you want to it to be available in a variable for you. To do that, you'll need to supply an actions class which is very simple for this project. You just make a class whose methods match the methods of your grammar (although very simple ones, like value/header that don't require special processing besides stringification, can be ignored). There are some more creative/compact ways to handle processing of yours, but I'll go with a fairly rudimentary approach for illustration. Here's our class:
class ParserActions {
method headerRow ($/) { ... }
method valueRow ($/) { ... }
method TOP ($/) { ... }
}
Each method has the signature ($/) which is the regex match variable. So now, let's ask what information we want from each token. In header row, we want each of the header values, in a row. So:
method headerRow ($/) { 
my #headers = $<header>.map: *.Str
make #headers;
}
Any token with a quantifier on it will be treated as a Positional, so we could also access each individual header match with $<header>[0], $<header>[1], etc. But those are match objects, so we just quickly stringify them. The make command allows other tokens to access this special data that we've created.
Our value row will look identically, because the $<value> tokens are what we care about.
method valueRow ($/) { 
my #values = $<value>.map: *.Str
make #values;
}
When we get to last method, we will want to create the array with hashes.
method TOP ($/) {
my #entries;
my #headers = $<headerRow>.made;
my #rows = $<valueRow>.map: *.made;
for #rows -> #values {
my %entry = flat #headers Z #values;
#entries.push: %entry;
}
make #entries;
}
Here you can see how we access the stuff we processed in headerRow() and valueRow(): You use the .made method. Because there are multiple valueRows, to get each of their made values, we need to do a map (this is a situation where I tend to write my grammar to have simply <header><data> in the grammar, and defeine the data as being multiple rows, but this is simple enough it's not too bad).
Now that we have the headers and rows in two arrays, it's simply a matter of making them an array of hashes, which we do in the for loop. The flat #x Z #y just intercolates the elements, and the hash assignment Does What We Mean, but there are other ways to get the array in hash you want.
Once you're done, you just make it, and then it will be available in the made of the parse:
say Parser.parse($dat, :actions(ParserActions)).made
-> [{Email => test#email.com, ID => 1, Name => test} {Email => stan#nowhere.net, ID => 321, Name => stan} {}]
It's fairly common to wrap these into a method, like
sub parse-tsv($tsv) {
return Parser.parse($tsv, :actions(ParserActions)).made
}
That way you can just say
my #entries = parse-tsv($dat);
say #entries[0]<Name>; # test
say #entries[1]<Email>; # stan#nowhere.net
TL;DR: you don't. Just use Text::CSV, which is able to deal with every format.
I will show how old Text::CSV will probably be useful:
use Text::CSV;
my $text = q:to/EOF/;
ID Name Email
1 test test#email.com
321 stan stan#nowhere.net
EOF
my #data = $text.lines.map: *.split(/\t/).list;
say #data.perl;
my $csv = csv( in => #data, key => "ID");
print $csv.perl;
The key part here is the data munging that converts the initial file into an array or arrays (in #data). It's only needed, however, because the csv command is not able to deal with strings; if data is in a file, you're good to go.
The last line will print:
${" 1" => ${:Email("test\#email.com"), :ID(" 1"), :Name("test")}, " 321" => ${:Email("stan\#nowhere.net"), :ID(" 321"), :Name("stan")}}%
The ID field will become the key to the hash, and the whole thing an array of hashes.
TL;DR regexs backtrack. tokens don't. That's why your pattern isn't matching. This answer focuses on explaining that, and how to trivially fix your grammar. However, you should probably rewrite it, or use an existing parser, which is what you should definitely do if you just want to parse TSV rather than learn about raku regexes.
A fundamental misunderstanding?
I think I'm misunderstanding something fundamental about regexes in raku.
(If you already know the term "regexes" is a highly ambiguous one, consider skipping this section.)
One fundamental thing you may be misunderstanding is the meaning of the word "regexes". Here are some popular meanings folk assume:
Formal regular expressions.
Perl regexes.
Perl Compatible Regular Expressions (PCRE).
Text pattern matching expressions called "regexes" that look like any of the above and do something similar.
None of these meanings are compatible with each other.
While Perl regexes are semantically a superset of formal regular expressions, they are far more useful in many ways but also more vulnerable to pathological backtracking.
While Perl Compatible Regular Expressions are compatible with Perl in the sense they were originally the same as standard Perl regexes in the late 1990s, and in the sense that Perl supports pluggable regex engines including the PCRE engine, PCRE regex syntax is not identical to the standard Perl regex used by default by Perl in 2020.
And while text pattern matching expressions called "regexes" generally do look somewhat like each other, and do all match text, there are dozens, perhaps hundreds, of variations in syntax, and even in semantics for the same syntax.
Raku text pattern matching expressions are typically called either "rules" or "regexes". The use of the term "regexes" conveys the fact that they look somewhat like other regexes (although the syntax has been cleaned up). The term "rules" conveys the fact they are part of a much broader set of features and tools that scale up to parsing (and beyond).
The quick fix
With the above fundamental aspect of the word "regexes" out of the way, I can now turn to the fundamental aspect of your "regex"'s behavior.
If we switch three of the patterns in your grammar for the token declarator to the regex declarator, your grammar works as you intended:
grammar Parser {
regex TOP { <headerRow><valueRow>+ }
regex headerRow { [\s*<header>]+\n }
token header { \S+ }
regex valueRow { [\s*<value>]+\n? }
token value { \S+ }
}
The sole difference between a token and a regex is that a regex backtracks whereas a token doesn't. Thus:
say 'ab' ~~ regex { [ \s* a ]+ b } # 「ab」
say 'ab' ~~ token { [ \s* a ]+ b } # 「ab」
say 'ab' ~~ regex { [ \s* \S ]+ b } # 「ab」
say 'ab' ~~ token { [ \s* \S ]+ b } # Nil
During processing of the last pattern (that could be and often is called a "regex", but whose actual declarator is token, not regex), the \S will swallow the 'b', just as it temporarily will have done during processing of the regex in the prior line. But, because the pattern is declared as a token, the rules engine (aka "regex engine") does not backtrack, so the overall match fails.
That's what's going on in your OP.
The right fix
A better solution in general is to wean yourself from assuming backtracking behavior, because it can be slow and even catastrophically slow (indistinguishable from the program hanging) when used in matching against a maliciously constructed string or one with an accidentally unfortunate combination of characters.
Sometimes regexs are appropriate. For example, if you're writing a one-off and a regex does the job, then you're done. That's fine. That's part of the reason that / ... / syntax in raku declares a backtracking pattern, just like regex. (Then again you can write / :r ... / if you want to switch on ratcheting -- "ratchet" means the opposite of "backtrack", so :r switches a regex to token semantics.)
Occasionally backtracking still has a role in a parsing context. For example, while the grammar for raku generally eschews backtracking, and instead has hundreds of rules and tokens, it nevertheless still has 3 regexs.
I've upvoted #user0721090601++'s answer because it's useful. It also addresses several things that immediately seemed to me to be idiomatically off in your code, and, importantly, sticks to tokens. It may well be the answer you prefer, which will be cool.

How do I search for a string in this JSON with Python

My JSON file looks something like:
{
"generator": {
"name": "Xfer Records Serum",
....
},
"generator": {
"name: "Lennar Digital Sylenth1",
....
}
}
I ask the user for search term and the input is searched for in the name key only. All matching results are returned. It means if I input 's' only then also both the above ones would be returned. Also please explain me how to return all the object names which are generators. The more simple method the better it will be for me. I use json library. However if another library is required not a problem.
Before switching to JSON I tried XML but it did not work.
If your goal is just to search all name properties, this will do the trick:
import re
def search_names(term, lines):
name_search = re.compile('\s*"name"\s*:\s*"(.*' + term + '.*)",?$', re.I)
return [x.group(1) for x in [name_search.search(y) for y in lines] if x]
with open('path/to/your.json') as f:
lines = f.readlines()
print(search_names('s', lines))
which would return both names you listed in your example.
The way the search_names() function works is it builds a regular expression that will match any line starting with "name": " (with varying amount of whitespace) followed by your search term with any other characters around it then terminated with " followed by an optional , and the end of string. Then applies that to each line from the file. Finally it filters out any non-matching lines and returns the value of the name property (the capture group contents) for each match.

How to send MarkDown to API

I'm trying to send Some Markdown text to a rest api. Just now I figure it out that break lines are not accepted in json.
Example. How to send this to my api:
An h1 header
============
Paragraphs are separated by a blank line.
2nd paragraph. *Italic*, **bold**, and `monospace`. Itemized lists
look like:
* this one
* that one
* the other one
Note that --- not considering the asterisk --- the actual text
content starts at 4-columns in.
> Block quotes are
> written like so.
>
> They can span multiple paragraphs,
> if you like.
Use 3 dashes for an em-dash. Use 2 dashes for ranges (ex., "it's all
in chapters 12--14"). Three dots ... will be converted to an ellipsis.
Unicode is supported. ☺
as
{
"body" : " (the markdown) ",
}
As you're trying to send it to a REST API endpoint, I'll assume you're searching for ways to do it using Javascript (since you didn't specify what tech you were using).
Rule of thumb: except if your goal is to re-build a JSON builder, use the ones already existing.
And, guess what, Javascript implements its JSON tools ! (see documentation here)
As it's shown in the documentation, you can use the JSON.stringify function to simply convert an object, like a string to a json-compliant encoded string, that can later be decoded on the server side.
This example illustrates how to do so:
var arr = {
text: "This is some text"
};
var json_string = JSON.stringify(arr);
// Result is:
// "{"text":"This is some text"}"
// Now the json_string contains a json-compliant encoded string.
You also can decode JSON client-side with javascript using the other JSON.parse() method (see documentation):
var json_string = '{"text":"This is some text"}';
var arr = JSON.parse(json_string);
// Now the arr contains an array containing the value
// "This is some text" accessible with the key "text"
If that doesn't answer your question, please edit it to make it more precise, especially on what tech you're using. I'll edit this answer accordingly
You need to replace the line-endings with \n and then pass it in your body key.
Also, make sure you escape double-quotes (") by \" else your body will end there.
# An h1 header\n============\n\nParagraphs are separated by a blank line.\n\n2nd paragraph. *Italic*, **bold**, and `monospace`. Itemized lists\nlook like:\n\n * this one\n * that one\n * the other one\n\nNote that --- not considering the asterisk --- the actual text\ncontent starts at 4-columns in.\n\n> Block quotes are\n> written like so.\n>\n> They can span multiple paragraphs,\n> if you like.\n\nUse 3 dashes for an em-dash. Use 2 dashes for ranges (ex., \"it's all\nin chapters 12--14\"). Three dots ... will be converted to an ellipsis.\nUnicode is supported.

access leaves of json tree

I have a JSON file of the form:
{"id":442500000116137984, "reply":0, "children":[{"id":442502378957201408, "reply":0, "children":[]}]}
{"id":442500001084612608, "reply":0, "children":[{"id":442500145871990784, "reply":1, "children":[{"id":442500258421952512, "reply":1, "children":[]}]}]}
{"id":442500000258342912, "reply":0, "children":[{"id":442500636668489728, "reply":0, "children":[]}]}
In this each line refers to a separate tree. Now I want to go to the leaves of every tree and do something, basically
import json
f = open("file", 'r')
for line in f:
tree = json.loads(line)
#somehow walk through the tree and find leaves
if isLeaf(child):
print "Reached Leaf"
How do I walk through this tree object to detect all leaves?
This should work.
import json
f = open("file", 'r')
leafArray = []
def parseTree(obj):
if len(obj["children"]) == 0:
leafArray.append(obj)
else:
for child in obj["children"]:
parseTree(child)
for line in f:
global leafArray
leafArray = []
tree = json.loads(line.strip())
parseTree(tree)
#somehow walk through the tree and find leaves
print ""
for each in leafArray:
print each
You know, I once had to deal with a lot of hypermedia objects out of JSON, so I wrote this library. The problem was that I didn't know the depths of the trees beforehand, so I needed to be able to search around and get what I called the "paths" (the set of keys/indices you would use to reach a leaf) and values.
Anyway, you can mine it for ideas (I wrote it only for Python3.3+, but here's the method inside a class that would do what you want).
The basic idea is that you walk down the tree and check the objects you encounter and if you get more dictionaries (even inside of lists), you keep plunging deeper (I found it easier to write it as a recursive generator mostly by subclassing collections.MutableMapping and creating a class with a custom enumerate).
You keep track of the path you've taken along the way and once you get a value that doesn't merit further exploration (it's not a dict or a list), then you yield your path and the value:
def enumerate(self, path=None):
"""Iterate through the PelicanJson object yielding 1) the full path to
each value and 2) the value itself at that path.
"""
if path is None:
path = []
for k, v in self.store.items():
current_path = path[:]
current_path.append(k)
if isinstance(v, PelicanJson):
yield from v.enumerate(path=current_path)
elif isinstance(v, list):
for idx, list_item in enumerate(v):
list_path = current_path[:]
list_path.append(idx)
if isinstance(list_item, PelicanJson):
yield from list_item.enumerate(path=list_path)
else:
yield list_path, list_item
else:
yield current_path, v
Because this is exclusively for Python3, it takes advantage of things like yield from, so it won't work out of the box for you (and I certainly don't mean to offer my solution as the only one). Personally, I just got frustrated with reusing a lot of this logic in various functions, so writing this library saved me a lot of work and I could go back to doing weird things with the Hypermedia APIs I had to deal with.
You can do something like this. (I don't know the syntax of python).
temp = tree #Your JSON object in each line
while (temp.children ! = []){
temp = temp.children;
}
Your temp will now be the leaf.