Checking against Nltk POS tags - nltk

I am just learning nltk using Python. I am using POS tagging. What I want to know is how do I use the tags. For example, this is the pseudocode:
words = []
teststr = "George did well in the test."
tokens = nltk.word_tokenize(teststr)
words = nltk.pos_tag(tokens)
I want to do something like this:
if words[i] == "proper noun":
#do something
How do I check whether a word is a noun or a verb or any other part of speech.
Can someone please help me out here?
Thanks.

If you look at the results of your pos_tag function call you are returned the following list:
[('George', 'NNP'), ('did', 'VBD'), ('well', 'RB'), ('in', 'IN'), ('the', 'DT'), ('test', 'NN'), ('.', '.')]
If you iterate through the list to do something based on the value being a proper noun you would need the following code:
if words[i][1] == 'NNP':
# do something
NNP is a singular proper noun. Each entry in that list is a tuple which the first value being the word and the second value being the pos.

Related

dumping list to JSON file creates list within a list [["x", "y","z"]], why?

I want to append multiple list items to a JSON file, but it creates a list within a list, and therefore I cannot acces the list from python. Since the code is overwriting existing data in the JSON file, there should not be any list there. I also tried it by having just an text in the file without brackets. It just creates a list within a list so [["x", "y","z"]] instead of ["x", "y","z"]
import json
filename = 'vocabulary.json'
print("Reading %s" % filename)
try:
with open(filename, "rt") as fp:
data = json.load(fp)
print("Data: %s" % data)#check
except IOError:
print("Could not read file, starting from scratch")
data = []
# Add some data
TEMPORARY_LIST = []
new_word = input("give new word: ")
TEMPORARY_LIST.append(new_word.split())
print(TEMPORARY_LIST)#check
data = TEMPORARY_LIST
print("Overwriting %s" % filename)
with open(filename, "wt") as fp:
json.dump(data, fp)
example and output with appending list with split words:
Reading vocabulary.json
Data: [['my', 'dads', 'house', 'is', 'nice']]
give new word: but my house is nicer
[['but', 'my', 'house', 'is', 'nicer']]
Overwriting vocabulary.json
So, if I understand what you are trying to accomplish correctly, it looks like you are trying to overwrite a list in a JSON file with a new list created from user input. For easiest data manipulation, set up your JSON file in dictionary form:
{
"words": [
"my",
"dad's",
"house",
"is",
"nice"
]
}
You should then set up functions to separate your functionality to make it more manageable:
def load_json(filename):
with open(filename, "r") as f:
return json.load(f)
Now, we can use those functions to load the JSON, access the words list, and overwrite it with the new word.
data = load_json("vocabulary.json")
new_word = input("Give new word: ").split()
data["words"] = new_word
write_json("vocabulary.json", data)
If the user inputs "but my house is nicer", the JSON file will look like this:
{
"words": [
"but",
"my",
"house",
"is",
"nicer"
]
}
Edit
Okay, I have a few suggestions to make before I get into solving the issue. Firstly, it's great that you have delegated much of the functionality of the program over to respective functions. However, using global variables is generally discouraged because it makes things extremely difficult to debug as any of the functions that use that variable could have mutated it by accident. To fix this, use method parameters and pass around the data accordingly. With small programs like this, you can think of the main() method as the point in which all data comes to and from. This means that the main() function will pass data to other functions and receive new or edited data back. One final recommendation, you should only be using all capital letters for variable names if they are going to be constant. For example, PI = 3.14159 is a constant, so it is conventional to make "pi" all caps.
Without using global, main() will look much cleaner:
def main():
choice = input("Do you want to start or manage the list? (start/manage)")
if choice == "start":
data = load_json()
words = data["words"]
dictee(words)
elif choice == "manage":
manage_list()
You can use the load_json() function from earlier (notice that I deleted write_json(), more on that later) if the user chooses to start the game. If the user chooses to manage the file, we can write something like this:
def manage_list():
choice = input("Do you want to add or clear the list? (add/clear)")
if choice == "add":
words_to_add = get_new_words()
add_words("vocabulary.json", words_to_add)
elif choice == "clear":
clear_words("vocabulary.json")
We get the user input first and then we can call two other functions, add_words() and clear_words():
def add_words(filename, words):
with open(filename, "r+") as f:
data = json.load(f)
data["words"].extend(words)
f.seek(0)
json.dump(data, f, indent=4)
def clear_words(filename):
with open(filename, "w+") as f:
data = {"words":[]}
json.dump(data, f, indent=4)
I did not utilize the load_json() function in the two functions above. My reasoning for this is because it would call for opening the file more times than needed, which would hurt performance. Furthermore, in these two functions, we already need to open the file, so it is okayt to load the JSON data here because it can be done with only one line: data = json.load(f). You may also notice that in add_words(), the file mode is "r+". This is the basic mode for reading and writing. "w+" is used in clear_words(), because "w+" not only opens the file for reading and writing, it overwrites the file if the file exists (that is also why we don't need to load the JSON data in clear_words()). Because we have these two functions for writing and/or overwriting data, we don't need the write_json() function that I had initially suggested.
We can then add to the list like so:
>>> Do you want to start or manage the list? (start/manage)manage
>>> Do you want to add or clear the list? (add/clear)add
>>> Please enter the words you want to add, separated by spaces: these are new words
And the JSON file becomes:
{
"words": [
"but",
"my",
"house",
"is",
"nicer",
"these",
"are",
"new",
"words"
]
}
We can then clear the list like so:
>>> Do you want to start or manage the list? (start/manage)manage
>>> Do you want to add or clear the list? (add/clear)clear
And the JSON file becomes:
{
"words": []
}
Great! Now, we implemented the ability for the user to manage the list. Let's move on to creating the functionality for the game: dictee()
You mentioned that you want to randomly select an item from a list and remove it from that list so it doesn't get asked twice. There are a multitude of ways you can accomplish this. For example, you could use random.shuffle:
def dictee(words):
correct = 0
incorrect = 0
random.shuffle(words)
for word in words:
# ask word
# evaluate response
# increment correct/incorrect
# ask if you want to play again
pass
random.shuffle randomly shuffles the list around. Then, you can iterate throught the list using for word in words: and start the game. You don't necessarily need to use random.choice here because when using random.shuffle and iterating through it, you are essentially selecting random values.
I hope this helped illustrate how powerful functions and function parameters are. They not only help you separate your code, but also make it easier to manage, understand, and write cleaner code.

How do I search for a string in this JSON with Python

My JSON file looks something like:
{
"generator": {
"name": "Xfer Records Serum",
....
},
"generator": {
"name: "Lennar Digital Sylenth1",
....
}
}
I ask the user for search term and the input is searched for in the name key only. All matching results are returned. It means if I input 's' only then also both the above ones would be returned. Also please explain me how to return all the object names which are generators. The more simple method the better it will be for me. I use json library. However if another library is required not a problem.
Before switching to JSON I tried XML but it did not work.
If your goal is just to search all name properties, this will do the trick:
import re
def search_names(term, lines):
name_search = re.compile('\s*"name"\s*:\s*"(.*' + term + '.*)",?$', re.I)
return [x.group(1) for x in [name_search.search(y) for y in lines] if x]
with open('path/to/your.json') as f:
lines = f.readlines()
print(search_names('s', lines))
which would return both names you listed in your example.
The way the search_names() function works is it builds a regular expression that will match any line starting with "name": " (with varying amount of whitespace) followed by your search term with any other characters around it then terminated with " followed by an optional , and the end of string. Then applies that to each line from the file. Finally it filters out any non-matching lines and returns the value of the name property (the capture group contents) for each match.

Counting occurrences of a list item from a list?

(See edit at the bottom of this post)
I'm making a program in Elixir that counts the types of HTML tags from a list of tags that I've already obtained. This means that the key should be the tag and the value should be the count.
e.g. in the following sample file
<html><head><body><sometag><sometag><sometag2><sometag>
My output should be something like the following:
html: 1
head: 1
body: 1
sometag: 3
sometag2: 1
Here is my code:
def tags(page) do
taglist = Regex.scan(~r/<[a-zA-Z0-9]+/, page)
dict = Map.new()
Enum.map(taglist, fn(x) ->
tag = String.to_atom(hd(x))
Map.put_new(dict, tag, 1)
end)
end
I know I should be probably using Enum.each instead but when I do that my dictionary ends up just being empty instead of incorrect.
With Enum.map, this is the output I receive:
iex(15)> A3.test
[%{"<html" => 1}, %{"<body" => 1}, %{"<p" => 1}, %{"<a" => 1}, %{"<p" => 1},
%{"<a" => 1}, %{"<p" => 1}, %{"<a" => 1}, %{"<p" => 1}, %{"<a" => 1}]
As you can see, there are duplicate entries and it's turned into a list of dictionaries. For now I'm not even trying to get the count working, so long as the dictionary doesn't duplicate entries (which is why the value is always just "1").
Thanks for any help.
EDIT: ------------------
Okay so I figured out that I need to use Enum.reduce
The following code produces the output I'm looking for (for now):
def tags(page) do
rawTagList = Regex.scan(~r/<[a-zA-Z0-9]+/, page)
tagList = Enum.map(rawTagList, fn(tag) -> String.to_atom(hd(tag)) end)
Enum.reduce(tagList, %{}, fn(tag, acc) ->
Map.put_new(acc, tag, 1)
end)
end
Output:
%{"<a": 1, "<body": 1, "<html": 1, "<p": 1}
Now I have to complete the challenge of actually counting the tags as I go...If anyone can offer any insight on that I'd be grateful!
First of all, it is not the best idea to parse html with regexes. See this question for more details (especially the accepted answer).
Secondly, you are trying to write imperative code in functional language (this is about first version of your code). Variables in Elixir are immutable. dict will always be an empty map. Enum.map takes a list and always returns new list of the same length with all elements transformed. Your transformation function takes an empty map and puts one key-value pair into it.
As a result you get a list with one element maps. The line:
Map.put_new(dict, tag, 1)
doesn't update dict in place, but creates new one using old one, which is empty. In your example it is exactly the same as:
%{tag => 1}
You have couple of options to do it differently. Closest approach would be to use Enum.reduce. It takes a list, an initial accumulator and a function elem, acc -> new_acc.
taglist
|> Enum.reduce(%{}, fn(tag, acc) -> Map.update(acc, tag, 1, &(&1 + 1)) end)
It looks a little bit complicated, because there are couple of nice syntactic sugars. taglist |> Enum.reduce(%{}, fun) is the same as Enum.reduce(taglist, %{}, fun). &(&1 + 1) is shorthand for fn(counter) -> counter + 1 end.
Map.update takes four arguments: a map to update, key to update, initial value if key doesn't exist and a function that does something with the key if it exists.
So, those two lines of code do this:
iterate over list Enum.reduce
starting with empty map %{}
take current element and map fn(tag, acc) and either:
if key doesn't exist insert 1
if it exists increment it by one &(&1 + 1)

python parsing json without key

This feels like an easy question but I can't seem to find the answer.
I'm working in python 3.3 and have something I retrieved from a JSON that looks like this:
some_list = json_response['key']
# some_list == {'a':'b','c':'d','e':'f'}{'g':'h','i':'j','k':'l'}
I've been trying to access each {} on its own with no success.
What would be the easiest way to go about doing this?
Thanks in advance.
Assuming some_list's value is {'a':'b', 'c':'d','e':'f'}{'g':'h', 'i':'j', 'k':'l'} string and you cannot change the way it is written/dumped into json_response['key'], you can replace }{ with },{ to have the dictionaries delimited and load the string with literal_eval():
>>> from ast import literal_eval
>>> some_list = "{'a':'b', 'c':'d','e':'f'}{'g':'h', 'i':'j', 'k':'l'}"
>>> literal_eval(some_list.replace("}{", "},{").strip())
({'a': 'b', 'c': 'd', 'e': 'f'}, {'i': 'j', 'k': 'l', 'g': 'h'})
This is a tuple of dictionaries printed.
One of the problems I had with the value was that I couldn't tell whether it was JSON or not a lot of the times.
In the end I decided to go about the roundabout way of casting to string so I know what it is, and just using regex.

access leaves of json tree

I have a JSON file of the form:
{"id":442500000116137984, "reply":0, "children":[{"id":442502378957201408, "reply":0, "children":[]}]}
{"id":442500001084612608, "reply":0, "children":[{"id":442500145871990784, "reply":1, "children":[{"id":442500258421952512, "reply":1, "children":[]}]}]}
{"id":442500000258342912, "reply":0, "children":[{"id":442500636668489728, "reply":0, "children":[]}]}
In this each line refers to a separate tree. Now I want to go to the leaves of every tree and do something, basically
import json
f = open("file", 'r')
for line in f:
tree = json.loads(line)
#somehow walk through the tree and find leaves
if isLeaf(child):
print "Reached Leaf"
How do I walk through this tree object to detect all leaves?
This should work.
import json
f = open("file", 'r')
leafArray = []
def parseTree(obj):
if len(obj["children"]) == 0:
leafArray.append(obj)
else:
for child in obj["children"]:
parseTree(child)
for line in f:
global leafArray
leafArray = []
tree = json.loads(line.strip())
parseTree(tree)
#somehow walk through the tree and find leaves
print ""
for each in leafArray:
print each
You know, I once had to deal with a lot of hypermedia objects out of JSON, so I wrote this library. The problem was that I didn't know the depths of the trees beforehand, so I needed to be able to search around and get what I called the "paths" (the set of keys/indices you would use to reach a leaf) and values.
Anyway, you can mine it for ideas (I wrote it only for Python3.3+, but here's the method inside a class that would do what you want).
The basic idea is that you walk down the tree and check the objects you encounter and if you get more dictionaries (even inside of lists), you keep plunging deeper (I found it easier to write it as a recursive generator mostly by subclassing collections.MutableMapping and creating a class with a custom enumerate).
You keep track of the path you've taken along the way and once you get a value that doesn't merit further exploration (it's not a dict or a list), then you yield your path and the value:
def enumerate(self, path=None):
"""Iterate through the PelicanJson object yielding 1) the full path to
each value and 2) the value itself at that path.
"""
if path is None:
path = []
for k, v in self.store.items():
current_path = path[:]
current_path.append(k)
if isinstance(v, PelicanJson):
yield from v.enumerate(path=current_path)
elif isinstance(v, list):
for idx, list_item in enumerate(v):
list_path = current_path[:]
list_path.append(idx)
if isinstance(list_item, PelicanJson):
yield from list_item.enumerate(path=list_path)
else:
yield list_path, list_item
else:
yield current_path, v
Because this is exclusively for Python3, it takes advantage of things like yield from, so it won't work out of the box for you (and I certainly don't mean to offer my solution as the only one). Personally, I just got frustrated with reusing a lot of this logic in various functions, so writing this library saved me a lot of work and I could go back to doing weird things with the Hypermedia APIs I had to deal with.
You can do something like this. (I don't know the syntax of python).
temp = tree #Your JSON object in each line
while (temp.children ! = []){
temp = temp.children;
}
Your temp will now be the leaf.