Remove stopwords with nltk.corpus from list with lists - nltk

I have a list containing lists with all seperated words of a review, that looks like this:
texts = [['fine','for','a','night'],['it','was','good']]
I want to remove all stopwords, using the nltk.corpus package, and put all the words without stopwords back into the list. The end results should be a list, consisting of a lists of words without stopwords. This it was I tried:
import nltk
nltk.download() # to download stopwords corpus
from nltk.corpus import stopwords
stopwords=stopwords.words('english')
words_reviews=[]
for review in texts:
wr=[]
for word in review:
if word not in stopwords:
wr.append(word)
words_reviews.append(wr)
This code actually worked, but now I get the error: AttributeError: 'list' object has no attribute 'words', referring to stopwords. I made sure that I installed all packages. What could be the problem?

The problem is that you redefine stopwords in your code:
from nltk.corpus import stopwords
stopwords=stopwords.words('english')
After the first line, stopwords is a corpus reader with a words() method. After the second line, it is a list. Proceed accordingly.
Actually looking things up in a list is really slow, so you'll get much better performance if you use this:
stopwords = set(stopwords.words('english'))

instead of
[word for word in text_tokens if not word in stopwords.words()]
use
[word for word in text_tokens if not word in all_stopwords]

i removed the set , it worked, may be you could try the same

Related

NLTK reconstruct sentence from tokens

I have used NLTK to tokenise a sentance, I would however now like to reconstruct the sentance into a string.
I've looked over the docs but can't see an obvious wat to do this. Is this possible at all?
tokens = [token.lower() for token in tokensCorrect]
The nltk provides no such function. Whitespace is thrown away during tokenization, so there is no way to get back exactly what you started with; the whitespace might have included newlines and multiple spaces, and there's no way to get these back. The best you can do is to join the sentence into a string that looks like a normal sentence. A simple " ".join(tokens) will put a space before and after all punctuation, which looks odd:
>>> print(" ".join(tokens))
'This is a sentence .'
So you need to get rid of spaces before most punctuation, except for a select few like ( and `` that should have the space after them removed. Even then it's sometimes guesswork, since the apostrophe ' is sometimes used between words, sometimes before, and sometimes after. ("Nuthin' doin', y'all!") Good luck with that.
My recommendation is to hold on to the original strings from which you tokenized the sentence, and go back to those. You don't show where your sentences come from so there's nothing more to say really.

Does nltk contain Arabic stop word, if not how can I add it?

I tried this but it doesn't work
from nltk.corpus import stopwords
stopwords_list = stopwords.words('arabic')
print(stopwords_list)
Update [January 2018]: The nltk data repository has included Arabic stopwords since October, 2017, so this issue no longer arises. The above code will work as expected.
As of October, 2017, the nltk includes a collection of Arabic stopwords. If you ran nltk.download() after that date, this issue will not arise. If you have been a user of nltk for some time and you now lack the Arabic stopwords, use nltk.download() to update your stopwords corpus.
If you call nltk.download() without arguments, you'll find that the stopwords corpus is shown as "out of date" (in red). Download the current version that includes Arabic.
Alternately, you can simply update the stopwords corpus by running the following code once, from the interactive prompt:
>>> import nltk
>>> nltk.download("stopwords")
Note:
Looking words up in a list is really slow. Use a set, not a list. E.g.,
arb_stopwords = set(nltk.corpus.stopwords.words("arabic"))
Original answer (still applicable to languages that are not included)
Why don't you just check what the stopwords collection contains:
>>> from nltk.corpus import stopwords
>>> stopwords.fileids()
['danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian',
'italian', 'norwegian', 'portuguese', 'russian', 'spanish', 'swedish',
'turkish']
So no, there's no list for Arabic. I'm not sure what you mean by "add it", but the stopwords lists are just lists of words. They don't even do morphological analysis, or other things you might want in an inflecting language. So if you have (or can put together) a list of Arabic stopwords, just put them in a set()ยน and you're one step ahead of where you'd be if your code worked.
There's an Arabic stopword list here:
https://github.com/mohataher/arabic-stop-words/blob/master/list.txt
If you save this file in your nltk_data directory with the filename arabic you will then be able to call it with nltk using your code above, which was:
from nltk.corpus import stopwords
stopwords_list = stopwords.words('arabic')
(Note that the possible locations of your nltk_data directory can be seen by typing nltk.data.path in your Python interpreter).
You can also use alexis' suggestion to check if it is found.
Do heed his advice to convert the the stopwords list to a set: stopwords_set = set(stopwords.words('arabic')), as it can make a real difference to performance.
You should use this library called Arabic stop words here is the pip for it:
pip install Arabic-Stopwords
just install it it should be imported after you type:
import arabicstopwords.arabicstopwords as stp
It is much better than the one in the nltk

Import huge data from CSV into Neo4j

When I try to import huge data into neo4j it gives following error:
there's a field starting with a quote and whereas it ends that quote there seems to be characters in that field after that ending quote. That isn't supported. This is what I read: 'Hello! I am trying to combine 2 variables to one variable. The variables are Public Folder Names and the ParentPath. Both can be found using Get-PublicFolder
Basically I want an array of Public Folders Path and Name so I will have an array like /Engineering/NewUsers
Below is my code
$parentpath = Get-PublicFolder -ResultSize Unlimited -Identity """ "'
It seems that there may be some information lacking from your question, especially about the data that is getting parsed, stack trace a.s.o.
Anyway, I think you can get around this by changing which character is treated as quote character. How are you calling the import tool and which version of Neo4j are you doing this on?
Try including argument --quote %, and I'm making this up by just using another character % as quote character. Would that help you?

How to convert a glob with braces to a regex

I want to change a glob such as c{at,lif} in to a regex. What would it look like? I've tried using /c[at,lif]/ but that did not work.
For Basic GREP operation see f.e. http://www.regular-expressions.info/refquick.html
From http://www.regular-expressions.info/alternation.html:
If you want to search for the literal text cat or dog, separate both options with a vertical bar or pipe symbol: cat|dog. If you want more options, simply expand the list: cat|dog|mouse|fish.
This suggests the following should work:
/c(at|lif)/
Obligatory What Was Wrong With Yours, Then:
/c[at,lif]/
The square brackets [..] are not used in GREP for grouping, but to define a character class. That is, here you create a custom class which allows one of the characters at,lif. Thus it matches ca or c, or cf -- but always only one character. Adding a repetition code c[at,lif]+ only appears to work because it will then match both cat and clif, but also cilt, calf, and c,a,t.

Sublime search with * (search multiple partials)

I would like to find the result -> 'the_red_cat' using sublime search. I have enabled the regex but cannot figure out how to get the result I need.
I would like to search something along the line of *the*ca* but it does not seem to yield any results.
The regex .*the.*ca.* finds the whole lines that contain the text. Searching for any number of characters with regex is done with .*.