How to read tagged and lemmatized corpus with NLTK? - nltk

I'm creating a corpus for my personal conlang, and I've been tagging and lemmatizing everything manually. The format is token/tag/lemma. For example, je/c/je nu/bd/nu jet/vasp3s/imi ambrós/nmns/ambrós ,/x/_ todá/bd/todá Petros/nmns/Petros est/vaii3s/esmi endó/p/endó ./x/_.
How can I read a file with this format using NLTK in Python? I tried using the Tagged Corpus Reader, but it wasn't recognizing the lemma information.

Related

Does nltk contain Arabic stop word, if not how can I add it?

I tried this but it doesn't work
from nltk.corpus import stopwords
stopwords_list = stopwords.words('arabic')
print(stopwords_list)
Update [January 2018]: The nltk data repository has included Arabic stopwords since October, 2017, so this issue no longer arises. The above code will work as expected.
As of October, 2017, the nltk includes a collection of Arabic stopwords. If you ran nltk.download() after that date, this issue will not arise. If you have been a user of nltk for some time and you now lack the Arabic stopwords, use nltk.download() to update your stopwords corpus.
If you call nltk.download() without arguments, you'll find that the stopwords corpus is shown as "out of date" (in red). Download the current version that includes Arabic.
Alternately, you can simply update the stopwords corpus by running the following code once, from the interactive prompt:
>>> import nltk
>>> nltk.download("stopwords")
Note:
Looking words up in a list is really slow. Use a set, not a list. E.g.,
arb_stopwords = set(nltk.corpus.stopwords.words("arabic"))
Original answer (still applicable to languages that are not included)
Why don't you just check what the stopwords collection contains:
>>> from nltk.corpus import stopwords
>>> stopwords.fileids()
['danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian',
'italian', 'norwegian', 'portuguese', 'russian', 'spanish', 'swedish',
'turkish']
So no, there's no list for Arabic. I'm not sure what you mean by "add it", but the stopwords lists are just lists of words. They don't even do morphological analysis, or other things you might want in an inflecting language. So if you have (or can put together) a list of Arabic stopwords, just put them in a set()¹ and you're one step ahead of where you'd be if your code worked.
There's an Arabic stopword list here:
https://github.com/mohataher/arabic-stop-words/blob/master/list.txt
If you save this file in your nltk_data directory with the filename arabic you will then be able to call it with nltk using your code above, which was:
from nltk.corpus import stopwords
stopwords_list = stopwords.words('arabic')
(Note that the possible locations of your nltk_data directory can be seen by typing nltk.data.path in your Python interpreter).
You can also use alexis' suggestion to check if it is found.
Do heed his advice to convert the the stopwords list to a set: stopwords_set = set(stopwords.words('arabic')), as it can make a real difference to performance.
You should use this library called Arabic stop words here is the pip for it:
pip install Arabic-Stopwords
just install it it should be imported after you type:
import arabicstopwords.arabicstopwords as stp
It is much better than the one in the nltk

Taking count in Rapidminer

How to take a row count of a list which is in word document?? If the same list is in excel I am able to take the count using aggregate operator but in word document it is not happening.
I recommend the answer from #awchisholm as it's the easiest solution. However, if you have several word documents this might become impractical.
In this case you can use the operator Loop Zip files to unzip the word document and look inside the for the file /word/document.xml and using RapidMiner's text functions (or Read XML) look for each instance of <w:p ...>...</w:p>, this represents a new line so you can count them from there.
There is also an xml doc in the unzipped directory called /docProps/app.xml you can read this in to find some meta information about the document such as number of words, characters & pages. Unfortunately I've found that unreliable for number of lines which is why I recommend using the <w:p> tag to search.
RapidMiner cannot easily read Word documents. You have to save the document as a text file and use the Read CSV operator to read the file.

Using NLTK RegexpParser to find subject, object, verb combinations

I'm trying to extract subject object verb combinations using the NLTK tool kit. This is my code so far. How would I be able to do it?
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
grammar = r"""
NP:
{<.*>+} # Chunk everything
}<VBD|VBZ|VBP|IN>+{ # Chink sequences of VBD and IN
"""
cp = nltk.RegexpParser(grammar)
s = "This song is the best song in the world. I really love it."
for t in sent_tokenize(s):
text = nltk.pos_tag(word_tokenize(t))
print cp.parse(text)
One approach you can try is to chunk the sentences in NPs (noun phrases) and VPs (verb phrases) and then build a RBS (Rule Based System) on top of this to establish the chunk roles. For example if the VP is in ActiveVoice then the Subject should be the chunk in front of the VP. If it's in PassiveVoice it should be the following NP.
You can also have a look at Pattern.en . The parser has Relation Extraction included: http://www.clips.ua.ac.be/pages/pattern-en#parser

RapidMiner Text Processing: How To write ngrams to file

How can I write ngrams extracted from Text to a new XLS or CSV file?
The process I created is shown below. I would like to know how to connect the Write Document utility and at which level. In the Main Process or in the Vector Creation? Which pipe goes where?
Screenshot Main Process:
Screenshot Vector Creation process:
Screenshot ngrams produced:
Screenshot Write Document operator:
I am using RapidMiner Studio 6.0.003 Community Edition
EDIT Solution:
There are two outputs from the Process Documents from Files operator. The top one is an example set and will correspond to the document vector generated by the operator. The bottom one is a word list that contains all the different words, including n-grams, that form the attributes within the document vector.
To write the word list to a file, you have to convert it to an example set using the WordList to Data operator. The example set that is produced can then be written to CSV or XLSX using the Write CSV or Write Excel operators.

how to use ascii character for quote in COPY in cqlsh

I am uploading data from a a big .csv file into Cassandra using copy in cqlsh.
I am using cassandra 1.2 and CQL 3.0.
However since " is part of my data I have to use some other character for uploading my data, I need to use any extended ASCII characters. I tried various approaches but fails.
The following works, but need to use an extended ascii characters for my purpose..
copy (<columnnames>) from <filename> where deleimiter='|' and quote = '"';
copy (<columnnames>) from <filename> where deleimiter='|' and quote = '~';
When I give quote='ß', I get the error below:
:"quotechar" must be an 1-character string
Pls advice on how I can use an extended ASCII character for quote parameter..
Thanks in advance
A note on the COPY documentation page suggests that for bulk loading (like in your case), the json2sstable utility should be used. You can then load the sstables to your cluster using sstableloader. So I suggest that you write a script/program to convert your CSV to JSON and use these tools for your big CSV. JSON will not have any problem handling all characters from ASCII table.
I had a similar problem, and inspected the source code of cqlsh (it's a python script). In my case, I was generating the csv with python, so it was a matter of finding the right python csv parameters.
Here's the key information from cqlsh:
csv_dialect_defaults = dict(delimiter=',', doublequote=False,
escapechar='\\', quotechar='"')
So if you are lucky enough to generate your .csv file from python, it's just a matter of using the csv module with:
writer = csv.writer(open("output.csv", 'w'), **csv_dialect_defaults)
Hope this helps, even if you are not using python.