I am trying to develop a spell checker for the Arabic Language, and I am using neuspell for this task.
My data contains 2 text files, correct_text.txt and misspelled_text.txt.
The Problem I am facing is that I am using AraT5 tokenizer to tokenize the words ( using other tokenizers will give UNK token for the misspelled words ).
The AraT5 tokenizer instead divides the misspelled words into many tokens, doing so, I will guarantee that I wont get UNK tokens for the misspelled words.
Example:
correct_sentece = "لا أحب أكل الطعام"
misspelledـsentence = "لا أحب أكلالطعام"
Arat5 Tokenizer for the misspelled sentence: ['▁لا', '▁أحب', '▁أكل', 'ال', 'طعام']
Other Tokenizers for the misspelled sentence: ['لا', 'أحب', 'UNK'] # assuming that "أكلالطعام" is not in the vocab/dataset ( real word examples )
Now since the AraT5 tokenizer returns more tokens for the misspelled text compared to the correct text. The length of the data is different. This is causing an issue when training as the data should have the same length for the correct and misspelled sentence.
What's the trick to solve this issue, I cant just expand my vocab to fit all the misspelled words for a certain language.
Related
I receive a CSV file from a 3rd party I need to IMPORT into Access. They claim they are unable to add any sort of Text Qualifier; all my common delimiter options (comma, tabs, pipe, $, ~, ^, etc.) all seem to appear in the data, so not reliable to use in an Import Spec. I cannot edit the data, but we can adjust the delimiter. Record counts are in 500K range x 50 columns (250MB).
I tried a non-ascii char as a delimiter (i.e., ÿ), I can add to an Import Spec, the sample data appears to delimit OK, but get a error (Subscript out of Range) when attempting the actual Import. Also tried a multi-character delimiter, but no-go.
Any suggestions to permit me to receive these csv tables? Daily task, many low-skilled users, remote locations, import function behind a button.
Sample Raw Data, truncated for width (June7, not sure if this helps the discussion)
9798ÿ9798ÿ451219417ÿ9033504ÿ9033504ÿPUNCH BIOPSY 4MM UNI-PUNCH SS SEAMLS RAZOR SHARP BLADE...
9798ÿ9798ÿ451219418ÿ1673BXÿ1673BXÿCLEANER INST 1GL KLENZYME LATEXÿSTERIS PLCÿ1673BXÿ1673BX...
9798ÿ9798ÿ451219419ÿA4823PRÿA4823PRÿBAG BIOHAZ THK1.3 MIL 24X23IN RED LDPE PRINT INF WASTE...
9798ÿ9798ÿ451219420ÿCUR9225ÿCUR9225ÿGLOVE EXAM CURAD MEDIUM LATEX FREEÿMEDLINE INDUSTRIES,...
9798ÿ9798ÿ451219421ÿCUR9226ÿCUR9226ÿGLOVE EXAM CURAD LARGE LATEX FREEÿMEDLINE INDUSTRIES, ...
9798ÿ9798ÿ451219422ÿ90176101ÿ90176101ÿDRAPE CONSUMABLE PK EQUIP OEC UROVIEW 2800 STERILE L...
Try another extended-ASCII character (128 - 254). The chosen delimiter ÿ (255) apparently doesn't work, but it's already a suspicious character since it has all bits set and sometimes has special meaning for that reason.
It's also good to consider the code page. If you're in the US using standard English version of Windows, its likely that Access is using the default "Western European (Windows)" (Windows-1252) code page. But if you're outside the US or have other languages installed, it could be that the particular default code page will treat certain characters differently. For reference, I'm using Access 2013 on Windows 10. In the Access text import wizard, clicking on the [Advanced...] button shows more options, including the selection of the import code page. Since you're having problems with the import, it is worth inspecting that settings.
For the record, I had similar results as you and others using the sample data and delimiter ÿ (255).
Next I tried À (192) which is a standard letter character in various code pages, so it should likely work even if the default were not Windows-1252. Indeed, it worked on my system and resulted in no errors.
To get the import working without errors at first, I would choose all Short Text and Long Text fields before specifying integer, date or other non-text types. If all text columns work, then try specific fields types. In this way, you can at least differentiate between delimiter errors and other data errors.
This isn't to discourage other options like fixed-width text, especially since in that case you won't have to worry about the delimiter at all.
I have a text as an in put, wh ere ther e are occassi on aly brok en wor ds.
Is there a function in NLTK or similar that could return the output as
I have a text as an input, where there are occassionaly broken words.?
You will not get everything in one function, but you can do with the help of Pyenchant library to check spellings of words. These steps you can do:
Take the sentence
Tokenize words using nltk word tokenizer
Check each in the dictionary provided by pyEnchant
If that word is in dictionary, means word is correct, else get suggested words related to that word using function provided by pyEnchant
Compute minimum edit distance(levenshtein distance) between incorrect word and each suggested word
Take the word with minimum distance
Yes, I will not say it performs efficiently, because pyEnchant dictionary contains lot of words that do not seems legal, but it works in some cases.
Above method is using Levenshtein distance, you can also do spell correction using Ngrams, jaccard coefficient also.
I already implemented this task, you can check on my gitHub link(https://github.com/rameshjesswani/Semantic-Textual-Similarity/blob/master/nlp_basics/nltk/string_similarity.ipynb)
I am looking for an open source ocr (maybe tesseract) that uses a dictionary to match words against. For example, I know that this ocr will only be used to search for certain names. Imagine I have a master guest list (written) and I want to scan this list in under a second with the ocr and check this against a database of names.
I understand that a traditional ocr can attempt to read every letter and then I could just cross reference the results with the 100 names, but this takes too long. If the ocr was just focusing on those 100 words and nothing else then it should be able to do all this in a split second. i.e. There is no point in guessing that a word might be "Jach" since "Jach" isn't a name in my database. The ocr should be able to infer that it is "Jack" since that is an actual name in the database.
Is this possible?
It should be possible. Think of it this way: instead of having your OCR look for 'J' it could be looking for 'Jack' directly, sort of: as an individual symbol.
So when you train / calibrate your OCR, train it with images of whole words, similar to how you would - for an individual symbol.
(if this feature is not directly available in your OCR then first map images of whole words to a unique symbol and later transform that symbol into the final word string)
I'm developing a system where keywords are extracted from plain text.
The requirements for a keyword are:
Between 1 - 45 letters long
Word must exist within the WordNet database
Must not be a "common" word
Must not be a curse word
I have fulfilled requirements 1 - 3, however I can't find a method for finding a distinction between curse words; how do I filter them?
I know this would not be a definitive method of filtering out all the curse words, but what happens is all keywords are first set to a state of "pending" before being "approved" by a moderator. However if I can get WordNet to filter most of the curse words out, it would make the moderator's job more easy.
It's strange, the Unix command line version of WordNet (wn) will give you the desired
information with the option -domn (domain):
wn ass -domnn (-domnv for a verb)
...
>>> USAGE->(noun) obscenity#2, smut#4, vulgarism#1, filth#4, dirty word#1
>>> USAGE->(noun) slang#2, cant#3, jargon#1, lingo#1, argot#1, patois#1, vernacular#1
However, the equivalent method in the NLTK just returns an empty list:
from nltk.corpus import wordnet
a = wordnet.synsets('ass')
for s in a:
for l in s.lemmas:
print l.usage_domains()
[]
[]
...
As an alternative you could try to filter words that have "obscene", "coarse" or "slang" in their SynSet's definition. But probably it's much easier to filter against a fixed list as suggested before (like the one at noswearing.com).
For the 4th point it would be better and effective if you can collect the list of curse words and remove them through iterative process.
To achieve the same you can checkout this blog
I will summarize the same here.
1. Load Swear words text file from here
2. Compare it with the text, remove if it match.
def remove_curse_words():
text = 'Hey Bro Fuck you'
text = ' '.join([word for word in text.split() if word not in curseWords])
return text
The output would be.
Hey bro you
I have a set of objects that I read information out of that contain information that ends up becoming a MATLAB m file. One piece of information ends up being a function name in MATLAB. I need to remove all of the not-allowed characters from that string before writing the M file out to the filesystem. Can someone tell me what characters make up the set of allowed characters in a function name for MATLAB?
Legal names follow the pattern [A-Za-z][A-Za-z0-9_]*, i.e. an alphabetic character followed by zero or more alphanumeric-or-underscore characters, up to NAMELENGTHMAX characters.
Since MATLAB variable and function naming rules are the same, you might find genvarname useful. It sanitizes arbitrary strings into legal MATLAB names.
The short answer...
Any alphanumeric characters or underscores, as long as the name starts with a letter.
The longer answer...
The MATLAB documentation has a section "Working with M-Files" that discusses naming with a little more detail. Specifically, it points out the functions NAMELENGTHMAX (the maximum number of characters in the name that the OS will pay attention to), ISVARNAME (to check if the variable/function name is valid), and ISKEYWORD (to display restricted keywords).
Edited:
this may be more informative:
http://scv.bu.edu/documentation/tutorials/MATLAB/functions.html