fix misspelled words in a corpus without dictionary

fix misspelled words in a corpus without dictionary - deep-learning

We have a history of conversations between humans (any language, any vocabulary), so with a lof of spelling errors:
"hellobb do u hav skip?" => "hello baby, do you have skype?"
Before running a deep learning task against this data set (find synonyms etc..), I would like to fix these errors.
Is it a good idea? I've never worked with such bad quality data. Wondering if there is a "magic solution" to achieve this.
Else I plan to use:
word embeddings (word2vec) to check if good and bad words are similar
distance function between words
if wordA is less famous wordB then fix(wordA) = wordB

There is no magic solution at this moment to guaranty to fix all misspelling errors on your text but here are some possible options you can consider:
Dictionary-based approach. I found Hunspell very handy in this case. It uses language modeling and Levenshtein distance to suggest the correct spelling. It is available on many natural & programming languages. Although it is a dictionary-based approach, it is superior to many sophisticated approaches. It is used in vast majority word-processing applications.
Statistical and traditional approach. Another possible solution is to develop your own statistical models such as language modeling. Training language modeling on a large corpus, at character level & word level, can found many misspelling on the text. Many speech recognition and search engines use language modeling at their heart to fix the misspelling.
Deep learning approach. If you look at NLPProgress.com, most of the state-of-the-art research used seq2seq models to attack grammatical error problem. The main intuition behind these models is to train a neural network on pairs of sentences which network learns how to fix grammatical error. These approaches require quite a lot of pairs sentence to gives a reliable result. If the available corpora are not fit to your needs, you can generate your own misspelling e.g. by replacing some tokens in your text.

Related

Why no programming in English? What is the difference between natural languages and programming languages?

What is the key difference between natural languages (such as English and French) and programming languages like C++ and Perl?
I am familiar with the ambiguity problem, but can't it be solved using an interactive compiler or using a subset of the natural language using a strict grammar but all the time still retaining the essence of the language?
Another issue is context. But lawyers have ways to solve this issue. (This question is not about reducing the programming complexity, it's simply about concise reasons and roadblock in using natural languages for instructing computer.)
Is there any other significant problem besides these two? Or do these two have greater consequences than I mentioned above? Is the interactive solution and lawyers language technically not feasible for programming?

This is an extremely interesting question and in short, yes, there are some very good reasons why we don't use English to write programs.
It's been said before that the greatest gift that computer science has given us is not the ability to talk to computers but now that formal languages exist for describing algorithms we now have even better tools for communicating these ideas to other people. Even if a computer is not involved. Indeed the best software engineers see their jobs primarily as writing software that is readable to other people so as to make maintenance and addition of new features as easy as possible. This is not possible in a language as big and as free form as any natural, spoken language.
Ambiguity
One reason is that of ambiguity. Have you ever looked a menu in a restaurant and seen that with your burger you can get "Coleslaw and fries or salad"? What does this mean? Can I get both coleslaw and fries or the other option is a salad alone? Or do I always get coleslaw and I have to chose between the fries or a salad? English is full of these things.
I used to teach a class on this and an example I liked to use to explain ambiguity was as follows. I asked the students to write a one paragraph story ending with the sentence "Tom asked Chris if he could help him". About half the time the stories written indicated that the student interpreted the sentence as Tom asking for assistance from Chris. The other half of the time people thought Tom was offering to lend Chris a hand.
If you think about it, there are a lot of people who do write programs in English. They're called product managers and the compiler they use is software engineers. The problem here is that a software engineer has to inject a lot of his own understanding of the problem to understand what the description really means. And trust me, there is a lot of back and forth. Even on very simple business requirements I must clarify ambiguities.
Context
I would not agree that lawyers have ways to solve the context problem. We continually have ongoing arguments in courts among, in some cases, some of the most educated people in the country, about the meanings of various laws. Sometimes this involves arguing about the context in which the laws were written long ago. Sometimes in involves applying it to a new, previously non-existent context like the Internet. The fact that we have thousands of lawyers working on disambiguating these issues is proof that it cannot be handled by a simple computer program like a compiler. It's just too hard of a problem.
Conciseness
Another issue is just the ability to be concise. Mathematics long ago invented notations for many different concepts in the maths because it's just easier to read if there's a special syntax that is concise and has a well defined meaning. A mathematician knows what it means when I say "f(x) = 3x+1". It means the same thing as "There is a function called f and it has one argument. The value of applying f to a number is the number that is one more than three times the number given." But the former is a lot easier to read once you've learned the syntax. The same is true for programming languages. Programming languages are specialized to describe computations.
Implementation
Creators of programming languages deliberately create very small languages. These are, in fact, subsets of English also with some extra syntax. The idea of understanding all of English in all of its free form ways and, worse yet, increasing vocabulary is a job for Natural Language Processing (NPL). A very hard job. If you want to be able to assign unambiguous meaning to a program and have the program's behavior never change, you need a well defined syntax and semantics.
The take-away point here is that English is a very big, very flexible language with no formal specification. Programming languages need to have a well-defined syntax and semantics in order for algorithms to have unambiguous and unchanging meaning. Someone could, in fact, write a formal syntax for a subset of English and give it unambiguous meaning. But this would be a huge, huge job.
Its been done
Check out BabelBuster. The idea here was to take C and convert it to and from a very small subset of very rigorous English such that one could write a program in C and then convert it to English. During the DeCSS DVD decryption arguments, the MPAA was trying to get programs that could decrypt their DVDs declared illegal. BabelBuster fought back with a very interesting idea. Create a way to convert English, which is protected under the freedom of speech into working code in C and thus make the point that C code is also just a language which should be protected as such. Therefor one should be able to publish code that cracks DeCSS. It's an interesting piece of work relevant to your question regardless of which side you're on.
The problem with BabelBuster is that you need to write your program in a very, very limited subset of English. But it is possible to do this.
Conclusion
English, like all natural languages, allows us to describe a computation or algorithm but the language is verbose, offers many ways to say the same thing, is dependent on the context of the speaker, and not formally specified. If your goal is to describe computations, you should take English, chose a minimal workable subset in which you can say everything you need. Formally specify what each word in this subset will mean. Then create a few special notations to make it concise to say the things you say just like mathematics did. If you do this, you do this you'll end up with a typical programming language, or something like it.

There are three principal reasons.
First, as Gabe says - people have figured out through trial and error that programming in things that are close to English sentences only forces programmers to type more useless cruft. (And yes, COBOL was explicitly designed to read more "naturally".)
To a programmer,
windows++
is more readable than
You should now increment the number of windows by one.
For example, Tetris is a rather easy game to code. I would be terribly surprised if you managed to make an English explanation that is detailed enough for a computer (remember, computers are dumb, so you have to spell it all out) in less pages than a short novel.
The second reason is that the range of things a computer knows how to do is rather small, so the number of language constructs that are needed for that is also limited. In contrast, natural languages need to be able to express the entirety of human experience, which does require many language constructs to pull off. For example, "According to his wife, John would have caught the fish yesterday if it hadn't rained" is not expressible in C - and does not need to be.
And third is, indeed, ambiguity, as you yourself note. There are a lot of places where a software error is simply not permissible. People do enough bugs in unambiguous languages; allowing ambiguity would be a disaster waiting to happen. And on the same subject, we are still unable to parse human language sufficiently well - state of the art parsers still have unacceptably high error rates.

It is possible to automatically translate structured English into code, as long as a restricted subset of the English language is used.
As a proof-of-concept, I have developed an programming language called EngScript, which translates English sentences sentences into Python source code.
Arithmetic operations can be written in plain English:
#print{3 to the power of 2}
#print{3 raised to the power of 2}
#Both of these statements print "9".
print{3 plus (the sum of 1 and 2)}
#This prints "5".
Variables can be initialized in plain English, too:
let x be (x plus 1)
if (x is not equal to 7) :
print x

Interesting NLP/machine-learning style project -- analyzing privacy policies

I wanted some input on an interesting problem I've been assigned. The task is to analyze hundreds, and eventually thousands, of privacy policies and identify core characteristics of them. For example, do they take the user's location?, do they share/sell with third parties?, etc.
I've talked to a few people, read a lot about privacy policies, and thought about this myself. Here is my current plan of attack:
First, read a lot of privacy and find the major "cues" or indicators that a certain characteristic is met. For example, if hundreds of privacy policies have the same line: "We will take your location.", that line could be a cue with 100% confidence that that privacy policy includes taking of the user's location. Other cues would give much smaller degrees of confidence about a certain characteristic.. For example, the presence of the word "location" might increase the likelihood that the user's location is store by 25%.
The idea would be to keep developing these cues, and their appropriate confidence intervals to the point where I could categorize all privacy policies with a high degree of confidence. An analogy here could be made to email-spam catching systems that use Bayesian filters to identify which mail is likely commercial and unsolicited.
I wanted to ask whether you guys think this is a good approach to this problem. How exactly would you approach a problem like this? Furthermore, are there any specific tools or frameworks you'd recommend using. Any input is welcome. This is my first time doing a project which touches on artificial intelligence, specifically machine learning and NLP.

The idea would be to keep developing these cues, and their appropriate confidence intervals to the point where I could categorize all privacy policies with a high degree of confidence. An analogy here could be made to email-spam catching systems that use Bayesian filters to identify which mail is likely commercial and unsolicited.
This is text classification. Given that you have multiple output categories per document, it's actually multilabel classification. The standard approach is to manually label a set of documents with the classes/labels that you want to predict, then train a classifier on features of the documents; typically word or n-gram occurrences or counts, possibly weighted by tf-idf.
The popular learning algorithms for document classification include naive Bayes and linear SVMs, though other classifier learners may work too. Any classifier can be extended to a multilabel one by the one-vs.-rest (OvR) construction.

A very interesting problem indeed!
On a higher level, what you want is summarization- a document has to be reduced to a few key phrases. This is far from being a solved problem. A simple approach would be to search for keywords as opposed to key phrases. You can try something like LDA for topic modelling to find what each document is about. You can then search for topics which are present in all documents- I suspect what will come up is stuff to do with licenses, location, copyright, etc. MALLET has an easy-to-use implementation of LDA.

I would approach this as a machine learning problem where you are trying to classify things in multiple ways- ie wants location, wants ssn, etc.
You'll need to enumerate the characteristics you want to use (location, ssn), and then for each document say whether that document uses that info or not. Choose your features, train your data and then classify and test.
I think simple features like words and n-grams would probably get your pretty far, and a dictionary of words related to stuff like ssn or location would finish it nicely.
Use the machine learning algorithm of your choice- Naive Bayes is very easy to implement and use and would work ok as a first stab at the problem.

Digit Recognition with Bayesian classes

I need to write an OCR program for digits only. I will use MNIST datasets. The problem is I do not know where to start. There are a lot of papers which doesn't really explain the algorithm. I don't really have much knowledge about pattern recognition. So I have a few questions.
Q1 : Where can I find the algorithm (or a tutorial)
Q2 : How do I classify digits? I don't need very advanced things. First thing that comes to my mind is finding the ratio of upper half/lower half and left side/ right side. Is there more useful and easy classification methods.
Q3 : What is back propagation and the layers which is shown in most of the papers. Do I need them for my simple OCR.
Note: I know my OCR program won't be accurate. It isn't very important for now.

If the closest engineering library to you has a section on image processing, computer vision, or machine vision, then with luck that library will have a copy of a book I recommend for OCR:
Character Recognition Systems by Cheriet, Kharma, Liu, and Suen
This book provides a fairly comprehensive overview of OCR techniques and recent research. It does not go into great depth on any particular subject, but it does provide references to academic papers.
Make sure you have access to a good introductory textbook on image processing. The book by Gonzalez and Woods is a standard in many universities:
Digital Image Processing by Gonzalez and Woods
Even "simple" OCR gets tricky very quickly. It could be overwhelming if you jump into a class about neural networks, Bayes theorem, etc., before you have a firm grasp of basic image processing principles.
If you can, try writing one or more OCR algorithms for machine-printed characters before you attempt to write an algorithm for handwritten characters.
Q1 : Where can I find the algorithm (or a tutorial)
There are numerous algorithms for OCR. The Cheriet book will give you a good start.
Q2 : How do I classify digits? I don't need very advanced things. First thing that comes to my mind is finding the ratio of upper half/lower half and left side/ right side. Is there more useful and easy classification methods.
Try implementing that technique and see how well it works. Even if the implementation doesn't work as well as you'd like, lessons learned while implementing it could help you later.
You can also subdivide a character into a 2 x 2 grid or 3 x 3 grid and check for relatively densities of pixels. Unlike machine printed characters, handwritten characters won't line up nicely in rectilinear grids.
Template matching using normalized correlation is simple, and it can work reasonably well for machine printed characters for a single, known font. It's relatively simple to implement and worth learning:
http://en.wikipedia.org/wiki/Cross-correlation#Normalized_cross-correlation
For OCR it's common to thin the characters in your sample as an initial step. Thinning is a technique to reduce a character (or any other shape) to a representation that is 1 pixel wide. Once you have a thinned character it can be easier to identify lines and intersections. If you can identify lines (or curves) and intesections, then one technique is to look at the relative position and angle of each line with respect to the others.
Common thinning algorithms include Stentiford and Zhang-Suen. There's a freeware version of WinTopo that demonstrates both of these algorithms:
http://wintopo.com/
You can look into academic papers about "stroke extraction", but those techniques tend to be more difficult to implement.
Q3 : What is back propagation and the layers which is shown in most of the papers. Do I need them for my simple OCR.
These terms refer to artificial neural networks. For a simple OCR algorithm you'll hard-code the recognition logic OR use simple training methods. Artificial neural networks can be trained to recognize characters that aren't hard-coded in your software.
http://en.wikipedia.org/wiki/Neural_network
Although you don't need to learn about artificial neural network to write a simple OCR algorithm, a simple algorithm will have only limited success with handwritten characters.
Above all, keep in mind that OCR for handwritten characters is an extremely difficult problem. If you could achieve a handwritten character read rate of 20% with a simple technique, then consider that a success.

Extracting 'useful' information out of sentences?

I am currently trying to understand sentences of this form:
The problem was more with the set-top box than the television. Restarting the set-top box solved the problem.
I am totally new to Natural Language Processing and started using Python's NLTK package to get my hands dirty. However, I am wondering if someone could give me an overview of the high-level steps involved in achieving this.
What I am trying to do is to identify what the problem was so in this case, set-top box and whether the action that was taken resolved the problem so in this case, yes because restarting fixed the problem. So if all the sentences were of this form, my life would have been easier but because it is natural language, the sentences could also be of the following form:
I took a look at the car and found nothing wrong with it. However, I suspect there is something wrong with the engine
So in this case, the problem was with the car. The action taken did not resolve the problem because of the presence of the word suspect. And the potential problem could be with the engine.
I am not looking for an absolute answer as I suspect this is very complex. What I am looking for is more rather a high-level overview that will point me in the right direction. If there is an easier/alternate way to do this, that is welcome as well.

Really the best you could hope for is a Naive Bayesian Classifier with a sufficiently large (probably more than you have) training set and be willing to tolerate a fair rate of false determinations.
Seeking the holy grail of NLP is bound to leave you somewhat unsatisfied.

Probably, if the sentences are well-formed, I would experiment with dependency parsing (http://nltk.googlecode.com/svn/trunk/doc/api/nltk.parse.malt.MaltParser-class.html#raw_parse). That gives you a graph of the constituents of a sentence and you can tell the relations between the lexical items. Later, you can extract phrases from the output of a dependency parser (http://nltk.googlecode.com/svn/trunk/doc/book/ch08.html#code-cfg2) That could help you to extract the direct object of a sentence, or the verb phrase in a sentence.
If you just want to get phrases or "chunks" from a sentence, you can try chunk parser (http://nltk.googlecode.com/svn/trunk/doc/api/nltk.chunk-module.html). You can also carry out named entity recognition (http://streamhacker.com/2009/02/23/chunk-extraction-with-nltk/). It's usually used to extract instances of places, organizations or people names but it could work in your case as well.
Assuming that you solve the problem of extracting noun/verb phrases from a sentence, you may need to filter them out to ease the job of your domain expert (too many phrases could overwhelm a judge). You may carry out a frequency analysis on your phrases, remove very frequent ones that are not usually related to the problem domain, or compile a white-list and keep the phrases that contain a pre-defined set of words, etc.

Does knowing a Natural Language well help with Programming?

We all hear that math at least helps a little bit with programming. My question though, does English or other natural language skills help with programming? I know it has to help with technical documentation, but what about actual programming? Are certain constructs in a programming language also there in natural languages? Does knowing how to write a 20 page research paper help with writing a 20k loc programming project?

Dijkstra went so far as to say: "Besides a mathematical inclination, an exceptionally good mastery of one's native tongue is the most vital asset of a competent programmer."
Edit: yes, I'm reasonably certain he was talking about the programming part of the job. Here's a bit more complete quote:
The problems of business administration in general and database management in particular are much too difficult for people who think in IBMerese, compounded by sloppy English.
About the use of language: it is impossible to sharpen a pencil with a blunt axe. It is equally vain to try to do it with ten blunt axes instead.
Besides a mathematical inclination, an exceptionally good mastery of one's native tongue is the most vital asset of a competent programmer.
From EWD498.
I certainly can't speak for Dijkstra, but I think it's impossible to cleanly separate the part where you're doing actual programming from the part where you're interacting with people. Just for example, even when you're working alone, it's crucial that you're able to understand (clearly and unambiguously) notes you wrote down about what to do, the nature of a bug, etc. A good command of English is necessary even when nobody else is involved at all (and, of course, that's unusual except on trivial tasks).

I don't know about causality, but the skill set required to write well overlaps quite a bit with those required for programming: knowing how to plan, being able to keep a myriad of details consistent, being able to make things clear for a future reader, knowing how to organize your thoughts and the resultant product. That isn't to say that a successful author would make a good programmer, but a programmer with good language skills and the same logic/math/deductive skills is probably a better programmer than one with poor language skills -- at least the code has a greater chance of being understandable.

Yes. Strong natural language skills help you to organize your thoughts in a coherent way that can easily be understood by others. That can help improve your code in everything from naming variables, methods, classes, etc., to expressing the contexts of objects in your model. Practices such as pair programming require you to be able to communicate well with your partner in order to write good code. Techniques such as Domain Driving Design emphasize using the domain language of the business in your code. Natural language skills facilitate that. And there is a strong drive in the development industry toward more natural language-like tools, e.g. many of the newer testing tools like rspec, gherkin, etc., are moving toward more natural language-like syntax. One of the things many people like about dynamic languages like Ruby and Python are that the code tends to read more like a natural language.

Let me state what should be the obvious: every healthy person above 12 knows at least one natural language. Moreover, every healthy person above 12 is able to generate and parse natural language a complex and rich language, and express and understand an extremely large set of ideas. In general, people are not likely to be limited in their ability to discuss issues by their language, but by the type of things they experienced and learned.
Having said that, there are several language-related skills that you might have thought about.
Writing style. You mentioned those specifically. Written language is different from spoken language. Way less intuitive. This is one reason people have to get coached in writing through their years in the education system.
Coding doesn't really involve writing. I mean, there's comments, but they can be rather laconic. Of course the work of a programmer usually involves at least some writing of documents, and writing abilities to make a difference there.
Analytical skills. Analytical skills are a complicated (not to say fuzzy) concept. Analytical skills aren't really about language, but insomuch they are taught and tested at all, it's in the context of writing essays.
Analytical skills are obviously very important in programming. I am not sure that these are exactly the same skills required to write a good essay about Euthanasia or whatever, but as was previously suggested, they may be related.
Foreign language. For people whose native language isn't English, a certain command of English may be needed. Not in the coding itself (knowing what "while" means in English isn't really critical to understanding what it does in Java), but because much training and support material is available mainly in English (did anyone mention Stack Overflow?). The English requirement may differ on the country you are in, and the company you work for, though.
Communication Skills. Ahhm. I was never exactly sure what this means exactly. Maybe it's a cultural thing. I do suspect it's less about knowing a language and more about knowing people.
So to some up, Dijkstra is a venerable computer scientist, but I am not sure he knew that much about language.

Programming isn't just about writing code. On any programming project of any size there will be the need for:
initial project proposal documents
design and architectural documents
programmers manual
users manual
training materials
communication with third party suppliers
etc.
On every big project I've worked on I'd guess I spent at least 50% of my time on the English language documents. So yes, an ability to explain and express yourself well is extremely important. Does it lead to writing better code? Once again, I would say yes - the need to provide clear documentation spills over into the need to write better code, itnerfaces et al.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008