Add punctuation to text - nltk

I am looking for a way to add punctuation to a sentence as in:
hey mike how are you -> Hey Mike, how are you?
If that model takes care of correct casing I would not be mad either. I've used nltk, spacy and CodeNLP in the past but I cannot recall (or find) anything that would allow me to enhance a sentence like that.
Is there a way to do this with any or those libraries?

According to this post, it is studied in speech recognition (for transcription) and Natural Language Processing (NLP).
One of the implementations referenced can be found here.
Taking your example sentence as input in the demo results in "Hey mike, how are you". As you can see, the result is somewhat closer to what someone would expect, but not quite the same.

The task of adding proper punctuations in a given string often referred to as "punctuation restoration" in the research community. nltk, spacy and CodeNLP do not have this feature.
https://github.com/ottokart/punctuator2. that Simon suggested is python 2.7 + Theano + MIT license + word-level prediction + published in 2016. A slightly more recent `package is https://github.com/geyang/deep-auto-punctuation (Pytorch, char-level prediction + published in 2017, but has no license).

Related

handwriting recognition with simple training

I've been reading (and trying) OCR programs suggested in previous answers but I'm still without a clear answer to my problem.
I need to recognize handwritten English text. The text would be multiple lines but each line is only one or two words length. The text is from a different person at time. I could ask that person to provide a training file (e.g. with the alphabet and 0-9 numbers) but I cannot really ask for a much more complicated training than this.
I need to integrate the recognition as part of another (Java) application but the solution doesn't need to be Java. I can just execute it from Java and get the results from a text file.
Any recommendations?
I've already tested Tesseract (bad results without training and training looks quite complex). Java OCR looked like the perfect solution (simple training, open source and Java) but it doesn't work well even with their own examples (anybody has had a better experiencie?). GOCR does not seem very active.
Of course I prefer free solutions but this is not a MUST (though the problem I see with a commercial option is that I must be able to integrate it in my own app which will be offered as SaaS)
From my experience ABBYY is one of the best for handwriting recognition, even without training. (It's possibly one of the most expensive too, though...) They have an SDK for Java.
http://www.abbyy.com
With a free trial, it's definately worth a look!
I am on the lookout for a handwritten text recognition software. So far the only one giving better results than even abby 11 has been SimpleOCR using the same text for both, which is a freeware for ocr but a 14 day trial for HCR!
I know I am answering after nearly 6 years. But if anyone's still looking, try using tensorflow. Their website has a simple example for handwritten digit recognition(MNIST). You can use this example and implement it for handwritten alphabet recognition (you need training data for this, I used NIST special Database 19 to get this data).

What's a good explanation of statistical machine translation?

I'm trying to find a good high level explanation of how statistical machine translation works. That is, supposing I have a corpus of non-aligned English, French and German texts, how could I use that to translate any sentence from one language to another ? It's not that I'm looking to build a Google Translate myself, but I'd like to understand how it works in more detail.
I've seen searched Google but come across nothing good, it either quickly needs advanced mathematics knowledge to understand or is way too generalized. Wikipedia's article on SMT seems to be both, so it doesn't really help much. I'm skeptical that this is such a complex area that it's simply not possible to understand without all the mathematics.
Can anyone give, or know of, a general step-by-step explanation of how such a system works, targeted towards programmers (so code examples are fine) but without needing a mathematics degree to understand ? Or a book that's like this would be great too.
Edit: A perfect example of what I'm looking for would be an SMT equivalent to Peter Norvig's great article on spelling correction. That gives a good idea of what it's involved in writing a spell checker, without going into detailed maths on Levenshtein/soundex/smoothing algorithms etc...
Here is a nice video lecture (in 2 parts):
http://videolectures.net/aerfaiss08_koehn_pbfs/
For in-depth details, I highly advise this book:
http://www.amazon.com/Statistical-Machine-Translation-Philipp-Koehn/dp/0521874157
Both are from the guy who created the most widely used MT system in research. It covers all the fundamental stuff, is very well explained and accurate. This probably one of the de-facto standard books that any researcher beginning in this field should read.
The Atlantic Online had a very straightforward nontechnical description of statistical machine translation back in December 1998:
Lost in Translation by Stephen Budiansky
I've read nontechnical stuff on statistical MT before but always wondered "yeah but how does the statistical stuff know which words map to which when word orders vary and supposedly no dictionary and no grammar are used?" Well this article actually does answer that and it's simple and straightforward and I was quite surprised.
A Peter Norvig talk from Google Developer Day 2007, Theorizing from Data: Avoiding the Capital Mistake, contains some accessible high-level explanation of the principles of statstical machine translation (starting from about 21:20).

OCR lib for math formulas

I need an open OCR library which is able to scan complex printed math formulas (for example some formulas which were generated via LaTeX). I want to get some LaTeX-like output (or just some AST-like data).
Is there something like this already? Or are current OCR technics just able to parse line-oriented text?
(Note that I also posted this question on Metaoptimize because some people there might have additional knowledge.)
The problem was also described by OpenAI as im2latex.
SESHAT is a open source system written in C++ for recognizing handwritten mathematical expressions. SESHAT was developed as part of a PhD thesis at the PRHLT research center at Universitat Politècnica de València.
An online demo:http://cat.prhlt.upv.es/mer/
The source: https://github.com/falvaro/seshat
Seshat is an open-source system for recognizing handwritten mathematical expressions. Given a sample represented as a sequence of strokes, the parser is able to convert it to LaTeX or other formats like InkML or MathML.
According to the answers on Metaoptimize and the discussion on the Tesseract mailinglist, there doesn't seem to be an open/free solution yet which can do that.
The only solution which seems to be able to do it (but I cannot verify as it is Windows-only and non-free) is, like a few other people have mentioned, the InftyProject.
InftyReader is the only one I'm aware of. It is NOT free software (it seems the money goes to a non-profit org, IIRC).
http://www.sciaccess.net/en/InftyReader/
I don't know why PDF can't have metadata in LaTeX? As in: put the LaTeX equation in it! Is this so hard? (I dunno anything about PDF syntax, but I imagine it can be done).
LaTeX syntax is THE ONE TRIED AND TRUE STANDARD for mathematics notation. It seems amazingly stupid that folks that produced MathML and other stuff don't take this in consideration. InftyReader generates MathML or LaTeX syntax.
If I want HTML (pure) I then use TTH to read the LaTeX syntax. Just works.
ABBYY FineReader (a great OCR program) claims you can train the software for Math, but this is immensely braindead (who has the time?)
And Unicode has lots of math symbols. That today's OCR readers can't grok them shows the sorry state of software and the brain deficit in this activity.
As to "one symbol at a time", TeX obviously has rules as to where it will place symbols. They can't write software that know those rules?! TeX is even public domain! They can just "use it" in their comercial products.
Check out "Web Equation." It can convert handwritten equations to LaTeX, MathML, or SymbolTree. I'm not sure if the engine is open source.
Considering that current technologies read one symbol at a time (see http://detexify.kirelabs.org/classify.html), I doubt there is an OCR for full mathematical equations.
Infty works fairly well. My former company integrated it into an application that reads equations out loud for blind people and is getting good feedback from users.
http://www.inftyproject.org/en/download.html
Since the output from math OCR for complex formulas will likely have bugs -- even humans have trouble with it -- you will have to proofread th results, at least if they matter. The (human) proofreader will then have to correct the results, meaning you need to have a math formula editor. Given the effort needed by humans, the probably limited corpus of complex formulas, you might find it easier to assign the task to humans.
As a research problem, reading math via OCR is fun -- you need a formalism for 2-D grammars plus a symbol recognizer.
In addition to references already mentioned here, why not google for this? There is work that was done at Caltech, Rochester, U. Waterloo, and UC Berkeley. How much of it is ready to use out of the box? Dunno.
As of August 2019, there are a few options, depending on what you need:
For converting printed math equations/formulas to LaTex, Mathpix is absolutely the best choice. It's free.
For converting handwritten math to LaTex or printed math, MyScript is the best option, although its app costs a few dollars.
You know, there's an application in Win7 just for that: Math Input Panel. It even handles handwritten input (it's actually made for this). Give it a shot if you have Win7, it's free!
there is this great short video: http://www.youtube.com/watch?v=LAJm3J36tLQ
explaining how you can train your Fine Reader to recognize math formulas. If you use Fine Reader already, better to stick with one tool. Of course it is not free ware :(

PHP/Python/C/C++ library/application to match/correct/give suggestions to input

I'd like to have a simple & lightweight library/application in PHP/Python/C/C++ library/application to match/correct/give suggestions to input. Example in/out:
Input: Webdevelopment ==> Output: Web Development
Input: Web developmen ==> Output: Web Development
Input: Web develop ==> Output: Web Development
Given there is database of correct words and phrases, I just need the library to match/guess phrases. Please suggest if you know any.
How to Write a Spelling Corrector from Google's Director of Resarch Peter Norvik contains a spelling corrector in 21 lines of Python, complete with explanations.
You will have to convert this into a module yourself, but that should be easy. Of course, you will also need a corpus (i.e. words), but he gives sources for these as well.
I guess what you want to do is compute the edit distance between strings (an input, output pair).
One of the simpler ones (that I've used for figuring out a team's full name from it's 3 letter short one - it's a long story..) is the Levenshtein distance. The last external link on the page has a bunch of different implementations of it (turns out it's standard on PHP 4.0.1+).

Natural Language Parsing for ToDo application

I'm wondering if someone could lead me to any examples of natural language parsing for to do lists. Nothing as intense as real Natural Language Parsing, but something that could process the line:
Go to George's house at 3pm on Tuesday with Kramer
as well as the line:
3 on tuesday go to georges
and get the same output.
I've seen other to do applications that do this sort of work in the past. Is there anything out there with examples or have people just custom written this code themselves?
Somebody pointed out this natural language parsing on this site..kudos to whoever you are for posting the link...http://code.gustavonarea.net/booleano/
That's a great idea! As you might imagine this is vastly complex and can be approached in many different ways. Perhaps check out the Natural Language Toolkit for starters, which is mostly python but also requires building some Ocaml and Java components. I also recommend reading some books and or papers on lexical semantics.
I wrote something similar to this in Perl. The input would be a day/time with the name of some action. Sentences like: "3pm Run full unit test suite", "reboot servers on dec 25", etc.
I used the Perl module Date::Manip since it's awesome for this sort of thing and coded the rest of the logic manually.