PHP/Python/C/C++ library/application to match/correct/give suggestions to input - language-agnostic

I'd like to have a simple & lightweight library/application in PHP/Python/C/C++ library/application to match/correct/give suggestions to input. Example in/out:
Input: Webdevelopment ==> Output: Web Development
Input: Web developmen ==> Output: Web Development
Input: Web develop ==> Output: Web Development
Given there is database of correct words and phrases, I just need the library to match/guess phrases. Please suggest if you know any.

How to Write a Spelling Corrector from Google's Director of Resarch Peter Norvik contains a spelling corrector in 21 lines of Python, complete with explanations.
You will have to convert this into a module yourself, but that should be easy. Of course, you will also need a corpus (i.e. words), but he gives sources for these as well.

I guess what you want to do is compute the edit distance between strings (an input, output pair).
One of the simpler ones (that I've used for figuring out a team's full name from it's 3 letter short one - it's a long story..) is the Levenshtein distance. The last external link on the page has a bunch of different implementations of it (turns out it's standard on PHP 4.0.1+).

Related

Custom translator - How can I train the machine to recognize the right translation solution (synonyms)?

I'm pretty new with Custom Translator and I'm working on a fashion-related EN_KO project.
There are many cases where a single English term has two possible translations into Korean. An example: if "fastening"is related to "bags, backpacks..." is 잠금 but if it's related to "clothes, shoes..." is 여밈.
I'd like to train the machine to recognize these differences. Could it be useful to upload a phrase dictionary? Any ideas? Thanks!
The purpose of training a custom translation system is to teach it how to translate terms in context.
The best way to teach the system how to translate is training with parallel documents of full sentence prose: the same document in two languages. A translation memory extract in a TMX or XLIFF file is the best material, but many other document formats are suitable as well, as long as you have both languages. Have at least 10000 sentences in both languages, upload to http://customtranslator.ai, and build a custom system with it.
If you have documents in Korean that are representative of the terminology and style you want to achieve, without an English match, you can automatically translate those to English, and add to the training material as parallel documents. Be sure to not use the automatically translated documents in the other direction.
A phrase dictionary is of limited help, because it is unaware of context. It is useful only in bootstrapping your custom system or for very rare terms where you cannot find or create a sentence.

Resources on generating equivalent phrases (same language translation)?

I'm interested in building a program that takes some text (an article, for example) and then generates a new text with equivalent meaning, but I'm not sure how to get started on such a problem.
Can anyone recommend some code/books/papers/techniques that would help me tackle this?
Did you check this one:
fastsubs: a program to generate most likely substitutes for words in a given text based on an n-gram language model.
More info is available at:
http://denizyuret.blogspot.be/2012/05/fastsubs-efficient-admissible-algorithm.html

OCR lib for math formulas

I need an open OCR library which is able to scan complex printed math formulas (for example some formulas which were generated via LaTeX). I want to get some LaTeX-like output (or just some AST-like data).
Is there something like this already? Or are current OCR technics just able to parse line-oriented text?
(Note that I also posted this question on Metaoptimize because some people there might have additional knowledge.)
The problem was also described by OpenAI as im2latex.
SESHAT is a open source system written in C++ for recognizing handwritten mathematical expressions. SESHAT was developed as part of a PhD thesis at the PRHLT research center at Universitat Politècnica de València.
An online demo:http://cat.prhlt.upv.es/mer/
The source: https://github.com/falvaro/seshat
Seshat is an open-source system for recognizing handwritten mathematical expressions. Given a sample represented as a sequence of strokes, the parser is able to convert it to LaTeX or other formats like InkML or MathML.
According to the answers on Metaoptimize and the discussion on the Tesseract mailinglist, there doesn't seem to be an open/free solution yet which can do that.
The only solution which seems to be able to do it (but I cannot verify as it is Windows-only and non-free) is, like a few other people have mentioned, the InftyProject.
InftyReader is the only one I'm aware of. It is NOT free software (it seems the money goes to a non-profit org, IIRC).
http://www.sciaccess.net/en/InftyReader/
I don't know why PDF can't have metadata in LaTeX? As in: put the LaTeX equation in it! Is this so hard? (I dunno anything about PDF syntax, but I imagine it can be done).
LaTeX syntax is THE ONE TRIED AND TRUE STANDARD for mathematics notation. It seems amazingly stupid that folks that produced MathML and other stuff don't take this in consideration. InftyReader generates MathML or LaTeX syntax.
If I want HTML (pure) I then use TTH to read the LaTeX syntax. Just works.
ABBYY FineReader (a great OCR program) claims you can train the software for Math, but this is immensely braindead (who has the time?)
And Unicode has lots of math symbols. That today's OCR readers can't grok them shows the sorry state of software and the brain deficit in this activity.
As to "one symbol at a time", TeX obviously has rules as to where it will place symbols. They can't write software that know those rules?! TeX is even public domain! They can just "use it" in their comercial products.
Check out "Web Equation." It can convert handwritten equations to LaTeX, MathML, or SymbolTree. I'm not sure if the engine is open source.
Considering that current technologies read one symbol at a time (see http://detexify.kirelabs.org/classify.html), I doubt there is an OCR for full mathematical equations.
Infty works fairly well. My former company integrated it into an application that reads equations out loud for blind people and is getting good feedback from users.
http://www.inftyproject.org/en/download.html
Since the output from math OCR for complex formulas will likely have bugs -- even humans have trouble with it -- you will have to proofread th results, at least if they matter. The (human) proofreader will then have to correct the results, meaning you need to have a math formula editor. Given the effort needed by humans, the probably limited corpus of complex formulas, you might find it easier to assign the task to humans.
As a research problem, reading math via OCR is fun -- you need a formalism for 2-D grammars plus a symbol recognizer.
In addition to references already mentioned here, why not google for this? There is work that was done at Caltech, Rochester, U. Waterloo, and UC Berkeley. How much of it is ready to use out of the box? Dunno.
As of August 2019, there are a few options, depending on what you need:
For converting printed math equations/formulas to LaTex, Mathpix is absolutely the best choice. It's free.
For converting handwritten math to LaTex or printed math, MyScript is the best option, although its app costs a few dollars.
You know, there's an application in Win7 just for that: Math Input Panel. It even handles handwritten input (it's actually made for this). Give it a shot if you have Win7, it's free!
there is this great short video: http://www.youtube.com/watch?v=LAJm3J36tLQ
explaining how you can train your Fine Reader to recognize math formulas. If you use Fine Reader already, better to stick with one tool. Of course it is not free ware :(

Reliably extracting identity fields from scanned documents / images?

I have to pull two pre-printed (not hand-written) fields out of a paper form, such that it can be automatically routed after being scanned. The fields contain batch and item identifiers, like "GG-9192" or "EPN/245G".
I've tried the following software:
Tesseract-OCR
Cuneiform
Canon ImageRunner built-in OCR
Asprise OCR Java API (demo)
I've tried the following settings:
Scanning at resolutions of 300dpi and 600dpi
Tried different fonts, including OCR-A and OCR-B.
In all cases output was pretty much all over the place. I can kick back documents for which I can't properly extract the necessary information, but I'm thinking it's going to be at least half of them. I considered some sort of fuzzy logic based on known values in a database, but sometimes these identifiers can differ by a single character, like "123G" and "123C".
Is this a lost cause? Perhaps OCR just isn't mature enough to handle a requirement of this nature? What other techniques might you recommend? Barcodes?
Edit: the containing application is in Java, so any recommendations for which there are free or cheap Java-based APIs for would help.
Edit 2: if anyone is interested...without any special tuning, Cuneiform for Linux and the Canon ImageRunner worked best, with Tesserect-OCR and Asprise Java API producing the worst results...none of the four was acceptable for anything but standard document search grade OCR. I'm beginning to think that this isn't going to work out.
If you have control over the fields, why use a human-readable format in the first place? For scanning, it seems like a QR Code, or something similar would be best. It is marked for orientation, and has some built-in error correction.
http://en.wikipedia.org/wiki/QR_Code
I started digging for products starting with Tomato's suggestion. I tried ABBYY and CVISION. Both have products that can automate OCR:
CVISION Maestro Recognition Server 4.0
ABBYY Recognition Server 2.0
In addition, ABBYY has SDKs for various platforms, and CVISION has an SDK that appears to work with at least VB/VC++.
I haven't tried either SDK yet, and am not sure it's necessary for my project. All I need is PDFs coming in that I can extract the text from. I did however try CVISION's server product and with the OCR on its most accurate settings, it worked really well. I haven't tried ABBYY's server product yet because I have to go through a reseller to get a trial. I'm in the process of doing so, but if it starts getting annoying I'm probably going to go with CVISION. I did try ABBYY's FineReader standalone product, and it worked very well, so I assume that their server product would also.

Tools to help reverse engineer binary file formats

What tools are available to aid in decoding unknown binary data formats?
I know Hex Workshop and 010 Editor both support structures. These are okay to a limited extent for a known fixed format but get difficult to use with anything more complicated, especially for unknown formats. I guess I'm looking at a module for a scripting language or a scriptable GUI tool.
For example, I'd like to be able to find a structure within a block of data from limited known information, perhaps a magic number. Once I've found a structure, then follow known length and offset words to find other structures. Then repeat this recursively and iteratively where it makes sense.
In my dreams, perhaps even automatically identify possible offsets and lengths based on what I've already told the system!
Here are some tips that come to mind:
From my experience, interactive scripting languages (I use Python) can be a great help. You can write a simple framework to deal with binary streams and some simple algorithms. Then you can write scripts that will take your binary and check various things. For example:
Do some statistical analysis on various parts. Random data, for example, will tell you that this part is probably compressed/encrypted. Zeros may mean padding between parts. Scattered zeros may mean integer values or Unicode strings and so on. Try to spot various offsets. Try to convert parts of the binary into 2 or 4 byte integers or into floats, print them and see if they make sence. Write some functions that will search for repeating or very similar parts in the data, this way you can easily spot headers.
Try to find as many strings as possible, try different encodings (c strings, pascal strings, utf8/16, etc.). There are some good tools for that (I think that Hex Workshop has such a tool). Strings can tell you a lot.
Good luck!
For Mac OS X, there's a great tool that's even better than my iBored: Synalyze It!
(http://www.synalysis.net/)
Compared to iBored, it is better suited for non-blocked files, while also giving full control over structures, including scriptability (with Lua). And it visualizes structures better, too.
Tupni; to my knowledge not directly available out of Microsoft Research, but there is a paper about this tool which can be of interest to someone wanting to write a similar program (perhaps open source):
Tupni: Automatic Reverse Engineering of Input Formats (# ACM digital library)
Abstract
Recent work has established the importance of automatic reverse
engineering of protocol or file format specifications. However, the
formats reverse engineered by previous tools have missed important
information that is critical for security applications. In this
paper, we present Tupni, a tool that can reverse engineer an input
format with a rich set of information, including record sequences,
record types, and input constraints. Tupni can generalize the format
specification over multiple inputs. We have implemented a
prototype of Tupni and evaluated it on 10 different formats: five
file formats (WMF, BMP, JPG, PNG and TIF) and five network
protocols (DNS, RPC, TFTP, HTTP and FTP). Tupni identified all
record sequences in the test inputs. We also show that, by aggregating
over multiple WMF files, Tupni can derive a more complete
format specification for WMF. Furthermore, we demonstrate the
utility of Tupni by using the rich information it provides for zeroday
vulnerability signature generation, which was not possible with
previous reverse engineering tools.
My own tool "iBored", which I released just recently, can do parts of this. I wrote the tool to visualize and debug file system formats (UDF, HFS, ISO9660, FAT etc.), and implemented search, copy and later even structure and templates support. The structure support is pretty straight-forward, and the templates are a way to identify structures dynamically.
The entire thing is programmable in a Visual BASIC dialect, allowing you to test values, read specific blocks, and all.
The tool is free, works on all platforms (Win, Mac, Linux), but as it's personal tool which I just released to the public to share it, it's not much documented.
However, if you want to give it a try, and like to give feedback, I might add more useful features.
I'd even open source it, but as it's written in REALbasic, I doubt many people will join such a project.
Link: iBored home page
I still occasionally use an old hex editor called A.X.E., Advanced Hex Editor. It seems to have largely disappeared from the Internet now, though Google should still be able to find it for you. The last version I know of was version 3.4, but I've really only used the free-for-personal-use version 2.1.
Its most interesting feature, and the one I've had the most use for deciphering various game and graphics formats, is its graphical view mode. That basically just shows you the file with each byte turned into a color-coded pixel. And as simple as that sounds, it has made my reverse-engineering attempts a lot easier at times.
I suppose doing it by eye is quite the opposite of doing automatic analysis, though, and the graphical mode won't be much use for finding and following offsets...
The later version has some features that sound like they could fit your needs (scripts, regularity finder, grammar generator), but I have no idea how good they are.
There is Hachoir which is a Python library for parsing any binary format into fields, and then browse the fields. It has lots of parsers for common formats, but you can also write own parsers for your files (eg. when working with code that reads or writes binary files, I usually write a Hachoir parser first to have a debugging aid). Looks like the project is pretty much inactive by now, though.
Kaitai is an open-source language for describing binary structures in data streams. It comes with a translator that can output parsing code for many programming languages, for inclusion in your own program code.
My project icebuddha.com supports this using python to describe the format in the browser.
A cut'n'paste of my answer to a similar question:
One tool is WinOLS, which is designed for interpreting and editing vehicle engine managment computer binary images (mostly the numeric data in their lookup tables). It has support for various endian formats (though not PDP, I think) and viewing data at various widths and offsets, defining array areas (maps) and visualising them in 2D or 3D with all kinds of scaling and offset options. It also has a heuristic/statistical automatic map finder, which might work for you.
It's a commercial tool, but the free demo will let you do everything but save changes to the binary and use engine management features you don't need.