SSIS Term Extraction for non-English text - sql-server-2008

I started learning Data Mining with SQL Server and I was curious that SQL Server Integration Services is capable to perform Term Extraction from English text. However I'm interested to perform Text Mining from non-English text, basically from Ukrainian. So these are the very questions:
Is there a way to implement Term Extraction from non-English text in SSIS? If yes then any suitable resources would be appreciated:)
If the answer for the first question is positive I would like to know if there are already some custom solutions for non-English text.
Thanks in advance:)

The documentation states that the term extraction transformation supports only English, and there's no mention of a mechanism for adding other languages.
Therefore, I would assume that you need to find some sort of tool that can do term extraction with Ukrainian text, and work out how to integrate it into SSIS. Finding a tool like that isn't really an SSIS issue, it's a general NLP or linguistics question, so you might get a better answer in another forum.

Related

Why IATA chose XML over JSON for NDC services?

May be it looks silly.. :)
But I am trying to find out what was the strong reason for IATA to choose XML over JSON.
I found lot of documentation on NDC XMLs, but couldn't find a proper functionality/feature of XML which is not possible in JSON, considering the advantages of using JSON
I could use some help in understanding it..
Thanks in advance..
A "why" question like this can be interpreted in two ways: (a) is there any historical evidence to show who made the decision, when they made it, and what their arguments were for making it? Or (b) can you think of any good reason why intelligent people might have made this choice?
I can't answer (a), but for (b) you have to look at the timeline. With something as big as IATA, it's likely they've been talking about this for at least 10 and maybe 20 years. Ten years ago, JSON was being promoted as "lightweight" - it didn't carry all the baggage of schemas, validation, transformation and query languages that went with XML. If you were in an airline, you didn't think of that as baggage, you thought of it as essential infrastructure. Being "lightweight" simply isn't a benefit in that world; on the contrary, the word is almost a suggestion that it's not up to doing heavyweight tasks.
Frankly (and at the risk of straying towards question (a)) I think it's very unlikely that the question of using JSON ever came up; they would all have been far too heavily committed to XML before anyone ever took JSON seriously. Don't forget that in 2005 XML was delivering things no one dreamt possible ten years earlier: a robust and rigorous data syntax, completely standardised, with full Unicode support, available at low cost on all platforms, with lots of tools around to support declarative processing. JSON was a new kid on the block, threatening to disrupt the consensus and fragment the industry, and for the people in this kind of community, that wasn't seen as something they needed or wanted.

Custom translator - How can I train the machine to recognize the right translation solution (synonyms)?

I'm pretty new with Custom Translator and I'm working on a fashion-related EN_KO project.
There are many cases where a single English term has two possible translations into Korean. An example: if "fastening"is related to "bags, backpacks..." is 잠금 but if it's related to "clothes, shoes..." is 여밈.
I'd like to train the machine to recognize these differences. Could it be useful to upload a phrase dictionary? Any ideas? Thanks!
The purpose of training a custom translation system is to teach it how to translate terms in context.
The best way to teach the system how to translate is training with parallel documents of full sentence prose: the same document in two languages. A translation memory extract in a TMX or XLIFF file is the best material, but many other document formats are suitable as well, as long as you have both languages. Have at least 10000 sentences in both languages, upload to http://customtranslator.ai, and build a custom system with it.
If you have documents in Korean that are representative of the terminology and style you want to achieve, without an English match, you can automatically translate those to English, and add to the training material as parallel documents. Be sure to not use the automatically translated documents in the other direction.
A phrase dictionary is of limited help, because it is unaware of context. It is useful only in bootstrapping your custom system or for very rare terms where you cannot find or create a sentence.

fix misspelled words in a corpus without dictionary

We have a history of conversations between humans (any language, any vocabulary), so with a lof of spelling errors:
"hellobb do u hav skip?" => "hello baby, do you have skype?"
Before running a deep learning task against this data set (find synonyms etc..), I would like to fix these errors.
Is it a good idea? I've never worked with such bad quality data. Wondering if there is a "magic solution" to achieve this.
Else I plan to use:
word embeddings (word2vec) to check if good and bad words are similar
distance function between words
if wordA is less famous wordB then fix(wordA) = wordB
There is no magic solution at this moment to guaranty to fix all misspelling errors on your text but here are some possible options you can consider:
Dictionary-based approach. I found Hunspell very handy in this case. It uses language modeling and Levenshtein distance to suggest the correct spelling. It is available on many natural & programming languages. Although it is a dictionary-based approach, it is superior to many sophisticated approaches. It is used in vast majority word-processing applications.
Statistical and traditional approach. Another possible solution is to develop your own statistical models such as language modeling. Training language modeling on a large corpus, at character level & word level, can found many misspelling on the text. Many speech recognition and search engines use language modeling at their heart to fix the misspelling.
Deep learning approach. If you look at NLPProgress.com, most of the state-of-the-art research used seq2seq models to attack grammatical error problem. The main intuition behind these models is to train a neural network on pairs of sentences which network learns how to fix grammatical error. These approaches require quite a lot of pairs sentence to gives a reliable result. If the available corpora are not fit to your needs, you can generate your own misspelling e.g. by replacing some tokens in your text.

handwriting recognition with simple training

I've been reading (and trying) OCR programs suggested in previous answers but I'm still without a clear answer to my problem.
I need to recognize handwritten English text. The text would be multiple lines but each line is only one or two words length. The text is from a different person at time. I could ask that person to provide a training file (e.g. with the alphabet and 0-9 numbers) but I cannot really ask for a much more complicated training than this.
I need to integrate the recognition as part of another (Java) application but the solution doesn't need to be Java. I can just execute it from Java and get the results from a text file.
Any recommendations?
I've already tested Tesseract (bad results without training and training looks quite complex). Java OCR looked like the perfect solution (simple training, open source and Java) but it doesn't work well even with their own examples (anybody has had a better experiencie?). GOCR does not seem very active.
Of course I prefer free solutions but this is not a MUST (though the problem I see with a commercial option is that I must be able to integrate it in my own app which will be offered as SaaS)
From my experience ABBYY is one of the best for handwriting recognition, even without training. (It's possibly one of the most expensive too, though...) They have an SDK for Java.
http://www.abbyy.com
With a free trial, it's definately worth a look!
I am on the lookout for a handwritten text recognition software. So far the only one giving better results than even abby 11 has been SimpleOCR using the same text for both, which is a freeware for ocr but a 14 day trial for HCR!
I know I am answering after nearly 6 years. But if anyone's still looking, try using tensorflow. Their website has a simple example for handwritten digit recognition(MNIST). You can use this example and implement it for handwritten alphabet recognition (you need training data for this, I used NIST special Database 19 to get this data).

OCR lib for math formulas

I need an open OCR library which is able to scan complex printed math formulas (for example some formulas which were generated via LaTeX). I want to get some LaTeX-like output (or just some AST-like data).
Is there something like this already? Or are current OCR technics just able to parse line-oriented text?
(Note that I also posted this question on Metaoptimize because some people there might have additional knowledge.)
The problem was also described by OpenAI as im2latex.
SESHAT is a open source system written in C++ for recognizing handwritten mathematical expressions. SESHAT was developed as part of a PhD thesis at the PRHLT research center at Universitat Politècnica de València.
An online demo:http://cat.prhlt.upv.es/mer/
The source: https://github.com/falvaro/seshat
Seshat is an open-source system for recognizing handwritten mathematical expressions. Given a sample represented as a sequence of strokes, the parser is able to convert it to LaTeX or other formats like InkML or MathML.
According to the answers on Metaoptimize and the discussion on the Tesseract mailinglist, there doesn't seem to be an open/free solution yet which can do that.
The only solution which seems to be able to do it (but I cannot verify as it is Windows-only and non-free) is, like a few other people have mentioned, the InftyProject.
InftyReader is the only one I'm aware of. It is NOT free software (it seems the money goes to a non-profit org, IIRC).
http://www.sciaccess.net/en/InftyReader/
I don't know why PDF can't have metadata in LaTeX? As in: put the LaTeX equation in it! Is this so hard? (I dunno anything about PDF syntax, but I imagine it can be done).
LaTeX syntax is THE ONE TRIED AND TRUE STANDARD for mathematics notation. It seems amazingly stupid that folks that produced MathML and other stuff don't take this in consideration. InftyReader generates MathML or LaTeX syntax.
If I want HTML (pure) I then use TTH to read the LaTeX syntax. Just works.
ABBYY FineReader (a great OCR program) claims you can train the software for Math, but this is immensely braindead (who has the time?)
And Unicode has lots of math symbols. That today's OCR readers can't grok them shows the sorry state of software and the brain deficit in this activity.
As to "one symbol at a time", TeX obviously has rules as to where it will place symbols. They can't write software that know those rules?! TeX is even public domain! They can just "use it" in their comercial products.
Check out "Web Equation." It can convert handwritten equations to LaTeX, MathML, or SymbolTree. I'm not sure if the engine is open source.
Considering that current technologies read one symbol at a time (see http://detexify.kirelabs.org/classify.html), I doubt there is an OCR for full mathematical equations.
Infty works fairly well. My former company integrated it into an application that reads equations out loud for blind people and is getting good feedback from users.
http://www.inftyproject.org/en/download.html
Since the output from math OCR for complex formulas will likely have bugs -- even humans have trouble with it -- you will have to proofread th results, at least if they matter. The (human) proofreader will then have to correct the results, meaning you need to have a math formula editor. Given the effort needed by humans, the probably limited corpus of complex formulas, you might find it easier to assign the task to humans.
As a research problem, reading math via OCR is fun -- you need a formalism for 2-D grammars plus a symbol recognizer.
In addition to references already mentioned here, why not google for this? There is work that was done at Caltech, Rochester, U. Waterloo, and UC Berkeley. How much of it is ready to use out of the box? Dunno.
As of August 2019, there are a few options, depending on what you need:
For converting printed math equations/formulas to LaTex, Mathpix is absolutely the best choice. It's free.
For converting handwritten math to LaTex or printed math, MyScript is the best option, although its app costs a few dollars.
You know, there's an application in Win7 just for that: Math Input Panel. It even handles handwritten input (it's actually made for this). Give it a shot if you have Win7, it's free!
there is this great short video: http://www.youtube.com/watch?v=LAJm3J36tLQ
explaining how you can train your Fine Reader to recognize math formulas. If you use Fine Reader already, better to stick with one tool. Of course it is not free ware :(