Train Tesseract for specific words - possible? - ocr

I want to use Tesseract to extract about 10-20 keywords from a document. The document will contain all English characters/words. What I am interested in is something like "Age: 23". Here Age is the keyword I am interested in and want to extract the 23 (the value for that) as well.
The first approach that comes in my mind is to extract the whole page into text and then look for keywords in the recognized text. But in terms of training the tesseract, is there a better approach if I know the keywords, which might result in a better accuracy?
I am more or less aware of the limitations of Tesseract OCR. Trying to maximize within that limitations. Thanks for all your expert advice.

Try bazaar matching pattern in Tesseract.

Related

keyword extraction and Keyword based text classification

Currently i am working on a project which requires keywords extraction or we can say keyword based text classification . The dataset contains 3 columns text, keywords and cc terms, I need to extract keywords from text and then classify the text based on those keywords, each row in dataset has their own keywords, i want to extract similar kind of keywords. I want to train the by providing text and keyword column so that the model is able to extract keywords for unknown text.please help
Keyword extraction is typically done using TF-IDF scores simply by setting a score threshold. When training a classifier, it does not make much sense to cut off the keywords at a certain threshold, knowing that something is not likely to be a keyword might also be a valuable piece of information for the classifier.
The simplest way to get the TF-IDF scores for particular words is using TfIdfVectorizer in scikit-learn that does all the laborious text preprocessing steps (tokenization, removing stop words).
You can probably achieve better results by fine-tuning BERT for your classification task (but of course at the expense of much higher computational costs).

How can I consider word dependence along with the semantic information in information retrieval?

I am working on a project that text retrieval is an important part of it. There is a reference collection (D), and users can enter queries (Q). Therefore, like a search engine, the goal is to retrieve the most related documents to each query.
I used pre-trained word embeddings to extract semantic knowledge about each word within a text. I then aggregated the continuous vectors of words to represent each text as a vector (using mean/sum aggregate function). Next, I indexed the source vectors and extracted the most similar vectors to the query vector. However, the result was not acceptable. I also tested the traditional approaches like the BOW technique. While these approaches work very well in some situations, they do not consider semantic and syntactic information (that made them not good for some queries).
Based on my investigation, considering word dependence (for example, co-occurring the words in the same sentence) along with the semantic information (obtained using the pre-trained word embeddings) can be very useful. However, I do not know how to combine them to be applicable in IR.
It should be noted that:
I'm not looking for paragraph2vec or doc2vec; those require training on a large data corpus, and I don't have a large data corpus. Instead, I want to use an existing word embeddings.
I'm not looking for a re-ranking technique like learning to rank approach. Instead, I'm looking for a way to take advantage of both syntactic and semantic information in the representation step i.e. mapping the text or query to a feature vector.
Any help would be appreciated.

How CFG and google n-gram can be combined to generate sentences

I have valid list of grammars and lexical items for generating grammatical correct phrases yet meaningless. I want to combine google n-gram to generate only the valid sentences. Is it feasible, is there any paper on this. I am using NLTK and Stanford core nlp tools.
No, it is not feasible. Real sentences have structure and meaning dependencies that go well beyond what can be captured in ngrams.
I suppose you're thinking of generating a random structure by expanding your CFG, then using ngrams to select among the possible vocabulary choices. It's a pretty simple thing to code: Chop off your grammar at the part-of-speech level, generate a "sentence" with your CFG as a string of POS tags, and use the ngrams to fill them out one by one.
To work with google's entire 5-gram collection you'll need a lot of disk space and a huge amount of RAM or some clever programming, so I recommend you experiment with one of the NLTK's tagged corpora (e.g., the Brown corpus with the "universal" tagset). Starting from any text, it is not hard to collect its ngrams, write a random text generator, and confirm that it produces semi-cohesive but undeniably incoherent (and still mostly ungrammatical) nonsense.

Tag Cloud Data Backend

I want to be able to generate tag clouds from free text that comes from any number of different sources. For clarity, I'm not talking about how to display a tag cloud once the critical tags/phrases are already discovered, I'm hoping to be able to discover the meaningful phrases themselves... preferable on a PHP/MySQL stack.
If I had to do this myself, I'd start by establishing some kind of index for words/phrases that gives a "normal" frequency for any word/phrase. eg "Constantinople" occurs once in every 1,000,000 words on average (normal frequency "0.000001"). Then as I analyze a body of text, I'd find the individual words/phrases (another challenge!), find frequencies of each within the input, and measure against the expected freqeuncy. Words that have the highest ratio against expected frequency get boosted priority in the cloud.
I'd like to believe someone else has already done this, WAY better than I could hope to, but I'll be damned if I can find it.
Any recommendations??
You need an inverted index, used by full-text search engines. A text search library like Lucene or Xapian should help, many such libraries have PHP bindings.

Best practices for seaching for alternate forms of a word with Lucene

I have a site which is searchable using Lucene. I've noticed from logs that users sometimes don't find what they're looking for because they enter a singular term, but only the plural version of that term is used on the site. I would like the search to find uses of other forms of a word as well. This is a problem that I'm sure has been solved many times over, so what are the best practices for this?
Please note: this site only has English content.
Some approaches I've thought of:
Look up the word in some kind of thesaurus file to determine alternate forms of a given word.
Some examples:
Searches for "car", also add "cars" to the query.
Searches for "carry", also add "carries" and "carried" to the query.
Searches for "small", also add "smaller" and "smallest" to the query.
Searches for "can", also add "can't", "cannot", "cans", and "canned" to the query.
And it should work in reverse (i.e. search for "carries" should add "carry" and "carried").
Drawbacks:
Doesn't work for many new technical words unless the dictionary/thesaurus is updated frequently.
I'm not sure about the performance of searching the thesaurus file.
Generate the alternate forms algorithmically, based on some heuristics.
Some examples:
If the word ends in "s" or "es" or "ed" or "er" or "est", drop the suffix
If the word ends in "ies" or "ied" or "ier" or "iest", convert to "y"
If the word ends in "y", convert to "ies", "ied", "ier", and "iest"
Try adding "s", "es", "er" and "est" to the word.
Drawbacks:
Generates lots of non-words for most inputs.
Feels like a hack.
Looks like something you'd find on TheDailyWTF.com. :)
Something much more sophisticated?
I'm thinking of doing some kind of combination of the first two approaches, but I'm not sure where to find a thesaurus file (or what it's called, as "thesaurus" isn't quite right, but neither is "dictionary").
Consider including the PorterStemFilter in your analysis pipeline. Be sure to perform the same analysis on queries that is used when building the index.
I've also used the Lancaster stemming algorithm with good results. Using the PorterStemFilter as a guide, it is easy to integrate with Lucene.
Word stemming works OK for English, however for languages where word stemming is nearly impossible (like mine) option #1 is viable. I know of at least one such implementation for my language (Icelandic) for Lucene that seems to work very well.
Some of those look like pretty neat ideas. Personally, I would just add some tags to the query (query transformation) to make it fuzzy, or you can use the builtin FuzzyQuery, which uses Levenshtein edit distances, which would help for mispellings.
Using fuzzy search 'query tags', Levenshtein is also used. Consider a search for 'car'. If you change the query to 'car~', it will find 'car' and 'cars' and so on. There are other transformations to the query that should handle almost everything you need.
If you're working in a specialised field (I did this with horticulture) or with a language that does't play nicely with normal stemming methods you could use the query logging to create a manual stemming table.
Just create a word -> stem mapping for all the mismatches you can think of / people are searching for, then when indexing or searching replace any word that occurs in the table with the appropriate stem. Thanks to query caching this is a pretty cheap solution.
Stemming is a pretty standard way to address this issue. I've found that the Porter stemmer is way to aggressive for standard keyword search. It ends up conflating words together that have different meanings. Try the KStemmer algorithm.