Tesseract with limited words - ocr

Is it possible to recognize a limited set of words in Tesseract?
I need to recognize a set of words (around 200) and want tesseract correct some words to the closest matching ones. In order to do that, I've updated the language models with my words (eng.word-dawg and eng.freq-dawg) and increased the sensitivity by setting language_model_penalty_non_freq_dict_word and language_model_penalty_non_dict_word to large numbers (tried 0.9 and 1.0). However, this does not have any affect on the output.
I have a word (BENZOATE) which tesseract always recognize as UENZOATE. This is weird as I have BENZOATE in my dictionary.

Related

Primefaces Signature component curve fitting to keep String value to less than 4000 characters

I need to persist the value of p:signature in Oracle. I'm using the String value (JSON lines) of the component but often users get too elaborate with their cursive signature and the string exceeds the 4000 character limit on the Oracle field. I implemented a validator to ensure 4k or less but users get frustrated when form kicks back and they have to retry.
Is there a way to minify the json representation of the line data generated but still have the signature still visually look the same? Like a curve fitting function. If I simply truncate the string to 4k, that just truncates the end of the signature.
It would be nice if the component had a way to set precision of the curve or max flag that would automatically keep the JSON representation to less a maximum number of characters.
You are trying to hammer a square peg in a round hole. A signature is as big as it is. Even if you compress it, it's never guaranteed to be less than 4k. You should use the correct column type for your data. In this case either a CLOB (or a BLOB).

Data PreProcessing for BERT (base-german)

I am working on a sentiment analysis solution with BERT to analyze tweets in german. My training dataset of is a class of 1000 tweets, which have been manually annotated into the classes neutral, positive and negative.
The dataset with 10.000 tweets is quite unevenly distributed:
approx.
3000 positive
2000 negative
5000 neutral
the tweets contain formulations with #names, https links, numbers, punctuation marks, smileys like :3 :D :) etc..
The interesting thing is, if I remove them with the following code during Data Cleaning, the F1 score gets worse. Only the removal of https links (if I do it alone) leads to a small improvement.
# removing the punctuation and numbers
def remove_punct(text):
text = re.sub(r'http\S+', '', text) # removing links
text = re.sub(r'#\S+', '', text) # removing referencing on usernames with #
text = re.sub(r':\S+', '', text) # removing smileys with : (like :),:D,:( etc)
text = "".join([char for char in text if char not in string.punctuation])
text = re.sub('[0-9]+', '', text)
return text
data['Tweet_clean'] = data['Tweet'].apply(lambda x: remove_punct(x)) # extending the dataset with the column tweet_clean
data.head(40)
also steps like stop words removal or lemmitazation lead more to a deterioration. Is this because I do something wrong or can the model BERT actually handle such values?
A second question is:
I found other records that were also manually annotated, but these are not tweets and the structure of the sentences and language use is different. Would you still recommend to add these records to my original?
There are about 3000 records in German.
My last question:
Should I reduce the class sizes to the size of the smallest unit and thus balance?
BERT can handle punctuation, smileys etc. Of course, smileys contribute a lot to sentiment analysis. So, don't remove them. Next, it would be fair to replace #mentions and links with some special tokens, because the model will probably never see them again in the future.
If your model is designed for tweets, I suggest that you fine-tune BERT with additional corpus, and after fine-tune with Twitter corpus. Or do it simultaneously. More training samples is generally better.
No, it is better to use class weights instead of downsampling.
Based on this paper (By Adam Ek, Jean-Philippe Bernardy and Stergios Chatzikyriakidis), BERT models outperform BiLSTM in terms of better generalizing to punctuations. Looking at the experiments' results in the paper, I say keep the punctuations.
I couln't find anything solid for smiley faces; However, after doing some experiments with the HuggingFace API, I didn't notice much difference with/without smiley faces.

How can I determine which regular expressions from a list possibly overlap

I have a table of regular expressions that are in an MySQL table that I match text against.
Is there a way, using MySQL or any other language (preferably Perl) that I can take this list of expressions and determine which of them MAY overlap. This should be independent of whatever text may be supplied to the expressions.
All of the expression have anchors.
Here is an example of what I am trying to get:
Expressions:
^a$
^b$
^ab
^b.*c
^batch
^catch
Result:
'^b.*c' and '^batch' MAY overlap
Thoughts?
Thanks,
Scott
Further explanation:
I have a list of user-created regexes and an imported list of strings that are to be matched against the regexes. In this case the strings are "clean" data (ie they are not user-created but imported from another source - they must not change).
When a user adds to the list of regexes I do not want any collisions on either the existing list of strings nor any future strings (which can not be guessed ahead of time - the only constraints being they are ASCII printable characters no longer than 255 characters).
A brute-force method would be to create a "rainbow" table of all of the permutations of strings and each time a regex is added run all of the regexes against the rainbow table. However I'd like to avoid this (I'm not even sure of the cost) and so was wondering aloud as to the possibility of an algorithm that would AT LEAST show which regexes in a list MAY collide.
I will punt on full REs. Even limiting to BREs and/or MySQL-pre-8.0 will be challenging. Here are some thoughts.
If end-anchored and no + or *, the calculate the length. The fixed-length can be used as a discriminator. Also, it could be used for toning back the "brute force" by perhaps an order of magnitude.
Anything followed by + or * gets turned into .* for simplicity. (Re the "may collide" rule.)
Any RE with explicit characters (including those followed by +) becomes a discriminator in some situations. Eg, ^a.*b$ vs ^a.*c$.
For those anchored at the end, reverse the pattern and test it that way. (I don't know how difficult reversing is.)
If you can say that a particular character must be at any position, then use it as a discriminator: ^a.b.*c$ -- a in pos 1; b in pos 3; c at end. Perhaps this can be extended to character classes: ^\w may match, but ^\d and ^a.*\d$ can't.

How to process my images to help Tesseract?

I have some images containing only digits, and a semicolon.
Example:
You can see more here: https://imgur.com/a/54dsl6h
They seem pretty clean and straightforward to me, but Tesseract considers them as empty "pages" (Empty page!!).
I tried both with oem 1 and oem 0 with a character list:
tesseract processed/35.0.png stdout -c tessedit_char_whitelist=0123456789: --oem 0
tesseract processed/35.0.png stdout
What can I do to get Tesseract to recognize the characters better?
Tesseract still gives me pretty bad results overall, but making the text bolder with a simple dilatation algorithm helped a bit.
In the end, since the font is really square, I used a trick, where I defined a bunch of segments for each digits, and depending on which segments intersect, or dont intersect with the digit, I can determine with 99% accuracy which digit it is.

SublimeText 2: What does the number on the left side mean in command palette?

When you search for a file name- on the left it gives you numbers ranging from 0-999. What do these numbers represent? It seems like a search ranking but I'm not sure.
They are a measure of the likelihood that the result will match your search query. This algorithm happens under the hood on most predictive or autocomplete searches (like Google's or Mac's Spotlight search) but the ST2 team decided it would be neat to show you the numeric result.
It takes a few items into consideration. Each one of these criteria adds more value to that result:
of matching characters
How frequently the file has been used
Proximity to the top folder
Whether the letters are in sequence or dispersed through the filename
Whether the filename starts with the matched letters, or the matched letters are somewhere in between.
In the example below, you can see the values go up as "index.html" gradually becomes more accurate. As expected, buried files, or files that are used less frequently get a lower value.