Google Cloud Vision Text Detection splitting words based on delimiters. Are there any parameters/properties available to combine? - ocr

Google Cloud OCR is splitting the words if any symbol occurs like ' or " or x or - etc.
Is there any way to only split if there is a space encountered? In the symbol level in FullTextAnntoations I found properties which denote BreakTypes but these symbols dont have any BreakType as well.
Example Snapshot-
https://snipboard.io/0UN4ca.jpg

Related

What does single character dot mean in google search?

I was reading a book about GHDB and i faced this:
intitle:index.of
results:36,000
intitle:"index of"
results:18,000
so , what does dot character mean in search ?
Most characters outside of the standard a-z,A-z,0-9 are ignored by Google search. The single period, '.', is one of them. In GMail search however, a single period acts as a wildcard catch-all term, so for example searching 'color.' would return mails with 'color' and, for example, mails with 'colorful' or 'colorless' in them too.
You can find more about Google Search Operators here, though this is not a typical question for Stack Overflow.

How to convert text to number in Google Script

I am a new user of Google Script and scripts in general.
My company has Office licences and for strategics reasons it wants to use google services.
My problem is that we extract from a software various data containing numbers. When we paste these datas on a spreadsheet the negatives numbers format is not recognized because they are like :
screenShot
I would like to apply the script only on a selection of active spreadsheet and the texte "1 234,56-" become a number "-1 234,56". The selection may contains positive number as "1 234,56".
Thank you for your help.
Best regards,
Anthony.
=VALUE(REGEXREPLACE(REGEXREPLACE(TEXT(A1; "0000.00"); "\s"; ""); "(.*?)-"; "-$1"))
This will first convert the number to a text string, then remove the whitespace character, then move the - sign in front of the number, and lastly convert it back to a numeric value.
Before:
1 234,56-
352,90
2 342,89-
24,0
45,00-
After (and you can use Sheets' number formats to further alter if needed):
−1234,56
352,9
−2342,89
24
−45
if you're range is not too long you can try something like that :
How to replace text in Google Spreadsheet using App Scripts?
i have the same probleme for a file of 5k line to remplace "." by "," it work but need a bit time :)
i hope this will help you
Best regards

how to convert/match a handwritten list of names? (HWR)

I would like to see if I can scan a sign-in sheet for a class. The good news is I know 90% of the names that might be written.
My idea was to use tessaract to parse an image of names, and then use the Levenshtein algorithm to compare each line with a list of names in my database and if I get reasonably close matches, then that name is right.
Does this approach sound like a good one? If not, other ideas?
I tried using tesseract on a sample sheet (see below)
I used:
tesseract simple.png -psm 4 outtxt
Tesseract Open Source OCR Engine v3.05.01 with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.
Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box
I am assuming it didn't like line 2 because I went below the line.
The results I got were:
1.. AM: (harm;
l. ’E (J 22 a 00k
2‘ wau \\) [HQ
4. KIM TAYLOE
5. LN] Davis
6‘ Mzflé! Ha K
Obviously not the greatest, my guess is the distance matches for 4 & 5 would work, but the rest are not even close.
I have control of my sign-in sheet, but not the handwriting of folks coming in, so if any changes to that I can do to help, please let me know.
Since your goal is to get names only - I would suggest you to reduce tessedit_char_whitelist to english alphabetical ones("ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.") so that you will not get characters that you don't expect as output like \\) [ .
Your initial approach to calculate L distance is fine if you success to extract text from handwritten image (which is a hard task for tesseract).
I would also suggest to run some preprocessing on your image. For example you can remove horizontal lines and extract text ROIs around them. In the best case you will be able to extract separated characters, but even if you don't do that - you will get better results & will be able to distinguish result names "line by line".
You should also try other recommended output quality improvement stages which you can find in Tesseract OCR wiki (link)

How to train tesseract and how to recognize multiple columns

I have the task of taking a PDF with images to a txt or csv file to store at a database. I am trying to use OCR on images like the one attached.
The results are as poor as the following:
`20—0
¿ ABÚEADD LDIDI ALBARH, JDSE
AHTÚHIÚ
—- EnlúndeLarreájzegm25- Sºt] . . . . . 944 355019
: ABDGADD 5E'I'IEH ÁLUAREI 5EUERIHD`
Of special importance is the phone number (944 355019), it seems close to correct but it still has wrong digits which makes the whole thing useless.
After much reading I still do not know how to train tesseract. I am following this instructions among others, which leads me with doubts such as:
It talks about getting a sample of the fonts to train. I have an image, so how do I get the exact font to somehow generate the training data?
More often than not I get the text moved from where you would expect to find it. I just read that that is because tesseract does OCR on a per column basis (and then I read it does not so I am confused). So, which one is it, and how make it to write it horizontally?

Can an OCR run in a split-second if it is highly targeted? (Small dictionary)

I am looking for an open source ocr (maybe tesseract) that uses a dictionary to match words against. For example, I know that this ocr will only be used to search for certain names. Imagine I have a master guest list (written) and I want to scan this list in under a second with the ocr and check this against a database of names.
I understand that a traditional ocr can attempt to read every letter and then I could just cross reference the results with the 100 names, but this takes too long. If the ocr was just focusing on those 100 words and nothing else then it should be able to do all this in a split second. i.e. There is no point in guessing that a word might be "Jach" since "Jach" isn't a name in my database. The ocr should be able to infer that it is "Jack" since that is an actual name in the database.
Is this possible?
It should be possible. Think of it this way: instead of having your OCR look for 'J' it could be looking for 'Jack' directly, sort of: as an individual symbol.
So when you train / calibrate your OCR, train it with images of whole words, similar to how you would - for an individual symbol.
(if this feature is not directly available in your OCR then first map images of whole words to a unique symbol and later transform that symbol into the final word string)