Can Tesseract OCR recognize subscripts and superscripts? - ocr

I have problems with the general recognition of subscript and superscript in text fragments.
Example-image:
I used Tesseract 4.1.1 with the training data available under https://github.com/tesseract-ocr/tessdata_best. The numerous options had default values except:
tessedit_create_hocr = 1 (to get result as HOCR)
hocr_font_info = 1 (to get additional font infos like font size)
hocr_char_boxes = 1 (to get character-based result)
The language was set to eng. Neither with page segmentation mode 3 (PSM_AUTO_OSD) nor 11 (PSM_SPARSE_TEXT) nor 12 (PSM_SPARSE_TEXT_OSD) the subscript/superscript was recognized correctly.
In the output the sub/sup-fragments were all more or less wrong:
"SubtextSub" is recognized as "Subtextsu,"
"SuptextSub" is recognized as "Suptexts?"
"P0" is recognized as "Po"
"P100" is recognized as "P1go"
"a2+b2" is recognized as "a+b?"
Using Tesseract for OCR is there a way to ...?
optimize subscript/superscript handling
get infos about recognized subscript/superscript (in the hocr-output - ideally for each character)

Working on the quality of the image as suggested in other questions/answers to this topic didn't really change anything.
Following these 2 links from the tesseract-google-newsgroup at first it really seemed to be a question of training:
link1 and link2.
But after doing some experiments I found out, that the used OEM_DEFAULT-OCR engine mode just doesn't bring up the needed information. I found a partial solution to the problem. Partial, because I now get most infos about sub/sup and also the recognized characters are right in most cases, but not for all characters.
Using the OEM_TESSERACT_ONLY-OCR engine mode (=the legacy mode) and some API methods provided by Tess4J I came up with the following java test class:
public class SubSupEvaluator {
public void determineSubSupCharacters(BufferedImage image) {
//1. initialize Tesseract and set image infos
TessBaseAPI handle = TessAPI1.TessBaseAPICreate();
try {
int bpp = image.getColorModel().getPixelSize();
int bytespp = bpp / 8;
int bytespl = (int) Math.ceil(image.getWidth() * bpp / 8.0);
TessBaseAPIInit2(handle, new File("./tessdata/").getAbsolutePath(), "eng", TessOcrEngineMode.OEM_TESSERACT_ONLY);
TessBaseAPISetPageSegMode(handle, TessPageSegMode.PSM_AUTO_OSD);
TessBaseAPISetImage(handle, ImageIOHelper.convertImageData(image), image.getWidth(), image.getHeight(), bytespp, bytespl);
//2. start actual OCR run
TessBaseAPIRecognize(handle, null);
//3. iterate over the result character-wise
TessResultIterator ri = TessBaseAPIGetIterator(handle);
TessPageIterator pi = TessResultIteratorGetPageIterator(ri);
TessPageIteratorBegin(pi);
do {
//determine character
Pointer ptr = TessResultIteratorGetUTF8Text(ri, TessPageIteratorLevel.RIL_SYMBOL);
String character = ptr.getString(0);
TessDeleteText(ptr); //release memory
//determine position information
IntBuffer leftB = IntBuffer.allocate(1);
IntBuffer topB = IntBuffer.allocate(1);
IntBuffer rightB = IntBuffer.allocate(1);
IntBuffer bottomB = IntBuffer.allocate(1);
TessPageIteratorBoundingBox(pi, TessPageIteratorLevel.RIL_SYMBOL, leftB, topB, rightB, bottomB);
//write info to console
System.out.println(String.format("%s - position [%d %d %d %d], subscript: %b, superscript: %b", character, leftB.get(), topB.get(),
rightB.get(), bottomB.get(), TessAPI1.TessResultIteratorSymbolIsSubscript(ri) == TessAPI1.TRUE,
TessAPI1.TessResultIteratorSymbolIsSuperscript(ri) == TessAPI1.TRUE));
} while (TessPageIteratorNext(pi, TessPageIteratorLevel.RIL_SYMBOL) == TessAPI1.TRUE);
} finally {
TessBaseAPIDelete(handle); //release memory
}
}
}
The legacy mode only works with 'normal' training data. Using the '-best' training data is bringing an error.

There is very little information on this topic.
One option to enhance sub/superscript character recognition (even if not the position itself) is by preprocessing the image, with cv2 / pil (also pillow) e.g., and then tesseract it.
See
How to detect subscript numbers in an image using OCR?
Related (but otherwise not answering the question):
https://www.mail-archive.com/tesseract-ocr#googlegroups.com/msg19434.html
https://github.com/tesseract-ocr/tesseract/blob/master/src/ccmain/superscript.cpp

what do you guys think about getting tesseract to recognize single letters?
Tesseract does not recognize single characters
I tried it with the option --psm 10
tesseract imTstg.png out5 --psm 10
but it did not seem to work. I am thinking about just running yolo to detect the single letters.

Related

Understanding DetectedBreak in google OCR full text annotations

I am trying to convert the full-text annotations of google vision OCR result to line level and word level which is in Block,Paragraph,Word and Symbol hierarchy.
However, when converting symbols to word text and word to line text, I need to understand the DetectedBreak property.
I went through This documentation.But I did not understand few of the them.
Can somebody explain what do the following Breaks mean? I only understood LINE_BREAK and SPACE.
EOL_SURE_SPACE
HYPHEN
LINE_BREAK
SPACE
SURE_SPACE
UNKNOWN
Can they be replaced by either a newline char or space ?
The link you provided has the most detailed explanation available of what each of these stands for. I suppose the best way to get a better understanding is to run ocr on different images and compare the response with what you see on the corresponding image. The following python script runs DOCUMENT_TEXT_DETECTION on an image saved in GCS and prints all detected breaks except from the ones you have no trouble understanding (LINE_BREAK and SPACE), along with the word immediately preceding them to enable comparison.
import sys
import os
from google.cloud import storage
from google.cloud import vision
def detect_breaks(gcs_image):
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/path/to/json'
client = vision.ImageAnnotatorClient()
feature = vision.types.Feature(
type=vision.enums.Feature.Type.DOCUMENT_TEXT_DETECTION)
image_source = vision.types.ImageSource(
image_uri=gcs_image)
image = vision.types.Image(
source=image_source)
request = vision.types.AnnotateImageRequest(
features=[feature], image=image)
annotation = client.annotate_image(request).full_text_annotation
breaks = vision.enums.TextAnnotation.DetectedBreak.BreakType
word_text = ""
for page in annotation.pages:
for block in page.blocks:
for paragraph in block.paragraphs:
for word in paragraph.words:
for symbol in word.symbols:
word_text += symbol.text
if symbol.property.detected_break.type:
if symbol.property.detected_break.type == breaks.SPACE or symbol.property.detected_break.type == breaks.LINE_BREAK:
word_text = ""
else:
print word_text,symbol.property.detected_break
word_text = ""
if __name__ == '__main__':
detect_breaks(sys.argv[1])

How can I search pipeline with another pipeline value on google cloud dataflow

I would like to search text which includes specified word from stream data with google cloud dataflow.
In detail, I will deal with following two stream.
stream A: element of stream is "word"
stream B: element of stream is "text". and each text consists of "word". This text may have "word" on stream A
Many "text" flow into stream B frequently. On the other hand, "word" flow into stream A occasionally.
When "word" flow into stream A, I would like to search "text" which has "word" and flow into stream B after 5 minutes ago.
Example
time stream A : stream B
00:01 - this is an apple
00:02 - this is an orange
00:03 - I have an apple
00:04 apple <= "this is an apple" and "I have an apple" are found
00:05 this <= "this is an apple" and "this is an orange" are found
Can I search text with google cloud dataflow?
If I understand your question correctly, there are multiple ways to achieve something like what you want. I will describe two variations.
The basic idea in my example code is to use an inner join and SlidingWindows of five minutes. You can implement the join using ParDo side inputs or CoGroupByKey, depending on your data sizes.
Here is how you set up your inputs and windowing:
PCollection<String> streamA = ...;
PCollection<String> streamB = ...;
PCollection<String> windowedStreamA = streamA.apply(
Window.into(
SlidingWindows.of(Duration.standardMinutes(5)).every(...)));
PCollection<String> windowedStreamB = streamB.apply(
Window.into(
SlidingWindows.of(Duration.standardMinutes(5)).every(...)));
You may want to adjust the size of windows or period to meet your specification & performance needs.
Here is a sketch of how to do the join with side inputs. This will iterate over the entire five minute window of streamB for each element of streamA, so performance will suffer if windows get large.
PCollectionView<Iterable<String>> streamBview = streamB.apply(View.asIterable());
PCollection<String> matches = windowedStreamA.apply(
ParDo.of(new DoFn<String, String>() {
#Override void processElement(ProcessContext context) {
for (String text : context.sideInput()) {
if (split(text).contains(context.element())) {
context.output(text);
}
}
}
});
Here is a sketch of how to do this with CoGroupByKey by pre-splitting the text and joining each keyword with the lines that contain that keyword. There is similar logic in the TfIdf example included with the SDK.
PCollection<KV<String, Void>> keyedStreamA = windowedStreamA.apply(
MapElements
.via(word -> KV.of(word, null))
.withOutputType(new TypeDescriptor<KV<String, Void>>() {}));
PCollection<KV<String, String>> keyedStreamB = windowedStreamB.apply(
FlatMapElements
.via(text -> split(text).forEach(word --> KV.of(word, text))
.withOutputType(new TypeDescriptor<KV<String, String>>() {}));
TupleTag<Void> tagA = new TupleTag<Void>() {};
TupleTag<String> tagB = new TupleTag<String>() {};
KeyedPCollectionTuple coGbkInput = KeyedPCollectionTuple
.of(tagA, keyedStreamA)
.and(tagB, keyedStreamB);
PCollection<String> matches = coGbkInput
.apply(CoGroupByKey.create())
.apply(FlatMapElements
.via(result -> result.getAll(tagB))
.withOutputType(new TypeDescriptor<String>()));
The best approach will depend on your data. If you are OK with getting more matches than just the last five minutes you can tune the amount of data duplication in windows by enlarging your sliding windows and having a larger period. You can also use triggers to tune when output is produced.

Write a CSV file in Shift-JIS (MFC VC++, Windows Embedded - WinCE)

As the title says, I have been trying to write data that the user enters into a CEdit control to a file.
The system is a handheld terminal running Windows CE, in which my test application is running, and I try to enter test data (Japanese characters in Romaji, Hiragana, Katakana and Kanji mixed along with normal English alphanumeric data) that initially is displayed in a CListCtrl. The characters display properly on the handheld display screen in my test application UI.
Finally, I try to read back the data from the List control and write it to a text CSV file. The data I get on reading back from the control is correct, but on writing it to the CSV, things mess up and my CSV file is unreadable and shows strange symbols and nonsense alphanumeric garbage.
I searched about this, and I ended up with a similar question on stackOverflow:
UTF-8, CString and CFile? (C++, MFC)
I tried some of their suggestions and finally ended up with a proper UTF-8 CSV file.
The write-to-csv-file code goes like this:
CStdioFile cCsvFile = CStdioFile();
cCsvFile.Open(cFileName, CFile::modeCreate|CFile::modeWrite);
char BOM[3]={0xEF, 0xBB, 0xBF}; // Utf-8 BOM
cCsvFile.Write(BOM,3); // Write the BOM first
for(int i = 0; i < M_cDataList.GetItemCount(); i++)
{
CString cDataStr = _T("\"") + M_cDataList.GetItemText(i, 0) + _T("\",");
cDataStr += _T("\"") + M_cDataList.GetItemText(i, 1) + _T("\",");
cDataStr += _T("\"") + M_cDataList.GetItemText(i, 2) + _T("\"\r\n");
CT2CA outputString(cDataStr, CP_UTF8);
cCsvFile.Write(outputString, ::strlen(outputString));
}
cCsvFile.Close();
So far it is OK.
Now, for my use case, I would like to change things a bit such that the CSV file is encoded as Shift-JIS, not UTF-8.
For Shift-JIS, what BOM do I use, and what changes should I make to the above code?
Thank you for any suggestions and help.
Codepage for Shift-JIS is apparently 932. Use WideCharToMultiByte and MultiByteToWideChar for conversion. For example:
CStringW source = L"日本語ABC平仮名ABCひらがなABC片仮名ABCカタカナABC漢字ABC①";
CStringA destination = CW2A(source, 932);
CStringW convertBack = CA2W(destination, 932);
//Testing:
ASSERT(source == convertBack);
AfxMessageBox(convertBack);
As far as I can tell there is no BOM for Shift-JIS. Perhaps you just want to work with UTF16. For example:
CStdioFile file;
file.Open(L"utf16.txt", CFile::modeCreate | CFile::modeWrite| CFile::typeUnicode);
BYTE bom[2] = { 0xFF, 0xFE };
file.Write(bom, 2);
CString str = L"日本語";
file.WriteString(str);
file.Close();
ps, according to this page there are some problems between codepage 932 and Shift-JIS, although I couldn't duplicate any errors.

Can I configure Tesseract to recognize texts from image with only specified length?

I am working on some OCR experiments where I would like to improve the quality of Tesseract output. Basically the test subject is things like CAPTCHA, random characters on an obfuscated image. Now Tesseract isn't doing a very good job. Partially because sometimes it identifies certain character as several characters/digits separately.
I am wondering if telling Tesseract that, my specific image should always contain a text of length, say six, could improve the OCR recognition result a bit. But I am not sure if this is even supported in Tesseract.
I didn't find documentation on that point. Could someone help point out if such feature exists, and if does, what configuration parameter I can set. Thanks!
Try this example for specifying length of the text. Please set value in for loop, which length you need to recognise text.
Consider following code:
Pix *image = pixRead("/usr/src/tesseract-3.02/phototest.tif");
tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
api->Init(NULL, "eng");
api->SetImage(image);
Boxa* boxes = api->GetComponentImages(tesseract::RIL_TEXTLINE, true, NULL, NULL);
printf("Found %d textline image components.\n", boxes->n);
for (int i = 0; i < boxes->n; i++) {
BOX* box = boxaGetBox(boxes, i, L_CLONE);
api->SetRectangle(box->x, box->y, box->w, box->h);
char* ocrResult = api->GetUTF8Text();
int conf = api->MeanTextConf();
fprintf(stdout, "Box[%d]: x=%d, y=%d, w=%d, h=%d, confidence: %d, text: %s",
i, box->x, box->y, box->w, box->h, conf, ocrResult);
}
In for (int i = 0; i < boxes->n; i++), replace boxes->n by 20 if you want specified length of 20.

Character confidence for Tesseract 3.02 using config file

How would I get the % confidence per character detected?
By searching around I found that you should set save_blob_choices to T.
So I added that to as a line in the hocr config file in tessdata/configs and called tesseract with it.
This is all I'm getting in the generated html file:
<span class='ocr_line' id='line_1' title="bbox 0 0 50 17"><span class='ocrx_word' id='word_1' title="bbox 3 2 45 15"><strong>31,835</strong></span>
As you can see there isn't any confidence annotations not even per word.
I don't have visual studio so I'm not able to make any code changes. But I'm also open to answers describing code changes as well as how I would compile the code without VS.
Here is the sample code of getting confidence of each word.
You can even replace RIL_WORD with RIL_SYMBOL to get confidence of each character.
mTess.Recognize(0);
tesseract::ResultIterator* ri = mTess.GetIterator();
if(ri != 0)
{
do
{
const char* word = ri->GetUTF8Text(tesseract::RIL_WORD);
if(word != 0 )
{
float conf = ri->Confidence(tesseract::RIL_WORD);
printf(" word:%s, confidence: %f", word, conf );
}
delete[] word;
} while((ri->Next(tesseract::RIL_WORD)));
delete ri;
}
You will have to write a program to do this. Take a look at the ResultIterator API example at Tesseract site. For your case, be sure to set save_blob_choices variable and iterate at RIL_SYMBOL level.