Character confidence for Tesseract 3.02 using config file - ocr

How would I get the % confidence per character detected?
By searching around I found that you should set save_blob_choices to T.
So I added that to as a line in the hocr config file in tessdata/configs and called tesseract with it.
This is all I'm getting in the generated html file:
<span class='ocr_line' id='line_1' title="bbox 0 0 50 17"><span class='ocrx_word' id='word_1' title="bbox 3 2 45 15"><strong>31,835</strong></span>
As you can see there isn't any confidence annotations not even per word.
I don't have visual studio so I'm not able to make any code changes. But I'm also open to answers describing code changes as well as how I would compile the code without VS.

Here is the sample code of getting confidence of each word.
You can even replace RIL_WORD with RIL_SYMBOL to get confidence of each character.
mTess.Recognize(0);
tesseract::ResultIterator* ri = mTess.GetIterator();
if(ri != 0)
{
do
{
const char* word = ri->GetUTF8Text(tesseract::RIL_WORD);
if(word != 0 )
{
float conf = ri->Confidence(tesseract::RIL_WORD);
printf(" word:%s, confidence: %f", word, conf );
}
delete[] word;
} while((ri->Next(tesseract::RIL_WORD)));
delete ri;
}

You will have to write a program to do this. Take a look at the ResultIterator API example at Tesseract site. For your case, be sure to set save_blob_choices variable and iterate at RIL_SYMBOL level.

Related

Can Tesseract OCR recognize subscripts and superscripts?

I have problems with the general recognition of subscript and superscript in text fragments.
Example-image:
I used Tesseract 4.1.1 with the training data available under https://github.com/tesseract-ocr/tessdata_best. The numerous options had default values except:
tessedit_create_hocr = 1 (to get result as HOCR)
hocr_font_info = 1 (to get additional font infos like font size)
hocr_char_boxes = 1 (to get character-based result)
The language was set to eng. Neither with page segmentation mode 3 (PSM_AUTO_OSD) nor 11 (PSM_SPARSE_TEXT) nor 12 (PSM_SPARSE_TEXT_OSD) the subscript/superscript was recognized correctly.
In the output the sub/sup-fragments were all more or less wrong:
"SubtextSub" is recognized as "Subtextsu,"
"SuptextSub" is recognized as "Suptexts?"
"P0" is recognized as "Po"
"P100" is recognized as "P1go"
"a2+b2" is recognized as "a+b?"
Using Tesseract for OCR is there a way to ...?
optimize subscript/superscript handling
get infos about recognized subscript/superscript (in the hocr-output - ideally for each character)
Working on the quality of the image as suggested in other questions/answers to this topic didn't really change anything.
Following these 2 links from the tesseract-google-newsgroup at first it really seemed to be a question of training:
link1 and link2.
But after doing some experiments I found out, that the used OEM_DEFAULT-OCR engine mode just doesn't bring up the needed information. I found a partial solution to the problem. Partial, because I now get most infos about sub/sup and also the recognized characters are right in most cases, but not for all characters.
Using the OEM_TESSERACT_ONLY-OCR engine mode (=the legacy mode) and some API methods provided by Tess4J I came up with the following java test class:
public class SubSupEvaluator {
public void determineSubSupCharacters(BufferedImage image) {
//1. initialize Tesseract and set image infos
TessBaseAPI handle = TessAPI1.TessBaseAPICreate();
try {
int bpp = image.getColorModel().getPixelSize();
int bytespp = bpp / 8;
int bytespl = (int) Math.ceil(image.getWidth() * bpp / 8.0);
TessBaseAPIInit2(handle, new File("./tessdata/").getAbsolutePath(), "eng", TessOcrEngineMode.OEM_TESSERACT_ONLY);
TessBaseAPISetPageSegMode(handle, TessPageSegMode.PSM_AUTO_OSD);
TessBaseAPISetImage(handle, ImageIOHelper.convertImageData(image), image.getWidth(), image.getHeight(), bytespp, bytespl);
//2. start actual OCR run
TessBaseAPIRecognize(handle, null);
//3. iterate over the result character-wise
TessResultIterator ri = TessBaseAPIGetIterator(handle);
TessPageIterator pi = TessResultIteratorGetPageIterator(ri);
TessPageIteratorBegin(pi);
do {
//determine character
Pointer ptr = TessResultIteratorGetUTF8Text(ri, TessPageIteratorLevel.RIL_SYMBOL);
String character = ptr.getString(0);
TessDeleteText(ptr); //release memory
//determine position information
IntBuffer leftB = IntBuffer.allocate(1);
IntBuffer topB = IntBuffer.allocate(1);
IntBuffer rightB = IntBuffer.allocate(1);
IntBuffer bottomB = IntBuffer.allocate(1);
TessPageIteratorBoundingBox(pi, TessPageIteratorLevel.RIL_SYMBOL, leftB, topB, rightB, bottomB);
//write info to console
System.out.println(String.format("%s - position [%d %d %d %d], subscript: %b, superscript: %b", character, leftB.get(), topB.get(),
rightB.get(), bottomB.get(), TessAPI1.TessResultIteratorSymbolIsSubscript(ri) == TessAPI1.TRUE,
TessAPI1.TessResultIteratorSymbolIsSuperscript(ri) == TessAPI1.TRUE));
} while (TessPageIteratorNext(pi, TessPageIteratorLevel.RIL_SYMBOL) == TessAPI1.TRUE);
} finally {
TessBaseAPIDelete(handle); //release memory
}
}
}
The legacy mode only works with 'normal' training data. Using the '-best' training data is bringing an error.
There is very little information on this topic.
One option to enhance sub/superscript character recognition (even if not the position itself) is by preprocessing the image, with cv2 / pil (also pillow) e.g., and then tesseract it.
See
How to detect subscript numbers in an image using OCR?
Related (but otherwise not answering the question):
https://www.mail-archive.com/tesseract-ocr#googlegroups.com/msg19434.html
https://github.com/tesseract-ocr/tesseract/blob/master/src/ccmain/superscript.cpp
what do you guys think about getting tesseract to recognize single letters?
Tesseract does not recognize single characters
I tried it with the option --psm 10
tesseract imTstg.png out5 --psm 10
but it did not seem to work. I am thinking about just running yolo to detect the single letters.

Unpivot Csv Files with changing schemas on linux

From one of our customers, we receive x number of csv files on our sftp server. The files usually vary in terms of header names, column count and of course row count (usually somewhere between a couple of thousand and a couple of million rows, file size do for the most of them not exceed 350 mb). Currently we process all the files through ssis using some custom c# script.
What I want to accomplish is this...Move the entire process to linux (our sftp server), in order to shorten the data-flow and the pre-processing time.
This may very well be a trivial task for a lot of you guys, but I cant say I belong to that category...having no real experience developing on linux.
So how to do this, are there any feasible solutions, in regards to time efficiency, memory consumption etc...
Csv files could look like this, except the number of user columns always change:
eg. Filename: userdata.csv
Question; user1; user2; user3; user4
How old are you; 20; 22; 45; 54
How tall are you; 186; 176; 166; 195
And the output I'm after looks like this:
Question; Value; User; Filename
How old are you; 20; user1; userdata
How old are you; 22; user2; userdata
How old are you; 45; user3; userdata
How old are you; 54; user4; userdata
How tall are you; 186; user1; userdata
How tall are you; 176; user2; userdata
How tall are you; 166; user3; userdata
How tall are you; 195; user4; userdata
Suggestions, advice...anything is most welcome.
Update:
Just to elaborate on the input/output specifics..
input.csv (The result of a questionnaire)
2 questions, "How old are you" and "How tall are you" answered by 4 users, "user1", "user2", "user3" and "user4".
For the purpose of this example "user1" - "user4" is used.
In our live data the users real names are used.
The number of user columns will vary depending on how many participated in the questionnaire.
output.csv
The header row is change to display 4 static fields: Question, Value, User and Filename.
Instead of having a row per question as in the input file, we need a row per user.
The Filename column should hold the name of the input file without extension.
character encoding is UTF-8 and the separator is semicolon. Qualifiers are not used.
So, after a bit of reading in here and a lot of trial and error, it seems i have a working solution. Though it might not be pretty and leaves room for improvement, this is what it got:
A scheduled bash script which loops a filename array and passes the individual filenames to the awk script.
orgFile.sh
#!/bin/sh
shopt -s nullglob
fileList=(*.csv)
for i in "${fileList[#]}"; do
awk -v filename="$i" -f newFile.awk $i
done
newFile.awk
#!/usr/bin/awk -f
function fname(file, a, n)
{
n = split(file, a, ".")
return a[1]
}
BEGIN{
FS = ";"
fn = "done_" filename
print "Question;Value;User;ID" > fn
}
{
if (NR == 1)
{
for (i = 1; i <= NF; i++)
{
headers[i] = $i
}
}
else
{
for (i = 1 ; i <= NF; i++ )
{
if (i > 1)
{
print $1 FS $i FS headers[i] FS fname(filename) >> fn
}
}
}
}

Write a CSV file in Shift-JIS (MFC VC++, Windows Embedded - WinCE)

As the title says, I have been trying to write data that the user enters into a CEdit control to a file.
The system is a handheld terminal running Windows CE, in which my test application is running, and I try to enter test data (Japanese characters in Romaji, Hiragana, Katakana and Kanji mixed along with normal English alphanumeric data) that initially is displayed in a CListCtrl. The characters display properly on the handheld display screen in my test application UI.
Finally, I try to read back the data from the List control and write it to a text CSV file. The data I get on reading back from the control is correct, but on writing it to the CSV, things mess up and my CSV file is unreadable and shows strange symbols and nonsense alphanumeric garbage.
I searched about this, and I ended up with a similar question on stackOverflow:
UTF-8, CString and CFile? (C++, MFC)
I tried some of their suggestions and finally ended up with a proper UTF-8 CSV file.
The write-to-csv-file code goes like this:
CStdioFile cCsvFile = CStdioFile();
cCsvFile.Open(cFileName, CFile::modeCreate|CFile::modeWrite);
char BOM[3]={0xEF, 0xBB, 0xBF}; // Utf-8 BOM
cCsvFile.Write(BOM,3); // Write the BOM first
for(int i = 0; i < M_cDataList.GetItemCount(); i++)
{
CString cDataStr = _T("\"") + M_cDataList.GetItemText(i, 0) + _T("\",");
cDataStr += _T("\"") + M_cDataList.GetItemText(i, 1) + _T("\",");
cDataStr += _T("\"") + M_cDataList.GetItemText(i, 2) + _T("\"\r\n");
CT2CA outputString(cDataStr, CP_UTF8);
cCsvFile.Write(outputString, ::strlen(outputString));
}
cCsvFile.Close();
So far it is OK.
Now, for my use case, I would like to change things a bit such that the CSV file is encoded as Shift-JIS, not UTF-8.
For Shift-JIS, what BOM do I use, and what changes should I make to the above code?
Thank you for any suggestions and help.
Codepage for Shift-JIS is apparently 932. Use WideCharToMultiByte and MultiByteToWideChar for conversion. For example:
CStringW source = L"日本語ABC平仮名ABCひらがなABC片仮名ABCカタカナABC漢字ABC①";
CStringA destination = CW2A(source, 932);
CStringW convertBack = CA2W(destination, 932);
//Testing:
ASSERT(source == convertBack);
AfxMessageBox(convertBack);
As far as I can tell there is no BOM for Shift-JIS. Perhaps you just want to work with UTF16. For example:
CStdioFile file;
file.Open(L"utf16.txt", CFile::modeCreate | CFile::modeWrite| CFile::typeUnicode);
BYTE bom[2] = { 0xFF, 0xFE };
file.Write(bom, 2);
CString str = L"日本語";
file.WriteString(str);
file.Close();
ps, according to this page there are some problems between codepage 932 and Shift-JIS, although I couldn't duplicate any errors.

Can I configure Tesseract to recognize texts from image with only specified length?

I am working on some OCR experiments where I would like to improve the quality of Tesseract output. Basically the test subject is things like CAPTCHA, random characters on an obfuscated image. Now Tesseract isn't doing a very good job. Partially because sometimes it identifies certain character as several characters/digits separately.
I am wondering if telling Tesseract that, my specific image should always contain a text of length, say six, could improve the OCR recognition result a bit. But I am not sure if this is even supported in Tesseract.
I didn't find documentation on that point. Could someone help point out if such feature exists, and if does, what configuration parameter I can set. Thanks!
Try this example for specifying length of the text. Please set value in for loop, which length you need to recognise text.
Consider following code:
Pix *image = pixRead("/usr/src/tesseract-3.02/phototest.tif");
tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
api->Init(NULL, "eng");
api->SetImage(image);
Boxa* boxes = api->GetComponentImages(tesseract::RIL_TEXTLINE, true, NULL, NULL);
printf("Found %d textline image components.\n", boxes->n);
for (int i = 0; i < boxes->n; i++) {
BOX* box = boxaGetBox(boxes, i, L_CLONE);
api->SetRectangle(box->x, box->y, box->w, box->h);
char* ocrResult = api->GetUTF8Text();
int conf = api->MeanTextConf();
fprintf(stdout, "Box[%d]: x=%d, y=%d, w=%d, h=%d, confidence: %d, text: %s",
i, box->x, box->y, box->w, box->h, conf, ocrResult);
}
In for (int i = 0; i < boxes->n; i++), replace boxes->n by 20 if you want specified length of 20.

Flash ABC : What does the number part of <file>.as$<number> in a swfdump

If I take a swf, and run it through swfdump
swfdump.exe -abc file.swf > ABC.txt
One the first run I may get some output in ABC.txt like this
ObjectConfig.as$60
And on a subsequent run of the same SWF get a different output
ObjectConfig.as$61
What is the meaning of the number after the $ ?
This is part of the debug metadata that the mxmlc compiler adds to the bytecode when you do a debug compile, debug=true. If you do a normal release compile, this info is omitted.
This metadata stores filenames and line numbers so that you can see the location in your source while debugging. Although I'm not sure on the exact meaning of these particular numbers, they seem to be a unique identifier or index of that file for the debugger, perhaps in case of two classes with the same name.
The best I can see is in the source code for swfdump, it calls swf_GetString. Somewhere in this chain it adds what looks like a debugLine or a scopeDepth to the end of the class name:
char* swf_GetString(TAG*t)
{
int pos = t->pos;
while(t->pos < t->len && swf_GetU8(t));
/* make sure we always have a trailing zero byte */
if(t->pos == t->len) {
if(t->len == t->memsize) {
swf_ResetWriteBits(t);
swf_SetU8(t, 0);
t->len = t->pos;
}
t->data[t->len] = 0;
}
return (char*)&(t->data[pos]);
}