Tesseract provides confidence score in TSV config, but i am looking for confidence score of whole image processed.
Below code will return confidence score per image.
String datapath = "D:\\Tesseract";
String language = "eng";
TessAPI1 api = new TessAPI1();
TessBaseAPI handle = api.TessBaseAPICreate();
File image = new File("testocr.png");
Leptonica leptInstance = Leptonica.INSTANCE;
Pix pix = leptInstance.pixRead(image.getPath());
api.TessBaseAPIInit3(handle, datapath, language);
api.TessBaseAPISetImage2(handle, pix);
int conf = api.TessBaseAPIMeanTextConf(handle);
System.out.println("conf" + conf);
// release Pix and Boxa resources
LeptUtils.dispose(pix);
Below maven dependency you need to add:
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>4.3.0</version>
</dependency>
Related
I'm trying to figure out how sequence to sequence loss is calculated. I am using the huggingface transformers library in this case, but this might actually be relevant to other DL libraries.
So to get the required data we can do:
from transformers import EncoderDecoderModel, BertTokenizer
import torch
import torch.nn.functional as F
torch.manual_seed(42)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
MAX_LEN = 128
tokenize = lambda x: tokenizer(x, max_length=MAX_LEN, truncation=True, padding=True, return_tensors="pt")
model = EncoderDecoderModel.from_encoder_decoder_pretrained('bert-base-uncased', 'bert-base-uncased') # initialize Bert2Bert from pre-trained checkpoints
input_seq = ["Hello, my dog is cute", "my cat cute"]
output_seq = ["Yes it is", "ok"]
input_tokens = tokenize(input_seq)
output_tokens = tokenize(output_seq)
outputs = model(
input_ids=input_tokens["input_ids"],
attention_mask=input_tokens["attention_mask"],
decoder_input_ids=output_tokens["input_ids"],
decoder_attention_mask=output_tokens["attention_mask"],
labels=output_tokens["input_ids"],
return_dict=True)
idx = output_tokens["input_ids"]
logits = F.log_softmax(outputs["logits"], dim=-1)
mask = output_tokens["attention_mask"]
Edit 1
Thanks to #cronoik I was able to replicate the loss calculated by huggingface as being:
output_logits = logits[:,:-1,:]
output_mask = mask[:,:-1]
label_tokens = output_tokens["input_ids"][:, 1:].unsqueeze(-1)
select_logits = torch.gather(output_logits, -1, label_tokens).squeeze()
huggingface_loss = -select_logits.mean()
However, since the last two tokens of the second input is just padding, shouldn't we calculate the loss to be:
seq_loss = (select_logits * output_mask).sum(dim=-1, keepdims=True) / output_mask.sum(dim=-1, keepdims=True)
seq_loss = -seq_loss.mean()
^This takes into account the length of the sequence of each row of outputs, and the padding by masking it out. Think this is especially useful when we have batches of varying length outputs.
ok I found out where I was making the mistakes. This is all thanks to this thread in the HuggingFace forum.
The output labels need to have -100 for the masked version. The transoformers library does not do it for you.
One silly mistake I made was with the mask. It should have been output_mask = mask[:, 1:] instead of :-1.
1. Using Model
We need to set the masks of output to -100. It is important to use clone as shown below:
labels = output_tokens["input_ids"].clone()
labels[output_tokens["attention_mask"]==0] = -100
outputs = model(
input_ids=input_tokens["input_ids"],
attention_mask=input_tokens["attention_mask"],
decoder_input_ids=output_tokens["input_ids"],
decoder_attention_mask=output_tokens["attention_mask"],
labels=labels,
return_dict=True)
2. Calculating Loss
So the final way to replicate it is as follows:
idx = output_tokens["input_ids"]
logits = F.log_softmax(outputs["logits"], dim=-1)
mask = output_tokens["attention_mask"]
# shift things
output_logits = logits[:,:-1,:]
label_tokens = idx[:, 1:].unsqueeze(-1)
output_mask = mask[:,1:]
# gather the logits and mask
select_logits = torch.gather(output_logits, -1, label_tokens).squeeze()
-select_logits[output_mask==1].mean(), outputs["loss"]
The above however ignores the fact that this comes from two different lines. So an alternate way of calculating loss could be:
seq_loss = (select_logits * output_mask).sum(dim=-1, keepdims=True) / output_mask.sum(dim=-1, keepdims=True)
seq_loss.mean()
thanks for sharing. However, the new version of transformers as of today actually does not "shift" anymore. The following is not needed.
#shift things
output_logits = logits[:,:-1,:]
label_tokens = idx[:, 1:].unsqueeze(-1)
output_mask = mask[:,1:
I have problems with the general recognition of subscript and superscript in text fragments.
Example-image:
I used Tesseract 4.1.1 with the training data available under https://github.com/tesseract-ocr/tessdata_best. The numerous options had default values except:
tessedit_create_hocr = 1 (to get result as HOCR)
hocr_font_info = 1 (to get additional font infos like font size)
hocr_char_boxes = 1 (to get character-based result)
The language was set to eng. Neither with page segmentation mode 3 (PSM_AUTO_OSD) nor 11 (PSM_SPARSE_TEXT) nor 12 (PSM_SPARSE_TEXT_OSD) the subscript/superscript was recognized correctly.
In the output the sub/sup-fragments were all more or less wrong:
"SubtextSub" is recognized as "Subtextsu,"
"SuptextSub" is recognized as "Suptexts?"
"P0" is recognized as "Po"
"P100" is recognized as "P1go"
"a2+b2" is recognized as "a+b?"
Using Tesseract for OCR is there a way to ...?
optimize subscript/superscript handling
get infos about recognized subscript/superscript (in the hocr-output - ideally for each character)
Working on the quality of the image as suggested in other questions/answers to this topic didn't really change anything.
Following these 2 links from the tesseract-google-newsgroup at first it really seemed to be a question of training:
link1 and link2.
But after doing some experiments I found out, that the used OEM_DEFAULT-OCR engine mode just doesn't bring up the needed information. I found a partial solution to the problem. Partial, because I now get most infos about sub/sup and also the recognized characters are right in most cases, but not for all characters.
Using the OEM_TESSERACT_ONLY-OCR engine mode (=the legacy mode) and some API methods provided by Tess4J I came up with the following java test class:
public class SubSupEvaluator {
public void determineSubSupCharacters(BufferedImage image) {
//1. initialize Tesseract and set image infos
TessBaseAPI handle = TessAPI1.TessBaseAPICreate();
try {
int bpp = image.getColorModel().getPixelSize();
int bytespp = bpp / 8;
int bytespl = (int) Math.ceil(image.getWidth() * bpp / 8.0);
TessBaseAPIInit2(handle, new File("./tessdata/").getAbsolutePath(), "eng", TessOcrEngineMode.OEM_TESSERACT_ONLY);
TessBaseAPISetPageSegMode(handle, TessPageSegMode.PSM_AUTO_OSD);
TessBaseAPISetImage(handle, ImageIOHelper.convertImageData(image), image.getWidth(), image.getHeight(), bytespp, bytespl);
//2. start actual OCR run
TessBaseAPIRecognize(handle, null);
//3. iterate over the result character-wise
TessResultIterator ri = TessBaseAPIGetIterator(handle);
TessPageIterator pi = TessResultIteratorGetPageIterator(ri);
TessPageIteratorBegin(pi);
do {
//determine character
Pointer ptr = TessResultIteratorGetUTF8Text(ri, TessPageIteratorLevel.RIL_SYMBOL);
String character = ptr.getString(0);
TessDeleteText(ptr); //release memory
//determine position information
IntBuffer leftB = IntBuffer.allocate(1);
IntBuffer topB = IntBuffer.allocate(1);
IntBuffer rightB = IntBuffer.allocate(1);
IntBuffer bottomB = IntBuffer.allocate(1);
TessPageIteratorBoundingBox(pi, TessPageIteratorLevel.RIL_SYMBOL, leftB, topB, rightB, bottomB);
//write info to console
System.out.println(String.format("%s - position [%d %d %d %d], subscript: %b, superscript: %b", character, leftB.get(), topB.get(),
rightB.get(), bottomB.get(), TessAPI1.TessResultIteratorSymbolIsSubscript(ri) == TessAPI1.TRUE,
TessAPI1.TessResultIteratorSymbolIsSuperscript(ri) == TessAPI1.TRUE));
} while (TessPageIteratorNext(pi, TessPageIteratorLevel.RIL_SYMBOL) == TessAPI1.TRUE);
} finally {
TessBaseAPIDelete(handle); //release memory
}
}
}
The legacy mode only works with 'normal' training data. Using the '-best' training data is bringing an error.
There is very little information on this topic.
One option to enhance sub/superscript character recognition (even if not the position itself) is by preprocessing the image, with cv2 / pil (also pillow) e.g., and then tesseract it.
See
How to detect subscript numbers in an image using OCR?
Related (but otherwise not answering the question):
https://www.mail-archive.com/tesseract-ocr#googlegroups.com/msg19434.html
https://github.com/tesseract-ocr/tesseract/blob/master/src/ccmain/superscript.cpp
what do you guys think about getting tesseract to recognize single letters?
Tesseract does not recognize single characters
I tried it with the option --psm 10
tesseract imTstg.png out5 --psm 10
but it did not seem to work. I am thinking about just running yolo to detect the single letters.
I develop OCR system based on JavaCV.
I use following libraries for my project:
https://github.com/bytedeco/javacv
https://github.com/bytedeco/javacpp-presets/tree/master/tesseract
In one case I need to find some part of an image and recognize letters on it.
I store a part of an image in IplImage type.
But for Tesseract I must use PIX format.
How can I convert IplImage to Pix ?
Posting the hack like solution found by the author of the question. It can also be found here.
IplImage prepareImg = ...
cvSaveImage("plate.jpg", prepareImg);
PIX pixImage = pixRead("/plate.jpg");
And from this question, you can convert IplImage to BufferedImage as follows.
public static BufferedImage toBufferedImage(IplImage src) {
OpenCVFrameConverter.ToIplImage iplConverter = new OpenCVFrameConverter.ToIplImage();
Java2DFrameConverter bimConverter = new Java2DFrameConverter();
Frame frame = iplConverter.convert(src);
BufferedImage img = bimConverter. convert(frame);
BufferedImage result = (BufferedImage)img.getScaledInstance(
img.getWidth(), img.getHeight(), java.awt.Image.SCALE_DEFAULT);
img.flush();
return result;
}
IplImage prepareImg = ...
cvSaveImage("test.jpg", prepareImg);
PIX pixImage = pixRead("/test.jpg");
--- Source : Same Github issues
As mentioned by a comment by rajind ruparathna
I have just started using rapid miner for text classification. I have created a process in which i used "Process Document from Files" operator for tf-idf conversion. I want to ask how to use this operator in Java code ? I search on internet but all are using the already created process or word list generated from documents ? I want to start it from scratch i.e.
1 ) Process Documents From File
1.1) Tokenization
1.2) Filtering
1.3) Stemming
1.4) N-Gram
2) Validation
2.1) Training (K-NN)
2.2) Apply Model
May be source code and image below can help You:
String processDefinitionFileName = "/home/maximk/.RapidMiner5/repositories/Local Repository/processes/processOpenCSV.rmp";
File processDefinition = new File( processDefinitionFileName );
Process readCSV = new Process( processDefinition );
File csvFile = new File( "/home/maximk/test.cvs" );
IOObject inObject = new SimpleFileObject( csvFile );
IOContainer inParameters = new IOContainer( inObject );
IOContainer outParameters = readCSV.run( inParameters );
SimpleExampleSet resultDataSet = (SimpleExampleSet) outParameters.getElementAt( 0 );
Is there a straightforward way to load a .csv file into Simplegeo Storage? I don't have great coding skills and I'm trying to get things set up so I can ask a freelancer to create some maps for my app. If someone has existing code to do this I can probably figure out how to make it work for my situation.
I just skimmed over the api. Here's a basic example in python
Assumed csv format:
layer, id, lat, lon
python
from simplegeo.models import Record, Client
lines = open('file.csv').split('\n')
client = Client('your-oauth-token', 'your-oauth-secret')
for line in lines:
parts = line.split(',')
if len(parts) == 4:
layer = parts[0].strip()
id = parts[1].strip()
lat = float(parts[2].strip())
lon = float(parts[3].strip())
r = Record(layer, id, lat, lon)
client.storage.add_record(r)
After a bit more digging, I found a python example on their site for this exact purpose
https://simplegeo.com/docs/tutorials/general-hackery#how-import-csv-file-simplegeo
import csv
import simplegeo
OAUTH_TOKEN = '[insert_oauth_token_here]'
OAUTH_SECRET = '[insert_oauth_secret_here]'
CSV_FILE = '[insert_csv_file_here]'
LAYER = '[insert_layer_name_here]'
client = simplegeo.Client(OAUTH_TOKEN, OAUTH_SECRET)
def insert(data):
layer = LAYER
id=data.pop("id")
lat=data.pop("latitude")
lon=data.pop("longitude")
# Grab more columns if you wish
record = simplegeo.Record(layer,id,lat,lon,**data)
client.add_record(record)
r = csv.DictReader(open(CSV_FILE, mode='U'))
for l in r:
insert(l)