Using Tesseract to OCR matchTemplate Regions of Interest (ROI) - ocr

This is my very first attempt at using Python. I normally use .NET, but to identify shapes in documents have turned to Python and OpenCV for image processing.
I am using OpenCV TemplateMatching (cv2.matchTemplate) to discover Regions of Interest (ROI) in my documents.
This works well. The template matches the ROI's and rectangles are placed, identifying the matches.
The ROI's in my images contain text which I also need to OCR and extract. I am trying to do this with Tesseract, but I think I am approaching it wrongly, based upon my results.
My process is this:
Run cv2.matchTemplate
Loop through matched ROI's
Add rectangle info. to image
Pass rectangle info. to Tesseract
Add text returned from tesseract to image
Write the final image
In the image below, you can see the matched regions (which are fine), but you can see that the text in the ROI doesn't match the text from tesseract (bottom right of ROI).
Please could someone take a look and advise where I am going wrong?
import cv2
import numpy as np
import pytesseract
import imutils
img_rgb = cv2.imread('images/pd2.png')
img_gray = cv2.cvtColor(img_rgb, cv2.COLOR_BGR2GRAY)
template = cv2.imread('images/matchMe.png', 0)
w, h = template.shape[::-1]
res = cv2.matchTemplate(img_gray, template, cv2.TM_CCOEFF_NORMED)
threshold = 0.45
loc = np.where(res >= threshold)
for pt in zip(*loc[::-1]):
cv2.rectangle(img_rgb, pt, (pt[0] + w, pt[1] + h), (0, 0, 255), 2)
roi = img_rgb[pt, (pt[0] + w, pt[1] + h)]
config = "-l eng --oem 1 --psm 7"
text = pytesseract.image_to_string(roi, config=config)
print(text)
cv2.putText(img_rgb, text, (pt[0] + w, pt[1] + h),
cv2.FONT_HERSHEY_SIMPLEX, 1.2, (0, 0, 255), 3)
cv2.imwrite('images/results.png', img_rgb)

There were two issues in your code:
1. You were modifying image (drawing rect) before OCR.
2. roi was not properly constructed.
img_rgb = cv2.imread('tess.png')
img_gray = cv2.cvtColor(img_rgb, cv2.COLOR_BGR2GRAY)
template = cv2.imread('matchMe.png', 0)
w, h = template.shape[::-1]
res = cv2.matchTemplate(img_gray, template, cv2.TM_CCOEFF_NORMED)
threshold = 0.45
loc = np.where(res >= threshold)
for pt in zip(*loc[::-1]):
roi = img_rgb[pt[1]:pt[1] + h, pt[0]: pt[0] + w]
config = "-l eng --oem 1 --psm 7"
text = pytesseract.image_to_string(roi, config=config)
print(text)
cv2.rectangle(img_rgb, pt, (pt[0] + w, pt[1] + h), (0, 0, 255), 2)
cv2.putText(img_rgb, text, (pt[0] + w, pt[1] + h),
cv2.FONT_HERSHEY_SIMPLEX, 1.2, (0, 0, 255), 3)
cv2.imwrite('results.png', img_rgb)
You might still have to feed tesseract even properly filtered image for any meaningful recognition. Hope this helps.

Related

Optical Character Recognition with Python Tesseract on a series of symbols

Hello, i am looking for guidance. I have been using pytesseract to do OCR but it seems like i can't get the OCR to recognise a series of equal signs put together in an image. any guidance on how to address this issue ? i tested the image with AWS Rekognition, Google Vision and same results. I tried to select ROI with Open CV and focus the OCR on that, and yet it still came out empty, i.e. no character recognised. appreciate for any guidance.
thank you
Your text seems to be dificult to extract. Try to work on a full image when extracting text with tesseract.
I made one aproach to your solution but as you can see the bounding box for characters is not the expected.
This is the code:
import cv2
import numpy as np
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract'
originalImage = cv2.imread('a.png')
grayImage = cv2.cvtColor(originalImage, cv2.COLOR_BGR2GRAY)
(thresh, blackAndWhiteImageOriginal) = cv2.threshold(grayImage, 127, 255, cv2.THRESH_BINARY_INV)
blackAndWhiteImage = cv2.dilate(blackAndWhiteImageOriginal, np.ones((3,3), np.uint8))
ocr_output_details = pytesseract.image_to_data(blackAndWhiteImage, output_type=pytesseract.Output.DICT, config="--psm 7 -c tessedit_char_whitelist==")
rgbImage = cv2.cvtColor(blackAndWhiteImage,cv2.COLOR_GRAY2RGB)
for i in range(len(ocr_output_details['level'])):
(x, y, w, h) = (ocr_output_details['left'][i], ocr_output_details['top'][i], ocr_output_details['width'][i], ocr_output_details['height'][i])
cv2.rectangle(rgbImage, (x, y), (x + w, y + h), (0,0,255), 2)
print('Text: ', ocr_output_details['text'])
cv2.imshow('Boxes', rgbImage)
cv2.waitKey(0)
cv2.destroyAllWindows()
And the result:
Result 1
Using another apropiate full image with expected caracthers size i can extract equal symbols perfectly with tesseract.
This is the code:
import cv2
import numpy as np
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract'
originalImage = cv2.imread('b.jpg')
grayImage = cv2.cvtColor(originalImage, cv2.COLOR_BGR2GRAY)
(thresh, blackAndWhiteImageOriginal) = cv2.threshold(grayImage, 127, 255, cv2.THRESH_BINARY)
blackAndWhiteImage = cv2.erode(blackAndWhiteImageOriginal, np.ones((3,3), np.uint8))
img = originalImage
img_copy = img.copy()
gray = cv2.cvtColor(img_copy, cv2.COLOR_BGR2GRAY)
_, thresh = cv2.threshold(gray, 180, 255, cv2.THRESH_BINARY)
results = pytesseract.image_to_data(thresh, config="-c tessedit_char_whitelist== --psm 6")
text = []
for b in map(str.split, results.splitlines()[1:]):
if len(b) == 12:
x, y, w, h = map(int, b[6: 10])
cv2.rectangle(originalImage, (x, y), (x + w, y + h), (255,0,0), 2)
cv2.putText(originalImage, b[11], (x, y + h + 15), cv2.FONT_HERSHEY_COMPLEX, 0.6, 0)
text.append(b[11])
print('Text: ', text)
cv2.imshow("Result", originalImage)
cv2.waitKey(0)
And the result:
Result 2
You can try to improve results using Tesseract documentation. Tesseract -Improving the quality of the output
Important things to do are:
Use white for the background and black for characters font color
Select desired tesseractpsm mode. In the previous cases i was using 6 and 7 psm modes to treat image as a single uniform block of text and treat the image as a single text line respectively
Try to use tessedit_char_whitelist config to specify only the characters that you are sarching for.

I Trying to visualize attention using image captioning model but get size of tensor error

Here is the code i used to visualize the attention of an image,
checkpoint = torch.load('/content/drive/MyDrive/database/checkpoints/' + MODEL_NAME + '_BEST.pt', map_location='cpu')
autoencoder.load_state_dict(checkpoint['model'])
autoencoder = autoencoder
autoencoder.eval()
decoder = autoencoder
decoder.eval()
encoder = autoencoder.encoder
encoder.eval()
# embeddings = torch.FloatTensor(len(EN.vocab), 300)
# init_embeddings(embeddings)
rev_word_map = {v: k for k, v in EN.vocab.stoi.items()}
image = '/content/drive/MyDrive/database/visual/testing1.jpg'
sequence, alphas = generate_caption(image,autoencoder.to(DEVICE), autoencoder.encoder.to(DEVICE),autoencoder.decoder.to(DEVICE),img_transform,5, EN, MAX_LEN,DEVICE, image_shape=(256, 256))
alphas = torch.FloatTensor(alphas)
visualize_att(image, sequence, alphas, rev_word_map)
I got error that saying the size tensor a must match with the size of tensor b,
RuntimeError: The size of tensor a (196) must match the size of tensor b (5) at non-singleton dimension 2
enter image description here

Move an object towards a target in PyBullet

I'm pretty new to PyBullet and physics engines in general. My first step is trying to get one object to move towards another.
import pybullet as p
import time
import pybullet_data
DURATION = 10000
physicsClient = p.connect(p.GUI)#or p.DIRECT for non-graphical version
p.setAdditionalSearchPath(pybullet_data.getDataPath()) #optionally
print("data path: %s " % pybullet_data.getDataPath())
p.setGravity(0,0,-10)
planeId = p.loadURDF("plane.urdf")
cubeStartPos = [0,0,1]
cubeStartOrientation = p.getQuaternionFromEuler([0,0,0])
boxId = p.loadURDF("r2d2.urdf",cubeStartPos, cubeStartOrientation)
gemId = p.loadURDF("duck_vhacd.urdf", [2,2,1], p.getQuaternionFromEuler([0,0,0]) )
for i in range (DURATION):
p.stepSimulation()
time.sleep(1./240.)
gemPos, gemOrn = p.getBasePositionAndOrientation(gemId)
cubePos, cubeOrn = p.getBasePositionAndOrientation(boxId)
oid, lk, frac, pos, norm = p.rayTest(cubePos, gemPos)[0]
#rt = p.rayTest(cubePos, gemPos)
#print("rayTest: %s" % rt[0][1])
print("rayTest: Norm: ")
print(norm)
p.applyExternalForce(objectUniqueId=boxId, linkIndex=-1, forceObj=pos
,posObj=gemPos, flags=p.WORLD_FRAME)
print(cubePos,cubeOrn)
p.disconnect()
But this just gets R2 to wiggle a bit. How do I do this?
First of all, if you are moving a robot, you should do something a little more complicated, by providing some commands to the joints of the robot. Here is an example
Now assuming that you are moving something less complicated by applying an external force, the simplest thing you can do is multiply a factor alpha with the difference between the two positions; this would be your force.
For your example this would be:
import numpy as np
import pybullet as p
import time
import pybullet_data
DURATION = 10000
ALPHA = 300
physicsClient = p.connect(p.GUI) # or p.DIRECT for non-graphical version
p.setAdditionalSearchPath(pybullet_data.getDataPath()) # optionally
print("data path: %s " % pybullet_data.getDataPath())
p.setGravity(0, 0, -10)
planeId = p.loadURDF("plane.urdf")
cubeStartPos = [0, 0, 1]
cubeStartOrientation = p.getQuaternionFromEuler([0, 0, 0])
boxId = p.loadURDF("r2d2.urdf", cubeStartPos, cubeStartOrientation)
gemId = p.loadURDF("duck_vhacd.urdf", [
2, 2, 1], p.getQuaternionFromEuler([0, 0, 0]))
for i in range(DURATION):
p.stepSimulation()
time.sleep(1./240.)
gemPos, gemOrn = p.getBasePositionAndOrientation(gemId)
boxPos, boxOrn = p.getBasePositionAndOrientation(boxId)
force = ALPHA * (np.array(gemPos) - np.array(boxPos))
p.applyExternalForce(objectUniqueId=boxId, linkIndex=-1,
forceObj=force, posObj=boxPos, flags=p.WORLD_FRAME)
print('Applied force magnitude = {}'.format(force))
print('Applied force vector = {}'.format(np.linalg.norm(force)))
p.disconnect()

return Caffe model results to a html web page

I have a Caffe model which detects objects on a video and show the results as bounding boxes and class lable's of each object. I'm using opencv function "cv2.imshow()" to show the results but I want to show this result on a html web page. I'm not using any frameworks link django.
how should I do this?
this is part of my code I want to show its result on a web page:
# loop over the frames from the video stream
while True:
# grab the frame from the threaded video stream and resize it
# to have a maximum width of 400 pixels
#frame = vs.read()
ret, frame = cap.read()
frame = imutils.resize(frame, width=400)
# grab the frame dimensions and convert it to a blob
(h, w) = frame.shape[:2]
blob = cv2.dnn.blobFromImage(cv2.resize(frame, (400, 300)),
0.008, (400, 300), 128)
# pass the blob through the network and obtain the detections and predictions
net.setInput(blob)
detections = net.forward()
# loop over the detections
for i in np.arange(0, detections.shape[2]):
# extract the confidence (i.e., probability) associated with
# the prediction
confidence = detections[0, 0, i, 2]
if confidence > treshHold:
# extract the index of the class label from the
# `detections`, then compute the (x, y)-coordinates of
# the bounding box for the object
idx = int(detections[0, 0, i, 1])
box = detections[0, 0, i, 3:7] * np.array([w, h, w, h])
(startX, startY, endX, endY) = box.astype("int")
# draw the prediction on the frame
label = "{}: {:.2f}%".format(CLASSES[idx],
confidence * 100)
cv2.rectangle(frame, (startX, startY), (endX, endY),
COLORS[idx], 2)
y = startY - 15 if startY - 15 > 15 else startY + 15
cv2.putText(frame, label, (startX, y),
cv2.FONT_HERSHEY_SIMPLEX, 0.5, COLORS[idx], 2)
# show the output frame
cv2.imshow("Frame", frame)
key = cv2.waitKey(1) & 0xFF
# if the `q` key was pressed, break from the loop
if key == ord("q"):
break
Compose html page, insert necessary links, write result to a file and you are done.
Check here for example: link

Getting different accuracies using different caffe classes(98.65 vs 98.1 vs 98.20)

When I train and then test my model using Caffe's command line interface, I get e.g. 98.65% whereas when I myself write code(given below) to calculate accuracy from the same pre-trained model, I get e.g 98.1% using Caffe.Net.
Everything is straight forward and I have no idea what is causing the issue.
I also tried using Caffe.Classifier and its predict method, and yet get another lesser accuracy(i.e. 98.20%!)
Here is the snippet of code I wrote:
import sys
import caffe
import numpy as np
import lmdb
import argparse
from collections import defaultdict
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import itertools
from sklearn.metrics import roc_curve, auc
import random
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--proto', help='path to the network prototxt file(deploy)', type=str, required=True)
parser.add_argument('--model', help='path to your caffemodel file', type=str, required=True)
parser.add_argument('--mean', help='path to the mean file(.binaryproto)', type=str, required=True)
#group = parser.add_mutually_exclusive_group(required=True)
parser.add_argument('--db_type', help='lmdb or leveldb', type=str, required=True)
parser.add_argument('--db_path', help='path to your lmdb/leveldb dataset', type=str, required=True)
args = parser.parse_args()
predicted_lables=[]
true_labels = []
misclassified =[]
class_names = ['unsafe','safe']
count=0
correct = 0
batch=[]
plabe_ls=[]
batch_size = 50
cropx = 224
cropy = 224
i = 0
multi_crop = False
use_caffe_classifier = True
caffe.set_mode_gpu()
# Extract mean from the mean image file
mean_blobproto_new = caffe.proto.caffe_pb2.BlobProto()
f = open(args.mean, 'rb')
mean_blobproto_new.ParseFromString(f.read())
mean_image = caffe.io.blobproto_to_array(mean_blobproto_new)
f.close()
net = caffe.Classifier(args.proto, args.model,
mean = mean_image[0].mean(1).mean(1),
image_dims = (224, 224))
net1 = caffe.Net(args.proto, args.model, caffe.TEST)
net1.blobs['data'].reshape(batch_size, 3,224, 224)
data_blob_shape = net1.blobs['data'].data.shape
#check and see if its lmdb or leveldb
if(args.db_type.lower() == 'lmdb'):
lmdb_env = lmdb.open(args.db_path)
lmdb_txn = lmdb_env.begin()
lmdb_cursor = lmdb_txn.cursor()
for key, value in lmdb_cursor:
count += 1
datum = caffe.proto.caffe_pb2.Datum()
datum.ParseFromString(value)
label = int(datum.label)
image = caffe.io.datum_to_array(datum).astype(np.float32)
#key,image,label
#buffer n image
if(count % 5000 == 0):
print('{0} samples processed so far'.format(count))
if(i < batch_size):
i+=1
inf= key,image,label
batch.append(inf)
#print(key)
if(i >= batch_size):
#process n image
ims=[]
for x in range(len(batch)):
img = batch[x][1]
#img has c,w,h shape! its already gone through transpose and channel swap when it was being saved into lmdb!
#Method III : use center crop just like caffe does in test time
if (use_caffe_classifier != True):
#center crop
c,w,h = img.shape
startx = h//2 - cropx//2
starty = w//2 - cropy//2
img = img[:, startx:startx + cropx, starty:starty + cropy]
#transpose the image so we can subtract from mean
img = img.transpose(2,1,0)
img -= mean_image[0].mean(1).mean(1)
#transpose back to the original state
img = img.transpose(2,1,0)
ims.append(img)
else:
ims.append(img.transpose(2,1,0))
if (use_caffe_classifier != True):
net1.blobs['data'].data[...] = ims[:]
out_1 = net1.forward()
plabe_ls = out_1['pred']
else:
out_1 = net.predict(np.asarray(ims), oversample=multi_crop)
plabe_ls = out_1
plbl = np.asarray(plabe_ls)
plbl = plbl.argmax(axis=1)
for j in range(len(batch)):
if (plbl[j] == batch[j][2]):
correct+=1
else:
misclassified.append(batch[j][0])
predicted_lables.append(plbl[j])
true_labels.append(batch[j][2])
batch.clear()
i = 0
sys.stdout.write("\rAccuracy: %.2f%%" % (100.*correct/count))
sys.stdout.flush()
print(", %i/%i corrects" % (correct, count))
What is causing this difference in accuracies ?
More information :
I am using Python3.5 on windows.
I read images from an lmdb dataset.
The images have 256x256 and center cropped with the size 224x224.
It is finetuned on GoogleNet.
For the Caffe.predict to work well I had to change classify.py
In training, I just use Caffes defaults, such as random crops at training and center crop at test-time.
Changes:
changed line 35 to:
self.transformer.set_transpose(in_, (2, 1, 0))
and line 99 to :
predictions = predictions.reshape((len(predictions) // 10, 10, -1))
1) First off, you need to revert Line 35 (32?) of classify.py: self.transformer.set_transpose(in_, (2, 1, 0)) back to the original
self.transformer.set_transpose(in_, (2, 0, 1)). So it expects HWC and transforms internally to CHW for downstream processing.
2) Run your Classifier branch as it is. You're likely to get a bad result. Please check this. If so, it means the image database is not CWH as you've commented, but actually CHW. After you've confirmed this, make the change to your Classifier branch: ims.append(img.transpose(2,1,0)) to become ims.append(img.transpose(1,2,0)). Re-test your Classifier branch. The result should be 98.2% (goto Step 3) or 98.65% (try Step 4).
3) If your result in Step 3 is 98.2%, also undo your the second change to classify.py. Theoretically, as your images have even height/width so // and / should have no difference. If it does differ or crashes, something is seriously wrong with your image database -- your assumption of the image size is incorrect. You need to check these. They could be off by a pixel or so, and could explain the slight discrepancies in accuracy.
4) If your result in Step 3 is 98.65%, then you need to make changes to the Caffe.Net branch of your code. The database images are CHW, so you need to make the first transpose: img = img.transpose(1,2,0) and the second transpose after mean subtraction to img = img.transpose(2,0,1). Then run your Caffe.Net branch. If you still get 98.1% as before, you should check that mean subtraction is performed correctly by your network.
In Steps (2) and (4), it's possible to get worse results, which means that the problem is likely a difference in mean subtraction for your trained Net vs your expectations in Python code. Check this.
About your 98.2% for the caffe.Classifier:
If you look at lines 78 - 80, the center crop is done along crop_dims , not img_dims. If you further look at line 42 on the caffe.Classifier constructor, the crop_dims are never user-determined. It's determined by the size of the Net's input blobs. Lastly, it you look at line 70, the img_dims are used to resize the images prior to center cropping. So what's happening with your setup is: a) The images are first getting resized to 224 x 224, then uselessly getting center cropped to 224 x 224 ( I assume this is the HxW for your Net ). You obviously will get results poorer than 98.65%. What you need to do is to change the img_dims = (256, 256). That prevents resizing. The crop will be picked up automatically from your Net and you should get your 98.65%.