I have a machine learning model trained for a custom data set that provides bounding boxes for specific portions of an image that contain text that I am interested in. I am trying to see if I can perform OCR on the bounding box and retrieve the chunk of text contained within the bounding box using Tesseract. Can someone point me in the right direction to obtain the text, given the bonding box?
Related
when we use Google vision's DOCUMENT_TEXT_DETECTION for a image, it decides what are the blocks in the image and what texts are in each block
Here I want to get the text for the blocks which are defined by me(already have a model to identify different blocks in a image).
Simply I want the texts within blocks defined by me but the defined by Google vision.
How I can achieve this?
For now, I decided to filter symbols for given block's vertices. It is better, if there is a way to simply find intersected symbols. For, now I'm going to loop through every symbol.
I found a better way to do this. First I merge each block vertically and between every block it is possible to include textual separator. It mean after every block there is a line of text. So then we can provide this image with merged blocks as the input for Google vision API. Is the response we can get full text for our input and we also have the text that we previously set between the blocks. So we can split the whole text using that. Then we can have block-wise text
Friends, I have images(more 100) with their ground truths (Their black white masks). How can I get them automatically bound in the pascal voc format, bounding box values, ie xml files.
I mean that creating xmin,xmax,ymin,ymax values from masks and saved them as xml files. I used LabelImg, but there was automatic way, I did not find. I will use them for deep learning pascal voc.
Is there a code, tool or link how to do?
If you want to get bounding box from masks, then you just need to use numpy.where() to get indexes of each max, then you simply get the max and min values of those indexes, and those are exactly the coordonates of the bounding box
From what I have read, I understand that methods used in faster-RCNN and SSD involve generating a set of anchor boxes. We first downsample the training image using a CNN and for every pixel in the downsampled feature map (which will form the center for our anchor boxes) we project it back onto the training image. We then draw the anchor boxes centered around that pixel using our pre-determined scales and ratios. What I dont understand is why dont we directly assume the centers of our anchor boxes on the training image with a suitable stride and use the CNN to only output the classification and regression values. What are we gaining by using the CNN to determine the centers of our anchor boxes which are ultimately going to be distributed evenly on the training image ?
To state more clearly -
Where will the centers of our anchor boxes be on the training image before our first prediction of the offset values and how do we decide those?
I think the confusion comes from this:
What are we gaining by using the CNN to determine the centers of our anchor boxes which are ultimately going to be distributed evenly on the training image
The network usually doesn't predict centers but corrections to a prior belief. The initial anchor centers are distributed evenly across the image, and as such don't fit the objects in the scene tightly enough. Those anchors just constitute a prior in the probabilistic sense. What your network will exactly output is implementation dependent, but will likely just be updates, i.e. corrections to those initial priors. This means that the centers that are predicted by your network are some delta_x, delta_y that adjust the bounding boxes.
Regarding this part:
why dont we directly assume the centers of our anchor boxes on the training image with a suitable stride and use the CNN to only output the classification and regression values
The regression values should still contain sufficient information to determine a bounding box in a unique way. Predicting width, height and center offsets (corrections) is a straightforward way to do it, but it's certainly not the only way. For example, you could modify the network to predict for each pixel, the distance vector to its nearest object center, or you could use parametric curves. However, crude, fixed anchor centers are not a good idea since they will also cause problems in classification, as you use them to pool features that are representative of the object.
I'm using ForgeViewer to display both IFC models and custom geometry (point clouds and meshes using THREE.js directly), and I'm using the Section tool to cut away parts of the model.
Is there anyway I can set the size of the planes in the UI. I want the arrows and planes to be centered around specific models making them easier to use. Also, it would be nice to be able to set the default size and position of the cutting box.
The size of the cutting plane/box as well as the position of the manipulating gizmo are estimated by the section tool based on the bounding box of all visible objects. There's no UI to change that behavior, but you might be able to reverse engineer the official Section tool and perhaps modify it to your needs.
Edit: alternatively, you could retrieve the THREE.js geometry representing the cutting plane after it's been created by the Section tool (and placed into viewer.impl.sceneAfter) and customize it as needed.
Adding to Petr's answer...
Use the 'box section' tool (see screenshot) and manually adjust the box size by clicking on each of the box faces to adjust.
Then use Augusto's blog post (below) to programmatically capture (using viewer.getState();) and replay your box section (viewer.setCutPlanes(planes);).
https://forge.autodesk.com/blog/viewer-setcutplanes
I am developing an OCR to detect credit card.
After scanning the image I get a list of words with it´s positions.
Any tips/suggestions about the best approach to detect which words correspond to each field of credit card (number, date, name)?
For example:
position = 96.00 491.00
text = CARDHOLDER
Thanks in advance
Your first problem is that most OCRs are not optimised for small amounts of text that take up most of the "page" (or card image, in your case) in spatially separated chunks. They expect lines, or pages of text from a scanned book or a newspaper. So straight away they're not likely to do that well at analysing the image.
Because the font is fairly uniform they'll likely recognise the characters well, but the layout will confuse the page segmentation algorithm and so the text you get out might not be in the right order. For example, the "1234" of the card number and the smaller "1234" below it constitute a single column of text, likewise the second two sets of four numbers and the expiration date.
For specialized cases where you know the layout in advance you really want to develop your own page segmentation algorithm to break up the image into zones, e.g. card number, card holder name, start and expiration dates. This shouldn't be too hard because I think the location of these components are standardised on credit cards. Assuming good preprocessing and binarization you could basically do a horizontal histogram and split the image at the troughs.
Then extract each zone as a separate image containing just one line of text and feed it to the OCR.
Alternately (the quick and dirty approach)
Instruct the OCR that what you want to recognise consists of a single column (i.e. prevent it from trying to figure out the page layout itself). You can do this with Tesseract using the -psm (page segmentation mode) parameter set to, probably, 6 (but try and see what gives you the best results)
Make Tesseract output hOCR format, which you can set in the configfile. hOCR format includes the bounding boxes of the lines that get output relative to the whole image.
write an algorithm that compares the bounding boxes in the hOCR to where you know each card component should be (looking for some percentage of overlap, it won't match exactly for obvious reasons.)
In addition to the good tips provided by Mikesname, you can greatly improve the recognition result regardless of which OCR engine you use if you use image processing to convert the image to bitonal (pure black and white), such as the attached copy of your image.