FirebaseVisionImage / ML Toolkit cropRect() support - firebase-mlkit

I am posting this question by request of a Firebase engineer.
I am using the Camera2 API in conjunction with Firebase-mlkit vision. I am using both barcode and on-platform OCR. The things I am trying to decode are mostly labels on equipment. In testing the application I have found that trying to scan the entire camera image produces mixed results. The main problem is that the field of view is too wide.
If there are multiple bar codes in view, firebase returns multiple results. You can sort of work around this by looking at the coordinates and picking the one closest to the center.
When scanning text, it's more or less the same, except that you get multiple Blocks, many times incomplete (you'll get a couple of letters here and there).
You can't just narrow the camera mode, though - for this type of scanning, the user benefits from the "wide" camera view for alignment. The ideal situation would be if you have a camera image (let's say for the sake of argument it's 1920x1080) but only a subset of the image is given to firebase-ml. You can imagine a camera view that has a guide box on the screen, and you orient and zoom the item you want to scan within that box.
You can select what kind of image comes from the Camera2 API but firebase-ml spits out warnings if you choose anything other than YUV_420_488. The problem is that there's not a great way in the Android API to deal with YUV images unless you do it yourself. That's what I ultimately ended up doing - I solved my problem by writing a Renderscript that takes an input YUV, converts it to RGBA, crops it, then applies any rotation if necessary. The result of this is a Bitmap, which I then feed into either the FirebaseVisionBarcodeDetectorOptions or FirebaseVisionTextRecognizer.
Note that the bitmap itself cases mlkit runtime warnings, urging me to use the YUV format instead. This is possible, but difficult. You would have to read the byte array and stride information from the original camera2 yuv image and create your own. The object that comes from camear2 is unfortunately a package-protected class, so you can't subclass it or create your own instance - you'd essentially have to start from scratch. (I'm sure there's a reason Google made this class package protected but it's extremely annoying that they did).
The steps I outlined above all work, but with format warnings from mlkit. What makes it even better is the performance gain - the barcode scanner operating on an 800x300 image takes a tiny fraction as long as it does on the full size image!
It occurs to me that none of this would be necessary if firebase paid attention to cropRect. According to the Image API, cropRect defines what portion of the image is valid. That property seems to be mutable, meaning you can get an Image and change its cropRect after the fact. That sounds perfect. I thought that I could get an Image off of the ImageReader, set cropRect to a subset of that image, and pass it to Firebase and that Firebase would ignore anything outside of cropRect.
This does not seem to be the case. Firebase seems to ignore cropRect. In my opinion, firebase should either support cropRect, or the documentation should explicitly state that it ignores it.
My request to the firebase-mlkit team is:
Define the behavior I should expect with regard to cropRect, and document it more explicitly
Explain at least a little about how images are processed by these recognizers. Why is it so insistent that YUV_420_488 be used? Maybe only the Y channel is used in decoding? Doesn't the recognizer have to convert to RGBA internally? If so, why does it get angry at me when I feed in Bitmaps?
Make these recognizers either pay attention to cropRect, or state that they don't and provide another way to tell these recognizers to work on a subset of the image, so that I can get the performance (reliability and speed) that one would expect out of having to ML correlate/transform/whatever a smaller image.
--Chris

Related

What is the best practice for displaying huge chat logs or console logs in a scrollable window? (AS3)

I'm writing a graphic console that highlights different entries and stores things when you input them (in AS3) but I've found that once there are thousands of entries, the program starts lagging and scrolling is slow. If I want scrolling to be animated with acceleration it gets even slower.
How do I move the giant block of objects that are my stored entries up and down?
Do I have to progressively load messages around where the user is looking? How does the scrollbar handle this, then?
you should create a custom container instead TextField, it would be easier to build an accelerated scrolling too,
each log entry would be an extended DisplayObject that holds anything you want just like inflating layouts in android.
the most important part should be reducing Memory usage:
you may only store plain text of log enteries in something like a global array and when scroll position is close enough, generate this layouts, then adding them in container to show, and vice versa for removing far behind chats.
however this proccess stills using much memory during runtime.
so, just according the concept of android's DiskLruCache, it is possible to storing some part of our invisible data which would be too far from our scroll position to disk instead memory, using SharedObject's.
How do I move the giant block of objects that are my stored entries up
and down?
You don't. As you have noticed, when the number Display Objectson the DisplayList greatly increases, the memory overhead increases and the housekeeping details of managing the Display Objectseventually causes performance to suffer. You don't mention any details of how you are implementing what you have so far so my comments will be general.
The way this is handled by various platform list components in Flex, iOS and I assume, Flash, is to only display the minimum number of objects needed, and as the user scrolls, objects are shuffled in and out of the render list. A further optimization is to use a "pool" of "template" objects which are reused so you don't pay a initialization time penalty. There is probably an actual name for this ("...buffering...") technique but I don't know what it is (hopefully some kind person will provide it and a link to a fuller description for how it works).
And as for how it works – you can DIY it, figuring out, as the user scrolls, which objects are moving off-screen and can be recycled, which are going to move on-screen, etc. Of course this all assumes that you have your objects stored in a data structure like and Array, ArrayList or ArrayCollection. As an alternative to coding all this from scratch, you might see if the DataGrid or List components will meet your needs – they manage all of this for you.
Flash Tutorial: The DataGrid Component (youTube video)
Customize the List component
Lots of other examples and resources out there.
(again, I work in Flex where the DataGrid and other list-based components can customized extensively using "skins" and custom item renderers for visual style – not sure if it is the same in Flash)

what is COLOR_FormatYUV420Flexible?

I wish to encode video/avc on my android encoder. The encoder (Samsung S5) publishes COLOR_FormatYUV420Flexible as one of its supported formats. yay!
but I dont quite understand what it is and how I can use it. the docs say:
Flexible 12 bits per pixel, subsampled YUV color format with 8-bit chroma and luma components.
Chroma planes are subsampled by 2 both horizontally and vertically. Use this format with Image. This format corresponds to YUV_420_888, and can represent the COLOR_FormatYUV411Planar, COLOR_FormatYUV411PackedPlanar, COLOR_FormatYUV420Planar, COLOR_FormatYUV420PackedPlanar, COLOR_FormatYUV420SemiPlanar and COLOR_FormatYUV420PackedSemiPlanar formats
This seems to suggest that I can use this const with just about any kind of YUV data: planer, semi-planer, packed etc. this seems unlikely: how would the encoder know how to interpret the data unless I specify exactly where the U/V values are?
is there any meta-data that I need to provide in addition to this const? does it just work?
Almost, but not quite.
This constant can be used with almost any form of YUV (both planar, semiplanar, packed and all that). But, the catch is, it's not you who can choose the layout and the encoder has to support it - it's the other way around. The encoder will choose the surface layout and will describe it via the flexible description, and you need to support it, whichever one it happens to be.
In practice, when using this, you don't call getInputBuffers() or getInputBuffer(int index), you call getInputImage(int index), which returns an Image, which contains pointers to the start of the three planes, and their row and pixel strides.
Note - when calling queueInputBuffer afterwards, you have to supply a size parameter, which can be tricky to figure out - see https://stackoverflow.com/a/35403738/3115956 for more details on that.

Extract or crop image from within TIFF

I need to extract/crop the logotype (BEAVER) in the middle from a TIFF file that looks like this: http://i41.tinypic.com/2i7rbie.jpg
And then I need to automate the process so it can be repeated about 9 million times...
My guess is that I would have to use some OCR software. But is it possible for such a software to "crop anything that starts below this point and ends above this point"?
Thoughts?
Typically OCR software does only extraction of text from images and conversion of it into some text-specific format. It does not do crop. However, you can use OCR technologies to achieve your task. I would recommend following:
OCR whole page
Get coordinates of recognized text
Apply your magic rules to recognized text to locate area to crop: such as averything in between "application filled" and "STATEMENT" sentences.
Cut from image that area and export it where you want it.
Real challenge is in the amount of text you would like to process. You have to be very carefull when defining your "smart rules" to make sure they don't provide false positives and always send suspicious images to separate queue that you will later manually review and update your rules.
In general it may look like this:
Take first 10 of images, define logo detection rules, test and see if everything works well
Then run on next 10, see what was prcessed wrong, what was not processed, update rules, re-process those 10 to make sure everything works well now
Re-run it on new batches of same size until it will start working well.
Then increase batch size from 10 to 100, and go with those batches until again everything start working smoothly
Then continue this way perfecting your rules and increasing batch size. At some point of time you will go to production speed.
Most likely you will encounter some strange images that either contradict existing rules, or just wrong. Not always you have to update your rules to accomodate it. It may happen that there it only dozen of images like that in whole your 9 million collection. It might be better to leave them in exceptions queue for manual processing, and don't risk stability of your magic rules.

document image processing

I working on an application for processing document images (mainly invoices) and basically, I'd like to convert certain regions of interest into an XML-structure and then classify the document based on that data. Currently I am using ImageJ for analyzing the document image and Asprise/tesseract for OCR.
Now I am looking for something to make developing easier. Specifically, I am looking for something to automatically deskew a document image and analyze the document structure (e.g. converting an image into a quadtree structure for easier processing). Although I prefer Java and ImageJ I am interested in any libraries/code/papers regardless of the programming language it's written in.
While the system I am working on should as far as possible process data automatically, the user should oversee the results and, if necessary, correct the classification suggested by the system. Therefore I am interested in using machine learning techniques to achieve more reliable results. When similar documents are processed, e.g. invoices of a specific company, its structure is usually the same. When the user has previously corrected data of documents from a company, these corrections should be considered in the future. I have only limited knowledge of machine learning techniques and would like to know how I could realize my idea.
The following prototype in Mathematica finds the coordinates of blocks of text and performs OCR within each block. You may need to adapt the parameters values to fit the dimensions of your actual images. I do not address the machine learning part of the question; perhaps you would not even need it for this application.
Import the picture, create a binary mask for the printed parts, and enlarge these parts using an horizontal closing (dilation and erosion).
Query for each blob's orientation, cluster the orientations, and determine the overall rotation by averaging the orientations of the largest cluster.
Use the previous angle to straighten the image. At this time OCR is possible, but you would lose the spatial information for the blocks of text, which will make the post-processing much more difficult than it needs to be. Instead, find blobs of text by horizontal closing.
For each connected component, query for the bounding box position and the centroid position. Use the bounding box positions to extract the corresponding image patch and perform OCR on the patch.
At this point, you have a list of strings and their spatial positions. That's not XML yet, but it sounds like a good starting point to be tailored straightforwardly to your needs.
This is the code. Again, the parameters (structuring elements) of the morphological functions may need to change, based on the scale of your actual images; also, if the invoice is too tilted, you may need to "rotate" roughly the structuring elements in order to still achieve good "un-skewing."
img = ColorConvert[Import#"http://www.team-bhp.com/forum/attachments/test-drives-initial-ownership-reports/490952d1296308008-laura-tsi-initial-ownership-experience-img023.jpg", "Grayscale"];
b = ColorNegate#Binarize[img];
mask = Closing[b, BoxMatrix[{2, 20}]]
orientations = ComponentMeasurements[mask, "Orientation"];
angles = FindClusters#orientations[[All, 2]]
\[Theta] = Mean[angles[[1]]]
straight = ColorNegate#Binarize[ImageRotate[img, \[Pi] - \[Theta], Background -> 1]]
TextRecognize[straight]
boxes = Closing[straight, BoxMatrix[{1, 20}]]
comp = MorphologicalComponents[boxes];
measurements = ComponentMeasurements[{comp, straight}, {"BoundingBox", "Centroid"}];
texts = TextRecognize#ImageTrim[straight, #] & /# measurements[[All, 2, 1]];
Cases[Thread[measurements[[All, 2, 2]] -> texts], (_ -> t_) /; StringLength[t] > 0] // TableForm
The paper we use for skew angle detection is: Skew detection and text line position determination in digitized documents by Gatos et. al. The only limitation with this paper is that it can detect skew upto -5 and +5 degrees. After that, we need something to slap the user with a message! :)
In your case, where there are primarily invoice scans, you may beautifully use: Multiresolution Analysis in Extraction of Reference Lines from Documents with Gray Level Background by Tag et. al.
We wrote the code in MATLAB, if you need help let me know!
I worked on a similar project once, and for being a long time user of OpenCV I ended up using it once again. OpenCV is a popular-cross-platform-computer-vision-library that offers programming interfaces for C and C++.
I found an interesting blog that had a post on how to detect the skew angle of a text using OpenCV, and then another on how to deskew.
To retrieve the text of the document and be able to pass a smaller image to tesseract, I suggest taking a look at the bounding box technique.
I don't know if the image acquisition procedure is your responsibility, but if it is you might want to take a look at how to do camera calibration with OpenCV to fix the distortion in the image caused by some camera lenses.

Optical character recognition

Hey everyone,
I'm trying to create a program in Java that can read numbers of the screen, and also recognise images on the screen. I was wondering how i can achieve this?
The font of the numbers will always be the same. I have never programmed anything like this before, but my idea of how it works is to have the program take a screenshot, then overlay the image of the numbers with the section of the screenshot image and check if they match, repeating this for each numbers. If this is the correct way to do this, how would i put that in code.
Thanks in advance for any help.
You could always train a neural net to do it for you. They can get pretty accurate sometimes. If you use something like Matlab it actually has capabilities for that already. Apparently there's a neural network library for java (http://neuroph.sourceforge.net/) although I've never used it personally.
Here's a tutorial about using neuroph: http://www.certpal.com/blogs/2010/04/java-neural-networks-and-neuroph-a-tutorial/
You can use a neural network, support vector machine, or other machine learning construct for this. But it will not do the entire job. If you do a screen shot, you are going to be left with a very large image that you will need to find the individual characters on. You also need to deal with the fact that the camera might not be pointed straight at the text that you want to read. You will likely need to use a series of algorithms to lock onto the right parts of the image and then downsample it in a way that size becomes neutral.
Here is a simple Java applet I wrote that does some of this.
http://www.heatonresearch.com/articles/42/page1.html
It lets you draw on a relatively large area and locks in on your char. Then it recognizes it. I am using the alphabet, but digits should be easier. The complete Java source code is included.
One simpler approach could be to use template matching. If the fonts are same, and/or the size (in pixels)is known, then simple template matching can do the job for you. ifsize of input is unknown, you might have to create copies of images at different scales and do the matching at each scale.
One with the extreme value(highest or lowest depending on the method you follow for template matching) is your result.
Follow this link for details