I have a collection of Chinese character images (with single characters, not text). I need a way to OCR them, i.e. to map them to Unicode.
However, a crucial fact is that many of the images are a bit blurred, or quite small. Thus, the algorithm (or library, or online service) should not just return one Unicode value, but some kind of probability vector which estimates the probability that the given image represents which character.
For example, the image
could have the following distribution:
咳 95%
骇 4%
该 1%
I'd rather not train a neural network myself, since I'm sure that all OCR models are probabilistic underneath. All I'm looking for is an OCR solution that exposes the probabilities for single characters.
Related
zi2zi, a Chinese alphabet generating GAN uses pix2pix for generating images. I also have seen many other applications using pix2pix for tasks that aren't related to image-to image translation. I compared the code of zi2zi with regular pix2pix, and found some implementation that I couldn't understand.
What is the target source and where is the random noise? Unlike image-to-image translation tasks where there exists an obvious target image, what is supposed to be the target source for character generation?
Suppose the output of the encoder portion of the unet is the latent space, then how are we supposed to set the latent space to a certain value for evaluation, exploration of the latent space while the decoder is effected by skip-connections of the encoder network?
I want to ask how pix2pix generalizes with these types of problems pix2pix isn't meant to be a powerful solution.
After digging in the code for a few hours I discovered how zi2zi utilizes the pix2pix methodology. If I am correct, the data is split into two parts: real_A and real_B. real_A is fed into the generator along with the class label embedding_ids and produces fake_b. The discriminator then aims at discriminating a fake_b and real_b with real_a as the target image.
Conclusively, this seemingly works like an autoencoder, but with the discriminator as an evaluation metric. In concept, there isn't much that is a difference between pix2pix and other GANs with encoders.
I am learning OCR and reading this book https://www.amazon.com/Character-Recognition-Different-Languages-Computing/dp/3319502514
The authors define 8 processes to implement OCR that follow one by one (2 after 1, 3 after 2 etc):
Optical scanning
Location segmentation
Pre-processing
Segmentation
Representation
Feature extraction
Recognition
Post-processing
This is what they write about representation (#5)
The fifth OCR component is representation. The image representation
plays one of the most important roles in any recognition system. In
the simplest case, gray level or binary images are fed to a
recognizer. However, in most of the recognition systems in order to
avoid extra complexity and to increase the accuracy of the algorithms,
a more compact and characteristic representation is required. For this
purpose, a set of features is extracted for each class that helps
distinguish it from other classes while remaining invariant to
characteristic differences within the class.The character image
representation methods are generally categorized into three major
groups: (a) global transformation and series expansion (b) statistical
representation and (c) geometrical and topological representation.
This is what they write about feature extraction (#6)
The sixth OCR component is feature extraction. The objective of
feature extraction is to capture essential characteristics of symbols.
Feature extraction is accepted as one of the most difficult problems
of pattern recognition. The most straight forward way of describing
character is by actual raster image. Another approach is to extract
certain features that characterize symbols but leaves the unimportant
attributes. The techniques for extraction of such features are divided
into three groups’ viz. (a) distribution of points (b) transformations
and series expansions and (c) structural analysis.
I am totally confused. I don't understand what is representation. As I understand after segmentation we must take from image some features, for example topological structure like Freeman chain code and must match to some saved at the learning stage model - i.e. to do recognition. By other words - segmentation - feature extraction - recognition. I don't understand what must be done on representation stage. Please, explain.
The representation component takes the raster image produced by segmentation and converts it into a simpler format (a "representation") that preserves the characteristic properties of classes. This is in order to reduce the complexity of the recognition process later on. The Freeman chain code you mention is one such representation.
Some (most?) authors conflate representation and feature extraction into a single step, but the authors of your book have chosen to treat them separately. Changing the representation isn't mandatory, but doing so reduces the complexity, and so increases the accuracy, of the training and recognition steps.
It is from this simpler representation that features are extracted in the feature extraction step. Which features are extracted will depend upon the representation chosen. This paper - Feature Extraction Methods for Character Recognition - A Survey - describes 11 different feature extraction methods that can be applied to 4 different representations.
The extracted features are what are passed to the trainer or recognizer.
I am interested in understanding how optical character recognition works on numbers in particular in an attempt to make my own, such as the logic behind determining a number (it must meet certain requirements)? Are there any resources out there that somewhat describe the algorithm that is used to detect numbers? I imagine it largely is comparing non-whitish pixel locations?
I have extracted features from many images of isolated characters (such as gradient, neighbouring pixel weight and geometric properties. How can I use HMMs as a classifier trained on this data? All literature I read about HMM refers to states and state transitions but I can't connect it to features and class labeling. The example on JAHMM's home page doesn't relate to my problem.
I need to use HMM not because it will work better than other approaches for this problem but because of constraints on project topic.
There was an answer to this question for online recognition but I want the same for offline and in a little more detail
EDIT: I partitioned each character into a grid with fixed number of squares. Now I am planning to perform feature extraction on each grid block and thus obtain a sequence of features for each sample by moving from left to right and top to bottom.
Would this represent an adequate "sequence" for an HMM i.e. would an HMM be able to guess the temporal variation of the data, even though the character is not drawn from left to right and top to bottom? If not suggest an alternate way.
Should I feed a lot of features or start with a few? how do I know if the HMM is underforming or if the features are bad? I am using JAHMM.
Extracting stroke features is difficult and cant be logically combined with grid features? (since HMM expects a sequence generated by some random process)
I've usually seen neural networks used for this sort of recognition task, i.e. here, here here, and here. Since a simple google search turns up so many hits for neural networks in OCR, I'll assume you are set in using HMMs (a project limitation, correct?) Regardless, these links can offer some insight into gridding the image and obtaining image features.
Your approach for turning a grid into a sequence of observations is reasonable. In this case, be sure you do not confuse observations and states. The features you extract from one block should be collected into one observation, i.e. a feature vector. (In comparison to speech recognition, your block's feature vector is analogous to the feature vector associated with a speech phoneme.) You don't really have much information regarding the underlying states. This is the hidden aspect of HMMs, and the training process should inform the model how likely one feature vector is to follow another for a character (i.e. transition probabilities).
Since this is an off-line process, don't be concerned with the temporal aspects of how characters are actually drawn. For the purposes of your task, you've imposed a temporal order on the sequence of observations with your the left-to-right, top-to-bottom block sequence. This should work fine.
As for HMM performance: choose a reasonable vector of salient features. In speech recog, the dimensionality of a feature vector can be high (>10). (This is also where the cited literature can assist.) Set aside a percentage of the training data so that you can properly test the model. First, train the model, and then evaluate the model on the training dataset. How well does classify your characters? If it does poorly, re-evaluate the feature vector. If it does well on the test data, test the generality of the classifier by running it on the reserved test data.
As for the number of states, I would start with something heuristically derived number. Assuming your character images are scaled and normalized, perhaps something like 40%(?) of the blocks are occupied? This is a crude guess on my part since a source image was not provided. For an 8x8 grid, this would imply that 25 blocks are occupied. We could then start with 25 states - but that's probably naive: empty blocks can convey information (meaning the number of states might increase), but some features sets may be observed in similar states (meaning the number of states might decrease.) If it were me, I would probably pick something like 20 states. Having said that: be careful not to confuse features and states. Your feature vector is a representation of things observed in a particular state. If the tests described above show your model is performing poorly, tweak the number of states up or down and try again.
Good luck.
So the question is: How does a computer go from binary code representing the letter "g" to the correct combination of pixel illuminations?
Here is what I have managed to figure out so far. I understand how the CPU takes the input generated by the keyboard and stores it in the RAM, and then retrieves it to do operations on using an instruction set. I also understand how it does these operations in detail. Then the CPU transmits the output of an operation which for this example is an instruction set that retrieves the "g" from the memory address and sends it to the monitor output.
Now my question is does the CPU convert the letter "g" to a bitmap directly or does it use a GPU that is either built-in or separate, OR does the monitor itself handle the conversion?
Also, is it possible to write your own code that interprets the binary and formats it for display?
In most systems the CPU doesn't speak with the monitor directly; it sends commands to a graphics card which in turn generates an electric signal that the monitor translates into a picture on the screen. There are many steps in this process and the processing model is system dependent.
From the software perspective, communication with the graphics card is made through a graphics card driver that translates your program's and the operating system's requests into something that the hardware on the card can understand.
There are different kinds of drivers; the simplest to explain is a text mode driver. In text mode the screen is composed of a number of cells, each of which can hold exactly one of predefined characters. The driver includes a predefined bit map font that describes how a character looks like by specifying which pixels are on and which are off . When a program requests a character to be printed on the screen, the driver looks it up in the font and tells the card to change the electric signal it's sending to the monitor so that the pixels on the screen reflect what's in the font.
The text mode has limited use though. You get only one choice of font, a limited choice of colors, and you can't draw graphics like lines or circles: you're limited to characters. For high quality graphics output a different driver is used. Graphics cards typically include a memory buffer that contains the contents of the screen in a well defined format, like "n bits per pixel, m pixels per row, .." To draw something on the screen you just have to write to this memory buffer. In order to do that the driver maps the buffer into the computer memory so that the operating system and programs can use the buffer as if it was a part of RAM. Programs then can directly put the pixels they want to show, and to put the letter g on the screen it's up to the application programmer to output pixels in a way that resembles that letter. Of course there are many libraries to help programmers do this, otherwise the current state of the graphical user interface would be even sorrier than it is.
Of course, this is a simplification of what actually goes on in a computer, and there are systems that don't work exactly like this, for example some CPUs do have an integrated graphics card, and some output devices are not based on drawing pixels but plotting lines, but I hope this clears the confusion a little.
See here http://en.m.wikipedia.org/wiki/Code_page_437
It describes the character based mechanism used to display characters on a VGA monitor in character mode.