What is representation in optical character recognition?

What is representation in optical character recognition? - ocr

I am learning OCR and reading this book https://www.amazon.com/Character-Recognition-Different-Languages-Computing/dp/3319502514
The authors define 8 processes to implement OCR that follow one by one (2 after 1, 3 after 2 etc):
Optical scanning
Location segmentation
Pre-processing
Segmentation
Representation
Feature extraction
Recognition
Post-processing
This is what they write about representation (#5)
The fifth OCR component is representation. The image representation
plays one of the most important roles in any recognition system. In
the simplest case, gray level or binary images are fed to a
recognizer. However, in most of the recognition systems in order to
avoid extra complexity and to increase the accuracy of the algorithms,
a more compact and characteristic representation is required. For this
purpose, a set of features is extracted for each class that helps
distinguish it from other classes while remaining invariant to
characteristic differences within the class.The character image
representation methods are generally categorized into three major
groups: (a) global transformation and series expansion (b) statistical
representation and (c) geometrical and topological representation.
This is what they write about feature extraction (#6)
The sixth OCR component is feature extraction. The objective of
feature extraction is to capture essential characteristics of symbols.
Feature extraction is accepted as one of the most difficult problems
of pattern recognition. The most straight forward way of describing
character is by actual raster image. Another approach is to extract
certain features that characterize symbols but leaves the unimportant
attributes. The techniques for extraction of such features are divided
into three groups’ viz. (a) distribution of points (b) transformations
and series expansions and (c) structural analysis.
I am totally confused. I don't understand what is representation. As I understand after segmentation we must take from image some features, for example topological structure like Freeman chain code and must match to some saved at the learning stage model - i.e. to do recognition. By other words - segmentation - feature extraction - recognition. I don't understand what must be done on representation stage. Please, explain.

The representation component takes the raster image produced by segmentation and converts it into a simpler format (a "representation") that preserves the characteristic properties of classes. This is in order to reduce the complexity of the recognition process later on. The Freeman chain code you mention is one such representation.
Some (most?) authors conflate representation and feature extraction into a single step, but the authors of your book have chosen to treat them separately. Changing the representation isn't mandatory, but doing so reduces the complexity, and so increases the accuracy, of the training and recognition steps.
It is from this simpler representation that features are extracted in the feature extraction step. Which features are extracted will depend upon the representation chosen. This paper - Feature Extraction Methods for Character Recognition - A Survey - describes 11 different feature extraction methods that can be applied to 4 different representations.
The extracted features are what are passed to the trainer or recognizer.

Related

Why does backprop algorithm store the inputs to the non-linearity of the hidden layers?

I have been reading the Deep Learning book by Ian Goodfellow and it mentions in Section 6.5.7 that
The main memory cost of the algorithm is that we need to store the input to the nonlinearity of the hidden layer.
I understand that backprop stores the gradients in a similar fashion to dynamic programming so not to recompute them. But I am confused as to why it stores the input as well?

Backpropagation is a special case of reverse mode automatic differentiation (AD).
In contrast to the forward mode, the reverse mode has the major advantage that you can compute the derivative of an output w.r.t. all inputs of a computation in one pass.
However, the downside is that you need to store all intermediate results of the algorithm you want to differentiate in a suitable data structure (like a graph or a Wengert tape) for as long as you are computing its Jacobian with reverse mode AD, because you're basically "working your way backwards" through the algorithm.
Forward mode AD does not have this disadvantage, but you need to repeat its calculation for every input, so it only makes sense if your algorithm has a lot more output variables than input variables.

How can tasks that aren't Image-to-image translation work with Pix2pix?

zi2zi, a Chinese alphabet generating GAN uses pix2pix for generating images. I also have seen many other applications using pix2pix for tasks that aren't related to image-to image translation. I compared the code of zi2zi with regular pix2pix, and found some implementation that I couldn't understand.
What is the target source and where is the random noise? Unlike image-to-image translation tasks where there exists an obvious target image, what is supposed to be the target source for character generation?
Suppose the output of the encoder portion of the unet is the latent space, then how are we supposed to set the latent space to a certain value for evaluation, exploration of the latent space while the decoder is effected by skip-connections of the encoder network?
I want to ask how pix2pix generalizes with these types of problems pix2pix isn't meant to be a powerful solution.

After digging in the code for a few hours I discovered how zi2zi utilizes the pix2pix methodology. If I am correct, the data is split into two parts: real_A and real_B. real_A is fed into the generator along with the class label embedding_ids and produces fake_b. The discriminator then aims at discriminating a fake_b and real_b with real_a as the target image.
Conclusively, this seemingly works like an autoencoder, but with the discriminator as an evaluation metric. In concept, there isn't much that is a difference between pix2pix and other GANs with encoders.

How to input audio data into deep learning algorithm?

I'm very new in deep learning, and I'm targeting to use GAN (Generative Adversarial Network) to recognize emotional speech. I've only known images being as inputs to most deep learning algorithms, such as GAN. but I'm curious as to how audio data can be an input into it, besides of using images of the spectrograms as the input. also, i'd appreciate it if you can explain it in laymen terms.

Audio data can be be represented in form of numpy arrays but before moving to that you must understand what audio really is. If you give a thought on what an audio looks like, it is nothing but a wave like format of data, where the amplitude of audio change with respect to time.
Assuming that our audio is represented in time domain, we can extract the values at every half-second(arbitrary). This is called sampling rate.
Converting the data into frequency domain can reduce the amount of computation requires as the sampling rate is less.
Now, let's load the data. We'll use a library called librosa , which can be installed using pip.
data, sampling_rate = librosa.load('audio.wav')
Now, you have both the data and the sampling rate. We can plot the waveform now.
librosa.display.waveplot(data, sr=sampling_rate)
Now, you have the audio data in form of numpy array. You can now study the features of the data and extract the ones you find interesting to train your models.

Further to Ayush’s discussion, for information on the challenges and work arounds of dealing with large amounts of data at different time scales in audio data I suggest this post on WaveNet: https://deepmind.com/blog/article/wavenet-generative-model-raw-audio
After that it sounds like you want to do classification. In that case a GAN on it’s own is not suitable. If you have plenty of data you could use a straight LSTM (or another type of RNN) which is designed to model time series, or you can take set sized chunks of input and use a 1-d CNN (similar to WaveNet). If you have lots of unlabelled data from the same or similar domain and limited training data you could use a GAN to learn to generate new samples, then use the discriminator from the GAN as pre-trained weights for a CNN classifier.

Since you are trying to perform Speech Emotion Recognition (SER) using deep learning, you can go for a recurrent architecture (LSTM or GRU) or a combination of CNN and recurrent network architecture (CRNN) instead of GANs since GANs are complicated and difficult to train.
In a CRNN, the CNN layers will extract features of varying details and complexity, whereas the recurrent layers will take care of the temporal dependencies. You can then finally use a fully connected layer for regression or classification output, depending on whether your output label is discrete (for categorical emotions like angry, sad, neutral etc) or continuous (arousal and valence space).
Regarding the choice of input, you can use either a spectrogram input (2D) or raw speech signal (1D) as input. For spectrogram input, you have to use a 2D CNN whereas for a raw speech signal you can use a 1D CNN. Mel scale spectrograms are usually preferred over linear spectrograms since our ears hear frequencies in log scale and not linearly.
I have used a CRNN architecture to estimate the level of verbal conflict arising from conversational speech. Even though it is not SER, it is a very similar task.
You can find more details in the paper
http://www.eecs.qmul.ac.uk/~andrea/papers/2019_SPL_ConflictNET_Rajan_Brutti_Cavallaro.pdf
Also, check my github code for the same paper
https://github.com/smartcameras/ConflictNET
and a SER paper whose code I reproduced in Python
https://github.com/vandana-rajan/1D-Speech-Emotion-Recognition
And finally as Ayush mentioned, Librosa is one of the best Python libraries for audio processing. You have functions to create spectrograms in Librosa.

Using HMM for offline character recognition

I have extracted features from many images of isolated characters (such as gradient, neighbouring pixel weight and geometric properties. How can I use HMMs as a classifier trained on this data? All literature I read about HMM refers to states and state transitions but I can't connect it to features and class labeling. The example on JAHMM's home page doesn't relate to my problem.
I need to use HMM not because it will work better than other approaches for this problem but because of constraints on project topic.
There was an answer to this question for online recognition but I want the same for offline and in a little more detail
EDIT: I partitioned each character into a grid with fixed number of squares. Now I am planning to perform feature extraction on each grid block and thus obtain a sequence of features for each sample by moving from left to right and top to bottom.
Would this represent an adequate "sequence" for an HMM i.e. would an HMM be able to guess the temporal variation of the data, even though the character is not drawn from left to right and top to bottom? If not suggest an alternate way.
Should I feed a lot of features or start with a few? how do I know if the HMM is underforming or if the features are bad? I am using JAHMM.
Extracting stroke features is difficult and cant be logically combined with grid features? (since HMM expects a sequence generated by some random process)

I've usually seen neural networks used for this sort of recognition task, i.e. here, here here, and here. Since a simple google search turns up so many hits for neural networks in OCR, I'll assume you are set in using HMMs (a project limitation, correct?) Regardless, these links can offer some insight into gridding the image and obtaining image features.
Your approach for turning a grid into a sequence of observations is reasonable. In this case, be sure you do not confuse observations and states. The features you extract from one block should be collected into one observation, i.e. a feature vector. (In comparison to speech recognition, your block's feature vector is analogous to the feature vector associated with a speech phoneme.) You don't really have much information regarding the underlying states. This is the hidden aspect of HMMs, and the training process should inform the model how likely one feature vector is to follow another for a character (i.e. transition probabilities).
Since this is an off-line process, don't be concerned with the temporal aspects of how characters are actually drawn. For the purposes of your task, you've imposed a temporal order on the sequence of observations with your the left-to-right, top-to-bottom block sequence. This should work fine.
As for HMM performance: choose a reasonable vector of salient features. In speech recog, the dimensionality of a feature vector can be high (>10). (This is also where the cited literature can assist.) Set aside a percentage of the training data so that you can properly test the model. First, train the model, and then evaluate the model on the training dataset. How well does classify your characters? If it does poorly, re-evaluate the feature vector. If it does well on the test data, test the generality of the classifier by running it on the reserved test data.
As for the number of states, I would start with something heuristically derived number. Assuming your character images are scaled and normalized, perhaps something like 40%(?) of the blocks are occupied? This is a crude guess on my part since a source image was not provided. For an 8x8 grid, this would imply that 25 blocks are occupied. We could then start with 25 states - but that's probably naive: empty blocks can convey information (meaning the number of states might increase), but some features sets may be observed in similar states (meaning the number of states might decrease.) If it were me, I would probably pick something like 20 states. Having said that: be careful not to confuse features and states. Your feature vector is a representation of things observed in a particular state. If the tests described above show your model is performing poorly, tweak the number of states up or down and try again.
Good luck.

What is "Orthogonality"?

What does "orthogonality" mean when talking about programming languages?
What are some examples of Orthogonality?

Orthogonality is the property that means "Changing A does not change B". An example of an orthogonal system would be a radio, where changing the station does not change the volume and vice-versa.
A non-orthogonal system would be like a helicopter where changing the speed can change the direction.
In programming languages this means that when you execute an instruction, nothing but that instruction happens (which is very important for debugging).
There is also a specific meaning when referring to instruction sets.

From Eric S. Raymond's "Art of UNIX programming"
Orthogonality is one of the most important properties that can help make even complex designs compact. In a purely orthogonal design, operations do not have side effects; each action (whether it's an API call, a macro invocation, or a language operation) changes just one thing without affecting others. There is one and only one way to change each property of whatever system you are controlling.

Think of it has being able to change one thing without having an unseen affect on another part.

Broadly, orthogonality is a relationship between two things such that they have minimal effect on each other.
The term comes from mathematics, where two vectors are orthogonal if they intersect at right angles.
Think about a typical 2 dimensional cartesian space (your typical grid with X/Y axes). Plot two lines: x=1 and y=1. The two lines are orthogonal. You can change x=1 by changing x, and this will have no effect on the other line, and vice versa.
In software, the term can be appropriately used in situations where you're talking about two parts of a system which behave independently of each other.

If you have a set of constructs. A langauge is said to be orthogonal if it allows the programmer to mix these constructs freely. For example, in C you can't return an array(static array), C is said to be unorthognal in this case:
int[] fun(); // you can't return a static array.
// Of course you can return a pointer, but the langauge allows passing arrays.
// So, it is unorthognal in case.

Most of the answers are very long-winded, and even obscure. The point is: if a tool is orthogonal, it can be added, replaced, or removed, in favor of better tools, without screwing everything else up.
It's the difference between a carpenter having a hammer and a saw, which can be used for hammering or sawing, or having some new-fangled hammer/saw combo, which is designed to saw wood, then hammer it together. Either will work for sawing and then hammering together, but if you get some task that requires sawing, but not hammering, then only the orthogonal tools will work. Likewise, if you need to screw instead of hammering, you won't need to throw away your saw, if it's orthogonal (not mixed up with) your hammer.
The classic example is unix command line tools: you have one tool for getting the contents of a disk (dd), another for filtering lines from the file (grep), another for writing those lines to a file (cat), etc. These can all be mixed and matched at will.

While talking about project decisions on programming languages, orthogonality may be seen as how easy is for you to predict other things about that language for what you've seen in the past.
For instance, in one language you can have:
str.split
for splitting a string and
len(str)
for getting the lenght.
On a language more orthogonal, you would always use str.x or x(str).
When you would clone an object or do anything else, you would know whether to use
clone(obj)
or
obj.clone
That's one of the main points on programming languages being orthogonal. That avoids you to consult the manual or ask someone.
The wikipedia article talks more about orthogonality on complex designs or low level languages.
As someone suggested above on a comment, the Sebesta book talks cleanly about orthogonality.
If I would use only one sentence, I would say that a programming language is orthogonal when its unknown parts act as expected based on what you've seen.
Or... no surprises.
;)

From Robert W. Sebesta's "Concepts of Programming Languages":
As examples of the lack of orthogonality in a high-level language,
consider the following rules and exceptions in C. Although C has two
kinds of structured data types, arrays and records (structs), records
can be returned from functions but arrays cannot. A member of a
structure can be any data type except void or a structure of the same
type. An array element can be any data type except void or a function.
Parameters are passed by value, unless they are arrays, in which case
they are, in effect, passed by reference (because the appearance of an
array name without a subscript in a C program is interpreted to be
the address of the array’s first element)

from wikipedia:
Computer science
Orthogonality is a system design property facilitating feasibility and compactness of complex designs. Orthogonality guarantees that modifying the technical effect produced by a component of a system neither creates nor propagates side effects to other components of the system. The emergent behavior of a system consisting of components should be controlled strictly by formal definitions of its logic and not by side effects resulting from poor integration, i.e. non-orthogonal design of modules and interfaces. Orthogonality reduces testing and development time because it is easier to verify designs that neither cause side effects nor depend on them.
For example, a car has orthogonal components and controls (e.g. accelerating the vehicle does not influence anything else but the components involved exclusively with the acceleration function). On the other hand, a non-orthogonal design might have its steering influence its braking (e.g. electronic stability control), or its speed tweak its suspension.1 Consequently, this usage is seen to be derived from the use of orthogonal in mathematics: One may project a vector onto a subspace by projecting it onto each member of a set of basis vectors separately and adding the projections if and only if the basis vectors are mutually orthogonal.
An instruction set is said to be orthogonal if any instruction can use any register in any addressing mode. This terminology results from considering an instruction as a vector whose components are the instruction fields. One field identifies the registers to be operated upon, and another specifies the addressing mode. An orthogonal instruction set uniquely encodes all combinations of registers and addressing modes.

From Wikipedia:
Orthogonality is a system design
property facilitating feasibility and
compactness of complex designs.
Orthogonality guarantees that
modifying the technical effect
produced by a component of a system
neither creates nor propagates side
effects to other components of the
system. The emergent behavior of a
system consisting of components should
be controlled strictly by formal
definitions of its logic and not by
side effects resulting from poor
integration, i.e. non-orthogonal
design of modules and interfaces.
Orthogonality reduces testing and
development time because it is easier
to verify designs that neither cause
side effects nor depend on them.
For example, a car has orthogonal
components and controls (e.g.
accelerating the vehicle does not
influence anything else but the
components involved exclusively with
the acceleration function). On the
other hand, a non-orthogonal design
might have its steering influence its
braking (e.g. electronic stability
control), or its speed tweak its
suspension.[1] Consequently, this
usage is seen to be derived from the
use of orthogonal in mathematics: One
may project a vector onto a subspace
by projecting it onto each member of a
set of basis vectors separately and
adding the projections if and only if
the basis vectors are mutually
orthogonal.
An instruction set is said to be
orthogonal if any instruction can use
any register in any addressing mode.
This terminology results from
considering an instruction as a vector
whose components are the instruction
fields. One field identifies the
registers to be operated upon, and
another specifies the addressing mode.
An orthogonal instruction set uniquely
encodes all combinations of registers
and addressing modes.
To put it in the simplest terms possible, two things are orthogonal if changing one has no effect upon the other.

Orthogonality means the degree to which language consists of a set of independent primitive constructs that can be combined as necessary to express a program.
Features are orthogonal if there are no restrictions on how they may be combined
Example : non-orthogonality
PASCAL: functions can't return structured types.
Functional Languages are highly orthogonal.

Real life examples of orthogonality in programming languages
There are a lot of answers already that explain what orthogonality generally is while specifying some made up examples. E.g. this answer explains it well. I wanted to provide (and gather) some real life examples of orthogonal or non-orthogonal features in programming languages:
Orthogonal: C++20 Modules and Namespaces
On the cppreference-page about the new Modules system in c++20 is written:
Modules are orthogonal to namespaces
In this case they write that modules are orthogonal to namespaces because a statement like import foo will not import the module-namespace related to foo:
import foo; // foo exports foo::bar()
bar (); // Error
foo::bar (); // Ok
using namespace foo;
bar (); // Ok
(adapted from modules-cppcon2017 slide 9)

In programming languages a programming language feature is said to be orthogonal if it is bounded with no restrictions (or exceptions).
For example, in Pascal functions can't return structured types. This is a restriction on returning values from a function. Therefore we it is considered as a non-orthogonal feature. ;)

Orthogonality in Programming:
Orthogonality is an important concept, addressing how a relatively small number of components can be combined in a relatively small number of ways to get the desired results. It is associated with simplicity; the more orthogonal the design, the fewer exceptions. This makes it easier to learn, read and write programs in a programming language. The meaning of an orthogonal feature is independent of context; the key parameters are symmetry and consistency (for example, a pointer is an orthogonal concept).
from Wikipedia

Orthogonality in a programming language means that a relatively small set of
primitive constructs can be combined in a relatively small number of ways to
build the control and data structures of the language. Furthermore, every pos-
sible combination of primitives is legal and meaningful. For example, consider data types. Suppose a language has four primitive data types (integer, float,
double, and character) and two type operators (array and pointer). If the two
type operators can be applied to themselves and the four primitive data types,
a large number of data structures can be defined.
The meaning of an orthogonal language feature is independent of the
context of its appearance in a program. (the word orthogonal comes from the
mathematical concept of orthogonal vectors, which are independent of each
other.) Orthogonality follows from a symmetry of relationships among primi-
tives. A lack of orthogonality leads to exceptions to the rules of the language.
For example, in a programming language that supports pointers, it should be
possible to define a pointer to point to any specific type defined in the language.
However, if pointers are not allowed to point to arrays, many potentially useful user-defined data structures cannot be defined.
We can illustrate the use of orthogonality as a design concept by compar-
ing one aspect of the assembly languages of the IBM mainframe computers
and the VAX series of minicomputers. We consider a single simple situation:
adding two 32-bit integer values that reside in either memory or registers and
replacing one of the two values with the sum. The IBM mainframes have two
instructions for this purpose, which have the forms
A Reg1, memory_cell
AR Reg1, Reg2
where Reg1 and Reg2 represent registers. The semantics of these are
Reg1 ← contents(Reg1) + contents(memory_cell)
Reg1 ← contents(Reg1) + contents(Reg2)
The VAX addition instruction for 32-bit integer values is
ADDL operand_1, operand_2
whose semantics is
operand_2 ← contents(operand_1) + contents(operand_2)
In this case, either operand can be a register or a memory cell.
The VAX instruction design is orthogonal in that a single instruction can
use either registers or memory cells as the operands. There are two ways to
specify operands, which can be combined in all possible ways. The IBM design
is not orthogonal. Only two out of four operand combinations possibilities are
legal, and the two require different instructions, A and AR . The IBM design
is more restricted and therefore less writable. For example, you cannot add
two values and store the sum in a memory location. Furthermore, the IBM
design is more difficult to learn because of the restrictions and the additional instruction.
Orthogonality is closely related to simplicity: The more orthogonal the
design of a language, the fewer exceptions the language rules require. Fewer
exceptions mean a higher degree of regularity in the design, which makes the
language easier to learn, read, and understand. Anyone who has learned a sig-
nificant part of the English language can testify to the difficulty of learning its
many rule exceptions (for example, i before e except after c).

The basic idea of orthogonality is that things that are not related conceptually should not be related in the system. Parts of the architecture that really have nothing to do with the other, such as the database and the UI, should not need to be changed together. A change to one should not cause a change to the other.

Orthogonality is the idea that things that are not related conceptually should not be related in the system so parts of the architecture that have nothing to do with each other, like the database and UI should not be changed together. A change to one part of your system should not cause the change to the other.
If for example, you change a few lines on the screen and cause a change in the database schema, this is called coupling. You usually want to minimize coupling between things that are mostly unrelated because it can grow and the system can become a nightmare to maintain in the long run.

From Michael C. Feathers' book "Working Effectively With Legacy Code":
If you want to change existing behavior in your code and there is exactly one place you have to go to make that change, you've got orthogonality.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008