Possible to pass both Embedding Input and Numerical Feature Input into model? - deep-learning

For a transaction categorisation model, I have a number of useful features. One of my features is a word embedding of the transaction description. Another is the amount of the transaction in cents.
Are there any approaches to allow simultaneous embedding inputs and standard numerical feature input?

Related

Why does backprop algorithm store the inputs to the non-linearity of the hidden layers?

I have been reading the Deep Learning book by Ian Goodfellow and it mentions in Section 6.5.7 that
The main memory cost of the algorithm is that we need to store the input to the nonlinearity of the hidden layer.
I understand that backprop stores the gradients in a similar fashion to dynamic programming so not to recompute them. But I am confused as to why it stores the input as well?
Backpropagation is a special case of reverse mode automatic differentiation (AD).
In contrast to the forward mode, the reverse mode has the major advantage that you can compute the derivative of an output w.r.t. all inputs of a computation in one pass.
However, the downside is that you need to store all intermediate results of the algorithm you want to differentiate in a suitable data structure (like a graph or a Wengert tape) for as long as you are computing its Jacobian with reverse mode AD, because you're basically "working your way backwards" through the algorithm.
Forward mode AD does not have this disadvantage, but you need to repeat its calculation for every input, so it only makes sense if your algorithm has a lot more output variables than input variables.

RNN (LSTM) that uses up to the nth time step

Usually in a RNN only the previous input and hidden state is used to calculate the output.
However, what would happen if we use up to n previous steps? In essence feeding an n-gram to the neural network?
Since n-grams are generally quite good in short text generation, this added information will lessen the burden in the hidden state to memorize short term knowledge and focus on the context aspect of the text.
This seems quite a simple thing but I'm unable to find any paper that have implemented this.
The closest think I've seen to what your are describing is attention mechanism in auto encoder. Where a Dense layer essentially control which of the encoded hidden states should be used by the decoding layer, instead of relying solely on the last hidden state.
here is the paper if you want to read more.
This architecture intent at circumventing the limit on how much information can be stored in one hidden state over long sequence.

What does the EpisodeParameterMemory of keras-rl do?

I have found the keras-rl/examples/cem_cartpole.py example and I would like to understand, but I don't find documentation.
What does the line
memory = EpisodeParameterMemory(limit=1000, window_length=1)
do? What is the limit and what is the window_length? Which effect does increasing either / both parameters have?
EpisodeParameterMemory is a special class that is used for CEM. In essence it stores the parameters of a policy network that were used for an entire episode (hence the name).
Regarding your questions: The limit parameter simply specifies how many entries the memory can hold. After exceeding this limit, older entries will be replaced by newer ones.
The second parameter is not used in this specific type of memory (CEM is somewhat of an edge case in Keras-RL and mostly there as a simple baseline). Typically, however, the window_length parameter controls how many observations are concatenated to form a "state". This may be necessary if the environment is not fully observable (think of it as transforming a POMDP into an MDP, or at least approximately). DQN on Atari uses this since a single frame is clearly not enough to infer the velocity of a ball with a FF network, for example.
Generally, I recommend reading the relevant paper (again, CEM is somewhat of an exception). It should then become relatively clear what each parameter means. I agree that Keras-RL desperately needs documentation but I don't have time to work on it right now, unfortunately. Contributions to improve the situation are of course always welcome ;).
A little late to the party, but I feel like the answer doesn't really answer the question.
I found this description online (https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html#replay-memory):
We’ll be using experience replay
memory for training our DQN. It stores the transitions that the agent
observes, allowing us to reuse this data later. By sampling from it
randomly, the transitions that build up a batch are decorrelated. It
has been shown that this greatly stabilizes and improves the DQN
training procedure.
Basically you observe and save all of your state transitions so that you can train your network on them later on (instead of having to make observations from the environment all the time).

Using HMM for offline character recognition

I have extracted features from many images of isolated characters (such as gradient, neighbouring pixel weight and geometric properties. How can I use HMMs as a classifier trained on this data? All literature I read about HMM refers to states and state transitions but I can't connect it to features and class labeling. The example on JAHMM's home page doesn't relate to my problem.
I need to use HMM not because it will work better than other approaches for this problem but because of constraints on project topic.
There was an answer to this question for online recognition but I want the same for offline and in a little more detail
EDIT: I partitioned each character into a grid with fixed number of squares. Now I am planning to perform feature extraction on each grid block and thus obtain a sequence of features for each sample by moving from left to right and top to bottom.
Would this represent an adequate "sequence" for an HMM i.e. would an HMM be able to guess the temporal variation of the data, even though the character is not drawn from left to right and top to bottom? If not suggest an alternate way.
Should I feed a lot of features or start with a few? how do I know if the HMM is underforming or if the features are bad? I am using JAHMM.
Extracting stroke features is difficult and cant be logically combined with grid features? (since HMM expects a sequence generated by some random process)
I've usually seen neural networks used for this sort of recognition task, i.e. here, here here, and here. Since a simple google search turns up so many hits for neural networks in OCR, I'll assume you are set in using HMMs (a project limitation, correct?) Regardless, these links can offer some insight into gridding the image and obtaining image features.
Your approach for turning a grid into a sequence of observations is reasonable. In this case, be sure you do not confuse observations and states. The features you extract from one block should be collected into one observation, i.e. a feature vector. (In comparison to speech recognition, your block's feature vector is analogous to the feature vector associated with a speech phoneme.) You don't really have much information regarding the underlying states. This is the hidden aspect of HMMs, and the training process should inform the model how likely one feature vector is to follow another for a character (i.e. transition probabilities).
Since this is an off-line process, don't be concerned with the temporal aspects of how characters are actually drawn. For the purposes of your task, you've imposed a temporal order on the sequence of observations with your the left-to-right, top-to-bottom block sequence. This should work fine.
As for HMM performance: choose a reasonable vector of salient features. In speech recog, the dimensionality of a feature vector can be high (>10). (This is also where the cited literature can assist.) Set aside a percentage of the training data so that you can properly test the model. First, train the model, and then evaluate the model on the training dataset. How well does classify your characters? If it does poorly, re-evaluate the feature vector. If it does well on the test data, test the generality of the classifier by running it on the reserved test data.
As for the number of states, I would start with something heuristically derived number. Assuming your character images are scaled and normalized, perhaps something like 40%(?) of the blocks are occupied? This is a crude guess on my part since a source image was not provided. For an 8x8 grid, this would imply that 25 blocks are occupied. We could then start with 25 states - but that's probably naive: empty blocks can convey information (meaning the number of states might increase), but some features sets may be observed in similar states (meaning the number of states might decrease.) If it were me, I would probably pick something like 20 states. Having said that: be careful not to confuse features and states. Your feature vector is a representation of things observed in a particular state. If the tests described above show your model is performing poorly, tweak the number of states up or down and try again.
Good luck.

Coarse-grained vs fine-grained

What is the difference between coarse-grained and fine-grained?
I have searched these terms on Google, but I couldn't find what they mean.
From Wikipedia (granularity):
Granularity is the extent to which a
system is broken down into small
parts, either the system itself or its
description or observation. It is the
extent to which a larger entity is
subdivided. For example, a yard broken
into inches has finer granularity than
a yard broken into feet.
Coarse-grained systems consist of
fewer, larger components than
fine-grained systems; a coarse-grained
description of a system regards large
subcomponents while a fine-grained
description regards smaller components
of which the larger ones are composed.
In simple terms
Coarse-grained - larger components than fine-grained, large subcomponents. Simply wraps one or more fine-grained services together into a more coarse­-grained operation.
Fine-grained - smaller components of which the larger ones are composed, lower­level service
It is better to have more coarse-grained service operations, which are composed by fine-grained operations
Coarse-grained: A few ojects hold a lot of related data that's why services have broader scope in functionality. Example: A single "Account" object holds the customer name, address, account balance, opening date, last change date, etc.
Thus: Increased design complexity, smaller number of cells to various operations
Fine-grained: More objects each holding less data that's why services have more narrow scope in functionality. Example: An Account object holds balance, a Customer object holds name and address, a AccountOpenings object holds opening date, etc.
Thus: Decreased design complexity , higher number of cells to various service operations.
These are relationships defined between these objects.
One more way to understand would be to think in terms of communication between processes and threads. Processes communicate with the help of coarse grained communication mechanisms like sockets, signal handlers, shared memory, semaphores and files. Threads, on the other hand, have access to shared memory space that belongs to a process, which allows them to apply finer grain communication mechanisms.
Source: Java concurrency in practice
In the context of services:
http://en.wikipedia.org/wiki/Service_Granularity_Principle
By definition a coarse-grained service operation has broader scope
than a fine-grained service, although the terms are relative. The
former typically requires increased design complexity but can reduce
the number of calls required to complete a task.
A fine grained service interface is about the same like chatty interface.
In term of dataset like a text file ,Coarse-grained meaning we can transform the whole dataset but not an individual element on the dataset While fine-grained means we can transform individual element on the dataset.
Coarse-grained and Fine-grained both think about optimizing a number of servicess. But the difference is in the level. I like to explain with an example, you will understand easily.
Fine-grained: For example, I have 100 services like findbyId, findbyCategry, findbyName...... so on. Instead of that many services why we can not provide find(id, category, name....so on). So this way we can reduce the services. This is just an example, but the goal is how to optimize the number of services.
Coarse-grained: For example, I have 100 clients, each client have their own set of 100 services. So I have to provide 100*100 total services. It is very much difficult. Instead of that what I do is, I identify all common services which apply to most of the clients as one service set and remaining separately. For example in 100 services 50 services are common. So I have to manage 100*50 + 50 only.
Coarse-grained granularity does not always mean bigger components, if you go by literally meaning of the word coarse, it means harsh, or not appropriate. e.g. In software projects management, if you breakdown a small system into few components, which are equal in size, but varies in complexities and features, this could lead to a coarse-grained granularity. In reverse, for a fine-grained breakdown, you would divide the components based on their cohesiveness of the functionalities each component is providing.
coarse grained and fine grained. Both of these modes define how the cores are shared
between multiple Spark tasks. As the name suggests, fine-grained mode is
responsible for sharing the cores at a more granular level. Fine-grained mode has been deprecated by Spark and will soon be removed.
Corse-grained services provides broader functionalities as compared to fine-grained service. Depending on the business domain, a single service can be created to serve a single business unit or specialised multiple fine-grained services can be created if subunits are largely independent of each other.
Coarse grained service may get more difficult may be less adaptable to change due to its size while fine-grained service may introduce additional complexity of managing multiple services.
Granularity has an important application while storing large scale data where space is very important.
The meaning of granularity according to Oxford dictionary is -
"The scale or level of detail in a set of data."
According to Cambridge dictionary -
"A lot of small details included in information, making it possible for you to understand very clearly what is happening"
So from the word specific meaning, it is some kind of partition of data for a continuous process.
Finer granularity consists of small interval partition, so that detailed representation can be achieved.
On the other hand, coarser granularity is larger frame interval, so that it can save storage.
Uses of two types of granularity is application specific.
For example- If we have an application, where recent time information is more important than the older information. For detailed representation of recent data can be found by finer granularity, while for older data representation we can use coarser granularity.