What does the EpisodeParameterMemory of keras-rl do? - reinforcement-learning

I have found the keras-rl/examples/cem_cartpole.py example and I would like to understand, but I don't find documentation.
What does the line
memory = EpisodeParameterMemory(limit=1000, window_length=1)
do? What is the limit and what is the window_length? Which effect does increasing either / both parameters have?

EpisodeParameterMemory is a special class that is used for CEM. In essence it stores the parameters of a policy network that were used for an entire episode (hence the name).
Regarding your questions: The limit parameter simply specifies how many entries the memory can hold. After exceeding this limit, older entries will be replaced by newer ones.
The second parameter is not used in this specific type of memory (CEM is somewhat of an edge case in Keras-RL and mostly there as a simple baseline). Typically, however, the window_length parameter controls how many observations are concatenated to form a "state". This may be necessary if the environment is not fully observable (think of it as transforming a POMDP into an MDP, or at least approximately). DQN on Atari uses this since a single frame is clearly not enough to infer the velocity of a ball with a FF network, for example.
Generally, I recommend reading the relevant paper (again, CEM is somewhat of an exception). It should then become relatively clear what each parameter means. I agree that Keras-RL desperately needs documentation but I don't have time to work on it right now, unfortunately. Contributions to improve the situation are of course always welcome ;).

A little late to the party, but I feel like the answer doesn't really answer the question.
I found this description online (https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html#replay-memory):
We’ll be using experience replay
memory for training our DQN. It stores the transitions that the agent
observes, allowing us to reuse this data later. By sampling from it
randomly, the transitions that build up a batch are decorrelated. It
has been shown that this greatly stabilizes and improves the DQN
training procedure.
Basically you observe and save all of your state transitions so that you can train your network on them later on (instead of having to make observations from the environment all the time).

Related

Web Audio Pitch Detection for Tuner

So I have been making a simple HTML5 tuner using the Web Audio API. I have it all set up to respond to the correct frequencies, the problem seems to be with getting the actual frequencies. Using the input, I create an array of the spectrum where I look for the highest value and use that frequency as the one to feed into the tuner. The problem is that when creating an analyser in Web Audio it can not become more specific than an FFT value of 2048. When using this if i play a 440hz note, the closest note in the array is something like 430hz and the next value seems to be higher than 440. Therefor the tuner will think I am playing these notes when infact the loudest frequency should be 440hz and not 430hz. Since this frequency does not exist in the analyser array I am trying to figure out a way around this or if I am missing something very obvious.
I am very new at this so any help would be very appreciated.
Thanks
There are a number of approaches to implementing pitch detection. This paper provides a review of them. Their conclusion is that using FFTs may not be the best way to go - however, it's unclear quite what their FFT-based algorithm actually did.
If you're simply tuning guitar strings to fixed frequencies, much simpler approaches exist. Building a fully chromatic tuner that does not know a-priori the frequency to expect is hard.
The FFT approach you're using is entirely possible (I've built a robust musical instrument tuner using this approach that is being used white-label by a number of 3rd parties). However you need a significant amount of post-processing of the FFT data.
To start, you solve the resolution problem using the Short Timer FFT (STFT) - or more precisely - a succession of them. The process is described nicely in this article.
If you intend building a tuner for guitar and bass guitar (and let's face it, everyone who asks the question here is), you'll need t least a 4092-point DFT with overlapping windows in order not to violate the nyquist rate on the bottom E1 string at ~41Hz.
You have a bunch of other algorithmic and usability hurdles to overcome. Not least, perceived pitch and the spectral peak aren't always the same. Taking the spectral peak from the STFT doesn't work reliably (this is also why the basic auto-correlation approach is also broken).

Using HMM for offline character recognition

I have extracted features from many images of isolated characters (such as gradient, neighbouring pixel weight and geometric properties. How can I use HMMs as a classifier trained on this data? All literature I read about HMM refers to states and state transitions but I can't connect it to features and class labeling. The example on JAHMM's home page doesn't relate to my problem.
I need to use HMM not because it will work better than other approaches for this problem but because of constraints on project topic.
There was an answer to this question for online recognition but I want the same for offline and in a little more detail
EDIT: I partitioned each character into a grid with fixed number of squares. Now I am planning to perform feature extraction on each grid block and thus obtain a sequence of features for each sample by moving from left to right and top to bottom.
Would this represent an adequate "sequence" for an HMM i.e. would an HMM be able to guess the temporal variation of the data, even though the character is not drawn from left to right and top to bottom? If not suggest an alternate way.
Should I feed a lot of features or start with a few? how do I know if the HMM is underforming or if the features are bad? I am using JAHMM.
Extracting stroke features is difficult and cant be logically combined with grid features? (since HMM expects a sequence generated by some random process)
I've usually seen neural networks used for this sort of recognition task, i.e. here, here here, and here. Since a simple google search turns up so many hits for neural networks in OCR, I'll assume you are set in using HMMs (a project limitation, correct?) Regardless, these links can offer some insight into gridding the image and obtaining image features.
Your approach for turning a grid into a sequence of observations is reasonable. In this case, be sure you do not confuse observations and states. The features you extract from one block should be collected into one observation, i.e. a feature vector. (In comparison to speech recognition, your block's feature vector is analogous to the feature vector associated with a speech phoneme.) You don't really have much information regarding the underlying states. This is the hidden aspect of HMMs, and the training process should inform the model how likely one feature vector is to follow another for a character (i.e. transition probabilities).
Since this is an off-line process, don't be concerned with the temporal aspects of how characters are actually drawn. For the purposes of your task, you've imposed a temporal order on the sequence of observations with your the left-to-right, top-to-bottom block sequence. This should work fine.
As for HMM performance: choose a reasonable vector of salient features. In speech recog, the dimensionality of a feature vector can be high (>10). (This is also where the cited literature can assist.) Set aside a percentage of the training data so that you can properly test the model. First, train the model, and then evaluate the model on the training dataset. How well does classify your characters? If it does poorly, re-evaluate the feature vector. If it does well on the test data, test the generality of the classifier by running it on the reserved test data.
As for the number of states, I would start with something heuristically derived number. Assuming your character images are scaled and normalized, perhaps something like 40%(?) of the blocks are occupied? This is a crude guess on my part since a source image was not provided. For an 8x8 grid, this would imply that 25 blocks are occupied. We could then start with 25 states - but that's probably naive: empty blocks can convey information (meaning the number of states might increase), but some features sets may be observed in similar states (meaning the number of states might decrease.) If it were me, I would probably pick something like 20 states. Having said that: be careful not to confuse features and states. Your feature vector is a representation of things observed in a particular state. If the tests described above show your model is performing poorly, tweak the number of states up or down and try again.
Good luck.

Does anyone know what "Quantum Computing" is?

In physics, its the ability for particles to exist in multiple/parallel dynamic states at a particular point in time. In computing, would it be the ability of a data bit to equal 1 or 0 at the same time, a third value like NULL[unknown] or multiple values?.. How can this technology be applied to: computer processors, programming, security, etc.?.. Has anyone built a practical quantum computer or developed a quantum programming language where, for example, the program code dynamically changes or is autonomous?
I have done research in quantum computing, and here is what I hope is an informed answer.
It is often said that qubits as you see them in a quantum computer can exist in a "superposition" of 0 and 1. This is true, but in a more subtle way than you might first guess. Even with a classical computer with randomness, a bit can exist in a superposition of 0 and 1, in the sense that it is 0 with some probability and 1 with some probability. Just as when you roll a die and don't look at the outcome, or receive e-mail that you haven't yet read, you can view its state as a superposition of the possibilities. Now, this may sound like just flim-flam, but the fact is that this type of superposition is a kind of parallelism and that algorithms that make use of it can be faster than other algorithms. It is called randomized computation, and instead of superposition you can say that the bit is in a probabilistic state.
The difference between that and a qubit is that a qubit can have a fat set of possible superpositions with more properties. The set of probabilistic states of an ordinary bit is a line segment, because all there is a probability of 0 or 1. The set of states of a qubit is a round 3-dimensional ball. Now, probabilistic bit strings are more complicated and more interesting than just individual probabilistic bits, and the same is true of strings of qubits. If you can make qubits like this, then actually some computational tasks wouldn't be any easier than before, just as randomized algorithms don't help with all problems. But some computational problems, for example factoring numbers, have new quantum algorithms that are much faster than any known classical algorithm. It is not a matter of clock speed or Moore's law, because the first useful qubits could be fairly slow and expensive. It is only sort-of parallel computation, just as an algorithm that makes random choices is only in weak sense making all choices in parallel. But it is "randomized algorithms on steroids"; that's my favorite summary for outsiders.
Now the bad news. In order for a classical bit to be in a superposition, it has be a random choice that is secret from you. Once you look a flipped coin, the coin "collapses" to either heads for sure or tails for sure. The difference between that and a qubit is that in order for a qubit to work as one, its state has to be secret from the rest of the physical universe, not just from you. It has to be secret from wisps of air, from nearby atoms, etc. On the other hand, for qubits to be useful for a quantum computer, there has to be a way to manipulate them while keeping their state a secret. Otherwise its quantum randomness or quantum coherence is wrecked. Making qubits at all isn't easy, but it is done routinely. Making qubits that you can manipulate with quantum gates, without revealing what is in them to the physical environment, is incredibly difficult.
People don't know how to do that except in very limited toy demonstrations. But if they could do it well enough to make quantum computers, then some hard computational problems would be much easier for these computers. Others wouldn't be easier at all, and great deal is unknown about which ones can be accelerated and by how much. It would definitely have various effects on cryptography; it would break the widely used forms of public-key cryptography. But other kinds of public-key cryptography have been proposed that could be okay. Moreover quantum computing is related to the quantum key distribution technique which looks very safe, and secret-key cryptography would almost certainly still be fairly safe.
The other factor where the word "quantum" computing is used regards an "entangled pair". Essentially if you can create an entangled pair of particles which have a physical "spin", quantum physics dictates that the spin on each electron will always be opposite.
If you could create an entangled pair and then separate them, you could use the device to transmit data without interception, by changing the spin on one of the particles. You can then create a signal which is modulated by the particle's information which is theoretically unbreakable, as you cannot know what spin was on the particles at any given time by intercepting the information in between the two signal points.
A whole lot of very interested organisations are researching this technique for secure communications.
Yes, there is quantum encryption, by which if someone tries to spy on your communication, it destroys the datastream such that neither they nor you can read it.
However, the real power of quantum computing lies in that a qubit can have a superposition of 0 and 1. Big deal. However, if you have, say, eight qubits, you can now represent a superposition of all integers from 0 to 255. This lets you do some rather interesting things in polynomial instead of exponential time. Factorization of large numbers (IE, breaking RSA, etc.) is one of them.
There are a number of applications of quantum computing.
One huge one is the ability to solve NP-hard problems in P-time, by using the indeterminacy of qubits to essentially brute-force the problem in parallel.
(The struck-out sentence is false. Quantum computers do not work by brute-forcing all solutions in parallel, and they are not believed to be able to solve NP-complete problems in polynomial time. See e.g. here.)
Just a update of quantum computing industry base on Greg Kuperberg's answer:
D-Wave 2 System is using quantum annealing.
The superposition quantum states will collapse to a unique state when a observation happened. The current technologies of quantum annealing is apply a physical force to 2 quantum bits, the force adds constrains to qubits so when observation happened, the qubit will have higher probability to collapse to a result that we are willing to see.
Reference:
How does a quantum machine work
I monitor recent non-peer reviewed articles on the subject, this is what I extrapolate from what I have read. a qubit, in addition to what has been said above. namely that they can hold values in superposition, they can also hold multiple bits, for example spin up/+ spin down/+ spin -/vertical , I need to abbreviate +H,-H,+V,-V Left+, LH,LV also not all of the combinations are valid and there are additional values that can be placed on the type of qubit
each used similar to ram vs rom etc. photon with a wavelength, electron with a charge, photon with a charge, photon with a spin, you get the idea, some combinations are not valid and some require additional algorithms in order to pass the argument to the next variable(location where data is stored) or qubit(location of superposition of values to be returned, if you will simply because the use of wires is by necessity limited due to size and space. One of the greatest challenges is controlling or removing Q.(quantum) decoherence. This usually means isolating the system from its environment as interactions with the external world cause the system to decohere. November 2011 researchers factorised 143 using 4 qubits. that same year, D-Wave Systems announced the first commercial quantum annealer on the market by the name D-Wave One. The company claims this system uses a 128 qubit processor chipset.May 2013, Google Inc announced that it was launching the Q. AI. Lab, hopefully to boost AI. I really do Hope I didn't waste anyones time with things they already knew. If you learned something please up.
As I can not yet comment, it really depends on what type of qubit you are working with to know the number of states for example the UNSW silicon Q. bit" vs a Diamond-neutron-valency or a SSD NMR Phosphorus - silicon vs Liquid NMR of the same.

What kind of learning algorithm would you use to build a model of how long it takes a human to solve a given Sudoku situation?

I don't have much experience in machine learning, pattern recognition, data mining, etc. and in their underlying theory and systems.
I would like to develop an artificial model of the time it takes a human to make a move in a given Sudoku puzzle.
So what I'm looking for as an output from the machine learning process is a model that can give predictions on how long does it take for a target human to make a move in a given Sudoku situation.
Same input doesn't always map to same outcome. It takes different times for the human to make a move with the same situation, but my hypothesis is that there's a tendency in the resulting probability distribution. (My educated guess is that it is ~normal.)
I have ideas about the factors that influence the distribution (like #empty slots) but would preferably leave it to the system to figure these patterns out. Please notice, that I'm not interested in the patterns, just the model.
I can generate sample and test data easily by running sudoku puzzles and measuring the times it takes to make the moves.
What kind of learning algorithm would you suggest to use for this?
I was thinking NNs, but I'm not sure if they can have the desired property of giving weighted random outcomes for the same input.
If I understand this correctly you have an input vector of length 81, which contains 1 if the square is filled in and 0 otherwise. You want to learn a function which returns a probability distribution which models the response time of a human to that board position.
My first response would be that this is a regression problem and you should try straightforward linear regression. This will not provide you with a distribution of response times, but a single 'best-guess' response time.
I'm not clear on why you want to model a distribution of response times. However, if you really want to do want to output a distribution then it sounds like you want to look at Bayesian methods. I'm not really an expert on Bayesian inference, so I can't help you much further here.
However, I don't really think your approach is going to work because I agree with your intuition about features such as the number of empty slots being important. There are also other obvious features, such as the number of empty slots per row/column that are likely to be important. Explicitly putting these features in your representation will probably be much more successful than expecting that the learning algorithm will infer something similar on its own.
The monte carlo method seems like it would work well here but would require a stack of solutions the size of the moon to really do it. And it wouldn't give you the time per person, just the time on average.
My understanding of it, tenuous as it is, is that you have a database with a board position and the time it took a human to make the next move. At the very least you have a starting point for most moves. Even if it's not in the database you could start to calculate how long it would take to make a move based on some algorithm. Though I know you had specified you wanted machine learning to do this it might be worth segmenting the problem into something a little smaller then building on it.
If you have some guesstimate as to what influences the function (# of empty cell, etc), try to train a classifier on a vector of features, and not on the 81 cells vector (0/1 or 0..9, doesn't really matter for my argument).
I think that your claim:
we wouldn't have to necessary know the underlying patterns, the "trained patterns" in a learning system automatically encodes these sometimes quite delicate and subtle patterns inside them -- that's one of their great power
is wrong. you do have to give the network the right domain. for example, when trying to detect object in an image, working in the pixel domain is pointless. you'll only get results if you first run some feature detection to detect edges, corners, etc.
Theoretically, with enough non-linearity (in NN - enough layers in the network) it can detect such things, but in practice, I have never seen that work, without giving the classifier the right features to work with.
I was thinking NNs, but I'm not sure if they can have the desired property of giving weighted random outcomes for the same input.
You're just trying to learn a function from 2^81 or 10^81 (or a much smaller feature space as I suggest) to R (response time between 0 and Inf) or some discretization of that. So NN and other classifiers can do that.

What is Cyclomatic Complexity?

A term that I see every now and then is "Cyclomatic Complexity". Here on SO I saw some Questions about "how to calculate the CC of Language X" or "How do I do Y with the minimum amount of CC", but I'm not sure I really understand what it is.
On the NDepend Website, I saw an explanation that basically says "The number of decisions in a method. Each if, for, && etc. adds +1 to the CC "score"). Is that really it? If yes, why is this bad? I can see that one might want to keep the number of if-statements fairly low to keep the code easy to understand, but is this really everything to it?
Or is there some deeper concept to it?
I'm not aware of a deeper concept. I believe it's generally considered in the context of a maintainability index. The more branches there are within a particular method, the more difficult it is to maintain a mental model of that method's operation (generally).
Methods with higher cyclomatic complexity are also more difficult to obtain full code coverage on in unit tests. (Thanks Mark W!)
That brings all the other aspects of maintainability in, of course. Likelihood of errors/regressions/so forth. The core concept is pretty straight-forward, though.
Cyclomatic complexity measures the number of times you must execute a block of code with varying parameters in order to execute every path through that block. A higher count is bad because it increases the chances for logical errors escaping your testing strategy.
Cyclocmatic complexity = Number of decision points + 1
The decision points may be your conditional statements like if, if … else, switch , for loop, while loop etc.
The following chart describes the type of the application.
Cyclomatic Complexity lies 1 – 10  To be considered Normal
applicatinon
Cyclomatic Complexity lies 11 – 20  Moderate application
Cyclomatic Complexity lies 21 – 50  Risky application
Cyclomatic Complexity lies more than 50  Unstable application
Wikipedia may be your friend on this one: Definition of cyclomatic complexity
Basically, you have to imagine your program as a control flow graph and then
The complexity is (...) defined as:
M = E − N + 2P
where
M = cyclomatic complexity,
E = the number of edges of the graph
N = the number of nodes of the graph
P = the number of connected components
CC is a concept that attempts to capture how complex your program is and how hard it is to test it in a single integer number.
Yep, that's really it. The more execution paths your code can take, the more things that must be tested, and the higher probability of error.
Another interesting point I've heard:
The places in your code with the biggest indents should have the highest CC. These are generally the most important areas to ensure testing coverage because it's expected that they'll be harder to read/maintain. As other answers note, these are also the more difficult regions of code to ensure coverage.
Cyclomatic Complexity really is just a scary buzzword. In fact it's a measure of code complexity used in software development to point out more complex parts of code (more likely to be buggy, and therefore has to be very carefully and thoroughly tested). You can calculate it using the E-N+2P formula, but I would suggest you have this calculated automatically by a plugin. I have heard of a rule of thumb that you should strive to keep the CC below 5 to maintain good readability and maintainability of your code.
I have just recently experimented with the Eclipse Metrics Plugin on my Java projects, and it has a really nice and concise Help file which will of course integrate with your regular Eclipse help and you can read some more definitions of various complexity measures and tips and tricks on improving your code.
That's it, the idea is that a method which has a low CC has less forks, looping etc which all make a method more complex. Imagine reviewing 500,000 lines of code, with an analyzer and seeing a couple methods which have oder of magnitude higher CC. This lets you then focus on refactoring those methods for better understanding (It's also common that a high CC has a high bug rate)
Each decision point in a routine (loop, switch, if, etc...) essentially boils down to an if statement equivalent. For each if you have 2 codepaths that can be taken. So with the 1st branch there's 2 code paths, with the second there are 4 possible paths, with the 3rd there are 8 and so on. There are at least 2**N code paths where N is the number of branches.
This makes it difficult to understand the behavior of code and to test it when N grows beyond some small number.
The answers provided so far do not mention the correlation of software quality to cyclomatic complexity. Research has shown that having a lower cyclomatic complexity metric should help develop software that is of higher quality. It can help with software quality attributes of readability, maintainability, and portability. In general one should attempt to obtain a cyclomatic complexity metric of between 5-10.
One of the reasons for using metrics like cyclomatic complexity is that in general a human being can only keep track of about 7 (plus or minus 2) pieces of information simultaneously in your brain. Therefore, if your software is overly complex with multiple decision paths, it is unlikely that you will be able to visualize how your software will behave (i.e. it will have a high cyclomatic complexity metric). This would most likely lead to developing erroneous or bug ridden software. More information about this can be found here and also on Wikipedia.
Cyclomatic complexity is computed using the control flow graph. The Number of quantitative measure of linearly independent paths through a program's source code is called as Cyclomatic Complexity ( if/ if else / for / while )
Cyclomatric complexity is basically a metric to figure out areas of code that needs more attension for the maintainability. It would be basically an input to the refactoring.
It definitely gives an indication of code improvement area in terms of avoiding deep nested loop, conditions etc.
That's sort of it. However, each branch of a "case" or "switch" statement tends to count as 1. In effect, this means CC hates case statements, and any code that requires them (command processors, state machines, etc).
Consider the control flow graph of your function, with an additional edge running from the exit to the entrance. The cyclomatic complexity is the maximum number of cuts we can make without separating the graph into two pieces.
For example:
function F:
if condition1:
...
else:
...
if condition2:
...
else:
...
Control Flow Graph
You can probably intuitively see why the linked graph has a cyclomatic complexity of 3.
Cyclomatric complexity is a measure of how complex a unit of software is.It measures the number of different paths a program might follow with conditional logic constructs (If ,while,for,switch & cases etc....). If you will like to learn more about calculating it here is a wonderful youtube video you can watch https://www.youtube.com/watch?v=PlCGomvu-NM
It is important in designing test cases because it reveals the different paths or scenarios a program can take .
"To have good testability and maintainability, McCabe recommends
that no program module should exceed a cyclomatic complexity of 10"(Marsic,2012, p. 232).
Reference:
Marsic., I. (2012, September). Software Engineering. Rutgers University. Retrieved from www.ece.rutgers.edu/~marsic/books/SE/book-SE_marsic.pdf