What kind of learning algorithm would you use to build a model of how long it takes a human to solve a given Sudoku situation? - language-agnostic

I don't have much experience in machine learning, pattern recognition, data mining, etc. and in their underlying theory and systems.
I would like to develop an artificial model of the time it takes a human to make a move in a given Sudoku puzzle.
So what I'm looking for as an output from the machine learning process is a model that can give predictions on how long does it take for a target human to make a move in a given Sudoku situation.
Same input doesn't always map to same outcome. It takes different times for the human to make a move with the same situation, but my hypothesis is that there's a tendency in the resulting probability distribution. (My educated guess is that it is ~normal.)
I have ideas about the factors that influence the distribution (like #empty slots) but would preferably leave it to the system to figure these patterns out. Please notice, that I'm not interested in the patterns, just the model.
I can generate sample and test data easily by running sudoku puzzles and measuring the times it takes to make the moves.
What kind of learning algorithm would you suggest to use for this?
I was thinking NNs, but I'm not sure if they can have the desired property of giving weighted random outcomes for the same input.

If I understand this correctly you have an input vector of length 81, which contains 1 if the square is filled in and 0 otherwise. You want to learn a function which returns a probability distribution which models the response time of a human to that board position.
My first response would be that this is a regression problem and you should try straightforward linear regression. This will not provide you with a distribution of response times, but a single 'best-guess' response time.
I'm not clear on why you want to model a distribution of response times. However, if you really want to do want to output a distribution then it sounds like you want to look at Bayesian methods. I'm not really an expert on Bayesian inference, so I can't help you much further here.
However, I don't really think your approach is going to work because I agree with your intuition about features such as the number of empty slots being important. There are also other obvious features, such as the number of empty slots per row/column that are likely to be important. Explicitly putting these features in your representation will probably be much more successful than expecting that the learning algorithm will infer something similar on its own.

The monte carlo method seems like it would work well here but would require a stack of solutions the size of the moon to really do it. And it wouldn't give you the time per person, just the time on average.
My understanding of it, tenuous as it is, is that you have a database with a board position and the time it took a human to make the next move. At the very least you have a starting point for most moves. Even if it's not in the database you could start to calculate how long it would take to make a move based on some algorithm. Though I know you had specified you wanted machine learning to do this it might be worth segmenting the problem into something a little smaller then building on it.

If you have some guesstimate as to what influences the function (# of empty cell, etc), try to train a classifier on a vector of features, and not on the 81 cells vector (0/1 or 0..9, doesn't really matter for my argument).
I think that your claim:
we wouldn't have to necessary know the underlying patterns, the "trained patterns" in a learning system automatically encodes these sometimes quite delicate and subtle patterns inside them -- that's one of their great power
is wrong. you do have to give the network the right domain. for example, when trying to detect object in an image, working in the pixel domain is pointless. you'll only get results if you first run some feature detection to detect edges, corners, etc.
Theoretically, with enough non-linearity (in NN - enough layers in the network) it can detect such things, but in practice, I have never seen that work, without giving the classifier the right features to work with.
I was thinking NNs, but I'm not sure if they can have the desired property of giving weighted random outcomes for the same input.
You're just trying to learn a function from 2^81 or 10^81 (or a much smaller feature space as I suggest) to R (response time between 0 and Inf) or some discretization of that. So NN and other classifiers can do that.

Related

Train neural network with unlimited training data

Does fitting a neural network with generated, and thus infinite training samples work?
Will the batch size still matter? Should training samples be repeated over some batches or is it ok to use every sample only once?
Does the term "epoch" makes any sense, if there is no complete dataset to iterate over?
Does validation makes any sense, if every sample from the training dataset already is a new one? If not, will training loss behave like validation loss would?
Does fitting a neural network with generated, and thus infinite training samples work?
Yes, it is completely fine, in fact it will likely be a much better setup than the one you are used to.
Will the batch size still matter?
Yes, batch size controls noise in the gradient estimation, the bigger the batch, smaller the error.
Should training samples be repeated over some batches or is it ok to use every sample only once?
If you can avoid repeating them, and just keep generating, you will be in a cleaner math setup, in practise it likely won't matter much.
Does the term "epoch" makes any sense, if there is no complete dataset to iterate over?
The term "epoch" is one of the big mistakes that we made as a community, it really is meaningless even when dataset is finite. Avoiding it completely will simplify your life, just think in terms of gradient updates/samples consumed and forget the epochs.
Does validation makes any sense, if every sample from the training dataset already is a new one? If not, will training loss behave like validation loss would?
It does still make sense just as an additional verification you are making progress, just remember to make sure you do not "generate" your validation set during training. That being said, it is much less important than in other cases, as long as your test scenario is also going to be generated in the same way. For example this is a reason why many RL papers (especially from Atari times) would not have validation sets - since training and test "environments" were exactly the same.

Adding noise to image for deep learning, yes or no?

I've been thinking that adding noise to an image can prevent overfitting and also "increase" the dataset by adding variations to it. I'm only trying to add some random 1s to images that has shape (256,256,3) which uses uint8 to represent its color. I don't think that can affect the visualization at all (I showed both images with matplotlib and they seems almost the same) and has only ~0.01 mean difference in the sum of their values.
But it doesn't look to have its advances. After training for a long time it's still not as good as the one doesn't use noises.
Has anyone tried to use noise for image classification tasks like this? Is it eventually better?
I wouldn't go to add noise to your data. Some papers employ input deformations during training to increase robutness and convergence speed of models. However, these deformations are statistically inefficient (not just on image but any kind of data).
You can read Intriguing properties of Neural Networks from Szegedy et al. for more details (and refer to references 9 & 13 for papers that uses deformations).
If you want to avoid overfitting, you might be interested to read about regularization instead.
Yes you may add noise to extend your dataset and avoid overfitting your training set but make sure it is random otherwise your network will take this noise as something it should learn (and that's not something you want). I wouldn't use this method first to do that, I would first rotate and/or flip my samples.
However, your network should perform better or, at least, as well as your previous network.
First thing I would check is : How do you measure your performances ? What were your performances before and after ? And did you change anything else ?
There are a couple of works that deal with this problem. Because you make the training set harder the training error will be lower, however your generalization might be better. It has been shown that adding noise can have stability effects for training Generative Adversarial Networks (Adversarial Training).
For classification tasks it is not that cut and dry. Not many works have actually dealt with this topic. The closest one is to my best knowledge is this one from google (https://arxiv.org/pdf/1412.6572.pdf), where they show the limitation of using training without noise. They do report a regularization effect, but not actual better results than using other methods.

CNN(neural network) for finding the supernova

I'm a student major in physics as well as CS. One of my tasks is to find the supernova. The discovery of supernova is tedious and tough.
Through contrast the picture now and before, then we may find some bright spot on the picture and that may be the supernova.
like this,
The picture has many noise, and there are always many ghost spot because of instability of the instruments, or other lights make the illusions.
However, the supernova has some obvious characteristics, it always show up around the fixed stars. The shape of light is circle. etc.There already some conventional methods used on that. But they don't have good performance.
So I wonder if it's worthwhile trying it on CNN.
Which kind of data can CNN do well on?
Thanks.
So I wonder if it's worthwhile trying it on CNN.
I think CNN is overkill for this problem.
Which kind of data can CNN do well on?
Data with complex localised relationships in the structure and a large number of features. You use a convolution across a local frame to learn the representation.
The problem you have is very simple. You don't have many parameters, i.e. colour is grayscale, representation of a supernova is all contained within the immediate vicinity of it's occurrence.
I think you would probably have much more success with some really simple algorithm such as:
Find all fixed stars
search for any big 'blobs' of light with specific parameters
search for any circles of light
These alone will massively reduce the computational size of the problem. From there, there are a number of ML approaches you could take.
CNNs are generally for very big data sets with highly complex non-linear relationships. This (may?) be a big data set but it is certainly not complex in this particular task.

Web Audio Pitch Detection for Tuner

So I have been making a simple HTML5 tuner using the Web Audio API. I have it all set up to respond to the correct frequencies, the problem seems to be with getting the actual frequencies. Using the input, I create an array of the spectrum where I look for the highest value and use that frequency as the one to feed into the tuner. The problem is that when creating an analyser in Web Audio it can not become more specific than an FFT value of 2048. When using this if i play a 440hz note, the closest note in the array is something like 430hz and the next value seems to be higher than 440. Therefor the tuner will think I am playing these notes when infact the loudest frequency should be 440hz and not 430hz. Since this frequency does not exist in the analyser array I am trying to figure out a way around this or if I am missing something very obvious.
I am very new at this so any help would be very appreciated.
Thanks
There are a number of approaches to implementing pitch detection. This paper provides a review of them. Their conclusion is that using FFTs may not be the best way to go - however, it's unclear quite what their FFT-based algorithm actually did.
If you're simply tuning guitar strings to fixed frequencies, much simpler approaches exist. Building a fully chromatic tuner that does not know a-priori the frequency to expect is hard.
The FFT approach you're using is entirely possible (I've built a robust musical instrument tuner using this approach that is being used white-label by a number of 3rd parties). However you need a significant amount of post-processing of the FFT data.
To start, you solve the resolution problem using the Short Timer FFT (STFT) - or more precisely - a succession of them. The process is described nicely in this article.
If you intend building a tuner for guitar and bass guitar (and let's face it, everyone who asks the question here is), you'll need t least a 4092-point DFT with overlapping windows in order not to violate the nyquist rate on the bottom E1 string at ~41Hz.
You have a bunch of other algorithmic and usability hurdles to overcome. Not least, perceived pitch and the spectral peak aren't always the same. Taking the spectral peak from the STFT doesn't work reliably (this is also why the basic auto-correlation approach is also broken).

Using HMM for offline character recognition

I have extracted features from many images of isolated characters (such as gradient, neighbouring pixel weight and geometric properties. How can I use HMMs as a classifier trained on this data? All literature I read about HMM refers to states and state transitions but I can't connect it to features and class labeling. The example on JAHMM's home page doesn't relate to my problem.
I need to use HMM not because it will work better than other approaches for this problem but because of constraints on project topic.
There was an answer to this question for online recognition but I want the same for offline and in a little more detail
EDIT: I partitioned each character into a grid with fixed number of squares. Now I am planning to perform feature extraction on each grid block and thus obtain a sequence of features for each sample by moving from left to right and top to bottom.
Would this represent an adequate "sequence" for an HMM i.e. would an HMM be able to guess the temporal variation of the data, even though the character is not drawn from left to right and top to bottom? If not suggest an alternate way.
Should I feed a lot of features or start with a few? how do I know if the HMM is underforming or if the features are bad? I am using JAHMM.
Extracting stroke features is difficult and cant be logically combined with grid features? (since HMM expects a sequence generated by some random process)
I've usually seen neural networks used for this sort of recognition task, i.e. here, here here, and here. Since a simple google search turns up so many hits for neural networks in OCR, I'll assume you are set in using HMMs (a project limitation, correct?) Regardless, these links can offer some insight into gridding the image and obtaining image features.
Your approach for turning a grid into a sequence of observations is reasonable. In this case, be sure you do not confuse observations and states. The features you extract from one block should be collected into one observation, i.e. a feature vector. (In comparison to speech recognition, your block's feature vector is analogous to the feature vector associated with a speech phoneme.) You don't really have much information regarding the underlying states. This is the hidden aspect of HMMs, and the training process should inform the model how likely one feature vector is to follow another for a character (i.e. transition probabilities).
Since this is an off-line process, don't be concerned with the temporal aspects of how characters are actually drawn. For the purposes of your task, you've imposed a temporal order on the sequence of observations with your the left-to-right, top-to-bottom block sequence. This should work fine.
As for HMM performance: choose a reasonable vector of salient features. In speech recog, the dimensionality of a feature vector can be high (>10). (This is also where the cited literature can assist.) Set aside a percentage of the training data so that you can properly test the model. First, train the model, and then evaluate the model on the training dataset. How well does classify your characters? If it does poorly, re-evaluate the feature vector. If it does well on the test data, test the generality of the classifier by running it on the reserved test data.
As for the number of states, I would start with something heuristically derived number. Assuming your character images are scaled and normalized, perhaps something like 40%(?) of the blocks are occupied? This is a crude guess on my part since a source image was not provided. For an 8x8 grid, this would imply that 25 blocks are occupied. We could then start with 25 states - but that's probably naive: empty blocks can convey information (meaning the number of states might increase), but some features sets may be observed in similar states (meaning the number of states might decrease.) If it were me, I would probably pick something like 20 states. Having said that: be careful not to confuse features and states. Your feature vector is a representation of things observed in a particular state. If the tests described above show your model is performing poorly, tweak the number of states up or down and try again.
Good luck.