I was wondering which is the best machine learning technique to approximate a function that takes a 32-bit number and returns another 32-bit number, from a set of observations.
Thanks!
Multilayer perceptron neural networks would be worth taking a look at. Though you'll need to process the inputs to a floating point number between 0 and 1, and then map the outputs back to the original range.
There are several possible solutions to your problem:
1.) Fitting a linear hypothesis with least-squares method
In that case, you are approximating a hypothesis y = ax + b with the least squares method. This one is really easy to implement, but sometimes, a linear model is not good enough to fit your data. But - I would give this one a try first.
Good thing is that there is a closed form, so you can directly calculate parameters a and b from your data.
See Least Squares
2.) Fitting a non-linear model
Once seen that your linear model does not describe your function very well, you can try to fit higher polynomial models to your data.
Your hypothesis then might look like
y = ax² + bx + c
y = ax³ + bx² + cx + d
etc.
You can also use least squares method to fit your data, and techniques from the gradient descent types (simmulated annealing, ...). See also this thread: Fitting polynomials to data
Or, as in the other answer, try fitting a Neural Network - the good thing is that it will automatically learn the hypothesis, but it is not so easy to explain what the relation between input and output is. But in the end, a neural network is also a linear combination of nonlinear functions (like sigmoid or tanh functions).
Related
I think the answer would be yes, but I'm unable to reason out a good explanation on this.
The mathematical argument lies in a power to represent linearity, we can use following three lemmas to show that:
Lemma 1
With affine transformations (linear layer) we can map the input hypercube [0,1]^d into arbitrary small box [a,b]^k. Proof is quite simple, we can just make all the biases to be equal to a, and make weights multiply by (b-a).
Lemma 2
For sufficiently small scale, many non-linearities are approximately linear. This is actually very much a definition of a derivative, or, taylor expansion. In particular let us take relu(x), for x>0 it is in fact, linear! What about sigmoid? Well if we look at a tiny tiny region [-eps, eps] you can see that it approaches a linear function as eps->0!
Lemma 3
Composition of affine functions is affine. In other words, if I were to make a neural network with multiple linear layers, it is equivalent of having just one. This comes from the matrix composition rules:
W2(W1x + b1) + b2 = W2W1x + W2b1 + b2 = (W2W1)x + (W2b1 + b2)
------ -----------
New weights New bias
Combining the above
Composing the three lemmas above we see that with a non-linear layer, there always exists an arbitrarily good approximation of the linear function! We simply use the first layer to map entire input space into the tiny part of the pre-activation spacve where your linearity is approximately linear, and then we "map it back" in the following layer.
General case
This is a very simple proof, now in general you can use Universal Approximation Theorem to show that a non-linear neural network (Sigmoid, Relu, many others) that is sufficiently large, can approximate any smooth target function, which includes linear ones. This proof (originally given by Cybenko) is however much more complex and relies on showing that specific classes of functions are dense in the space of continuous functions.
Technically, yes.
The reason you could use a non-linear activation function for this task is that you can manually alter the results. Let's say the range the activation function outputs is between 0.0-1.0, then you can round up or down to get a binary 0/1. Just to be clear, rounding up or down isn't linear activation, but for this specific question the purpose of the network was for classification, where some kind of rounding has to be applied.
The reason you shouldn't is the same reason that you shouldn't attach an industrial heater to a fan and call it a hair-drier, it's unnecessarily powerful and it could potentially waste resources and time.
I hope this answer helped, have a good day!
Hello Guys,
I am working right now on an Autoencoder reducing some simple 2D Data to 1D. The architecture is 2 - 10 - 1 - 10 - 2 Neurons/Layer. As Activation Function I use sigmoid in every layer but the output-layer, where I use the identity.
I am using the Accord.NET Framework to build that.
I am Pre-Training the Autoencoder with RBMs and CD-Algorithm, where I can change the initial weights, the learning rate, the momentum and the weight decay.
The Fine-Tuning is accomplished by backpropagation where I can configure the learning rate and the momentum.
The data is some artificially created shape and is marked green in the picture:
data + reconstruction
The reconstruction of the autoencoder is the yellow line. Which leads to my problem. Somehow the encoder is not able to create a non-linear shape as output.
Although I tested arround a lot and changed values a dozen times, I am not getting better results. Maybe someone here has an idea how I could find the problem.
Thanks!
Look in general any neural network is based on a linear representation for your feature against the output so what the net is actually doing (consider two features) is [w1*x1 + w2*x2 = output].
What you need to do to achieve a non-linear representation is to use extra feature(s) which is a non-linear representation of the old feature(s). Let's say for example use x1^2 as an extra feature or x2^2 or both of them. Hence, the net will give this global equation [w1*x1 + w2*x2 + w3*x1^2 = output] which in nature is a non-linear equation and then you can have a nonlinear representation.
The extra feature equation depends mainly on your data. I have used a quadratic equation in my example but it is not always the correct thing to do. Referring to
Your data I think you need to use a cos(x) or sin(x) representation.
I am attempting to fit a circle to some data. This requires numerically solving a set of three non-linear simultaneous equations (see the Full Least Squares Method of this document).
To me it seems that the NEWTON function provided by IDL is fit for solving this problem. NEWTON requires the name of a function that will compute the values of the equation system for particular values of the independent variables:
FUNCTION newtfunction,X
RETURN, [Some function of X, Some other function of X]
END
While this works fine, it requires that all parameters of the equation system (in this case the set of data points) is hard coded in the newtfunction. This is fine if there is only one data set to solve for, however I have many thousands of data sets, and defining a new function for each by hand is not an option.
Is there a way around this? Is it possible to define functions programmatically in IDL, or even just pass in the data set in some other manner?
I am not an expert on this matter, but if I were to solve this problem I would do the following. Instead of solving a system of 3 non-linear equations to find the three unknowns (i.e. xc, yc and r), I would use an optimization routine to converge to a solution by starting with an initial guess. For this steepest descent, conjugate gradient, or any other multivariate optimization method can be used.
I just quickly derived the least square equation for your problem as (please check before use):
F = (sum_{i=1}^{N} (xc^2 - 2 xi xc + xi^2 + yc^2 - 2 yi yc + yi^2 - r^2)^2)
Calculating the gradient for this function is fairly easy, since it is just a summation, and therefore writing a steepest descent code would be trivial, to calculate xc, yc and r.
I hope it helps.
It's usual to use a COMMON block in these types of functions to pass in other parameters, cached values, etc. that are not part of the calling signature of the numeric routine.
I have a function f(x) = 1/(x + a+ b*I*sign(x)) and I want to calculate the
integral of
dx dy dz f(x) f(y) f(z) f(x+y+z) f(x-y - z)
over the entire R^3 (b>0 and a,- b are of order unity). This is just a representative example -- in practice I have n<7 variables and 2n-1 instances of f(), n of them involving the n integration variables and n-1 of them involving some linear combintation of the integration variables. At this stage I'm only interested in a rough estimate with relative error of 1e-3 or so.
I have tried the following libraries :
Steven Johnson's cubature code: the hcubature algorithm works but is abysmally slow, taking hundreds of millions of integrand evaluations for even n=2.
HintLib: I tried adaptive integration with a Genz-Malik rule, the cubature routines, VEGAS and MISER with the Mersenne twister RNG. For n=3 only the first seems to be somewhat viable option but it again takes hundreds of millions of integrand evaluations for n=3 and relerr = 1e-2, which is not encouraging.
For the region of integration I have tried both approaches: Integrating over [-200, 200]^n (i.e. a region so large that it essentially captures most of the integral) and the substitution x = sinh(t) which seems to be a standard trick.
I do not have much experience with numerical analysis but presumably the difficulty lies in the discontinuities from the sign() term. For n=2 and f(x)f(y)f(x-y) there are discontinuities along x=0, y=0, x=y. These create a very sharp peak around the origin (with a different sign in the various quadrants) and sort of 'ridges' at x=0,y=0,x=y along which the integrand is large in absolute value and changes sign as you cross them. So at least I know which regions are important. I was thinking that maybe I could do Monte Carlo but somehow "tell" the algorithm in advance where to focus. But I'm not quite sure how to do that.
I would be very grateful if you had any advice on how to evaluate the integral with a reasonable amount of computing power or how to make my Monte Carlo "idea" work. I've been stuck on this for a while so any input would be welcome. Thanks in advance.
One thing you can do is to use a guiding function for your Monte Carlo integration: given an integral (am writing it in 1D for simplicity) of ∫ f(x) dx, write it as ∫ f(x)/g(x) g(x) dx, and use g(x) as a distribution from which you sample x.
Since g(x) is arbitrary, construct it such that (1) it has peaks where you expect them to be in f(x), and (2) such that you can sample x from g(x) (e.g., a gaussian, or 1/(1+x^2)).
Alternatively, you can use a Metropolis-type Markov chain MC. It will find the relevant regions of the integrand (almost) by itself.
Here are a couple of trivial examples.
I want to invert a 4x4 matrix. My numbers are stored in fixed-point format (1.15.16 to be exact).
With floating-point arithmetic I usually just build the adjoint matrix and divide by the determinant (e.g. brute force the solution). That worked for me so far, but when dealing with fixed point numbers I get an unacceptable precision loss due to all of the multiplications used.
Note: In fixed point arithmetic I always throw away some of the least significant bits of immediate results.
So - What's the most numerical stable way to invert a matrix? I don't mind much about the performance, but simply going to floating-point would be to slow on my target architecture.
Meta-answer: Is it really a general 4x4 matrix? If your matrix has a special form, then there are direct formulas for inverting that would be fast and keep your operation count down.
For example, if it's a standard homogenous coordinate transform from graphics, like:
[ux vx wx tx]
[uy vy wy ty]
[uz vz wz tz]
[ 0 0 0 1]
(assuming a composition of rotation, scale, translation matrices)
then there's an easily-derivable direct formula, which is
[ux uy uz -dot(u,t)]
[vx vy vz -dot(v,t)]
[wx wy wz -dot(w,t)]
[ 0 0 0 1 ]
(ASCII matrices stolen from the linked page.)
You probably can't beat that for loss of precision in fixed point.
If your matrix comes from some domain where you know it has more structure, then there's likely to be an easy answer.
I think the answer to this depends on the exact form of the matrix. A standard decomposition method (LU, QR, Cholesky etc.) with pivoting (an essential) is fairly good on fixed point, especially for a small 4x4 matrix. See the book 'Numerical Recipes' by Press et al. for a description of these methods.
This paper gives some useful algorithms, but is behind a paywall unfortunately. They recommend a (pivoted) Cholesky decomposition with some additional features too complicated to list here.
I'd like to second the question Jason S raised: are you certain that you need to invert your matrix? This is almost never necessary. Not only that, it is often a bad idea. If you need to solve Ax = b, it is more numerically stable to solve the system directly than to multiply b by A inverse.
Even if you have to solve Ax = b over and over for many values of b, it's still not a good idea to invert A. You can factor A (say LU factorization or Cholesky factorization) and save the factors so you're not redoing that work every time, but you'd still solve the system each time using the factorization.
You might consider doubling to 1.31 before doing your normal algorithm. It'll double the number of multiplications, but you're doing a matrix invert and anything you do is going to be pretty tied to the multiplier in your processor.
For anyone interested in finding the equations for a 4x4 invert, you can use a symbolic math package to resolve them for you. The TI-89 will do it even, although it'll take several minutes.
If you give us an idea of what the matrix invert does for you, and how it fits in with the rest of your processing we might be able to suggest alternatives.
-Adam
Let me ask a different question: do you definitely need to invert the matrix (call it M), or do you need to use the matrix inverse to solve other equations? (e.g. Mx = b for known M, b) Often there are other ways to do this w/o explicitly needing to calculate the inverse. Or if the matrix M is a function of time & it changes slowly then you could calculate the full inverse once, & there are iterative ways to update it.
If the matrix represents an affine transformation (many times this is the case with 4x4 matrices so long as you don't introduce a scaling component) the inverse is simply the transpose of the upper 3x3 rotation part with the last column negated. Obviously if you require a generalized solution then looking into Gaussian elimination is probably the easiest.