Convergence rate adaptive Runge Kutta methods - numerical-methods

I have a system of ODEs and I solve them with two different algorithms:
A normal, constant stepsize, Runge Kutta 4 implementation
A modified Runge Kutta 4, with variable stepsize control
My professor asked me to make a plot showing convergences rate of the two
algorithms to determine which one works best.
We usually do this plotting the difference between the analytical solution and the numerical one at final time vs the stepsize.
However, I don't know how to perform such analysis since for the adaptive version of the algorithm the stepsize is not fixed, hence any error vs step plot would make no sense...
Is there any other way to do it I'm not seeing?

Related

Regression problem getting much better results when dividing values by 100

I'm working on a regression problem in pytorch. My target values can be either between 0 to 100 or 0 to 1 (they represent % or % divided by 100).
The data is unbalanced, I have much more data with lower targets.
I've noticed that when I run the model with targets in the range 0-100, it doesn't learn - the validation loss doesn't improve, and the loss on the 25% large targets is very big, much bigger than the std in this group.
However, when I run the model with targets in the range 0-1, it does learn and I get good results.
If anyone can explain why this happens, and if using the ranges 0-1 is "cheating", that will be great.
Also - should I scale the targets? (either if I use the larger or the smaller range).
Some additional info - I'm trying to fine tune bert for a specific task. I use MSEloss.
Thanks!
I think your observation relates to batch normalization. There is a paper written on the subject, an numerous medium/towardsdatascience posts, which i will not list here. Idea is that if you have a no non-linearities in your model and loss function, it doesn't matter. But even in MSE you do have non-linearity, which makes it sensitive to scaling of both target and source data. You can experiment with inserting Batch Normalization Layers into your models, after dense or convolutional layers. In my experience it often improves accuracy.

Why do we need the hyperparameters beta and alpha in LDA?

I'm trying to understand the technical part of Latent Dirichlet Allocation (LDA), but I have a few questions on my mind:
First: Why do we need to add alpha and gamma every time we sample the equation below? What if we delete the alpha and gamma from the equation? Would it still be possible to get the result?
Second: In LDA, we randomly assign a topic to every word in the document. Then, we try to optimize the topic by observing the data. Where is the part which is related to posterior inference in the equation above?
If you look at the inference derivation on Wiki, the alpha and beta are introduced simply because the theta and phi are both drawn from Dirichlet distribution uniquely determined by them separately. The reason of choosing Dirichlet distribution as the prior distribution (e.g. P(phi|beta)) are mainly for making the math feasible to tackle by utilizing the nice form of conjugate prior (here is Dirichlet and categorical distribution, categorical distribution is a special case of multinational distribution where n is set to one, i.e. only one trial). Also, the Dirichlet distribution can help us "inject" our belief that doc-topic and topic-word distribution are centered in a few topics and words for a document or topic (if we set low hyperparameters). If you remove alpha and beta, I am not sure how it will work.
The posterior inference is replaced with joint probability inference, at least in Gibbs sampling, you need joint probability while pick one dimension to "transform the state" as the Metropolis-Hasting paradigm does. The formula you put here is essentially derived from the joint probability P(w,z). I would like to refer you the book Monte Carlo Statistical Methods (by Robert) to fully understand why inference works.

Recurrent NNs: what's the point of parameter sharing? Doesn't padding do the trick anyway?

The following is how I understand the point of parameter sharing in RNNs:
In regular feed-forward neural networks, every input unit is assigned an individual parameter, which means that the number of input units (features) corresponds to the number of parameters to learn. In processing e.g. image data, the number of input units is the same over all training examples (usually constant pixel size * pixel size * rgb frames).
However, sequential input data like sentences can come in highly varying lengths, which means that the number of parameters will not be the same depending on which example sentence is processed. That is why parameter sharing is necessary for efficiently processing sequential data: it makes sure that the model always has the same input size regardless of the sequence length, as it is specified in terms of transition from one state to another. It is thus possible to use the same transition function with the same weights (input to hidden weights, hidden to output weights, hidden to hidden weights) at every time step. The big advantage is that it allows generalization to sequence lengths that did not appear in the training set.
My questions are:
Is my understanding of RNNs, as summarized above, correct?
In the actual code example in Keras I looked at for LSTMs, they padded the sentences to equal lengths before all. By doing so, doesn't this wash away the whole purpose of parameter sharing in RNNs?
Parameter Sharing
Being able to efficiently process sequences of varying length is not the only advantage of parameter sharing. As you said, you can achieve that with padding. The main purpose of parameter sharing is a reduction of the parameters that the model has to learn. This is the whole purpose of using a RNN.
If you would learn a different network for each time step and feed the output of the first model to the second etc. you would end up with a regular feed-forward network. For a number of 20 time steps, you would have 20 models to learn. In Convolutional Nets, parameters are shared by the Convolutional Filters because when we can assume that there are similar interesting patterns in different regions of the picture (for example a simple edge). This drastically reduces the number of parameters we have to learn. Analogously, in sequence learning we can often assume that there are similar patterns at different time steps. Compare 'Yesterday I ate an apple' and 'I ate an apple yesterday'. These two sentences mean the same, but the 'I ate an apple' part occurs on different time steps. By sharing parameters, you only have to learn what that part means once. Otherwise, you'd have to learn it for every time step, where it could occur in your model.
There is a drawback to sharing the parameters. Because our model applies the same transformation to the input at every time step, it now has to learn a transformation that makes sense for all time steps. So, it has to remember, what word came in which time step, i.e. 'chocolate milk' should not lead to the same hidden and memory state as 'milk chocolate'. But this drawback is small compared to using a large feed-forward network.
Padding
As for padding the sequences: the main purpose is not directly to let the model predict sequences of varying length. Like you said, this can be done by using parameter sharing. Padding is used for efficient training - specifically to keep the computational graph during training low. Without padding, we have two options for training:
We unroll the model for each training sample. So, when we have a sequence of length 7, we unroll the model to 7 time steps, feed the sequence, do back-propagation through the 7 time steps and update the parameters. This seems intuitive in theory. But in practice, this is inefficient, because TensorFlow's computational graphs don't allow recurrency, they are feedforward.
The other option is to create the computational graphs before starting training. We let them share the same weights and create one computational graph for every sequence length in our training data. But when our dataset has 30 different sequence lengths this means 30 different graphs during training, so for large models, this is not feasible.
This is why we need padding. We pad all sequences to the same length and then only need to construct one computational graph before starting training. When you have both very short and very long sequence lengths (5 and 100 for example), you can use bucketing and padding. This means, you pad the sequences to different bucket lengths, for example [5, 20, 50, 100]. Then, you create a computational graph for each bucket. The advantage of this is, that you don't have to pad a sequence of length 5 to 100, as you would waste a lot of time on "learning" the 95 padding tokens in there.

Implementing large linear regression models using CUDA

For analysis of 10^6 genetic factors and their GeneXGene interactions (~5x10^11), I have numerous and independent linear regression problems which are probably suitable for analysis on GPUs.
The objective is to exhaustively search for GeneXGene interaction effects in modulating an outcome variable (a brain phenotype) using linear regression with the interaction term included.
As far as I know, the Householder QR factorization could be the solution for fitting regression models, however, given that each regression matrix in this particular work could easily approach the size of ~ 10'000x10, even each single regression matrix does not seem to fit in GPU on-chip memory (shared, registers etc.).
Should I accept this as a problem which is inherently bandwidth-limited and keep the matrices in GPU global memory during regression analysis, or are other strategies possible?
EDIT
Here are more details about the problem:
There will be approximately 10'000 subjects, each with ~1M genetic parameters (genetic matrix:10'000x10^6). The algorithm in each iteration should select two columns of this genetic matrix (10'000x2) and also around 6 other variables unrelated to genetic data (age, gender etc) so the final regression model will be dealing with a matrix like the size of 10'000x[2(genetic factors)+6(covariates)+2(intercept&interaction term)] and an outcome variable vector (10'000x1). This same process will be repeated ~5e11 times each time with a given pair of genetic factors. Those models passing a predefined statistical threshold should be saved as output.
The specific problem is that although there are ~5e11 separate regression models, even a single one does not seem to fit in on-chip memory.
I also guess that sticking with CUDA libraries may not be the solution here as this mandates most of the data manipulation to take place on the CPU side and only sending each QR decomposition to GPU?
You whole data matrix (1e4 x 1e6) may be too large to fit in the global memory, while each of your least squares solving (1e4 x 10) may be too small to fully utilize the GPU.
For each least squares problem, you could use cuSolver for QR factorization and triangular solving.
http://docs.nvidia.com/cuda/cusolver/index.html#cuds-lt-t-gt-geqrf
If the problem size is too small to fully utilize the GPU, you could use concurrent kernel execution to solve multiple equations at the same time.
https://devblogs.nvidia.com/parallelforall/gpu-pro-tip-cuda-7-streams-simplify-concurrency/
For the whole data matrix, if it can not fit into the global memory, you could work on only part of it at a time. For example you could divide the matrix into ten (1e4 x 1e5) blocks, each time you load two of the blocks through PCIe, select all possible two-column combinations from the two blocks respectively, solve the equation and then load another two blocks. Maximize the block size will help you minimize the PCIe data transfer. With proper design, I'm sure the time for PCIe data transfer will be much smaller than solving 1e12 equations. Furthermore you could overlap the data transfer with solver kernel executions.
https://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-cc/

Resolve system of equations with 10th degree polynomial, LSM

I have numerical problem with resolve system of equations (polynomial 10th degree) using ordinary LSM (Least Square Method). I obtained parameters with huge and very small values - therefore I can't inverse matrix constructed in this method - precision is to low even in extended variables. I tried do this in C++,Matlab,Delphi.
Can somebody know application instruments which can I do this with enough accurancy or numerical tips do get good results. Standard calculation on matrix is unfortunatly elusive.
I think that your problem comes from the fact that you are using 10th order polynomials, which quite often lead to numerical problems:
First of all, they can be unsuitable because of large oscillations. Even when interpolating a simple function, these oscillations can be present, see the famous Runge's example.
Secondly, the fitting of the high order polynomials can lead to hill-conditioned linear systems, which is why you could not invert the matrix (which you should anyway not do). I made a simple experiment: I took 11 equidistant points (on the interval [0,1]) and assembled the matrix of the linear system to solve. Matlab gives me a condition number of about 1e8, so the least square matrix has a condition number of 1e16. So your matrix is 'close to singular' and this means that all the numerical precision is lost.
So, the best way to get rid of your problem is to get rid of the 10th order polynomial. You should maybe consider lower order polynomials, splines or piecewise polynomial approximations.
If you really need 10th order polynomials (e.g. if you know that your data have been generated by such a polynomial), then do not invert the matrix. Use a good preconditioner and an iterative method to solve the system without inverting the matrix.