How to set max interations for QP subproblem of sqp optimization? - octave

I am trying to run a nonlinear optimization using Octave's sqp solver, but I'm getting a warning that says "sqp: QP subproblem failed to converge in 200 iterations". From what I understand, the sqp (nonlinear) solver makes successive calls to the QP (quadratic) solver. But I only seem to be able to set the max iterations for the initial sqp call--not the QP subcalls.
The two things I tried so far were setting the max iterations to 500 in the sqp call - sqp (x0, phi, g, h, lb, ub, maxiter=500) - and including the line optimset('MaxIter',500) at the start of my script, but both of those only set the max iterations for the sqp solver and not the QP subproblems. Is there any way to set the max iterations for the QP subproblems as well?

Currently this is not possible as you can see in the QP call inside the SQP implementation: http://hg.savannah.gnu.org/hgweb/octave/file/8c648c3a2c8f/scripts/optimization/sqp.m#l412
Please start a feature request on Savannah: https://www.gnu.org/software/octave/bugs.html

Related

calculating DFT of time signal in MATLAB

This code computes the DFT from time domain.
Can anybody see the code below and help me to get the right answer?
my problem is:
when I change N value, for example, to 4, 5, 10 ,or other values.
X(1) changes with that. but I think X(1) must be the same for every value of N.
just like the shape below: the N value changes but the vertical value is the same.
I appreciate if you help me.
Thank you.
enter image description here
clear; clc;
% %% Analytical
N=4;
k=0:N-1;
X=zeros(N,1);
t=k/N;
x=(5+2*cos(2*pi*t-pi/2)+3*cos(4*pi*t))
%x=abs((1-(0.012.*(pi.*52.*(t-0.3721)).^2)).*exp(-(pi.*52.*(t-0.3721).^2)))
abs(sum(x))
for k=0:N-1
for n=0:N-1
X(k+1)=X(k+1)+x(n+1).*exp(-1i.*2.*pi.*(n).*(k)/N);
end
end
k1=[0:N-1];
stem(k1,abs(X))
% xlim([0 1])
% ylim([-1 1])
xlabel('Frequency');
ylabel('|X(k)|');
title('Frequency domain - Magnitude response')
Your definition of DFT (which is probably the most common definition) does not have the property that X(1) remains constant with N. Instead, it is X(1)/N which will remain constant. To use this DFT to get the magnitudes of the input at various frequencies, you'll need to divide the DFT output by N.
To verify this, you can call Matlab's fft function and compare with your results. You should get the same answer from Matlab's fft. Note that Matlab's fft documentation says:
The resulting FFT amplitude is A*n/2, where A is the original amplitude and n is the number of FFT points.

Statistical method to know when enough performance test iterations have been performed

I'm doing some performance/load testing of a service. Imagine the test function like this:
bytesPerSecond = test(filesize: 10MB, concurrency: 5)
Using this, I'll populate a table of results for different sizes and levels of concurrency. There are other variables too, but you get the idea.
The test function spins up concurrency requests and tracks throughput. This rate starts off at zero, then spikes and dips until it eventually stabilises on the 'true' value.
However it can take a while for this stability to occur, and there are lot of combinations of input to evaluate.
How can the test function decide when it's performed enough samples? By enough, I suppose I mean that the result isn't going to change beyond some margin if testing continues.
I remember reading an article about this a while ago (from one of the jsperf authors) that discussed a robust method, but I cannot find the article any more.
One simple method would be to compute the standard deviation over a sliding window of values. Is there a better approach?
IIUC, you're describing the classic problem of estimating the confidence interval of the mean with unknown variance. That is, suppose you have n results, x1, ..., xn, where each of the xi is a sample from some process of which you don't know much: not the mean, not the variance, and not the distribution's shape. For some required confidence interval, you'd like to now whether n is large enough so that, with high probability the true mean is within the interval of your mean.
(Note that with relatively-weak conditions, the Central Limit Theorem guarantees that the sample mean will converge to a normal distribution, but to apply it directly you would need the variance.)
So, in this case, the classic solution to determine if n is large enough, is as follows:
Start by calculating the sample mean μ = ∑i [xi] / n. Also calculate the normalized sample variance s2 = ∑i [(xi - μ)2] / (n - 1)
Depending on the size of n:
If n > 30, the confidence interval is approximated as μ ± zα / 2(s / √(n)), where, if necessary, you can find here an explanation on the z and α.
If n < 30, the confidence interval is approximated as μ ± tα / 2(s / √(n)); see again here an explanation of the t value, as well as a table.
If the confidence is enough, stop. Otherwise, increase n.
Stability means rate of change (derivative) is zero or close to zero.
The test function spins up concurrency requests and tracks throughput.
This rate starts off at zero, then spikes and dips until it eventually
stabilises on the 'true' value.
I would track your past throughput values. For example last X values or so. According to this values, I would calculate rate of change (derivative of your throughput). If your derivative is close to zero, then your test is stable. I will stop test.
How to find X? I think instead of constant value, such as 10, choosing a value according to maximum number of test can be more suitable, for example:
X = max(10,max_test_count * 0.01)

Octave FWHM calculation

I am having some problem about calculating the FWHM of my data. Because the "fwhm" function in signal package results in a 100 times bigger value than i expected to get.
What i did is that,
Depending on the gaussian distribution function (you can find it on wikipedia) I produced some data. In this function you can give a specific sigma (RMS) value (FWHM=sigma*2.355). Here is that the script I wrote to understand the situation
x=10:0.01:40;
x0=25;
sigma=0.25;
y=(1/(sigma*sqrt(2*pi)))*exp(-((x-x0).^2)/(2*sigma^2));
z=fwhm(y)/2.355;
plot(x,y)
when I compared the results the output of "fwhm" function (24.999) is 100 times bigger than the one I used (0.25) in the function.
If you have any idea it will be very helpful.
Thanks in advance.
Your z is 100 times bigger because your steps in x are 1/100 (0.01). If you use fwhm(y) it is expected that the stepsize in x is 1. If not you have to specify that.
In your case you should do:
z=fwhm(x, y)/2.355
z = 0.24999
which matches your sigma

cublas one function call produced three executions

I called the cublas_Sgemm_v2 function for 10236 times with first matrix non-transposed and the second transposed. However, in the nvprof results, I saw three items produced from that function call. The (m, n, k) values to the function call are (588, 588, 20).
There are the items listed in the nvprof results.
Time(%) Time Calls Avg Min Max Name
12.32% 494.86ms 10236 48.344us 47.649us 49.888us sgemm_sm35_ldg_nt_128x8x128x16x16
8.64% 346.91ms 10236 33.890us 32.352us 35.488us sgemm_sm35_ldg_nt_64x16x128x8x32
8.11% 325.63ms 10236 31.811us 31.360us 32.512us sgemm_sm35_ldg_nt_128x16x64x16x16
Is this expected and why is that? Can someone explain what do the values in the function names such as sgemm_sm35_ldg_nt_128x8x128x16x16 mean?
I also have other function calls to cublas_Sgemm_v2 with different transpose settings and I only see one item per each function call.
UPDATE:
As #Marco13 asked, I put more results here:
Time(%) Time Calls Avg Min Max Name
--------------------------------------------------------------------------------
Resulted from 7984 calls with (Trans, NonTrans) with (m, n, k) = (588, 100, 588)
20.84% 548.30ms 7984 68.675us 58.977us 81.474us sgemm_sm35_ldg_tn_32x16x64x8x16
Resulted from 7984 calls with (NonTrans, NonTrans) with (m, n, k) = (588, 100, 588)
12.95% 340.71ms 7984 42.674us 21.856us 64.514us sgemm_sm35_ldg_nn_64x16x64x16x16
All the following resulted from 3992 calls with (NonTrans, Trans) with (m, n, k) = (588, 588, 100)
9.81% 258.15ms 3992 64.666us 61.601us 68.642us sgemm_sm35_ldg_nt_128x8x128x16x16
6.84% 179.90ms 3992 45.064us 40.097us 49.505us sgemm_sm35_ldg_nt_64x16x128x8x32
6.33% 166.51ms 3992 41.709us 38.304us 61.185us sgemm_sm35_ldg_nt_128x16x64x16x16
Another run with 588 changed to 288:
Time(%) Time Calls Avg Min Max Name
--------------------------------------------------------------------------------
Resulted from 7984 calls with (Trans, NonTrans) with (m, n, k) = (288, 100, 288)
22.01% 269.11ms 7984 33.706us 30.273us 39.232us sgemm_sm35_ldg_tn_32x16x64x8x16
Resulted from 7984 calls with (NonTrans, NonTrans) with (m, n, k) = (288, 100, 288)
14.79% 180.78ms 7984 22.642us 18.752us 26.752us sgemm_sm35_ldg_nn_64x16x64x16x16
Resulted from 3992 calls with (NonTrans, Trans) with (m, n, k) = (288, 288, 100)
7.43% 90.886ms 3992 22.766us 19.936us 25.024us sgemm_sm35_ldg_nt_64x16x64x16x16
From the last three lines is looks like certain transposition types can be more efficient than the others, and certain matrix sizes are more economic in terms of computation time over matrix size. What is the guideline of ensuring economic computation?
UPDATE 2:
For the case of (m, n, k) = (588, 100, 588) above, I manually transposed the matrix before calling the sgemm function, then there is only one item in the nvprof result. The time it take is only a little less than the sum of the two items in the above table. So there is no much performance gain from doing so.
Time(%) Time Calls Avg Min Max Name
--------------------------------------------------------------------------------
31.65% 810.59ms 15968 50.763us 21.505us 72.098us sgemm_sm35_ldg_nn_64x16x64x16x16
Sorry, not an answer - but slightly too long for a comment:
Concerning the edit, about the influence of the "transpose" state: Transposing a matrix might cause an access pattern that is worse in terms of memory coalescing. A quick websearch brings brings some results about this ( https://devtalk.nvidia.com/default/topic/528450/cuda-programming-and-performance/cublas-related-question/post/3734986/#3734986 ), but with a slightly different setup than yours:
DGEMM performance on a K20c
args: ta=N tb=N m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096
elapsed = 0.13280010 sec GFLOPS=1034.93
args: ta=T tb=N m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096
elapsed = 0.13872910 sec GFLOPS=990.7
args: ta=N tb=T m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096
elapsed = 0.12521601 sec GFLOPS=1097.61
args: ta=T tb=T m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096
elapsed = 0.13652611 sec GFLOPS=1006.69
In this case, the differences do not seem worth the hassle of changing the matrix storage (e.g. from column-major to row-major, to avoid transposing the matrix), because all patterns seem to run with a similar speed. But your mileage may vary - particularly, the difference in your tests between (t,n) and (n,n) are very large (548ms vs 340ms), which I found quite surprising. If you have the choice to easily switch between various representations of the matrix, then a benchmark covering all the four cases may be worthwhile.
In any case, regarding your question about the functions that are called there: The CUBLAS code for the sgemm function in CUBLAS 1.1 was already full of unrolled loops and already contained 80 (!) versions of the sgemm function for different cases that have been assembled using a #define-hell. It has to be assumed that this has become even more unreadable in the newer CUBLAS versions, where the newer compute capabilities have to be taken into account - and the function names that you found there indicated that this indeed is the case:
sgemm_sm35_ldg_nt_64x16x128x8x32
sm35 : Runs on a device with compute capability 3.5
ldg : ? Non-texture-memory version ? (CUBLAS 1.1 contained functions called sgemm_main_tex_* which worked on texture memory, and functions sgemm_main_gld_* which worked on normal, global memory)
nt : First matrix is Not transposed, second one is Transposed
64x16x128x8x32 - Probably related to tile sizes, maybe shared memory etc...
Still, I think it's surprising that a single call to sgemm causes three of these internal functions to be called. But as mentioned in the comment: I assume that they try to handle the "main" part of the matrix with a specialized, efficient version, and "border tiles" with one that is capable of doing range checks and/or cope with warps that are not full. (Not very precise, just to be suggestive: A matrix of size 288x288 could be handled by an efficient core for matrices of size 256x256, and two calls for the remaining 32x288 and 288x32 entries).
But all this is also the reason why I guess there can hardly be a general guideline concerning the matrix sizes: The "best" matrix size in terms of computation time over matrix size will at least depend on
the hardware version (compute capability) of the target system
the transposing-flags
the CUBLAS version
EDIT Concerning the comment: One could imagine that there should be a considerable difference between the transosed and the non-transposed processing. When multiplying two matrices
a00 a01 a02 b00 b01 b02
a10 a11 a12 * b10 b11 b12
a20 a21 a22 b20 b21 b22
Then the first element of the result will be
a00 * b00 + a01 * b10 + a02 * b20
(which simply is the dot product of the first row of a and the first column of b). For this computation one has to read consecutive values from a. But the values that are read from b are not consecutive. Instead, they are "the first value in each row". One could think that this would have a negative impact on memory coalescing. But for sure, the NVIDIA engineers have tried hard to avoid any negative impact here, and the implementation of sgemm in CUBLAS is far, far away from "a parallel version of the naive 3-nested-loops implementation" where this access pattern would have such an obvious drawback.

FFT Magnitude Output

I'm just getting into signal processing and need to do some DFT/FFT work.
If I take a signal with two freqs of 2Hz and 5Hz: x(t)=sin(2*2pi*t)+sin(5*2pi*t). I sample at 100Hz for 5 sec (so my DFT size is 500).
Because my inputs are real values I get a symmetric DFT, so can discard the 2nd half and convert the DFT values into magnitude by doing sqrt(re^2+c^2).
My bin widths are 100/500 = 0.2Hz, and so I get:
With peaks at 2Hz and 5Hz as expected.
My question is: why are the magnitudes different?
On a related note, why are there not two perfect spikes at 2hz and 5Hz, i.ee the graph has non-zero values at 1.5 and 2.5 etc. Is this spectral leakage?
I expect your 500 data points are being processed as a 512 point FFT (most FFT libraries do not support arbitrary size inputs and so typically they zero pad to the next highest power of 2). If that is the case then you will be seeing the effects of spectral leakage. Applying a window function prior to the FFT should fix this. Note that you will still see "skirts" on either side of your peaks - this is due to the uncertainty introduced by a finite sampling window.