Related
I'm attempting my first solo project, after taking an introductory course to machine learning, where I'm trying to use linear approximation to predict the outcome of addition/subtraction of two numbers.
I have 3 features: first number, subtraction/addition (0 or 1), and second number.
So my input looks something like this:
3 0 1
4 1 2
3 0 3
With corresponding output like this:
2
6
0
I have (I think) successfully implemented logistic regression algorithm, as the squared error does gradually decrease, but in 100 values, ranging from 0 to 50, the squared error value flattens out at around 685.6 after about 400 iterations.
Graph: Squared Error vs Iterations
.
To fix this, I have tried using a larger dataset for training, getting rid of regularization, and normalizing the input values.
I know that one of the steps to fix high bias is to add complexity to the approximation, but I want to maximize the performance at this particular level. Is it possible to go any further on this level?
My linear approximation code in Octave:
% Iterate
for i = 1 : iter
% hypothesis
h = X * Theta;
% reg theta prep
regTheta = Theta;
regTheta(:, 1) = 0;
% cost calc
J(i, 2) = (1 / (2 * m)) * (sum((h - y) .^ 2) + lambda * sum(sum(regTheta .^ 2,1),2));
% theta calc
Theta = Theta - (alpha / m) * ((h - y)' * X)' + lambda * sum(sum(regTheta, 1), 2);
end
Note: I'm using 0 for lambda, as to ignore regularization.
Suppose I have a 32 or 64 bit unsigned integer.
What is the fastest way to find the index i of the leftmost bit such that the number of 0s in the leftmost i bits equals the number of 1s in the leftmost i bits?
I was thinking of some bit tricks like the ones mentioned here.
I am interested in recent x86_64 processor. This might be relevant as some processor support instructions as POPCNT (count the number of 1s) or LZCNT (counts the number of leading 0s).
If it helps, it is possible to assume that the first bit has always a certain value.
Example (with 16 bits):
If the integer is
1110010100110110b
^
i
then i=10 and it corresponds to the marked position.
A possible (slow) implementation for 16-bit integers could be:
mask = 1000000000000000b
pos = 0
count=0
do {
if(x & mask)
count++;
else
count--;
pos++;
x<<=1;
} while(count)
return pos;
Edit: fixed bug in code as per #njuffa comment.
I don't have any bit tricks for this, but I do have a SIMD trick.
First a few observations,
Interpreting 0 as -1, this problem becomes "find the first i so that the first i bits sum to 0".
0 is even but all the bits have odd values under this interpretation, which gives the insight that i must be even and this problem can be analyzed by blocks of 2 bits.
01 and 10 don't change the balance.
After spreading the groups of 2 out to bytes (none of the following is tested),
// optionally use AVX2 _mm_srlv_epi32 instead of ugly variable set
__m128i spread = _mm_shuffle_epi8(_mm_setr_epi32(x, x >> 2, x >> 4, x >> 6),
_mm_setr_epi8(0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15));
spread = _mm_and_si128(spread, _mm_set1_epi8(3));
Replace 00 by -1, 11 by 1, and 01 and 10 by 0:
__m128i r = _mm_shuffle_epi8(_mm_setr_epi8(-1, 0, 0, 1, 0,0,0,0,0,0,0,0,0,0,0,0),
spread);
Calculate the prefix sum:
__m128i pfs = _mm_add_epi8(r, _mm_bsrli_si128(r, 1));
pfs = _mm_add_epi8(pfs, _mm_bsrli_si128(pfs, 2));
pfs = _mm_add_epi8(pfs, _mm_bsrli_si128(pfs, 4));
pfs = _mm_add_epi8(pfs, _mm_bsrli_si128(pfs, 8));
Find the highest 0:
__m128i iszero = _mm_cmpeq_epi8(pfs, _mm_setzero_si128());
return __builtin_clz(_mm_movemask_epi8(iszero) << 15) * 2;
The << 15 and *2 appear because the resulting mask is 16 bits but the clz is 32 bit, it's shifted one less because if the top byte is zero that indicates that 1 group of 2 is taken, not zero.
This is a solution for 32-bit data using classical bit-twiddling techniques. The intermediate computation requires 64-bit arithmetic and logic operations. I have to tried to stick to portable operations as far as it was possible. Required is an implementation of the POSIX function ffsll to find the least-significant 1-bit in a 64-bit long long, and a custom function rev_bit_duos that reverses the bit-duos in a 32-bit integer. The latter could be replaced with a platform-specific bit-reversal intrinsic, such as the __rbit intrinsic on ARM platforms.
The basic observation is that if a bit-group with an equal number of 0-bits and 1-bits can be extracted, it must contain an even number of bits. This means we can examine the operand in 2-bit groups. We can further restrict ourselves to tracking whether each 2-bit increases (0b11), decreases (0b00) or leaves unchanged (0b01, 0b10) a running balance of bits. If we count positive and negative changes with separate counters, 4-bit counters will suffice unless the input is 0 or 0xffffffff, which can be handled separately. Based on comments to the question, these cases shouldn't occur. By subtracting the negative change count from the positive change count for each 2-bit group we can find at which group the balance becomes zero. There may be multiple such bit groups, we need to find the first one.
The processing can be parallelized by expanding each 2-bit group into a nibble that then can serve as a change counter. The prefix sum can be computed via integer multiply with an appropriate constant, which provides the necessary shift & add operations at each nibble position. Efficient ways for parallel nibble-wise subtraction are well-known, likewise there is a well-known technique due to Alan Mycroft for detecting zero-bytes that is trivially changeable to zero-nibble detection. POSIX function ffsll is then applied to find the bit position of that nibble.
Slightly problematic is the requirement for extraction of a left-most bit group, rather than a right-most, since Alan Mycroft's trick only works for finding the first zero-nibble from the right. Also, handling the prefix-sum for left-most bit group require use of a mulhi operation which may not be easily available, and may be less efficient than standard integer multiplication. I have addressed both of these issues by simply bit-reversing the original operand up front.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
/* Reverse bit-duos using classic binary partitioning algorithm */
inline uint32_t rev_bit_duos (uint32_t a)
{
uint32_t m;
a = (a >> 16) | (a << 16); // swap halfwords
m = 0x00ff00ff; a = ((a >> 8) & m) | ((a << 8) & ~m); // swap bytes
m = (m << 4)^m; a = ((a >> 4) & m) | ((a << 4) & ~m); // swap nibbles
m = (m << 2)^m; a = ((a >> 2) & m) | ((a << 2) & ~m); // swap bit-duos
return a;
}
/* Return the number of most significant (leftmost) bits that must be extracted
to achieve an equal count of 1-bits and 0-bits in the extracted bit group.
Return 0 if no such bit group exists.
*/
int solution (uint32_t x)
{
const uint64_t mask16 = 0x0000ffff0000ffffULL; // alternate half-words
const uint64_t mask8 = 0x00ff00ff00ff00ffULL; // alternate bytes
const uint64_t mask4h = 0x0c0c0c0c0c0c0c0cULL; // alternate nibbles, high bit-duo
const uint64_t mask4l = 0x0303030303030303ULL; // alternate nibbles, low bit-duo
const uint64_t nibble_lsb = 0x1111111111111111ULL;
const uint64_t nibble_msb = 0x8888888888888888ULL;
uint64_t a, b, r, s, t, expx, pc_expx, nc_expx;
int res;
/* common path can't handle all 0s and all 1s due to counter overflow */
if ((x == 0) || (x == ~0)) return 0;
/* make zero-nibble detection work, and simplify prefix sum computation */
x = rev_bit_duos (x); // reverse bit-duos
/* expand each bit-duo into a nibble */
expx = x;
expx = ((expx << 16) | expx) & mask16;
expx = ((expx << 8) | expx) & mask8;
expx = ((expx << 4) | expx);
expx = ((expx & mask4h) * 4) + (expx & mask4l);
/* compute positive and negative change counts for each nibble */
pc_expx = expx & ( expx >> 1) & nibble_lsb;
nc_expx = ~expx & (~expx >> 1) & nibble_lsb;
/* produce prefix sums for positive and negative change counters */
a = pc_expx * nibble_lsb;
b = nc_expx * nibble_lsb;
/* subtract positive and negative prefix sums, nibble-wise */
s = a ^ ~b;
r = a | nibble_msb;
t = b & ~nibble_msb;
s = s & nibble_msb;
r = r - t;
r = r ^ s;
/* find first nibble that is zero using Alan Mycroft's magic */
r = (r - nibble_lsb) & (~r & nibble_msb);
res = ffsll (r) / 2; // account for bit-duo to nibble expansion
return res;
}
/* Return the number of most significant (leftmost) bits that must be extracted
to achieve an equal count of 1-bits and 0-bits in the extracted bit group.
Return 0 if no such bit group exists.
*/
int reference (uint32_t x)
{
int count = 0;
int bits = 0;
uint32_t mask = 0x80000000;
do {
bits++;
if (x & mask) {
count++;
} else {
count--;
}
x = x << 1;
} while ((count) && (bits <= (int)(sizeof(x) * CHAR_BIT)));
return (count) ? 0 : bits;
}
int main (void)
{
uint32_t x = 0;
do {
uint32_t ref = reference (x);
uint32_t res = solution (x);
if (res != ref) {
printf ("x=%08x res=%u ref=%u\n\n", x, res, ref);
}
x++;
} while (x);
return EXIT_SUCCESS;
}
A possible solution (for 32-bit integers). I'm not sure if it can be improved / avoid the use of lookup tables. Here x is the input integer.
//Look-up table of 2^16 elements.
//The y-th is associated with the first 2 bytes y of x.
//If the wanted bit is in y, LUT1[y] is minus the position of the bit
//If the wanted bit is not in y, LUT1[y] is the number of ones in excess in y minus 1 (between 0 and 15)
LUT1 = ....
//Look-up talbe of 16 * 2^16 elements.
//The y-th element is associated to two integers y' and y'' of 4 and 16 bits, respectively.
//y' is the number of excess ones in the first byte of x, minus 1
//y'' is the second byte of x. The table contains the answer to return.
LUT2 = ....
if(LUT1[x>>16] < 0)
return -LUT1[x>>16];
return LUT2[ (LUT1[x>>16]<<16) | (x & 0xFFFF) ]
This requires ~1MB for the lookup tables.
The same idea also works using 4 lookup tables (one per byte of x). The requires more operations but brings down the memory to 12KB.
LUT1 = ... //2^8 elements
LUT2 = ... //8 * 2^8 elements
LUT3 = ... //16 * 2^8 elements
LUT3 = ... //24 * 2^8 elements
y = x>>24
if(LUT1[y] < 0)
return -LUT1[y];
y = (LUT1[y]<<8) | ((x>>16) & 0xFF);
if(LUT2[y] < 0)
return -LUT2[y];
y = (LUT2[y]<<8) | ((x>>8) & 0xFF);
if(LUT3[y] < 0)
return -LUT3[y];
return LUT4[(LUT2[y]<<8) | (x & 0xFF) ];
f(n)= 1 + 2 + 3 + ยท ยท + n
g(n) = 3(n^2) + nlogn
Determining f = O(g) or
f = โฆ(g) or f = ฮ(g)
.As per my effort and understanding one guess It might be f=O(g) as g(n) has a n^2 power which grows faster than n .
Another way : if divided both by n , f(n) will have a constant 1 and g(n) : nlogn which grows faster than constant 1 . so , f=O(g) .
Is that a correct answer?
What actually is scaling property of Big-O ?
How to prove : For any constant c > 0, cf(n) is O(f(n)).
Understanding so far :
cf(n) < (c + k)f(n) holds for all n > 0 and k > 0.
i. Constant factors are ignored.
ii. Only the powers and functions of n should be exploited
It is this ignoring of constant factors that motivates for such a
notation. Which proves f is O(f).
Is this explanation enough to prove that scaling property of Big-O ?
f(n)=O(g(n)) if there is a positive constant c such as.
|f(n)| <= c*|g(n)| for all n>=n(initial)
and since f(n)=(n(n-1))/2 ----> (n^2)
n^2<= n^2 + nlogn (ignore the constants), for all n>=1 then yes your answer is right.
I am new to Prolog and while I can understand the code, I find it hard to create a program. I am trying to create a function that takes an integer and return 2^(integer) example pow(4) returns 16 (2^4). I also need it to be in a loop to keep taking input until user inputs negative integer then it exits.
In this example, C is counter, X is user input, tried to include variable for output but cant think how to integrate it.
pow(0):- 0.
pow(1):- 2.
pow(X):-
X > 1,
X is X-1,
power(X),
C is X-1,
pow(X1),
X is 2*2.
pow(X):- X<0, C is 0.
pow(C).
You really need to read something about Prolog before trying to program in it. Skim through http://en.wikibooks.org/wiki/Prolog, for example.
Prolog doesn't have "functions": there are predicates. All inputs and outputs are via predicate parameters, the predicate itself doesn't return anything.
So pow(0):- 0. and pow(1):- 2. don't make any sense. What you want is pow(0, 0). and pow(1, 2).: let the first parameter be the input, and the second be the output.
X is X-1 also doesn't make sense: in Prolog variables are like algebra variables, X means the same value through the whole system of equations. Variables are basically write-once, and you have to introduce new variables in this and similar cases: X1 is X-1.
Hope that's enough info to get you started.
The [naive] recursive solution:
pow2(0,1) . % base case: any number raised to the 0 power is 1, by definition
pow2(N,M) :- % a positive integral power of 2 is computed thus:
integer(N) , % - verify than N is an inetger
N > 0 , % - verify that N is positive
N1 is N-1 , % - decrement N (towards zero)
pow2(N1,M1) , % - recurse down (when we hit zero, we start popping the stack)
M is M1*2 % - multiply by 2
. %
pow2(N,M) :- % negative integral powers of 2 are computed the same way:
integer(N) , % - verify than N is an integer
N < 0 , % - verify than N is negative
N1 is N+1 , % - increment N (towards zero).
pow2(N1,M) , % - recurse down (we we hit zero, we start popping the stack)
M is M / 2.0 % - divide by 2.
. % Easy!
The above, however, will overflow the stack when the recursion level is sufficiently high (ignoring arithmetic overflow issues). SO...
The tail-recursive solution is optimized away into iteration:
pow2(N,M) :- %
integer(N) , % validate that N is an integer
pow2(N,1,M) % invoke the worker predicate, seeding the accumulator with 1
. %
pow2(0,M,M) . % when we hit zero, we're done
pow2(N,T,M) :- % otherwise...
N > 0 , % - if N is positive,
N1 is N-1 , % - decrement N
T1 is T*2 , % - increment the accumulator
pow2(N1,T1,M) % - recurse down
. %
pow2(N,T,M) :- % otherwise...
N < 0 , % - if N is negative,
N1 is N+1 , % - increment N
T1 is T / 2.0 , % - increment the accumulator
pow2(N1,T1,M) % - recurse down
. %
If I have an FFT implementation of a certain size M (power of 2), how can I calculate the FFT of a set of size P=k*M, where k is a power of 2 as well?
#define M 256
#define P 1024
complex float x[P];
complex float X[P];
// Use FFT_M(y) to calculate X = FFT_P(x) here
[The question is expressed in a general sense on purpose. I know FFT calculation is a huge field and many architecture specific optimizations were researched and developed, but what I am trying to understand is how is this doable in the more abstract level. Note that I am no FFT (or DFT, for that matter) expert, so if an explanation can be laid down in simple terms that would be appreciated]
Here's an algorithm for computing an FFT of size P using two smaller FFT functions, of sizes M and N (the original question call the sizes M and k).
Inputs:
P is the size of the large FFT you wish to compute.
M, N are selected such that MN=P.
x[0...P-1] is the input data.
Setup:
U is a 2D array with M rows and N columns.
y is a vector of length P, which will hold FFT of x.
Algorithm:
step 1. Fill U from x by columns, so that U looks like this:
x(0) x(M) ... x(P-M)
x(1) x(M+1) ... x(P-M+1)
x(2) x(M+2) ... x(P-M+2)
... ... ... ...
x(M-1) x(2M-1) ... x(P-1)
step 2. Replace each row of U with its own FFT (of length N).
step 3. Multiply each element of U(m,n) by exp(-2*pi*j*m*n/P).
step 4. Replace each column of U with its own FFT (of length M).
step 5. Read out the elements of U by rows into y, like this:
y(0) y(1) ... y(N-1)
y(N) y(N+1) ... y(2N-1)
y(2N) y(2N+1) ... y(3N-1)
... ... ... ...
y(P-N) y(P-N-1) ... y(P-1)
Here is MATLAB code which implements this algorithm. You can test it by typing fft_decomposition(randn(256,1), 8);
function y = fft_decomposition(x, M)
% y = fft_decomposition(x, M)
% Computes FFT by decomposing into smaller FFTs.
%
% Inputs:
% x is a 1D array of the input data.
% M is the size of one of the FFTs to use.
%
% Outputs:
% y is the FFT of x. It has been computed using FFTs of size M and
% length(x)/M.
%
% Note that this implementation doesn't explicitly use the 2D array U; it
% works on samples of x in-place.
q = 1; % Offset because MATLAB starts at one. Set to 0 for C code.
x_original = x;
P = length(x);
if mod(P,M)~=0, error('Invalid block size.'); end;
N = P/M;
% step 2: FFT-N on rows of U.
for m = 0 : M-1
x(q+(m:M:P-1)) = fft(x(q+(m:M:P-1)));
end;
% step 3: Twiddle factors.
for m = 0 : M-1
for n = 0 : N-1
x(m+n*M+q) = x(m+n*M+q) * exp(-2*pi*j*m*n/P);
end;
end;
% step 4: FFT-M on columns of U.
for n = 0 : N-1
x(q+n*M+(0:M-1)) = fft(x(q+n*M+(0:M-1)));
end;
% step 5: Re-arrange samples for output.
y = zeros(size(x));
for m = 0 : M-1
for n = 0 : N-1
y(m*N+n+q) = x(m+n*M+q);
end;
end;
err = max(abs(y-fft(x_original)));
fprintf( 1, 'The largest error amplitude is %g\n', err);
return;
% End of fft_decomposition().
kevin_o's response worked quite well. I took his code and eliminated the loops using some basic Matlab tricks. It functionally is identical to his version
function y = fft_decomposition(x, M)
% y = fft_decomposition(x, M)
% Computes FFT by decomposing into smaller FFTs.
%
% Inputs:
% x is a 1D array of the input data.
% M is the size of one of the FFTs to use.
%
% Outputs:
% y is the FFT of x. It has been computed using FFTs of size M and
% length(x)/M.
%
% Note that this implementation doesn't explicitly use the 2D array U; it
% works on samples of x in-place.
q = 1; % Offset because MATLAB starts at one. Set to 0 for C code.
x_original = x;
P = length(x);
if mod(P,M)~=0, error('Invalid block size.'); end;
N = P/M;
% step 2: FFT-N on rows of U.
X=fft(reshape(x,M,N),[],2);
% step 3: Twiddle factors.
X=X.*exp(-j*2*pi*(0:M-1)'*(0:N-1)/P);
% step 4: FFT-M on columns of U.
X=fft(X);
% step 5: Re-arrange samples for output.
x_twiddle=bsxfun(#plus,M*(0:N-1)',(0:M-1))+q;
y=X(x_twiddle(:));
% err = max(abs(y-fft(x_original)));
% fprintf( 1, 'The largest error amplitude is %g\n', err);
return;
% End of fft_decomposition()
You could just use the last log2(k) passes of a radix-2 FFT, assuming the previous FFT results are from appropriately interleaved data subsets.
Well an FFT is basically a recursive type of Fourier Transform. It relies on the fact that as wikipedia puts it:
The best-known FFT algorithms depend upon the factorization of N, but there are FFTs with O(N log N) complexity for >all N, even for prime N. Many FFT algorithms only depend on the fact that e^(-2pi*i/N) is an N-th primitive root of unity, and >thus can be applied to analogous transforms over any finite field, such as number-theoretic transforms. Since the >inverse DFT is the same as the DFT, but with the opposite sign in the exponent and a 1/N factor, any FFT algorithm >can easily be adapted for it.
So this has pretty much already been done in the FFT. If you are talking about getting longer period signals out of your transform you are better off doing an DFT over the data sets of limited frequencies. There might be a way to do it from the frequency domain but IDK if anyone has actually done it. You could be the first!!!! :)