When to use logarithmic variables - regression

I want to make a regression. My dependent variable Y is a score of discrete numbers from 0 to 12. I have 6 independent variables. Four of them are in percent between 0.4 % and 40 %. The two other independent variables are not in percent. One of them measures years and is between 1.8 and 67. The other independent variable measures size and is between 0.008 and 5,117 (very high).
I don't want to use winsorizing, that's why I decided to use logarithm. My question now: Is it good when I log the two independent variables which are not in percent, because they have high vaules?
That means I would use: reg Y x1 x2 x3 x4 logx5 logx6
Why I would like to use log is because I want to look out for the outliers. I have not really outliers, but I think it's better to use log for comparision reasons.
Or do I just have to use log, when the distribution of some variables is skewed? Meaning log leads to a lower skewness and is thus used appropriately.
Thank you in advance.
Lukas

Related

Lasso Regression

For the lasso (linear regression with L1 regularization) with a fixed value of λ, it is necessary to
use cross–validation to select the best optimization algorithm.
I know for a fact that we can use cross validation to find optimal value of λ, but is it neccesary to use cross validation in case λ is fixed?
Any thoughts please?
Cross Validation isn't about if your Regularization Parameter is Fixed or not. Its more related to the R^2 metric.
Lets say you consider 100 records and divide your data into 5 sub-datasets , means each sub-data contains 20 records.
Now out of 5 sub-datas , there are 5 different ways to assign anyone of the sub-data as Cross-Validation (CV) Data.
For all these 5 scenarios, we can find out the R^2, and then find out the Average R^2.
This way, you can have a comparison of your R-score with the Average R-score.

Find the Relationship Between Two Logarithmic Equations

No idea if I am asking this question in the right place, but here goes...
I have a set of equations that were calculated based on numbers ranging from 4 to 8. So an equation for when this number is 5, one for when it is 6, one for when it is 7, etc. These equations were determined from graphing a best fit line to data points in a Google Sheet graph. Here is an example of a graph...
Example...
When the number is between 6 and 6.9, this equation is used: windGust6to7 = -29.2 + (17.7 * log(windSpeed))
When the number is between 7 and 7.9, this equation is used: windGust7to8 = -70.0 + (30.8 * log(windSpeed))
I am using these equations to create an image in python, but the image is too choppy since each equation covers a range from x to x.9. In order to smooth this image out and make it more accurate, I really would need an equation for every 0.1 change in number. So an equation for 6, a different equation for 6.1, one for 6.2, etc.
Here is an example output image that is created using the current equations:
So my question is: Is there a way to find the relationship between the two example equations I gave above in order to use that to create a smoother looking image?
This is not about logarithms; for the purposes of this derivation, log(windspeed) is a constant term. Rather, you're trying to find a fit for your mapping:
6 (-29.2, 17.7)
7 (-70.0, 30.8)
...
... and all of the other numbers you have already. You need to determine two basic search paramteres:
(1) Where in each range is your function an exact fit? For instance, for the first one, is it exactly correct at 6.0, 6.5, 7.0, or elsewhere? Change the left-hand column to reflect that point.
(2) What sort of fit do you want? You are basically fitting a pair of parameterized equations, one for each coefficient:
x y x y
6 -29.2 6 17.7
7 -70.0 7 30.8
For each of these, you want to find the coefficients of a good matching function. This is a large field of statistical and algebraic study. Since you have four ranges, you will have four points for each function. It is straightforward to fit a cubic equation to each set of points in Cartesian space. However, the resulting function may not be as smooth as you like; in such a case, you may well find that a 4th- or 5th- degree function fits better, or perhaps something exponential, depending on the actual distribution of your points.
You need to work with your own problem objectives and do a little more research into function fitting. Once you determine the desired characteristics, look into scikit for fitting functions to do the heavy computational work for you.

Plot power series gnuplot

I'd like to know how to plot power series (whose variable is x), but I don't even know where to start with.
I know it might not be possible plot infinite series, but it'd do as well plotting the sum of the first n terms.
Gnuplot has a sum function, which can be used inside the using statement to sum up several columns or terms. Together with the special file name + you can implement power series.
Consider the exponention function, which has a power series
\sum_{n=0}^\infty x^n/n!
So, we define a term as
term(x, n) = x**n/n!
Now we can plot the power series up to the n=5 term with
set xrange [0:4]
term(x, n) = x**n/n!
set samples 20
plot '+' using 1:(sum [n=0:5] term($1, n))
To plot the results when using 2 to 7 terms and compare it with the actual exp function, use
term(x, n) = x**n/n!
set xrange [-2:2]
set samples 41
set key left
plot exp(x), for [i=1:6] '+' using 1:(sum[t=0:i] term($1, t)) title sprintf('%d terms', i)
The easiest way that I can think of is to generate a file that has a column of x-values and a column of f(x) values, then just plot the table like you would any other data. A power series is continuous, so you can just connect the dots and have a fairly accurate representation (provided your dots are close enough together). Also, when evaluating f(x), you just sum up the first N terms (where N is big enough). Big enough means that the sum of the rest of the terms is smaller than whatever error you allow. (*If you want 3 good digits, then N needs to be large enough that the remaining sum is smaller than .001.)
You can pull out a calc II textbook to determine how to bound the error on the tail of the sum. A lot of calc classes briefly cover it, but students tend to feel like the error estimates are pointless (I know because I've taught the course a few times.) As an example, if you have an alternating series (whose terms are decreasing in absolute value), then the absolute value of the first term you omit (don't sum) is an upperbound on your error.
*This statement is not 100% true, it is slightly over simplified, but is correct for most practical purposes.

Temperature Scale in SA

First, this is not a question about temperature iteration counts or automatically optimized scheduling. It's how the data magnitude relates to the scaling of the exponentiation.
I'm using the classic formula:
if(delta < 0 || exp(-delta/tK) > random()) { // new state }
The input to the exp function is negative because delta/tK is positive, so the exp result is always less then 1. The random function also returns a value in the 0 to 1 range.
My test data is in the range 1 to 20, and the delta values are below 20. I pick a start temperature equal to the initial computed temperature of the system and linearly ramp down to 1.
In order to get SA to work, I have to scale tK. The working version uses:
exp(-delta/(tK * .001)) > random()
So how does the magnitude of tK relate to the magnitude of delta? I found the scaling factor by trial and error, and I don't understand why it's needed. To my understanding, as long as delta > tK and the step size and number of iterations are reasonable, it should work. In my test case, if I leave out the extra scale the temperature of the system does not decrease.
The various online sources I've looked at say nothing about working with real data. Sometimes they include the Boltzmann constant as a scale, but since I'm not simulating a physical particle system that doesn't help. Examples (typically with pseudocode) use values like 100 or 1000000.
So what am I missing? Is scaling another value that I must set by trial and error? It's bugging me because I don't just want to get this test case running, I want to understand the algorithm, and magic constants mean I don't know what's going on.
Classical SA has 2 parameters: startingTemperate and cooldownSchedule (= what you call scaling).
Configuring 2+ parameters is annoying, so in OptaPlanner's implementation, I automatically calculate the cooldownSchedule based on the timeGradiant (which is a double going from 0.0 to 1.0 during the solver time). This works well. As a guideline for the startingTemperature, I use the maximum score diff of a single move. For more information, see the docs.

Compressing a binary matrix

We were asked to find a way to compress a square binary matrix as much as possible, and if possible, to add redundancy bits to check and maybe correct errors.
The redundancy thing is easy to implement in my opinion. The complicated part is compressing the matrix. I thought about using run-length after reshaping the matrix to a vector because there will be more zeros than ones, but I only achieved a 40bits compression (we are working on small sizes) although I thought it'd be better.
Also, after run-length an idea was Huffman coding the matrix, but a dictionary must be sent in order to recover the original information.
I'd like to know what would be the best way to compress a binary matrix?
After reading some comments, yes #Adam you're right, the 14x14 matrix should be compressed in 128bits, so if I only use the coordinates (rows&cols) for each non-zero element, still it would be 160bits (since there are twenty ones). I'm not looking for an exact solution but for a useful idea.
You can only talk about compressing something if you have a distribution and a representation. That's the issue of the dictionary you have to send along: you always need some sort of dictionary of protocol to uncompress something. It just so happens that things like .zip and .mpeg already have those dictionaries/codecs. Even something as simple as Huffman-encoding is an algorithm; on the other side of the communication channel (you can think of compression as communication), the other person already has a bit of code (the dictionary) to perform the Huffman decompression scheme.
Thus you cannot even begin to talk about compressing something without first thinking "what kinds of matrices do I expect to see?", "is the data truly random, or is there order?", and if so "how can I represent the matrices to take advantage of order in the data?".
You cannot compress some matrices without increasing the size of other objects (by at least 1 bit). This is bad news if all matrices are equally probable, and you care equally about them all.
Addenda:
The answer to use sparse matrix machinery is not necessarily the right answer. The matrix could for example be represented in python as [[(r+c)%2 for c in range (cols)] for r in range(rows)] (a checkerboard pattern), and a sparse matrix wouldn't compress it at all, but the Kolmogorov complexity of the matrix is the above program's length.
Well, I know every matrix will have the same number of ones, so this is kind of deterministic. The only think I don't know is where the 1's will be. Also, if I transmit the matrix with a dictionary and there are burst errors, maybe the dictionary gets affected so... wouldnt be the resulting information corrupted? That's why I was trying to use lossless data compression such as run-length, the decoder just doesnt need a dictionary. --original poster
How many 1s does the matrix have as a fraction of its size, and what is its size (NxN -- what is N)?
Furthermore, this is an incorrect assertion and should not be used as a reason to desire run-length encoding (which still requires a program); when you transmit data over a channel, you can always add error-correction to this data. "Data" is just a blob of bits. You can transmit both the data and any required dictionaries over the channel. The error-correcting machinery does not care at all what the bits you transmit are for.
Addendum 2:
There are (14*14) choose 20 possible arrangements, which I assume are randomly chosen. If this number was larger than 128^2 what you're trying to do would be impossible. Fortunately log_2((14*14) choose 20) ~= 90bits < 128bits so it's possible.
The simple solution of writing down 20 numbers like 32,2,67,175,52,...,168 won't work because log_2(14*14)*20 ~= 153bits > 128bits. This would be equivalent to run-length encoding. We want to do something like this but we are on a very strict budget and cannot afford to be "wasteful" with bits.
Because you care about each possibility equally, your "dictionary"/"program" will simulate a giant lookup table. Matlab's sparse matrix implementation may work but is not guaranteed to work and is thus not a correct solution.
If you can create a bijection between the number range [0,2^128) and subsets of size 20, you're good to go. This corresponds to enumerating ways to descend the pyramid in http://en.wikipedia.org/wiki/Binomial_coefficient to the 20th element of row 196. This is the same as enumerating all "k-combinations". See http://en.wikipedia.org/wiki/Combination#Enumerating_k-combinations
Fortunately I know that Mathematica and Sage and other CAS software can apparently generate the "5th" or "12th" or arbitrarily numbered k-subset. Looking through their documentation, we come upon a function called "rank", e.g. http://www.sagemath.org/doc/reference/sage/combinat/subset.html
So then we do some more searching, and come across some arcane Fortran code like http://people.sc.fsu.edu/~jburkardt/m_src/subset/ksub_rank.m and http://people.sc.fsu.edu/~jburkardt/m_src/subset/ksub_unrank.m
We could reverse-engineer it, but it's kind of dense. But now we have enough information to search for k-subset rank unrank, which leads us to http://www.site.uottawa.ca/~lucia/courses/5165-09/GenCombObj.pdf -- see the section
"Generating k-subsets (of an n-set): Lexicographical
Ordering" and the rank and unrank algorithms on the next few pages.
In order to achieve the exact theoretically optimal compression, in the case of a uniformly random distribution of 1s, we must thus use this technique to biject our matrices to our output number of range <2^128. It just so happens that combinations have a natural ordering, known as ranking and unranking of combinations. You assign a number to each combination (ranking), and if you know the number you automatically know the combination (unranking). Googling k-subset rank unrank will probably yield other algorithms.
Thus your solution would look like this:
serialize the matrix into a list
e.g. [[0,0,1][0,1,1][1,0,0]] -> [0,0,1,0,1,1,1,0,0]
take the indices of the 1s:
e.g. [0,0,1,0,1,1,1,0,0] -> [3,5,6,7]
1 2 3 4 5 6 7 8 9 a k=4-subset of an n=9 set
take the rank
e.g. compressed = rank([3,5,6,7], n=9)
compressed==412 (or something, I made that up)
you're done!
e.g. 412 -binary-> 110011100 (at most n=9bits, less than 2^n=2^9=512)
to uncompress, unrank it
I'll get to 128 bits in a sec, first here's how you fit a 14x14 boolean matrix with exactly 20 nonzeros into 136 bits. It's based on the CSC sparse matrix format.
You have an array c with 14 4-bit counters that tell you how many nonzeros are in each column.
You have another array r with 20 4-bit row indices.
56 bits (c) + 80 bits (r) = 136 bits.
Let's squeeze 8 bits out of c:
Instead of 4-bit counters, use 2-bit. c is now 2*14 = 28 bits, but can't support more than 3 nonzeros per column. This leaves us with 128-80-28 = 20 bits. Use that space for array a4c with 5 4-bit elements that "add 4 to an element of c" specified by the 4-bit element. So, if a4c={2,2,10,15, 15} that means c[2] += 4; c[2] += 4 (again); c[10] += 4;.
The "most wasteful" distribution of nonzeros is one where the column count will require an add-4 to support 1 extra nonzero: so 5 columns with 4 nonzeros each. Luckily we have exactly 5 add-4s available.
Total space = 28 bits (c) + 20 bits
(a4c) + 80 bits (r) = 128 bits.
Your input is a perfect candidate for a sparse matrix. You said you're using Matlab, so you already have a good sparse matrix built for you.
spm = sparse(dense_matrix)
Matlab's sparse matrix implementation uses Compressed Sparse Columns, which has memory usage on the order of 2*(# of nonzeros) + (# of columns), which should be pretty good in your case of 20 nonzeros and 14 columns. Storing 20 values sure is better than storing 196...
Also remember that all matrices in Matlab are going to be composed of doubles. Just because your matrix can be stored as a 1-bit boolean doesn't mean Matlab won't stick it into a 64-bit floating point value... If you do need it as a boolean you're going to have to make your own type in C and use .mex files to interface with Matlab.
After thinking about this again, if all your matrices are going to be this small and they're all binary, then just store them as a binary vector (bitmask). Going off your 14x14 example, that requires 196 bits or 25 bytes (plus n, m if your dimensions are not constant). That same vector in Matlab would use 64 bits per element, or 1568 bytes. So storing the matrix as a bitmask takes as much space as 4 elements of the original matrix in Matlab, for a compression ratio of 62x.
Unfortunately I don't know if Matlab supports bitmasks natively or if you have to resort to .mex files. If you do get into C++ you can use STL's vector<bool> which implements a bitmask for you.