Compare large sets of weighted tag clouds? - mysql

I have thousands of large sets of tag cloud data; I can retrieve a weighted tag clouds for each set with a simple select/group statement (for example)
SELECT tag, COUNT( * ) AS weight
FROM tags
WHERE set_id = $set_id
GROUP BY tag
ORDER BY COUNT( * ) DESC
What I'd like to know is this -- what is the best way to compare weighted tag clouds and find other sets that are most similar, taking the weight (the number of occurrences within the set) into account and possibly even computing a comparison score, all in one somewhat effiecient statement?
I found the web to be lacking quality literature on the topic, thought it somewhat broadly relevant and tried to abstract my example to keep it generally applicable.

First you need to normalize every tag cloud like you would do for a vector, assuming that a tag cloud is a n-dimensional vector in which every dimension rapresents a word and its value rapresents the weight of the word.
You can do it by calculating the norm (or magnitude) of every cloud, that is the square root of all the weights squared:
m = sqrt( w1*w1 + w2*w2 + ... + wn*wn)
then you generate your normalized tag cloud by dividing each weight for the norm of the cloud.
After this you can easily calculate similarity by using a scalar product between the clouds, that is just multiply every component of each pair and all all of them together. Eg:
v1 = { a: 0.12, b: 0.31; c: 0.17; e: 0.11 }
v2 = { a: 0.21, b: 0.11; d: 0.08; e: 0.28 }
similarity = v1.a*v2.a + v1.b*v1.b + 0 + 0 + v1.e*v2.e
if a vector has a tag that the other one doesn't then that specific product is obviously 0.
This similarity in within range [0,1], 0 means no correlation while 1 means equality.

Related

Is it possible that the number of basic functions is more than the number of observations in spline regression?

I want to run regression spline with B-spline basis function. The data is structured in such a way that the number of observations is less than the number of basis functions and I get a good result.
But I`m not sure if this is the correct case.
Do I have to have more rows than columns like linear regression?
Thank you.
When the number of observations, N, is small, it’s easy to fit a model with basis functions with low square error. If you have more basis functions than observations, then you could have 0 residuals (perfect fit to the data). But that is not to be trusted because it may not be representative of more data points. So yes, you want to have more observations than you do columns. Mathematically, you cannot properly estimate more than N columns because of collinearity. For a rule of thumb, 15 - 20 observations are usually needed for each additional variable / spline.
But, this isn't always the case, such as in genetics when we have hundreds of thousands of potential variables and small sample size. In that case, we turn to tools that help with a small sample size, such as cross validation and bootstrap.
Bootstrap (ie resample with replacement) your datapoints and refit splines many times (100 will probably do). Then you average the splines and use these as the final spline functions. Or you could do cross validation, where you train on a smaller dataset (70%) and then test it on the remaining dataset.
In the functional data analysis framework, there are packages in R that construct and fit spline bases (such as cubic, B, etc). These packages include refund, fda, and fda.usc.
For example,
B <- smooth.construct.cc.smooth.spec(object = list(term = "day.t", bs.dim = 12, fixed = FALSE, dim = 1, p.order = NA, by = NA),data = list(day.t = 200:320), knots = list())
constructs a B spline basis of dimension 12 (over time, day.t), but you can also use these packages to help choose a basis dimension.

regarding one code segment in computing log_sum_exp

In this tutorial on using Pytorch to implement BiLSTM-CRF, author implements the following function. In specific, I am not quite understand what does max_score_broadcast = max_score.view(1, -1).expand(1, vec.size()[1]) try to do?, or which kind of math formula it corresponds to?
# Compute log sum exp in a numerically stable way for the forward algorithm
def log_sum_exp(vec):
max_score = vec[0, argmax(vec)]
max_score_broadcast = max_score.view(1, -1).expand(1, vec.size()[1])
return max_score + \
torch.log(torch.sum(torch.exp(vec - max_score_broadcast)))
Looking at the code, it seems like vec has a shape of (1, n).
Now we can follow the code line by line:
max_score = vec[0, argmax(vec)]
Using vec in the location 0, argmax(v) is just a fancy way of taking the maximal value of vec. So, max_score is (as the name suggests) the maximal value of vec.
max_score_broadcast = max_score.view(1, -1).expand(1, vec.size()[1])
Next, we want to subtract max_score from each of the elements of vec.
To do so the code creates a vector of the same shape as vec with all elements equal to max_score.
First, max_score is reshaped to have two dimensions using the view command, then the expanded 2d vector is "stretched" to have length n using the expand command.
Finally, the log sum exp is computed robustly:
return max_score + \
torch.log(torch.sum(torch.exp(vec - max_score_broadcast)))
The validity of this computation can be seen in this picture:
The rationale behind it is that exp(x) can "explode" for x > 0, therefore, for numerical stability, it is best to subtract the maximal value before taking exp.
As a side note, I think a slightly more elegant way to do the same computation, taking advantage of broadcasting, would be
max_score, _ = vec.max(dim=1, keepdim=True) # take max along second dimension
lse = max_score + torch.log(torch.sum(torch.exp(vec - max_score), dim=1))
return lse
Also note that log sum exp is already implemented by pytorch: torch.logsumexp.

Backpropagation on Two Layered Networks

i have been following cs231n lectures of Stanford and trying to complete assignments on my own and sharing these solutions both on github and my blog. But i'm having a hard time on understanding how to modelize backpropagation. I mean i can code modular forward and backward passes but what bothers me is that if i have the model below : Two Layered Neural Network
Lets assume that our loss function here is a softmax loss function. In my modular softmax_loss() function i am calculating loss and gradient with respect to scores (dSoft = dL/dY). After that, when i'am following backwards lets say for b2, db2 would be equal to dSoft*1 or dW2 would be equal to dSoft*dX2(outputs of relu gate). What's the chain rule here ? Why isnt dSoft equal to 1 ? Because dL/dL would be 1 ?
The softmax function is outputs a number given an input x.
What dSoft means is that you're computing the derivative of the function softmax(x) with respect to the input x. Then to calculate the derivative with respect to W of the last layer you use the chain rule i.e. dL/dW = dsoftmax/dx * dx/dW. Note that x = W*x_prev + b where x_prev is the input to the last node. Therefore dx/dW is just x and dx/db is just 1, which means that dL/dW or simply dW is dsoftmax/dx * x_prev and dL/db or simply db is dsoftmax/dx * 1. Note that here dsoftmax/dx is dSoft we defined earlier.

Function to dampen a value

I have a list of documents each having a relevance score for a search query. I need older documents to have their relevance score dampened, to try to introduce their date in the ranking process. I already tried fiddling with functions such as 1/(1+date_difference), but the reciprocal function is too discriminating for close recent dates.
I was thinking maybe a mathematical function with range (0..1) and domain(0..x) to amplify their score, where the x-axis is the age of a document. It's best to explain what I further need from the function by an image:
Decaying behavior is often modeled well by an exponentional function (many decaying processes in nature also follow it). You would use 2 positive parameters A and B and get
y(x) = A exp(-B x)
Since you want a y-range [0,1] set A=1. Larger B give slower decays.
If a simple 1/(1+x) decreases too quickly too soon, a sigmoid function like 1/(1+e^-x) or the error function might be better suited to your purpose. Let the current date be somewhere in the negative numbers for such a function, and you can get a value that is current for some configurable time and then decreases towards a base value.
log((x+1)-age_of_document)
Where the base of the logarithm is (x+1). Note the x is as per your diagram and is the "threshold". If the age of the document is greater than x the score goes negative. Multiply by the maximum possible score to introduce scaling.
E.g. Domain = (0,10) with a maximum score of 10: 10*(log(11-x))/log(11)
A bit late, but as thiton says, you might want to use a sigmoid function instead, since it has a "floor" value for your long tail data points. E.g.:
0.8/(1+5^(x-3)) + 0.2 - You can adjust the constants 5 and 3 to control the slope of the curve. The 0.2 is where the floor will be.

mysql/stats: Weighting an average to accentuate differences from the mean

This is for a new feature on http://cssfingerprint.com (see /about for general info).
The feature looks up the sites you've visited in a database of site demographics, and tries to guess what your demographic stats are based on that.
All my demgraphics are in 0..1 probability format, not ratios or absolute numbers or the like.
Essentially, you have a large number of data points that each tend you towards their own demographics. However, just taking the average is poor, because it means that by adding in a lot of generic data, the number goes down.
For example, suppose you've visited sites S0..S50. All except S0 are 48% female; S0 is 100% male. If I'm guessing your gender, I want to have a value close to 100%, not just the 49% that a straight average would give.
Also, consider that most demographics (i.e. everything other than gender) does not have the average at 50%. For example, the average probability of having kids 0-17 is ~37%. The more a given site's demographics are different from this average (e.g. maybe it's a site for parents, or for child-free people), the more it should count in my guess of your status.
What's the best way to calculate this?
For extra credit: what's the best way to calculate this, that is also cheap & easy to do in mysql?
ETA: I think that something approximating what I want is Φ(AVG(z-score ^ 2, sign preserved)). But I'm not sure if this is a good weighting function.
(Φ is the standard normal distribution function - http://en.wikipedia.org/wiki/Standard_normal_distribution#Definition)
A good framework for these kinds of calculations is Bayesian inference. You have a prior distribution of the demographics - eg 50% male, 37% childless, etc. Preferrably, you would have it multivariately: 10% male childless 0-17 Caucasian ..., but you can start with one-at-a-time.
After this prior each site contributes new information about the likelihood of a demographic category, and you get the posterior estimate which informs your final guess. Using some independence assumptions the updating formula is as follows:
posterior odds = (prior odds) * (site likelihood ratio),
where odds = p/(1-p) and the likelihood ratio is a multiplier modifying the odds after visiting the site. There are various formulas for it, but in this case I would just use the above formula for the general population and the site's population to calculate it.
For example, for a site that has 35% of its visitors in the "under 20" agegroup, which represents 20% of the population, the site likelihood ratio would be
LR = (0.35/0.65) / (0.2/0.8) = 2.154
so visiting this site would raise the odds of being "under 20" 2.154-fold.
A site that is 100% male would have an infinite LR, but you would probably want to limit it somewhat by, say, using only 99.9% male. A site that is 50% male would have an LR of 1, so it would not contribute any information on gender distribution.
Suppose you start knowing nothing about a person - his or her odds of being "under 20" are 0.2/0.8 = 0.25. Suppose the first site has an LR=2.154 for this outcome - now the odds of being "under 20" becomes 0.25*(2.154) = 0.538 (corresponding to the probability of 35%). If the second site has the same LR, the posterior odds become 1.16, which is already 54%, etc. (probability = odds/(1+odds)). At the end you would pick the category with the highest posterior probability.
There are loads of caveats with these calculations - for example, the assumption of independence likely being wrong, but it can provide a good start.
The naive Bayesian formula for you case looks like this:
SELECT probability
FROM (
SELECT #apriori := CAST(#apriori * ratio / (#apriori * ratio + (1 - #apriori) * (1 - ratio)) AS DECIMAL(30, 30)) AS probability,
#step := #step + 1 AS step
FROM (
SELECT #apriori := 0.5,
#step := 0
) vars,
(
SELECT 0.99 AS ratio
UNION ALL
SELECT 0.48
UNION ALL
SELECT 0.48
UNION ALL
SELECT 0.48
UNION ALL
SELECT 0.48
UNION ALL
SELECT 0.48
UNION ALL
SELECT 0.48
UNION ALL
SELECT 0.48
) q
) q2
ORDER BY
step DESC
LIMIT 1
Quick 'n' dirty: get a male score by multiplying the male probabilities, and a female score by multiplying the female probabilities. Predict the larger. (Actually, don't multiply; sum the log of each probability instead.) I think this is a maximum likelihood estimator if you make the right (highly unrealistic) assumptions.
The standard formula for calculating the weighted mean is given in this question and this question
I think you could look into these approaches and then work out how you calculate your weights.
In your gender example above you could adopt something along the lines of a set of weights {1, ..., 0 , ..., 1} which is a linear decrease from 0 to 1 for gender values of 0% male to 50% and then a corresponding increase up to 100%. If you want the effect to be skewed in favour of the outlying values then you easily come up with a exponential or trigonometric function that provides a different set of weights. If you wanted to then a normal distribution curve will also do the trick.