Comapring the rows of same dataframe - mysql

A B C
1 2 3
4 2 3
1 2 3
I want to compare row1 with row2 and row2 with row3 so on rown with row1.
If they are same I want to print it is as "same" or else "different" in another data frame
output for above table:
A B C
Different same same
Different same same
same same same
For the below code I'm getting
True or false as the output. I want to replace that with Different and same.
compare = t(combn(nrow(Data.matrix),2,FUN=function(x)we2009[x[1],]==Data.matrix[x[2],]))
rownames(compare) = combn(nrow(Data.matrix),2,FUN=function(x)paste0("seq",x[1],"_seq",x[2]))
View(compare)

there are plenty of options how to do that ,
since you add MySQL tag the easiest way is to do that with sql in case you have limited number of columns , you can also use package SQL in r
library(sqldf)
sqldf('select
case when a=b then 'same' else 'different' as a
case when b=c then 'same' else 'different' as b
case when c=a then 'same' else 'different' as c
from my_dataset'

Does this deliver the desired results?
data_test = data.frame(A = c(1,4,1), B = c(2,2,2), C = c(3,3,3))
# create shifted helper-columns
data_test_help = cbind(data_test, data_test[c(2:NROW(data_test), 1),])
# apply comparision on each row
t(apply(data_test_help,1, function(f) f[1:3] == f[4:6]))
# for same, different notation instead of true false
t(apply(data_test_help,1, function(f) ifelse(f[1:3] == f[4:6], "same", "Different")))

Related

How to calculate a probability vector and an observation count vector for a range of bins?

I want to test the hypothesis whether some 30 occurrences should fit a Poisson distribution.
#GNU Octave
X = [8 0 0 1 3 4 0 2 12 5 1 8 0 2 0 1 9 3 4 5 3 3 4 7 4 0 1 2 1 2]; #30 observations
bins = {0, 1, [2:3], [4:5], [6:20]}; #each bin can be single value or multiple values
I am trying to use Pearson's chi-square statistics here and coded the below function. I want a Poisson vector to contain corresponding Poisson probabilities for each bin and count the observations for each bin. I feel the loop is rather redundant and ugly. Can you please let me know how can I re-factor the function without the loop and make the whole calculation cleaner and more vectorized?
function result= poissonGoodnessOfFit(bins, observed)
assert(iscell(bins), "bins should be a cell array");
assert(all(cellfun("ismatrix", bins)) == 1, "bin entries either scalars or matrices");
assert(ismatrix(observed) && rows(observed) == 1, "observed data should be a 1xn matrix");
lambda_head = mean(observed); #poisson lambda parameter estimate
k = length(bins); #number of bin groups
n = length(observed); #number of observations
poisson_probability = []; #variable for poisson probability for each bin
observations = []; #variable for observation counts for each bin
for i=1:k
if isscalar(bins{1,i}) #this bin contains a single value
poisson_probability(1,i) = poisspdf(bins{1, i}, lambda_head);
observations(1, i) = histc(observed, bins{1, i});
else #this bin contains a range of values
inner_bins = bins{1, i}; #retrieve the range
inner_bins_k = length(inner_bins); #number of values inside
inner_poisson_probability = []; #variable to store individual probability of each value inside this bin
inner_observations = []; #variable to store observation counts of each value inside this bin
for j=1:inner_bins_k
inner_poisson_probability(1,j) = poisspdf(inner_bins(1, j), lambda_head);
inner_observations(1, j) = histc(observed, inner_bins(1, j));
endfor
poisson_probability(1, i) = sum(inner_poisson_probability, 2); #assign over the sum of all inner probabilities
observations(1, i) = sum(inner_observations, 2); #assign over the sum of all inner observation counts
endif
endfor
expected = n .* poisson_probability; #expected observations if indeed poisson using lambda_head
chisq = sum((observations - expected).^2 ./ expected, 2); #Pearson Chi-Square statistics
pvalue = 1 - chi2cdf(chisq, k-1-1);
result = struct("actual", observations, "expected", expected, "chi2", chisq, "pvalue", pvalue);
return;
endfunction
There's a couple of things worth noting in the code.
First, the 'scalar' case in your if block is actually identical to your 'range' case, since a scalar is simply a range of 1 element. So no special treatment is needed for it.
Second, you don't need to create such explicit subranges, your bin groups seem to be amenable to being used as indices into a larger result (as long as you add 1 to convert from 0-indexed to 1-indexed indices).
Therefore my approach would be to calculate the expected and observed numbers over the entire domain of interest (as inferred from your bin groups), and then use the bin groups themselves as 1-indices to obtain the desired subgroups, summing accordingly.
Here's an example code, written in the octave/matlab compatible subset of both languges:
function Result = poissonGoodnessOfFit( BinGroups, Observations )
% POISSONGOODNESSOFFIT( BinGroups, Observations) calculates the [... etc, etc.]
pkg load statistics; % only needed in octave; for matlab buy statistics toolbox.
assert( iscell( BinGroups ), 'Bins should be a cell array' );
assert( all( cellfun( #ismatrix, BinGroups ) ) == 1, 'Bin entries either scalars or matrices' );
assert( ismatrix( Observations ) && rows( Observations ) == 1, 'Observed data should be a 1xn matrix' );
% Define helpful variables
RangeMin = min( cellfun( #min, BinGroups ) );
RangeMax = max( cellfun( #max, BinGroups ) );
Domain = RangeMin : RangeMax;
LambdaEstimate = mean( Observations );
NBinGroups = length( BinGroups );
NObservations = length( Observations );
% Get expected and observed numbers per 'bin' (i.e. discrete value) over the *entire* domain.
Expected_Domain = NObservations * poisspdf( Domain, LambdaEstimate );
Observed_Domain = histc( Observations, Domain );
% Apply BinGroup values as indices
Expected_byBinGroup = cellfun( #(c) sum( Expected_Domain(c+1) ), BinGroups );
Observed_byBinGroup = cellfun( #(c) sum( Observed_Domain(c+1) ), BinGroups );
% Perform a Chi-Square test on the Bin-wise Expected and Observed outputs
O = Observed_byBinGroup; E = Expected_byBinGroup ; df = NBinGroups - 1 - 1;
ChiSquareTestStatistic = sum( (O - E) .^ 2 ./ E );
PValue = 1 - chi2cdf( ChiSquareTestStatistic, df );
Result = struct( 'actual', O, 'expected', E, 'chi2', ChiSquareTestStatistic, 'pvalue', PValue );
end
Running with your example gives:
X = [8 0 0 1 3 4 0 2 12 5 1 8 0 2 0 1 9 3 4 5 3 3 4 7 4 0 1 2 1 2]; % 30 observations
bins = {0, 1, [2:3], [4:5], [6:20]}; % each bin can be single value or multiple values
Result = poissonGoodnessOfFit( bins, X )
% Result =
% scalar structure containing the fields:
% actual = 6 5 8 6 5
% expected = 1.2643 4.0037 13.0304 8.6522 3.0493
% chi2 = 21.989
% pvalue = 0.000065574
A general comment about the code; it is always preferable to write self-explainable code, rather than code that does not make sense by itself in the absence of a comment. Comments generally should only be used to explain the 'why', rather than the 'how'.

Function on each row of pandas DataFrame but not generating a new column

I have a data frame in pandas as follows:
A B C D
3 4 3 1
5 2 2 2
2 1 4 3
My final goal is to produce some constraints for an optimization problem using the information in each row of this data frame so I don't want to generate an output and add it to the data frame. The way that I have done that is as below:
def Computation(row):
App = pd.Series(row['A'])
App = App.tolist()
PT = [row['B']] * len(App)
CS = [row['C']] * len(App)
DS = [row['D']] * len(App)
File3 = tuplelist(zip(PT,CS,DS,App))
return m.addConstr(quicksum(y[r,c,d,a] for r,c,d,a in File3) == 1)
But it does not work out by calling:
df.apply(Computation, axis = 1)
Could you please let me know if there is anyway to do this process?
.apply will attempt to convert the value returned by the function to a pandas Series or DataFrame. So, if that is not your goal, you are better off using .iterrows:
# In pseudocode:
for row in df.iterrows:
constrained = Computation(row)
Also, your Computation can be expressed as:
def Computation(row):
App = list(row['A']) # Will work as long as row['A'] is iterable
# For the next 3 lines, see note below.
PT = [row['B']] * len(App)
CS = [row['C']] * len(App)
DS = [row['D']] * len(App)
File3 = tuplelist(zip(PT,CS,DS,App))
return m.addConstr(quicksum(y[r,c,d,a] for r,c,d,a in File3) == 1)
Note: [<list>] * n will create n pointers or references to the same <list>, not n independent lists. Changes to one copy of n will change all copies in n. If that is not what you want, use a function. See this question and it's answers for details. Specifically, this answer.

Boolean Logic with If case

There are 2 cases given in the question and on that basis we have to answer.
Cases:
if((NOT(value>=1) OR NOT(value<=10))
if((NOT(value>=1) AND NOT(value<=10))
Now the questions are:
which case you are going to use if the given value either is 1 or 10 ?
which case you are going to use if the given value must be 1 or 10 ?
the problem is whether I takes 1 or 10 I am getting same answer in both the cases. That is if(0) and thus if statement is false in both the cases.?
(NOT(value>=1) OR NOT(value<=10)) = (value < 1) OR (value > 10)
This case is true for [-Infinity ... 0] or [11 ... + Infinity]
Is false for 1 or 10
((NOT(value>=1) AND NOT(value<=10)) = (value < 1) AND (value > 10)
This case is always false, as no number can be smaller than 1 and bigger than 10 the same time.

Data set too large to load into memory for processing

I have a bigger rapidly growing data set of around 4 million rows, in order to define and exclude the outliers (for statistics / analytics usage) I need the algorithm to consider all entries in this data set. However this is too much data to load into memory and my system chokes. I'm currently using this to collect and process the data:
#scoreInnerFences = innerFence Post.where( :source => 1 ).
order( :score ).
pluck( :score )
Using the typical divide and conquer method won't work, I don't think because every entry has to be considered to keep my outlier calculation accurate. How can this be achieved efficiently?
innerFence identifies the lower quartile and upper quartile of the data set, then uses those findings to calculate the outliers. Here is the (yet to be refactored, non-DRY) code for this:
def q1(s)
q = s.length / 4
if s.length % 2 == 0
return ( s[ q ] + s[ q - 1 ] ) / 2
else
return s[ q ]
end
end
def q2(s)
q = s.length / 4
if s.length % 2 == 0
return ( s[ q * 3 ] + s[ (q * 3) - 1 ] ) / 2
else
return s[ q * 3 ]
end
end
def innerFence(s)
q1 = q1(s)
q2 = q2(s)
iq = (q2 - q1) * 3
if1 = q1 - iq
if2 = q2 + iq
return [if1, if2]
end
This is not the best way, but it is an easy way:
Do several querys. First you count the number of scores:
q = Post.where( :source => 1 ).count
then you do your calculations
then you fetch the scores
q1 = Post.where( :source => 1 ).
reverse_order(:score).
select("avg(score) as score").
offset(q).limit((q%2)+1)
q2 = Post.where( :source => 1 ).
reverse_order(:score).
select("avg(score) as score").
offset(q*3).limit((q%2)+1)
The code is probably wrong but I'm sure you get the idea.
For large datasets, I sometimes drop down below ActiveRecord. It's a memory hog, even I imagine, using pluck. Of course it's less portable, but sometimes it's worth it.
scores = Post.connection.execute('select score from posts where score > 1 order by score').map(&:first)
Don't know if that will help enough for 4 million record. If not, maybe look at a stored procedure?

Mysql Precedence Logic

Any explanation to the following queries :
Select x FROM y WHERE a = 1 OR a = 2 AND (b = 1 OR b = 2)
why it doesn't return the correct info while this return the correct info :
Select x FROM y WHERE (a = 1 OR a = 2) AND (b = 1 OR b = 2)
Am i missing something here ?
X Y (X OR Y) X OR Y
1 0 1 1
0 1 1 1
1 1 1 1
0 0 0 0
I know in term of precedence the () have priority , but why should i add them the the first part of the query ?
Correct me if I'm wrong
Thank you
AND has a higher precedence than OR so your first query is equivalent to this:
Select x FROM y WHERE a = 1 OR a = 3 OR (a = 2 AND (b = 1 OR b = 2))
Which is not equivalent to
Select x FROM y WHERE (a = 1 OR a = 2 OR a = 3) AND (b = 1 OR b = 2)
I guess you forgot the a = 3 part in your first query.
Operator precedence in MySQL
Because ambiguity is an undesirable trait?
Also, the optimizer will re-order your WHERE Conditions if it thinks it will perform better. Your ambiguity will, therefore, cause different results depending on how/what it evaluates first.
Always be explicit with your intentions.
Using parentheses in your WHERE clause does not just affect precedence but also groups predicates together. In your example the difference in results is more a matter of grouping rather than precedence.
You could think of this: (pN = predicate expression)
WHERE a = 1 OR a = 2 AND (b = 1 OR b = 2)
as:
WHERE p1 OR p2 AND p3
And this:
WHERE (a = 1 OR a = 2 OR a = 3) AND (b = 1 OR b = 2)
as:
WHERE p1 AND p2
and so it becomes clear that the results could be quite different.