Move values in a data frame from one column to another based on matching criteria - json

I am receiving output from a JSON object,however the JSON returns three fields sometimes two somtimes one, depending in the input. As a result I have a dataframe which looks like this:
mixed score type
1 1 0.0183232 positive
2 neutral <NA> <NA>
3 -0.566558 negative <NA>
4 0.473484 positive <NA>
5 0.856743 positive <NA>
6 -0.422655 negative <NA>
Mixed can take values of 1 or 0
Score can take a positive or negative value between -1 and +1
Type can take a value of either positive, negative or neutral
I'm wondering how I can rearrange the values in the data.frame so that they are in the correct column i.e.
mixed score type
1 1 0.018323 positive
2 <NA> <NA> neutral
3 <NA> -0.566558 negative
4 <NA> 0.473484 positive
5 <NA> 0.856743 positive
6 <NA> -0.422655 negative

Not an elegant solution at all, but the best I could come up with.
### Seeds initial Dataframe
mixed = c("1", "neutral", "0.473484", "-0.566558", "0.856743", "-0.422655", "-0.692675")
score = c("0.0183232", "0", "positive", "negative", "positive", "negative", "negative")
type = c("positive", "0", "0", "0", "0", "0", "0")
df = data.frame(mixed, score, type)
# Create a new DF (3 cols by nrow ize) for output
df <- as.data.frame(matrix(0, ncol = 3, nrow = i))
setnames(df, old=c("V1","V2", "V3"), new=c("mixed", "score", "type"))
df
# Create a 2nd new DF (3 cols by nrow ize) for output
df.2 <- as.data.frame(matrix(0, ncol = 3, nrow = i))
setnames(df.2, old=c("V1","V2", "V3"), new=c("mixed", "score", "type"))
df.2
#Check each column cell by cell if it does copy it do the shadow dataframe
# Set all <NA> values to Null
df[is.na(df)] <- 0
# Set interation length to column length
l <- 51
# Checked the mixed column for '1' and then copy it to the new frame
for(l in 1:l)
if (df$mixed[l] == '1')
{
df.2$mixed[l] <-df$mixed[l]
}
# Checked the mixed column for a value which is less than 1 and then copy it to the score column in the new frame
for(l in 1:l)
if (df$mixed[l] < '1')
{
df.2$score[l] <-df$mixed[l]
}
# Checked the mixed column for positive/negative/neutral and then copy it to the type column in the new frame
for(l in 1:l)
if (df$mixed[l] == "positive" | df$mixed[l] == "negative" | df$mixed[l] == "neutral")
{
df.2$type[l] <-df$mixed[l]
}
# Checked the score column for '1' and then copy it to mixed column in the new frame
for(l in 1:l)
if (df$score[l] == '1')
{
df.2$mixed[l] <-df$score[l]
}
# Checked the score column for a value which is less than 1 and then copy it to the score column in the new frame
for(l in 1:l)
if (df$score[l] < '1')
{
df.2$score[l] <-df$score[l]
}
# Checked the score column for positive/negative/neutral and then copy it to the type column in the new frame
for(l in 1:l)
if (df$score[l] == "positive" | df$score[l] == "negative" | df$score[l] == "neutral")
{
df.2$type[l] <-df$score[l]
}
# Checked the type column for '1' and then copy it to mixed column in the new frame **This one works***
for(l in 1:l)
if (df$type[l] == '1')
{
df.2$mixed[l] <-df$type[l]
}
# Checked the type column for a value which is less than 1 and then copy it to the score column in the new frame ** this one is erasing data in the new frame**
for(l in 1:l)
if (df$type[l] < '1')
{
df.2$score[l] <-df$type[l]
}
# Checked the type column for positive/negative/neutral and then copy it to the type column in the new frame **This one works***
for(l in 1:l)
if (df$type[l] == "positive" | df$type[l] == "negative" | df$type[l] == "neutral")
{
df.2$type[l] <-df$type[l]
}

Related

How to calculate a probability vector and an observation count vector for a range of bins?

I want to test the hypothesis whether some 30 occurrences should fit a Poisson distribution.
#GNU Octave
X = [8 0 0 1 3 4 0 2 12 5 1 8 0 2 0 1 9 3 4 5 3 3 4 7 4 0 1 2 1 2]; #30 observations
bins = {0, 1, [2:3], [4:5], [6:20]}; #each bin can be single value or multiple values
I am trying to use Pearson's chi-square statistics here and coded the below function. I want a Poisson vector to contain corresponding Poisson probabilities for each bin and count the observations for each bin. I feel the loop is rather redundant and ugly. Can you please let me know how can I re-factor the function without the loop and make the whole calculation cleaner and more vectorized?
function result= poissonGoodnessOfFit(bins, observed)
assert(iscell(bins), "bins should be a cell array");
assert(all(cellfun("ismatrix", bins)) == 1, "bin entries either scalars or matrices");
assert(ismatrix(observed) && rows(observed) == 1, "observed data should be a 1xn matrix");
lambda_head = mean(observed); #poisson lambda parameter estimate
k = length(bins); #number of bin groups
n = length(observed); #number of observations
poisson_probability = []; #variable for poisson probability for each bin
observations = []; #variable for observation counts for each bin
for i=1:k
if isscalar(bins{1,i}) #this bin contains a single value
poisson_probability(1,i) = poisspdf(bins{1, i}, lambda_head);
observations(1, i) = histc(observed, bins{1, i});
else #this bin contains a range of values
inner_bins = bins{1, i}; #retrieve the range
inner_bins_k = length(inner_bins); #number of values inside
inner_poisson_probability = []; #variable to store individual probability of each value inside this bin
inner_observations = []; #variable to store observation counts of each value inside this bin
for j=1:inner_bins_k
inner_poisson_probability(1,j) = poisspdf(inner_bins(1, j), lambda_head);
inner_observations(1, j) = histc(observed, inner_bins(1, j));
endfor
poisson_probability(1, i) = sum(inner_poisson_probability, 2); #assign over the sum of all inner probabilities
observations(1, i) = sum(inner_observations, 2); #assign over the sum of all inner observation counts
endif
endfor
expected = n .* poisson_probability; #expected observations if indeed poisson using lambda_head
chisq = sum((observations - expected).^2 ./ expected, 2); #Pearson Chi-Square statistics
pvalue = 1 - chi2cdf(chisq, k-1-1);
result = struct("actual", observations, "expected", expected, "chi2", chisq, "pvalue", pvalue);
return;
endfunction
There's a couple of things worth noting in the code.
First, the 'scalar' case in your if block is actually identical to your 'range' case, since a scalar is simply a range of 1 element. So no special treatment is needed for it.
Second, you don't need to create such explicit subranges, your bin groups seem to be amenable to being used as indices into a larger result (as long as you add 1 to convert from 0-indexed to 1-indexed indices).
Therefore my approach would be to calculate the expected and observed numbers over the entire domain of interest (as inferred from your bin groups), and then use the bin groups themselves as 1-indices to obtain the desired subgroups, summing accordingly.
Here's an example code, written in the octave/matlab compatible subset of both languges:
function Result = poissonGoodnessOfFit( BinGroups, Observations )
% POISSONGOODNESSOFFIT( BinGroups, Observations) calculates the [... etc, etc.]
pkg load statistics; % only needed in octave; for matlab buy statistics toolbox.
assert( iscell( BinGroups ), 'Bins should be a cell array' );
assert( all( cellfun( #ismatrix, BinGroups ) ) == 1, 'Bin entries either scalars or matrices' );
assert( ismatrix( Observations ) && rows( Observations ) == 1, 'Observed data should be a 1xn matrix' );
% Define helpful variables
RangeMin = min( cellfun( #min, BinGroups ) );
RangeMax = max( cellfun( #max, BinGroups ) );
Domain = RangeMin : RangeMax;
LambdaEstimate = mean( Observations );
NBinGroups = length( BinGroups );
NObservations = length( Observations );
% Get expected and observed numbers per 'bin' (i.e. discrete value) over the *entire* domain.
Expected_Domain = NObservations * poisspdf( Domain, LambdaEstimate );
Observed_Domain = histc( Observations, Domain );
% Apply BinGroup values as indices
Expected_byBinGroup = cellfun( #(c) sum( Expected_Domain(c+1) ), BinGroups );
Observed_byBinGroup = cellfun( #(c) sum( Observed_Domain(c+1) ), BinGroups );
% Perform a Chi-Square test on the Bin-wise Expected and Observed outputs
O = Observed_byBinGroup; E = Expected_byBinGroup ; df = NBinGroups - 1 - 1;
ChiSquareTestStatistic = sum( (O - E) .^ 2 ./ E );
PValue = 1 - chi2cdf( ChiSquareTestStatistic, df );
Result = struct( 'actual', O, 'expected', E, 'chi2', ChiSquareTestStatistic, 'pvalue', PValue );
end
Running with your example gives:
X = [8 0 0 1 3 4 0 2 12 5 1 8 0 2 0 1 9 3 4 5 3 3 4 7 4 0 1 2 1 2]; % 30 observations
bins = {0, 1, [2:3], [4:5], [6:20]}; % each bin can be single value or multiple values
Result = poissonGoodnessOfFit( bins, X )
% Result =
% scalar structure containing the fields:
% actual = 6 5 8 6 5
% expected = 1.2643 4.0037 13.0304 8.6522 3.0493
% chi2 = 21.989
% pvalue = 0.000065574
A general comment about the code; it is always preferable to write self-explainable code, rather than code that does not make sense by itself in the absence of a comment. Comments generally should only be used to explain the 'why', rather than the 'how'.

How to write a JSON object from R dataframe with grouping

In general I feel there is a need to make JSON objects by folding multiple columns. There is no direct way to do this afaik. Please point it out if there is ..
I have data of this from
A B C
1 a x
1 a y
1 c z
2 d p
2 f q
2 f r
How do I write a json which looks like
{'query':'1', 'type':[{'name':'a', 'values':[{'value':'x'}, {'value':'y'}]}, {'name':'c', 'values':[{'value':'z'}]}]}
and similarly for 'query':'2'
I am looking to spit them in the mongo import/export individual json lines format.
Any pointers are also appreciated..
You've got a little "non-standard" thing going with two keys of "value" (I don't know if this is legal json), as you can see here:
(js <- jsonlite::fromJSON('{"query":"1", "type":[{"name":"a", "values":[{"value":"x"}, {"value":"y"}]}, {"name":"c", "values":[{"value":"z"}]}]}'))
## $query
## [1] "1"
##
## $type
## name values
## 1 a x, y
## 2 c z
... with a data.frame cell containing a list of data.frames:
js$type$values[[1]]
## value
## 1 x
## 2 y
class(js$type$values[[1]])
## [1] "data.frame"
If you can accept your "type" variable containing a vector instead of a named-list, then perhaps the following code will suffice:
jsonlite::toJSON(lapply(unique(dat[, 'A']), function(a1) {
list(query = a1,
type = lapply(unique(dat[dat$A == a1, 'B']), function(b2) {
list(name = b2,
values = dat[(dat$A == a1) & (dat$B == b2), 'C'])
}))
}))
## [{"query":[1],"type":[{"name":["a"],"values":["x","y"]},{"name":["c"],"values":["z"]}]},{"query":[2],"type":[{"name":["d"],"values":["p"]},{"name":["f"],"values":["q","r"]}]}]

Use sum of multiple pandas columns in mapping a function

I am trying to create a new column in my DataFrame.
I want the new column to be a*b if the sum of a few other columns is == 0, 1 if the sum is == 1, and 0 otherwise.
The number of columns that I am summing across is dynamic in that it may be 3 columns I am summing across or it could be 100. I have a list of those column names (list_to_check) which could be of any length.
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4], 'c':['dd','ee','ff'], 'd1':[5,0,1], 'd2':[5,0,1], 'dn':[5,0,1]})
list_to_check = ['d1','d2','dn']
def func(a,b,c):
if sum(c) == 0:
a*b
elif sum(c) == 1:
1
else:
0
df['new_column'] = np.vectorize(func)(df.a,df.b,df[list_to_check])
vals = df[list_to_check].sum(1)
df['new_col'] = 0
df.loc[vals == 0, 'new_col'] = df.a * df.b
df.loc[vals == 1, 'new_col'] = 1

HowTo: select all rows in a cell array, where a particular column has a particular value

I have a cell array, A. I would like to select all rows where the first column (for example) has the value 1234 (for example).
When A is not a cell array, I can accomplish this by:
B = A(A(:,1) == 1234,:);
But when A is a cell array, I get this error message:
error: binary operator `==' not implemented for `cell' by `scalar' operations
Does anyone know how to accomplish this, for a cell array?
The problem is the expression a(:,1) == 1234 (and also a{:,1} == 1234).
For example:
octave-3.4.0:48> a
a =
{
[1,1] = 10
[2,1] = 13
[3,1] = 15
[4,1] = 13
[1,2] = foo
[2,2] = 19
[3,2] = bar
[4,2] = 999
}
octave-3.4.0:49> a(:,1) == 13
error: binary operator `==' not implemented for `cell' by `scalar' operations
octave-3.4.0:49> a{:,1} == 13
error: binary operator `==' not implemented for `cs-list' by `scalar' operations
I don't know if this is the simplest or most efficient way to do it, but this works:
octave-3.4.0:49> cellfun(#(x) isequal(x, 13), a(:,1))
ans =
0
1
0
1
octave-3.4.0:50> a(cellfun(#(x) isequal(x, 13), a(:,1)), :)
ans =
{
[1,1] = 13
[2,1] = 13
[1,2] = 19
[2,2] = 999
}
I guess the Class of A is cell. (You can see in the Workspace box).
So you may need to convert A to the matrix by cell2mat(A).
Then, just like Matlab as you did: B = A(A(:,1) == 1234,:);
I don't have Octave available at the moment to try it out, but I believe that the following would do it:
B = A(A{:,1} == 1234,:);
When dealing with cells () returns the cell, {} returns the contents of the cell.

Subsetting a data frame in a function using another data frame as parameter

I would like to submit a data frame to a function and use it to subset another data frame.
This is the basic data frame:
foo <- data.frame(var1= c(1, 1, 1, 2, 2, 3), var2=c('A', 'A', 'B', 'B', 'C', 'C'))
I use the following function to find out the frequencies of var2 for specified values of var1.
foobar <- function(x, y, z){
a <- subset(x, (x$var1 == y))
b <- subset(a, (a$var2 == z))
n=nrow(b)
return(n)
}
Examples:
foobar(foo, 1, "A") # returns 2
foobar(foo, 1, "B") # returns 1
foobar(foo, 3, "C") # returns 1
This works. But now I want to submit a data frame of values to foobar. Instead of the above examples, I would like to submit df to foobar and get the same results as above (2, 1, 1)
df <- data.frame(var1=c(1, 1, 3), var2=c("A", "B", "C"))
When I change foobar to accept two arguments like foobar(foo, df) and use y[, c(var1)] and y[, c(var2)] instead of the two parameters x and y it still doesn't work. Which way is there to do this?
edit1: last paragraph clarified
edit2: var1 type corrected
Try this:
library(plyr)
match_df <- function(x, match) {
vars <- names(match)
# Create unique id for each row
x_id <- id(match[vars])
match_id <- id(x[vars])
# Match identifiers and return subsetted data frame
x[match(x_id, match_id, nomatch = 0), ]
}
match_df(foo, df)
# var1 var2
# 1 1 A
# 3 1 B
# 5 2 C
Your function foobar is expecting three arguments, and you only supplied two arguments to it with foobar(foo, df). You can use apply to get what you want:
apply(df, 1, function(x) foobar(foo, x[1], x[2]))
And in use:
> apply(df, 1, function(x) foobar(foo, x[1], x[2]))
[1] 2 1 1
To respond to your edit:
I'm not entirely sure what y[, c(var1)] means, but here's an attempt at trying to figure out what you are trying to do.
What I think you were trying to do was: foobar(foo, y = df[, "var1"], z = df[, "var2"]).
First, note that the use of c() is not needed here and you can reference the columns you want by placing the name of the column in quotes OR reference the column by number (as I did above). Secondly, df[, "var1"] returns all of the rows for the column names var1 which has a length of three:
> length(df[, "var1"])
[1] 3
The function you defined is not set up to deal with vectors of length greater than 1. That is why we need to iterate through each row of your dataframe to grab a single value, process it, and then go to the next row in the data.frame. That is what the apply function does. It is equivalent to saying something along the lines of for (i in 1: length(nrow(df)) but is a more idiomatic way of handling such issues.
Finally, is there a reason you generated var1 as a factor? It probably makes more sense to treate these as numeric in my opinion. Compare:
> str(df)
'data.frame': 3 obs. of 2 variables:
$ var1: Factor w/ 2 levels "1","3": 1 1 2
$ var2: Factor w/ 3 levels "A","B","C": 1 2 3
Versus
> df2 <- data.frame(var1=c(1,1,3), var2=c("A", "B", "C"))
> str(df2)
'data.frame': 3 obs. of 2 variables:
$ var1: num 1 1 3
$ var2: Factor w/ 3 levels "A","B","C": 1 2 3
In summary - apply is the function you are after here. You may want to spend some time thinking about whether your data should be numeric or a factor, but apply is still what you want.
foobar2 <- function(x, df) {
.dofun <- function(y, z){
a <- subset(x, x$var1==y)
b <- subset(a, a$var2==z)
n <- nrow(b)
return (n)
}
ans <- mapply(.dofun, as.character(df$var1), as.character(df$var2))
names(ans) <- NULL
return(ans)
}