Selecting rows in R based on condition - function

I have a dataframe grouped by ID1, ID2, ID3 and variables V1, V2, V3 and V4. I am trying to capture values per group in R which are most different from median. For this, I have subtracted median from each value and squared that (bc there are some with negative values). Below is an example dataframe.
colnames <- c("ID1", "ID2", "ID3", "V1", "V2", "V3", "V4")
a <- c("A", "B", "C", "D")
b <- c("X", "Y", "Z", "T")
c <- c("1", "2", "3", "4")
d <- c(1.23,2.03,2.45,5.66)
e <- c(1,2,3,4)
df <-data.frame(a,b,c,d,e)
I have made a function med_removed as follows.
med_removed <- function(x, na.rm = TRUE, ...) {
mad <- sort((x- median(x, na.rm = T))^2)
y <- head(mad, 4)
y
}
df_selected <- df%>% group_by(ID1 ID2,ID3) %>%mutate_all(., med_removed)
The problem is that I want to select rows in the original dataframe based on the (x-median(x))^2 to pick up the top 2 values.
Does anyone know a good way of doing that.
Thanks

Related

Json to pandas dataframe with slight modification

I have a json data as below:
{
"X": "abc",
"Y": 1,
"Z": 4174,
"t_0":
{
"M": "bm",
"T": "sp",
"CUD": 4,
"t_1": '
{
"CUD": "1",
"BBC": "09",
"CPR": -127
},
"EVV": "10.7000",
"BBC": -127,
"CMIX": "25088"
},
"EYR": "sp"
}
The problem is converting to python data-frame creates two columns of same name CUD. One is under t_0 and another is under t_1. But both are different events. How can I append json tag name to column names so that I can differentiate two columns of same name. Something like t_0_CUD , t_1_CUD.
My code is below:
df = pd.io.json.json_normalize(json_data)
df.columns = df.columns.map(lambda x: x.split(".")[-1])
If use only first part of solution it return what you need, only instead _ are used .:
df = pd.io.json.json_normalize(json_data)
print (df)
X Y Z EYR t_0.M t_0.T t_0.CUD t_0.t_1.CUD t_0.t_1.BBC t_0.t_1.CPR \
0 abc 1 4174 sp bm sp 4 1 09 -127
t_0.EVV t_0.BBC t_0.CMIX
0 10.7000 -127 25088
If need _:
df.columns = df.columns.str.replace('\.','_')
print (df)
X Y Z EYR t_0_M t_0_T t_0_CUD t_0_t_1_CUD t_0_t_1_BBC t_0_t_1_CPR \
0 abc 1 4174 sp bm sp 4 1 09 -127
t_0_EVV t_0_BBC t_0_CMIX
0 10.7000 -127 25088

Function for changing size of points in geom_point based on sum of presence absence data

I'm working with a species data in presence/absence format set where samples have been taken multiple times a day over a period of several days.
Here's a dummy version of the data:
dummy = structure(list(Sample = c("A1", "A1", "A1", "A2", "A2", "A2",
"B1", "B1", "B1", "B2", "B2", "B2"), Species = c("snuffles1",
"snuffles2", "snuffles3", "snuffles1", "snuffles2", "snuffles3",
"snuffles1", "snuffles2", "snuffles3", "snuffles1", "snuffles2",
"snuffles3"), Presence = c(1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1
), Day = c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B",
"B", "B")), row.names = c(NA, -12L), class = c("tbl_df", "tbl",
"data.frame"))
ggplot(dummy[which(dummy$Presence>0),], aes(x = Day, y = Species, color = Species)) +
geom_point(alpha=0.5) +
geom_count(aes(size = sum(dummy$Presence)))
I would like to plot the data in ggplot where the size of each point is dependent on the sum of the number of observations within that group (i.e if on Day A, snuffles1 was observed 2 times, then the point should be size 2, whereas if on Day B, snuffles1 was observed once, the point would be size 1). I hope this makes sense? This counting presence/absence based on group is similar, but not quite what I need.
My guess is that I have to use some sort of function to count the number of observations for each species, depedent on which variable I'm considering, but I am not smart enough to think of how to do this.
Thanks for any and all advice.
Make an additional count by group. Then plot this data frame as an extra layer using geom_point
I am adding breaks to scale_size in order to show only the exiting sizes
library(tidyverse)
count_dum <- dummy %>% group_by(Day, Species) %>% summarise(count = sum(Presence))
ggplot(dummy[which(dummy$Presence > 0), ], aes(x = Day, y = Species, color = Species)) +
geom_point(data = count_dum, aes(size = count), alpha = 0.5) +
scale_size_continuous(breaks = unique(count_dum$count))

Convert Json array key's as csv column name and values

I am parsing a json data to write a csv file. I am using tidyjson package to do this work.
In some point I need to print all the subjects value below in a separate columns and score as a value. Meaning Physics, Mathematics will be a column name and score will be there value.
{
"results": {
"subjects": [
{
"subject": {
"name": "Physics",
"code": "PHY"
},
"score": 70
},
{
"subject": {
"name": "Mathematics",
"code": "MATH"
},
"score": 50
}
]
}
}
I have tried as below:
json_data %>%
as.tbl_json %>%
gather_array %>%
spread_values(user_id = jstring("user_id")) %>%
enter_object("results") %>%
enter_object("subjects") %>%
gather_array("subjects") %>%
spread_values(score = jstring("score")) %>%
enter_object("subject") %>%
spread_values(subject = jstring("subject")) %>%
mutate(Physics = case_when(.$name == "Physics" ~ score)) %>%
mutate(Mathematics = case_when(.$name == "Mathematics" ~ score))
But this shows multiple rows for one student. I need to show single row with each subject and score as a column value.
But this shows multiple rows for one student. I need to show single row with each subject and score as a column value.
That means, your need a unique row based on subject name? In that case you can use aggregate
if you have a data frame named df like,
subject <- c("phy", "math", "phy", "math")
Score <- c(10, NA, NA, 20)
df <- data.frame(subject, Score)
then,
aggregate(x=df[c("Score")], by=list(subjectName=df$subject), max, na.rm = TRUE)
output
subjectName Score
phy 10
math 20

JSON to R data frame: preserve repeated values

I have a JSON data source that is a list of objects. Some of the object properties are themselves lists. I want to turn the whole thing into a data frame, preserving the lists as data frame values.
Example JSON data:
[{
"id": "A",
"p1": [1, 2, 3],
"p2": "foo"
},{
"id": "B",
"p1": [4, 5, 6],
"p2": "bar"
}]
Desired data frame:
id p2 p1
1 A foo 1, 2, 3
2 B bar 4, 5, 6
Failed attempt 1
I have found this nicely straightforward way of parsing my JSON:
unlisted_data <- lapply(fromJSON(json_str), function(x){unlist(x)})
data.frame(do.call("rbind", unlisted_data))
However, the unlisting process spreads my repeated value across multiple columns:
id p11 p12 p13 p2
1 A 1 2 3 foo
2 B 4 5 6 bar
I expected that calling unlist with the recursive = FALSE option would take care of this, but it doesn't.
Failed attempt 2
I noticed that I can almost do this with the I function:
> data.frame(I(parsed_json[[1]]))
parsed_json..1..
id A
p1 1, 2, 3
p2 foo
But the rows and columns are reversed. Transposing the result mangles the repeated data:
> t(data.frame(I(parsed_json[[1]])))
id p1 p2
parsed_json..1.. "A" Numeric,3 "foo"
The jsonlite package can handle this just fine:
library(jsonlite)
fromJSON(txt)
# id p1 p2
#1 A 1, 2, 3 foo
#2 B 4, 5, 6 bar
fromJSON(txt)$p1
#[[1]]
#[1] 1 2 3
#
#[[2]]
#[1] 4 5 6

read mixed data into R

I have a text file '\t' separated. First two columns are text and third one is in JSON format like {type: [{a: a1, timestamp: 1}, {a:a2, timestamp: 2}]}
How can i put it into DF correctly?
I would like to parse line like factor1\tparam1\t{type: [{a: a1, timestamp: 1}, {a:a2, timestamp: 2}]} into DF like
factor_column param_column a_column ts_column
factor1 param1 a1 1
factor1 param1 a2 2
I have saved that one line of text you have provided into a file called 'parseJSON.txt'. You can then read the file in as per usual using read.table, then make use of library(jsonlite) to parse the 3rd column.
I've also formatted the line of text to include quotes around the JSON code:
factor1 param1 {"type": [{"a": "a1", "timestamp": 1}, {"a":"a2", "timestamp": 2}]}
library(jsonlite)
dat <- read.table("parseJSON.txt",
sep="\t",
header=F,
quote="")
#parse 3rd column using jsonlite
js <- fromJSON(as.character(dat[1,3]))
js is now a list
> js
$type
a timestamp
1 a1 1
2 a2 2
which can be combined with the first two columns of dat
res <- cbind(dat[,1:2],js$type)
names(res) <- c("factor_column", "param_column", "a_column", "ts_column")
which gives
> res
factor_column param_column a_column ts_column
1 factor1 param1 a1 1
2 factor1 param1 a2 2