Create a new column in a dataframe using a function - function

So I think I am having a simple problem.
I have a dataframe (my_data), that looks like this
Treatment Amount Duration
a 5 3000
b 8 2000
c 6 1000
d 2 5000
Now I want to create a new dataframe (my_data_1) which adds a new column based on a simple function Duration/Amount.
My_data_1 should look like this:
Treatment Amount Duration Mean duration
a 5 3000 600
b 8 2000 250
c 6 1000 167
d 2 5000 2500
I tried to write a function and implement it into my dataframe
mean_duration <- function(md){my_data$Duration / my_data$Amount}
my_data_1$md <- with(my_data, ave(Duration, Amount, FUN = mean_duration))
Where did I go wrong?

Related

How to perform a many-to-many or (at least) a outer-join in SPSS

usually I use [R] for my data analysis, but these days I have to use SPSS. I was expecting that data manipulation might get a little bit more difficult this way, but after my first day I kind of surrender :D and I really would appreciate some help ...
My problem is the following:
I have two data sets, which have an ID number. Neither data sets have a unique ID (in one data set, which should have unique IDs, there is kind of a duplicated row)
In a perfect world I would like to keep this duplicated row and simply perform a many-to-many-join. But I accepted, that I might have to delete this "bad" row (in dataset A) and perform a 1:many-join (join dataset B to dataset A, which contains the unique IDs).
If I run the join (and accept that it seems not to be possible to run a 1:many, but only a many:1-join), I have the problem, that I lose IDs. If I join dataset A to dataset B I lose all cases, that are not part of dataset B. But I really would like to have both IDs like in a full join or something.
Do you know if there is (kind of) a simple solution to my problem?
Example:
dataset A:
ID
VAL1
1
A
1
B
2
D
3
K
4
A
dataset B:
ID
VAL2
1
g
2
k
4
a
5
c
5
d
5
a
2
x
expected result (best solution):
ID
VAL1
VAL2
1
A
g
1
B
g
2
D
k
3
K
NA
4
A
a
2
D
x
expected result (second best solution):
ID
VAL1
VAL2
1
A
g
2
D
k
3
K
NA
4
A
a
5
NA
c
5
NA
d
5
NA
a
2
D
x
what I get (worst solution):
ID
VAL1
VAL2
1
A
g
2
D
k
4
A
a
5
NA
c
5
NA
d
5
NA
a
2
D
x
From your example It looks like what you need is a full many to many join, based on the ID's existing in dataset A. You can get this by creating a full Cartesian-Product of the two dataset, using dataset A as the first\left dataset.
The following syntax assumes you have the STATS CARTPROD extention command installed. If you don't you can see here about installing it.
First I'll recreate your example to demonstrate on:
dataset close all.
data list list/id1 vl1 (2F3) .
begin data
1 232
1 433
2 456
3 246
4 468
end data.
dataset name aaa.
data list list/id2 vl2 (2F3) .
begin data
1 111
2 222
4 333
5 444
5 555
5 666
2 777
3 888
end data.
dataset name bbb.
Now the actual work is fairly simple:
DATASET ACTIVATE aaa.
STATS CARTPROD VAR1=id1 vl1 INPUT2=bbb VAR2=id2 vl2
/SAVE OUTFILE="C:\somepath\yourcartesianproduct.sav".
* The new dataset now contains all possible combinations of rows in the two datasets.
* we will select only the relevant combinations, where the two ID's match.
select if id1=id2.
exe.

Post Increment date field in mySQL query using R

I am trying to query a table in our mySQL database using the DBI R package. However, I need to pull the fields from the table by changing the date field on a monthly basis and limiting it to 1.
I'm having trouble with the looping and sql query text. I would like to create a loop that changes the date (monthly) and then prints that to a database query that will then pull all the data that matches the monthly conditions.
This is my code so far:
for (i in seq(0,12,1)){
results <- dbGetQuery(myDB, paste("SELECT * FROM cost_and_price_period WHERE start_date <=", '01-[[i]]-2019'))
}
The main issue is that R doesn't acknowledge post-increment operators like ++, so I know I could just make 12 individual queries and then rbind them, but I would prefer to do one efficient query. Does anyone have any ideas?
This below solution could give you an idea how to proceed with your problem.
DummyTable
id names dob
1 1 aa 2018-01-01
2 2 bb 2018-02-01
3 3 cc 2018-03-01
4 4 dd 2018-04-01
5 5 ee 2018-05-01
6 6 ff 2018-06-01
7 7 gg 2018-07-01
8 8 hh 2018-08-01
9 9 ii 2018-09-01
10 10 jj 2018-10-01
11 11 kk 2018-11-01
12 12 ll 2018-12-01
13 13 ll 2018-12-01
Imagine we have the above table in MySQL. Then we need to access the data for 1st day of every month and store whole records as a data frame.
### Using for loop like from your question
n <- 12
df <- vector("list", n)
for (i in seq(1:12)){
df[[i]] <- data.frame(dbGetQuery(pool, paste0("SELECT * FROM dummyTable WHERE dob = '2018-",i,"-01';" ))) # in iteration `i` corresponds for month number
}
df <- do.call(rbind, df)
### Using lapply(preferred way)
n <- seq(1:12)
df <- lapply(n, function(x){
dbGetQuery(pool, paste0("SELECT * FROM dummyTable WHERE dob = '2018-",x,"-01';" ))
})
df <- do.call(rbind, df)
So output of df data frame will give the matched records from MySQL.

Cannot print out the latest results of table

I have the following table:
NAMES:
Fname | stime | etime | Ver | Rslt
x 4 5 1.01 Pass
x 8 10 1.01 Fail
x 6 7 1.02 Pass
y 4 8 1.01 Fail
y 9 10 1.01 Fail
y 11 12 1.01 Pass
y 10 14 1.02 Fail
m 1 2 1.01 Fail
m 4 6 1.01 Fail
The result I am trying to output is:
x 8 10 1.01 Fail
x 6 7 1.02 Pass
y 11 12 1.01 Pass
y 10 14 1.02 Fail
m 4 6 1.01 Fail
What the result means:
Fnames are an example of tests that are run. Each test was run on different platforms of software (The version numbers) Some tests were run on the same platform twice: It passed the first time and failed the second time or vice versa. My required output is basically the latest result of each case for each version. So basically the results above are all unique by their combination of Fname and Ver(sion), and they are selected by the latest etime from the unique group.
The query I have so far is:
select Fname,stime,max(etime),ver,Rslt from NAMES group by Fname,Rslt;
This however, does not give me the required output.
The output I get is (wrong):
x 4 10 1.01 Fail
x 6 7 1.02 Pass
y 4 12 1.01 Pass
y 10 14 1.02 Fail
m 1 6 1.01 Fail
Basically it takes the max time, but it does not really print the correct data out, it prints the max time, but it prints the initial time of the whole unique group of data, instead of the initial time of that particular test (record).
I have tried so long to fix this, but I seem to be going no where. I have a feeling there is a join somewhere in here, but I tried that too, no luck.
Any help is appreciated,
Thank you.
Use a subquery to get the max ETime by FName and Ver, then join your main table to it:
SELECT
NAMES.FName,
NAMES.STime,
NAMES.ETime,
NAMES.Ver,
NAMES.Rslt
FROM NAMES
INNER JOIN (
SELECT FName, Ver, MAX(ETime) AS MaxETime
FROM NAMES
GROUP BY FName, Ver
) T ON NAMES.FName = T.FName AND NAMES.Ver = T.Ver AND NAMES.ETime = T.MaxETime
You could first find which is the latests=max(etime) for each case for each version ?
select Fname,Ver,max(etime) from NAMES group by Fname,Ver;
From there you would display the whole thing via joining it again?
select *
from
NAMES
inner join
(select Fname,Ver,max(etime) as etime from NAMES group by Fname,Ver ) sub1
using (Fname,Ver,etime)
order by fname,Ver;

Extracting data from data frame in R

I am very new to R (computer programming in general) and am working on a bioinformatics project. I made a MySQL database and using RMySQL connected to that database in the MySQL server in R. From here I issued queries to select a certain field from a table, fetch this data and make it into a data frame in R as seen below:
> rs = dbSendQuery(con, "select mastitis_no from experiment")
> data = fetch(rs, n=-1)
> data
mastitis_no
1 5
2 2
3 8
4 6
5 2
....
> rt = dbSendQuery(con, "select BMSCC from experiment")
> datas = fetch(rt, n=-1)
> datas
BMSCC
1 14536
2 10667
3 23455
4 17658
5 14999
....
> ru = dbSendQuery(con, "select cattle_hygiene_score_avg from experiment")
> dat = fetch(ru, n=-1)
> dat
cattle_hygiene_score_avg
1 1.89
2 1.01
3 1.21
4 1.22
5 1.93
....
My first 2 data frames are integers and my third data frame is in decimal format. I am able to run a simple correlation test on these data frames, but a detailed test (or plot) cannot be run as seen below.
> cor(data, datas)
BMSCC
mastitis_no 0.8303017
> cor.test(data, datas)
Error in cor.test.default(data, datas) : 'x' must be a numeric vector
Therefore I accessed the data inside those data frames using the usual list idexing device $, however the decimal data frame did not work as noted below.
> data$mastitis
[1] 5 2 8 6 2 0 5 6 7 3 0 1 0 3 2 2 0 5 2 1
> datas$BMSCC
[1] 14536 10667 23455 17658 14999 5789 18234 22390 19069 13677 13536 11667 13455
[14] 17678 14099 15789 8234 21390 16069 13597
> dat$hygiene
NULL
by doing this I am able to perform a spearman rank correlation test and scatter plot on the first two data frames but not the decimal data frame. Any suggestion on what I need to do? I am sure the answer is quite simple but I cannot find the coding necessary for this simple task. Any help would be much appreciated.

How to apply a formula for removing data noise in R?

I am working on NGSim Traffic data, having 18 columns and 1180598 rows in a text file. I want to smooth the position data, in the column 'Local Y'. I know there are built-in functions for data smoothing in R but none of them seem to match with the formula I am required to apply. The data in text file looks something like this:
Index VehicleID Total_Frames Local Y
1 2 5 35.381
2 2 5 39.381
3 2 5 43.381
4 2 5 47.38
5 2 5 51.381
6 4 8 504.828
7 4 8 508.325
8 4 8 512.841
9 4 8 516.338
10 4 8 520.854
11 4 8 524.592
12 4 8 528.682
13 4 8 532.901
14 5 7 39.154
15 5 7 43.153
16 5 7 47.154
17 5 7 51.154
18 5 7 55.153
19 5 7 59.154
20 5 7 63.154
The above data columns are just example taken out of original file. Here you can see 3 vehicles, with vehicle IDs = 2, 4 and 5 but in fact there are 2169 vehicles with different IDS. The column Total_Frames tell us how many times vehicle Id of each vehicle is repeated in the first column, for example in the table above, vehicle ID 2 is repeated 5 times, hence '5' in Total_Frames column. Following is the formula I am required to apply to remove data noise (smoothing) from column 'Local Y':
Smoothed Position Value = (1/(Summation of [EXP^-abs(i-k)/delta] from k=i-D to i+D)) * ( (Summation of (Local Y) *[EXP^-abs(i-k)/delta] from k=i-D to i+D))
where,
i = index #
delta = 5
D = 15
I have tried using the built-in functions, which I know of, but they don't smooth the data as required. My question is: Is there any built-in function in R which can do the data smoothing in the way of given formula or which could take this formula as an argument? I need to apply the formula to every value in Local Y which has 15 values before and 15 values after them (i-D and i+D) for same vehicle Id. Can anyone give me any idea how to approach the problem? Thanks in advance.
You can place your formula in a function and then use the apply function of R to apply it to the elements in your "Local Y" column of the dataframe