I am very new to R (computer programming in general) and am working on a bioinformatics project. I made a MySQL database and using RMySQL connected to that database in the MySQL server in R. From here I issued queries to select a certain field from a table, fetch this data and make it into a data frame in R as seen below:
> rs = dbSendQuery(con, "select mastitis_no from experiment")
> data = fetch(rs, n=-1)
> data
mastitis_no
1 5
2 2
3 8
4 6
5 2
....
> rt = dbSendQuery(con, "select BMSCC from experiment")
> datas = fetch(rt, n=-1)
> datas
BMSCC
1 14536
2 10667
3 23455
4 17658
5 14999
....
> ru = dbSendQuery(con, "select cattle_hygiene_score_avg from experiment")
> dat = fetch(ru, n=-1)
> dat
cattle_hygiene_score_avg
1 1.89
2 1.01
3 1.21
4 1.22
5 1.93
....
My first 2 data frames are integers and my third data frame is in decimal format. I am able to run a simple correlation test on these data frames, but a detailed test (or plot) cannot be run as seen below.
> cor(data, datas)
BMSCC
mastitis_no 0.8303017
> cor.test(data, datas)
Error in cor.test.default(data, datas) : 'x' must be a numeric vector
Therefore I accessed the data inside those data frames using the usual list idexing device $, however the decimal data frame did not work as noted below.
> data$mastitis
[1] 5 2 8 6 2 0 5 6 7 3 0 1 0 3 2 2 0 5 2 1
> datas$BMSCC
[1] 14536 10667 23455 17658 14999 5789 18234 22390 19069 13677 13536 11667 13455
[14] 17678 14099 15789 8234 21390 16069 13597
> dat$hygiene
NULL
by doing this I am able to perform a spearman rank correlation test and scatter plot on the first two data frames but not the decimal data frame. Any suggestion on what I need to do? I am sure the answer is quite simple but I cannot find the coding necessary for this simple task. Any help would be much appreciated.
Related
usually I use [R] for my data analysis, but these days I have to use SPSS. I was expecting that data manipulation might get a little bit more difficult this way, but after my first day I kind of surrender :D and I really would appreciate some help ...
My problem is the following:
I have two data sets, which have an ID number. Neither data sets have a unique ID (in one data set, which should have unique IDs, there is kind of a duplicated row)
In a perfect world I would like to keep this duplicated row and simply perform a many-to-many-join. But I accepted, that I might have to delete this "bad" row (in dataset A) and perform a 1:many-join (join dataset B to dataset A, which contains the unique IDs).
If I run the join (and accept that it seems not to be possible to run a 1:many, but only a many:1-join), I have the problem, that I lose IDs. If I join dataset A to dataset B I lose all cases, that are not part of dataset B. But I really would like to have both IDs like in a full join or something.
Do you know if there is (kind of) a simple solution to my problem?
Example:
dataset A:
ID
VAL1
1
A
1
B
2
D
3
K
4
A
dataset B:
ID
VAL2
1
g
2
k
4
a
5
c
5
d
5
a
2
x
expected result (best solution):
ID
VAL1
VAL2
1
A
g
1
B
g
2
D
k
3
K
NA
4
A
a
2
D
x
expected result (second best solution):
ID
VAL1
VAL2
1
A
g
2
D
k
3
K
NA
4
A
a
5
NA
c
5
NA
d
5
NA
a
2
D
x
what I get (worst solution):
ID
VAL1
VAL2
1
A
g
2
D
k
4
A
a
5
NA
c
5
NA
d
5
NA
a
2
D
x
From your example It looks like what you need is a full many to many join, based on the ID's existing in dataset A. You can get this by creating a full Cartesian-Product of the two dataset, using dataset A as the first\left dataset.
The following syntax assumes you have the STATS CARTPROD extention command installed. If you don't you can see here about installing it.
First I'll recreate your example to demonstrate on:
dataset close all.
data list list/id1 vl1 (2F3) .
begin data
1 232
1 433
2 456
3 246
4 468
end data.
dataset name aaa.
data list list/id2 vl2 (2F3) .
begin data
1 111
2 222
4 333
5 444
5 555
5 666
2 777
3 888
end data.
dataset name bbb.
Now the actual work is fairly simple:
DATASET ACTIVATE aaa.
STATS CARTPROD VAR1=id1 vl1 INPUT2=bbb VAR2=id2 vl2
/SAVE OUTFILE="C:\somepath\yourcartesianproduct.sav".
* The new dataset now contains all possible combinations of rows in the two datasets.
* we will select only the relevant combinations, where the two ID's match.
select if id1=id2.
exe.
I am trying to query a table in our mySQL database using the DBI R package. However, I need to pull the fields from the table by changing the date field on a monthly basis and limiting it to 1.
I'm having trouble with the looping and sql query text. I would like to create a loop that changes the date (monthly) and then prints that to a database query that will then pull all the data that matches the monthly conditions.
This is my code so far:
for (i in seq(0,12,1)){
results <- dbGetQuery(myDB, paste("SELECT * FROM cost_and_price_period WHERE start_date <=", '01-[[i]]-2019'))
}
The main issue is that R doesn't acknowledge post-increment operators like ++, so I know I could just make 12 individual queries and then rbind them, but I would prefer to do one efficient query. Does anyone have any ideas?
This below solution could give you an idea how to proceed with your problem.
DummyTable
id names dob
1 1 aa 2018-01-01
2 2 bb 2018-02-01
3 3 cc 2018-03-01
4 4 dd 2018-04-01
5 5 ee 2018-05-01
6 6 ff 2018-06-01
7 7 gg 2018-07-01
8 8 hh 2018-08-01
9 9 ii 2018-09-01
10 10 jj 2018-10-01
11 11 kk 2018-11-01
12 12 ll 2018-12-01
13 13 ll 2018-12-01
Imagine we have the above table in MySQL. Then we need to access the data for 1st day of every month and store whole records as a data frame.
### Using for loop like from your question
n <- 12
df <- vector("list", n)
for (i in seq(1:12)){
df[[i]] <- data.frame(dbGetQuery(pool, paste0("SELECT * FROM dummyTable WHERE dob = '2018-",i,"-01';" ))) # in iteration `i` corresponds for month number
}
df <- do.call(rbind, df)
### Using lapply(preferred way)
n <- seq(1:12)
df <- lapply(n, function(x){
dbGetQuery(pool, paste0("SELECT * FROM dummyTable WHERE dob = '2018-",x,"-01';" ))
})
df <- do.call(rbind, df)
So output of df data frame will give the matched records from MySQL.
What would be the efficient way to count the number of rows which using dplyr to access sql table. MWE is below using SQLite, but I use PostgreSQL and have the same issue. Basically dim() is not very consistent. I used
dim()
This works for a schema in the database (First case), but is not very consistent when I create a tbl from an SQL query for the same schema (Second case). My number of rows is in the millions or I see this even with a small 1000 of rows. I get NA or ??. Is there anything that is missing?
#MWE
test_db <- src_sqlite("test_db.sqlite3", create = T)
library(nycflights13)
flights_sqlite <- copy_to(test_db, flights, temporary = FALSE, indexes = list(
c("year", "month", "day"), "carrier", "tailnum"))
flights_postgres <- tbl(test_db, "flights")
First case (table from direct schema)
flights_postgres
> flights_postgres
Source: postgres 9.3.5 []
From: flights [336,776 x 16]
year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
1 2013 1 1 517 2 830 11 UA N14228 1545 EWR IAH 227 1400 5 17
2 2013 1 1 533 4 850 20 UA N24211 1714 LGA IAH 227 1416 5 33
#using dim()
> dim(flights_postgres)
[1] 336776 16
The above works and get the count of the number of rows.
Second case (table from SQL query)
## use the flights schema above but can also be used to create other variables (like lag, lead) in run time
flight_postgres_2 <- tbl(test_db, sql("SELECT * FROM flights"))
>flight_postgres_2
Source: postgres 9.3.5 []
From: <derived table> [?? x 16]
year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
1 2013 1 1 517 2 830 11 UA N14228 1545 EWR IAH 227 1400 5 17
2 2013 1 1 533 4 850 20 UA N24211 1714 LGA IAH 227 1416 5 33
>
> dim(flight_postgres_2)
[1] NA 16
As you see it either prints as ?? or NA. So not very helpful.
I got around this by either using collect() or then convert the output to a dataframe using as.data.frame() to check the dimension. But these two methods may not be the ideal solution, given the time it may take for larger number of rows.
I think the answer is what #alistaire suggests: Do it in the database.
> flight_postgres_2 %>% summarize(n())
Source: sqlite 3.8.6 [test_db.sqlite3]
From: <derived table> [?? x 1]
n()
(int)
1 336776
.. ...
Asking dim to do this would be having your cake (lazy evaluation of SQL with dplyr, keeping data in the database) and eating it too (having full access to the data in R).
Note that this is doing #alistaire's approach underneath:
> flight_postgres_2 %>% summarize(n()) %>% explain()
<SQL>
SELECT "n()"
FROM (SELECT COUNT() AS "n()"
FROM (SELECT * FROM flights) AS "zzz11") AS "zzz13"
<PLAN>
selectid order from detail
1 0 0 0 SCAN TABLE flights USING COVERING INDEX flights_year_month_day
It often happens that data will be given to you with wrapped columns. Consider, for example:
CCY Decimals CCY Decimals CCY Decimals
AUD/CAD 5 EUR/CZK 4 GBP/NOK 5
AUD/CHF 5 EUR/DKK 5 GBP/NZD 5
AUD/DKK 5 EUR/GBP 5 GBP/PLN 5
AUD/JPY 3 EUR/HKD 5 GBP/SEK 5
AUD/NOK 5 EUR/HUF 3 GBP/SGD 5
...
Which should be parsed as a dataframe of two columns (CCY and Decimals), not six. My question is, what is the most idiomatic way of achieving this?
I would have wanted something like the following:
data = pd.read_csv("file.csv")
data.groupby(axis=1,by=data.columns.map(lambda s: s.replace("\..",""))).\
apply(lambda df : df.values.flatten())
When reading the csv file we end up with columns CCY,Decimals,CCY.1,Decimals.1 .. etc. The groupby operation returns a collection of data frames:
<pandas.core.groupby.DataFrameGroupBy object at 0x3a52b10>
Which we would then flatten using numpy functionality. So we would are converting DataFrames with repeating columns into Series, and then merging these into a result DF.
However, this doesn't work. I've tried passing the different keys arguments to groupBy, but it always complains about being unable to reindex non-unique columns.
There are a number of existing questions that deal with flattening groups of columns (e.g. "Flattening" output of group.nth in Pandas), but I can't find any that do this for repeating columns.
To use groupby, I'd do:
>>> groups = df.groupby(axis=1,by=lambda x: x.rsplit(".",1)[0])
>>> pd.DataFrame({k: v.values.flat for k,v in groups})
CCY Decimals
0 AUD/CAD 5
1 EUR/CZK 4
2 GBP/NOK 5
3 AUD/CHF 5
4 EUR/DKK 5
5 GBP/NZD 5
6 AUD/DKK 5
7 EUR/GBP 5
8 GBP/PLN 5
9 AUD/JPY 3
10 EUR/HKD 5
11 GBP/SEK 5
12 AUD/NOK 5
13 EUR/HUF 3
14 GBP/SGD 5
[15 rows x 2 columns]
and then sort.
I am working on NGSim Traffic data, having 18 columns and 1180598 rows in a text file. I want to smooth the position data, in the column 'Local Y'. I know there are built-in functions for data smoothing in R but none of them seem to match with the formula I am required to apply. The data in text file looks something like this:
Index VehicleID Total_Frames Local Y
1 2 5 35.381
2 2 5 39.381
3 2 5 43.381
4 2 5 47.38
5 2 5 51.381
6 4 8 504.828
7 4 8 508.325
8 4 8 512.841
9 4 8 516.338
10 4 8 520.854
11 4 8 524.592
12 4 8 528.682
13 4 8 532.901
14 5 7 39.154
15 5 7 43.153
16 5 7 47.154
17 5 7 51.154
18 5 7 55.153
19 5 7 59.154
20 5 7 63.154
The above data columns are just example taken out of original file. Here you can see 3 vehicles, with vehicle IDs = 2, 4 and 5 but in fact there are 2169 vehicles with different IDS. The column Total_Frames tell us how many times vehicle Id of each vehicle is repeated in the first column, for example in the table above, vehicle ID 2 is repeated 5 times, hence '5' in Total_Frames column. Following is the formula I am required to apply to remove data noise (smoothing) from column 'Local Y':
Smoothed Position Value = (1/(Summation of [EXP^-abs(i-k)/delta] from k=i-D to i+D)) * ( (Summation of (Local Y) *[EXP^-abs(i-k)/delta] from k=i-D to i+D))
where,
i = index #
delta = 5
D = 15
I have tried using the built-in functions, which I know of, but they don't smooth the data as required. My question is: Is there any built-in function in R which can do the data smoothing in the way of given formula or which could take this formula as an argument? I need to apply the formula to every value in Local Y which has 15 values before and 15 values after them (i-D and i+D) for same vehicle Id. Can anyone give me any idea how to approach the problem? Thanks in advance.
You can place your formula in a function and then use the apply function of R to apply it to the elements in your "Local Y" column of the dataframe