Post Increment date field in mySQL query using R - mysql

I am trying to query a table in our mySQL database using the DBI R package. However, I need to pull the fields from the table by changing the date field on a monthly basis and limiting it to 1.
I'm having trouble with the looping and sql query text. I would like to create a loop that changes the date (monthly) and then prints that to a database query that will then pull all the data that matches the monthly conditions.
This is my code so far:
for (i in seq(0,12,1)){
results <- dbGetQuery(myDB, paste("SELECT * FROM cost_and_price_period WHERE start_date <=", '01-[[i]]-2019'))
}
The main issue is that R doesn't acknowledge post-increment operators like ++, so I know I could just make 12 individual queries and then rbind them, but I would prefer to do one efficient query. Does anyone have any ideas?

This below solution could give you an idea how to proceed with your problem.
DummyTable
id names dob
1 1 aa 2018-01-01
2 2 bb 2018-02-01
3 3 cc 2018-03-01
4 4 dd 2018-04-01
5 5 ee 2018-05-01
6 6 ff 2018-06-01
7 7 gg 2018-07-01
8 8 hh 2018-08-01
9 9 ii 2018-09-01
10 10 jj 2018-10-01
11 11 kk 2018-11-01
12 12 ll 2018-12-01
13 13 ll 2018-12-01
Imagine we have the above table in MySQL. Then we need to access the data for 1st day of every month and store whole records as a data frame.
### Using for loop like from your question
n <- 12
df <- vector("list", n)
for (i in seq(1:12)){
df[[i]] <- data.frame(dbGetQuery(pool, paste0("SELECT * FROM dummyTable WHERE dob = '2018-",i,"-01';" ))) # in iteration `i` corresponds for month number
}
df <- do.call(rbind, df)
### Using lapply(preferred way)
n <- seq(1:12)
df <- lapply(n, function(x){
dbGetQuery(pool, paste0("SELECT * FROM dummyTable WHERE dob = '2018-",x,"-01';" ))
})
df <- do.call(rbind, df)
So output of df data frame will give the matched records from MySQL.

Related

Best way to plot by hour by day by month in r

currently I have created the following dataframe in R but I am having trouble with my visualisation.
The Dataframe looks as follows:
date weekday dayhour amount
2017-06 0 1 100
2017-06 0 2 200
2017-06 0 3 150
2017-06 0 4 600
2017-06 0 5 75
....
2018-06 6 21 60
2018-06 6 22 90
2018-06 6 23 150
2018-06 6 24 110
The amount is the average of that weekday by hour for that month. So for example the month june in 2017 on the first hour of each monday in june has an average amount of 100.
Now the idea is to plot my data in R in several graphs which will show me the data by hour by weekday for that given month. So 12 plots with each the amount on the y axis and the hour+weekday on the x axis.
I have tried several approaches such as looping through the months and plotting them with par(mfrow = c(2,6)). Also I tried plotting them one by one. However I am still a rookie with R and I can't find any good documentation or tutorial on how to do this. For now I have only been able to stack the datapoints in one loop by weekday and not by hour by doing the following on the dataset without hours included yet:
increase = 7
for (i in (length(occupancy_by_day)/7)) {
data = head(occupancy_by_day,increase:increase+increase)
plot(average_occupancy ~ Weekday, data=data)
increase = increase + 7
}
My closest guess to the correct answer at this moment is something like this:
par(mfrow = c(2,6))
increase = 06
for (i in (length(occupancy_by_day)/30,5)) {
data = occupancy_by_day[occupancy_by_day$date == paste(c('2017-',increase)), ]
plot(amount ~ weekday, data=data)
increase = increase + 1
}
This gives me the error:
Error in plot.window(...) : need finite 'xlim' values
Does anyone know a good solution to plotting the data in R?
Thanks in advance for any help/comments!
EDIT:
priority on this post would be how to plot data by hour by weekday. I could iterate through the months manually however I would still need to plot them. A loop for each month would be added bonus. Right now I have the following:
data =occupancy_by_day[occupancy_by_day$date == '2017-06', ]
plot(Amount ~ weekday+dayhour, data=data)
This sadly only plots the data by dayhour.
ADDED DRAWING OF CONCEPT:
https://imgur.com/qKFbbmJ
ANSWER:
Eventually I did a litle workaround to plot them with:
ggplot(data = data[data$date == '2017-12', ], aes(plotstamp, Amount, group=Weekday, col=Weekday)) +
geom_line() +
geom_point() +
ggtitle("December 2017")
the plotstamp is an extra column/index I added to my DF which allowed my to plot the values continously. Then I just plotted them seperately per month.
Make similar data
I think this is the partial solution you ask for in your edit (if I understand your task correctly), but I believe you can loop through months in the same way.
The only way I could think of was to transform the dates you have to date class. I used some prepared date data but you can fix yours using the strptime() and paste() commands to match mine. Also, the data I made is only for two days.
date1 <- c(rep("2017-06-1",24),rep("2017-06-2",24))
weekday <- c(rep(0,24),rep(1,24))
dayhour <- c(1:24,1:24)
# Add dayhour to date
date <- paste(date1, dayhour, sep = " ")
date <- strptime(date, "%Y-%m-%d %H")
amount <- c(1:24,(48:25)*2)
dat <- data.frame(date,weekday,dayhour,amount)
View(dat)
plot(x=dat$date, y=dat$amount)
This is how my created data looks like.
date weekday dayhour amount
1 2017-06-01 01:00:00 0 1 1
2 2017-06-01 02:00:00 0 2 2
3 2017-06-01 03:00:00 0 3 3
4 2017-06-01 04:00:00 0 4 4
....
46 2017-06-02 22:00:00 1 22 54
47 2017-06-02 23:00:00 1 23 52
48 2017-06-03 00:00:00 1 24 50
Loop for the plot.
If you write this in an R markdown document you will get nice pages for each plot so you don't have to use par(mfrow = c(1,2)). You probably also need to fix the loop arguments to fit your data.
par(mfrow = c(1,2))
start <- 0
end <- 23
step = 1
for (i in 1:(length(dat$date)/24)) {
data <- dat[(start+step) : (end+step), ] # The parenteses at (start+step) and (end+step) are important!
plot(x = data$date, y = data$amount)
step = step + 23
}
I hope this help you.
P.S. This is the first answer I write, so feel free to edit or improve my answer.

Group by / Summing values with the same column value

i was trying to solve a problem which just looks like the code written below, but from lack of knowledge and reading through the sqlalchemy documentation, i do not really find any solution on how to solve my problem, yet.
Objective:
Get summed value of sales_in_usd if year in year_column is same
What I got so far is by debugging and reading a bit through stackoverflow and documentations, google by using following query:
session.query(fact_corporate_sales, Company, Sales,
Time, Sector, func.sum(Sales.sales_in_usd).label('summary')).\
join(Sales).\
join(Time).\
join(Company).\
join(Segment).\
order_by(Time.year.desc()).\
filter(Company.company_name.like(filtered)).\
group_by(fact_corporate_sales.fact_cps_id, Company.company_name,fact_corporate_sales.cps_id).\
all()
And well the fact_cps_id is unique in the fact_table and the same table stores, the keys of the dimension tables as well..
I have a fact table which stores 4 foreign keys from 4 dimension tables.
fact_cps_id company_id sales_id time_id sector_id
1 4 2 1 2
2 4 1 1 3
3 4 3 2 1
4 4 2 2 4
5 4 4 3 2
6 4 99 1 1
dim_company
company_id company_name
1 Nike
2 Adidas
3 Puma
4 Reebok
dim_segment
segment_id segment_nom
1 basketball
2 running
3 soccer
4 watersports
dim_time
time_id quarter year
1 1 2013
2 2 2013
3 1 2014
4 3 2014
dim_sales
sales_id sales_in_euro
1 2000
2 3200
3 1400
4 1590
.. ..
99 1931
So basically, as you can see in the table and query what I was trying to do was summing up all sales from the as example dim_Time.year <- from the same year.
If we look into the fact_table we can see, that we have time_id = 1 three times, here. So those values could be summed up and displayed as a summary.
I know from standard SQL that it was possible by using group by and aggregate function sum.
My result(time_id is only for help therefore was no output):
13132.0 <- time_id = 1
21201.0 <- time_id = 2
23923.0 <- time_id = 1
31232.0 <- time_id = 99
32021.0 <- time_id = 2
32342.0 <- time_id = 1
131231.0 <- time_id = 4
I printed the actual query into the console and got this [had to remove .all(), because 'list' has no attribute called 'statement']:
SELECT fact_corporate_sales.cps_fact_id, fact_corporate_sales.cps_id,
fact_corporate_sales.company_id, fact_corporate_sales.time_id, fact_corporate_sales.segment_id, sum(dim_corporate_sales.sales_in_usd) AS summary
FROM fact_corporate_sales INNER JOIN dim_corporate_sales ON dim_corporate_sales.cps_id = fact_corporate_sales.cps_id INNER JOIN dim_time ON dim_time.time_id = fact_corporate_sales.time_id INNER JOIN dim_company ON dim_company.company_id = fact_corporate_sales.company_id INNER JOIN dim_segment ON dim_segment.segment_id = fact_corporate_sales.segment_id
WHERE dim_company.company_name LIKE %s GROUP BY fact_corporate_sales.cps_fact_id ORDER BY dim_time.year DESC
And if I want to group by for example dim_time.Year only..I get following response from mysql or console
Error Code: 1055. Expression #1 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'db.fact_corporate_sales.fact_cps_id' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by
The solution was only to execute following sql:
engine.execute("SET sql_mode='';")
As the response of my failed query was:
"this is incompatible with sql_mode=only_full_group_by"
I had to disable the sql_mode and so did I and got my result.

Count number of rows when using dplyr to access sql table/query

What would be the efficient way to count the number of rows which using dplyr to access sql table. MWE is below using SQLite, but I use PostgreSQL and have the same issue. Basically dim() is not very consistent. I used
dim()
This works for a schema in the database (First case), but is not very consistent when I create a tbl from an SQL query for the same schema (Second case). My number of rows is in the millions or I see this even with a small 1000 of rows. I get NA or ??. Is there anything that is missing?
#MWE
test_db <- src_sqlite("test_db.sqlite3", create = T)
library(nycflights13)
flights_sqlite <- copy_to(test_db, flights, temporary = FALSE, indexes = list(
c("year", "month", "day"), "carrier", "tailnum"))
flights_postgres <- tbl(test_db, "flights")
First case (table from direct schema)
flights_postgres
> flights_postgres
Source: postgres 9.3.5 []
From: flights [336,776 x 16]
year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
1 2013 1 1 517 2 830 11 UA N14228 1545 EWR IAH 227 1400 5 17
2 2013 1 1 533 4 850 20 UA N24211 1714 LGA IAH 227 1416 5 33
#using dim()
> dim(flights_postgres)
[1] 336776 16
The above works and get the count of the number of rows.
Second case (table from SQL query)
## use the flights schema above but can also be used to create other variables (like lag, lead) in run time
flight_postgres_2 <- tbl(test_db, sql("SELECT * FROM flights"))
>flight_postgres_2
Source: postgres 9.3.5 []
From: <derived table> [?? x 16]
year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
1 2013 1 1 517 2 830 11 UA N14228 1545 EWR IAH 227 1400 5 17
2 2013 1 1 533 4 850 20 UA N24211 1714 LGA IAH 227 1416 5 33
>
> dim(flight_postgres_2)
[1] NA 16
As you see it either prints as ?? or NA. So not very helpful.
I got around this by either using collect() or then convert the output to a dataframe using as.data.frame() to check the dimension. But these two methods may not be the ideal solution, given the time it may take for larger number of rows.
I think the answer is what #alistaire suggests: Do it in the database.
> flight_postgres_2 %>% summarize(n())
Source: sqlite 3.8.6 [test_db.sqlite3]
From: <derived table> [?? x 1]
n()
(int)
1 336776
.. ...
Asking dim to do this would be having your cake (lazy evaluation of SQL with dplyr, keeping data in the database) and eating it too (having full access to the data in R).
Note that this is doing #alistaire's approach underneath:
> flight_postgres_2 %>% summarize(n()) %>% explain()
<SQL>
SELECT "n()"
FROM (SELECT COUNT() AS "n()"
FROM (SELECT * FROM flights) AS "zzz11") AS "zzz13"
<PLAN>
selectid order from detail
1 0 0 0 SCAN TABLE flights USING COVERING INDEX flights_year_month_day

Cannot print out the latest results of table

I have the following table:
NAMES:
Fname | stime | etime | Ver | Rslt
x 4 5 1.01 Pass
x 8 10 1.01 Fail
x 6 7 1.02 Pass
y 4 8 1.01 Fail
y 9 10 1.01 Fail
y 11 12 1.01 Pass
y 10 14 1.02 Fail
m 1 2 1.01 Fail
m 4 6 1.01 Fail
The result I am trying to output is:
x 8 10 1.01 Fail
x 6 7 1.02 Pass
y 11 12 1.01 Pass
y 10 14 1.02 Fail
m 4 6 1.01 Fail
What the result means:
Fnames are an example of tests that are run. Each test was run on different platforms of software (The version numbers) Some tests were run on the same platform twice: It passed the first time and failed the second time or vice versa. My required output is basically the latest result of each case for each version. So basically the results above are all unique by their combination of Fname and Ver(sion), and they are selected by the latest etime from the unique group.
The query I have so far is:
select Fname,stime,max(etime),ver,Rslt from NAMES group by Fname,Rslt;
This however, does not give me the required output.
The output I get is (wrong):
x 4 10 1.01 Fail
x 6 7 1.02 Pass
y 4 12 1.01 Pass
y 10 14 1.02 Fail
m 1 6 1.01 Fail
Basically it takes the max time, but it does not really print the correct data out, it prints the max time, but it prints the initial time of the whole unique group of data, instead of the initial time of that particular test (record).
I have tried so long to fix this, but I seem to be going no where. I have a feeling there is a join somewhere in here, but I tried that too, no luck.
Any help is appreciated,
Thank you.
Use a subquery to get the max ETime by FName and Ver, then join your main table to it:
SELECT
NAMES.FName,
NAMES.STime,
NAMES.ETime,
NAMES.Ver,
NAMES.Rslt
FROM NAMES
INNER JOIN (
SELECT FName, Ver, MAX(ETime) AS MaxETime
FROM NAMES
GROUP BY FName, Ver
) T ON NAMES.FName = T.FName AND NAMES.Ver = T.Ver AND NAMES.ETime = T.MaxETime
You could first find which is the latests=max(etime) for each case for each version ?
select Fname,Ver,max(etime) from NAMES group by Fname,Ver;
From there you would display the whole thing via joining it again?
select *
from
NAMES
inner join
(select Fname,Ver,max(etime) as etime from NAMES group by Fname,Ver ) sub1
using (Fname,Ver,etime)
order by fname,Ver;

Extracting data from data frame in R

I am very new to R (computer programming in general) and am working on a bioinformatics project. I made a MySQL database and using RMySQL connected to that database in the MySQL server in R. From here I issued queries to select a certain field from a table, fetch this data and make it into a data frame in R as seen below:
> rs = dbSendQuery(con, "select mastitis_no from experiment")
> data = fetch(rs, n=-1)
> data
mastitis_no
1 5
2 2
3 8
4 6
5 2
....
> rt = dbSendQuery(con, "select BMSCC from experiment")
> datas = fetch(rt, n=-1)
> datas
BMSCC
1 14536
2 10667
3 23455
4 17658
5 14999
....
> ru = dbSendQuery(con, "select cattle_hygiene_score_avg from experiment")
> dat = fetch(ru, n=-1)
> dat
cattle_hygiene_score_avg
1 1.89
2 1.01
3 1.21
4 1.22
5 1.93
....
My first 2 data frames are integers and my third data frame is in decimal format. I am able to run a simple correlation test on these data frames, but a detailed test (or plot) cannot be run as seen below.
> cor(data, datas)
BMSCC
mastitis_no 0.8303017
> cor.test(data, datas)
Error in cor.test.default(data, datas) : 'x' must be a numeric vector
Therefore I accessed the data inside those data frames using the usual list idexing device $, however the decimal data frame did not work as noted below.
> data$mastitis
[1] 5 2 8 6 2 0 5 6 7 3 0 1 0 3 2 2 0 5 2 1
> datas$BMSCC
[1] 14536 10667 23455 17658 14999 5789 18234 22390 19069 13677 13536 11667 13455
[14] 17678 14099 15789 8234 21390 16069 13597
> dat$hygiene
NULL
by doing this I am able to perform a spearman rank correlation test and scatter plot on the first two data frames but not the decimal data frame. Any suggestion on what I need to do? I am sure the answer is quite simple but I cannot find the coding necessary for this simple task. Any help would be much appreciated.