How do I download a table from the CDC website as a CSV file? - html

I am trying to use BRFSS Data from the CDC in R. In particular, I am trying to read the 2014-2018 data into separate dataframes (step 1 complete), add column titles to the dataframes (what I'm working on), and combine all years into one dataframe.
The column titles are not in the ASC data file, but they are on this website in an HTML table:
https://www.cdc.gov/brfss/annual_data/2017/llcp_varlayout_17_onecolumn.html
How can I take the table from this website and download it as a CSV file?
p.s. this is the code I am trying to replicate in order to use the data (if anyone uses BRFSS data and has a better way, let me know). He already created a CSV of column title data that he is using, but it is for a different year so I can't use it and he doesn't give instructions. https://michaelminn.net/tutorials/r-brfss/

You can use rvest
library(rvest)
url <- "https://www.cdc.gov/brfss/annual_data/2017/llcp_varlayout_17_onecolumn.html"
data <- read_html(url) %>%
html_element(xpath="//main//table") %>%
html_table()
data
#> # A tibble: 358 × 3
#> `Starting Column` `Variable Name` `Field Length`
#> <int> <chr> <int>
#> 1 1 _STATE 2
#> 2 17 FMONTH 2
#> 3 19 IDATE 8
#> 4 19 IMONTH 2
#> 5 21 IDAY 2
#> 6 23 IYEAR 4
#> 7 32 DISPCODE 4
#> 8 36 SEQNO 10
#> 9 36 _PSU 10
#> 10 63 CTELENM1 1
#> # … with 348 more rows

Related

Add percent labels to a stacked bar graph of counts (no y variable in aes)

My data: each row is a participant (let's call it: pID) in my study. They all answered a question which could take response_values (Q_RV) of 1,2,3,4 or 5. Each participant is also labelled by health status (S) (1, 2, or 3).
data looks something like this:
#> # A tibble: 8 x 3
#> pID Q_RV S
#> <fct> <fct> <int>
#> 1 1 1
#> 2 1 1
#> 3 3 1
#> 4 3 2
#> 5 1 2
#> 6 2 1
#> 7 4 3
#> 8 5 1
I've made a stacked bar graph using counts of the each response value, and filled each bar by health status:
plot <- ggplot(data, aes (x=Q_RV, fill=S)) + [other stuff to make the plot look nice]
and I get this:
plot showing counts for each response value.
Now, I'd love to add a percent label above each bar that shows the percent of responses that had each value. In other words, over the far left bar, it should be roughly 75.5%
How do I do it? Every questions I've looked at uses a y argument in the aes....
Edit:
Found the answer here:
Adding percentage labels to a barplot with y-axis count

Rvest cannot find eq tags

I am currently using R to capture a table of columns. Using Rvest as well as finding its css selector, I am able to extract most of them using the html_nodes or html_table function. However, on some, when the css selector includes "eq(somenumber)", I am not able to extract the data. From what I know this eq tag has something to do with Java, but was wondering if there is a way I can use Rvest to get these tags or if there is another package I can do that.
To get the complete table from the link you can use -
library(rvest)
url <- 'https://www.ancestry.com/search/collections/62096/?count=50&marriage=1910&marriage_x=0-0-0'
result <- url %>% read_html %>% html_table() %>% .[[1]]
result
# `View Record` Name `Marriage Date` `Marriage Place` `Certificate Number` `View Images`
# <chr> <chr> <chr> <chr> <chr> <lgl>
# 1 View Record Mary Cordey year Hall certificate number NA
# 2 View Record Ralph W Craddock year Douglas certificate number NA
# 3 View Record Charles Courtney year Otoe certificate number NA
# 4 View Record Bessie A Crile year Saline certificate number NA
# 5 View Record Guy Crane year Douglas certificate number NA
# 6 View Record Storpha L Crow year Douglas certificate number NA
# 7 View Record Ernestine Crabtree year Lancaster certificate number NA
# 8 View Record Oscar C Croft year York certificate number NA
# 9 View Record Ansil B Crabill year Webster certificate number NA
#10 View Record Belva M Craig year Merrick certificate number NA
# … with 40 more rows

Post Increment date field in mySQL query using R

I am trying to query a table in our mySQL database using the DBI R package. However, I need to pull the fields from the table by changing the date field on a monthly basis and limiting it to 1.
I'm having trouble with the looping and sql query text. I would like to create a loop that changes the date (monthly) and then prints that to a database query that will then pull all the data that matches the monthly conditions.
This is my code so far:
for (i in seq(0,12,1)){
results <- dbGetQuery(myDB, paste("SELECT * FROM cost_and_price_period WHERE start_date <=", '01-[[i]]-2019'))
}
The main issue is that R doesn't acknowledge post-increment operators like ++, so I know I could just make 12 individual queries and then rbind them, but I would prefer to do one efficient query. Does anyone have any ideas?
This below solution could give you an idea how to proceed with your problem.
DummyTable
id names dob
1 1 aa 2018-01-01
2 2 bb 2018-02-01
3 3 cc 2018-03-01
4 4 dd 2018-04-01
5 5 ee 2018-05-01
6 6 ff 2018-06-01
7 7 gg 2018-07-01
8 8 hh 2018-08-01
9 9 ii 2018-09-01
10 10 jj 2018-10-01
11 11 kk 2018-11-01
12 12 ll 2018-12-01
13 13 ll 2018-12-01
Imagine we have the above table in MySQL. Then we need to access the data for 1st day of every month and store whole records as a data frame.
### Using for loop like from your question
n <- 12
df <- vector("list", n)
for (i in seq(1:12)){
df[[i]] <- data.frame(dbGetQuery(pool, paste0("SELECT * FROM dummyTable WHERE dob = '2018-",i,"-01';" ))) # in iteration `i` corresponds for month number
}
df <- do.call(rbind, df)
### Using lapply(preferred way)
n <- seq(1:12)
df <- lapply(n, function(x){
dbGetQuery(pool, paste0("SELECT * FROM dummyTable WHERE dob = '2018-",x,"-01';" ))
})
df <- do.call(rbind, df)
So output of df data frame will give the matched records from MySQL.

Count number of rows when using dplyr to access sql table/query

What would be the efficient way to count the number of rows which using dplyr to access sql table. MWE is below using SQLite, but I use PostgreSQL and have the same issue. Basically dim() is not very consistent. I used
dim()
This works for a schema in the database (First case), but is not very consistent when I create a tbl from an SQL query for the same schema (Second case). My number of rows is in the millions or I see this even with a small 1000 of rows. I get NA or ??. Is there anything that is missing?
#MWE
test_db <- src_sqlite("test_db.sqlite3", create = T)
library(nycflights13)
flights_sqlite <- copy_to(test_db, flights, temporary = FALSE, indexes = list(
c("year", "month", "day"), "carrier", "tailnum"))
flights_postgres <- tbl(test_db, "flights")
First case (table from direct schema)
flights_postgres
> flights_postgres
Source: postgres 9.3.5 []
From: flights [336,776 x 16]
year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
1 2013 1 1 517 2 830 11 UA N14228 1545 EWR IAH 227 1400 5 17
2 2013 1 1 533 4 850 20 UA N24211 1714 LGA IAH 227 1416 5 33
#using dim()
> dim(flights_postgres)
[1] 336776 16
The above works and get the count of the number of rows.
Second case (table from SQL query)
## use the flights schema above but can also be used to create other variables (like lag, lead) in run time
flight_postgres_2 <- tbl(test_db, sql("SELECT * FROM flights"))
>flight_postgres_2
Source: postgres 9.3.5 []
From: <derived table> [?? x 16]
year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
1 2013 1 1 517 2 830 11 UA N14228 1545 EWR IAH 227 1400 5 17
2 2013 1 1 533 4 850 20 UA N24211 1714 LGA IAH 227 1416 5 33
>
> dim(flight_postgres_2)
[1] NA 16
As you see it either prints as ?? or NA. So not very helpful.
I got around this by either using collect() or then convert the output to a dataframe using as.data.frame() to check the dimension. But these two methods may not be the ideal solution, given the time it may take for larger number of rows.
I think the answer is what #alistaire suggests: Do it in the database.
> flight_postgres_2 %>% summarize(n())
Source: sqlite 3.8.6 [test_db.sqlite3]
From: <derived table> [?? x 1]
n()
(int)
1 336776
.. ...
Asking dim to do this would be having your cake (lazy evaluation of SQL with dplyr, keeping data in the database) and eating it too (having full access to the data in R).
Note that this is doing #alistaire's approach underneath:
> flight_postgres_2 %>% summarize(n()) %>% explain()
<SQL>
SELECT "n()"
FROM (SELECT COUNT() AS "n()"
FROM (SELECT * FROM flights) AS "zzz11") AS "zzz13"
<PLAN>
selectid order from detail
1 0 0 0 SCAN TABLE flights USING COVERING INDEX flights_year_month_day

Using rvest to return descendants of a table

I am having trouble figuring out why the following code isn't returning the information specified by the xpath.
I am trying to select the count data found in the 'Core Questions' section of the page. I wanted to get it working for the table of the first question and then intended to extend it to do the same thing for each question/table on the page. Unfortunately I can get it to pull down the section of the table I am interested in. I imagine the answer involves specifying the children of the < tr > node I am interested in, i.e. multiple < td > tags, but my attempts to do this continue to fail. Would anyone be able to help me specify the part of the table I am interested in? (Bonus points if it can be done for all ten tables on the page!)
library(rvest)
detailed <- html("https://www.deakin.edu.au/evaluate/results/old/detail-rep.php?schedule_select=1301&faculty_select=01&school_select=0104&unit_select=MIS202&location_select=B")
q1 <- detailed %>%
html_nodes(xpath='//*[#id="main"]/div/div/form/fieldset[2]/table[1]/tbody/tr/td[2]/div/table/tbody/tr[5]') %>%
html_table(header = TRUE, fill=TRUE)
When I go to the ancestor table it pulls down the information but it is extremely messy and difficult to interpret. When I try to specify elements within this table I am unable to extract info. Is anyone able to explain to me why the descendants of table[1] are not being extracted? Here is the code to pull down table[1]:
q1 <- detailed %>%
html_nodes(xpath='//*[#id="main"]/div/div/form/fieldset[2]/table[1]') %>%
html_table(header = TRUE, fill = TRUE)
Does this get you where you need to be?
allqs <- detailed %>%
html_nodes(css = ".result center") %>%
html_text()
t(matrix(as.numeric(allqs), 5, 10, dimnames = list(c("Strongly Disagree", "Disagree", "Neutral", "Agree", "Strongly Agree"),
paste0("Q", 1:10))))
Which gives:
Strongly Disagree Disagree Neutral Agree Strongly Agree
Q1 0 4 4 9 1
Q2 1 2 2 11 2
Q3 0 0 2 11 5
Q4 1 3 2 9 3
Q5 0 3 4 10 1
Q6 0 1 5 7 2
Q7 0 3 6 6 3
Q8 1 0 2 7 8
Q9 0 0 5 7 5
Q10 0 1 4 7 5