Add percent labels to a stacked bar graph of counts (no y variable in aes) - bar-chart

My data: each row is a participant (let's call it: pID) in my study. They all answered a question which could take response_values (Q_RV) of 1,2,3,4 or 5. Each participant is also labelled by health status (S) (1, 2, or 3).
data looks something like this:
#> # A tibble: 8 x 3
#> pID Q_RV S
#> <fct> <fct> <int>
#> 1 1 1
#> 2 1 1
#> 3 3 1
#> 4 3 2
#> 5 1 2
#> 6 2 1
#> 7 4 3
#> 8 5 1
I've made a stacked bar graph using counts of the each response value, and filled each bar by health status:
plot <- ggplot(data, aes (x=Q_RV, fill=S)) + [other stuff to make the plot look nice]
and I get this:
plot showing counts for each response value.
Now, I'd love to add a percent label above each bar that shows the percent of responses that had each value. In other words, over the far left bar, it should be roughly 75.5%
How do I do it? Every questions I've looked at uses a y argument in the aes....
Edit:
Found the answer here:
Adding percentage labels to a barplot with y-axis count

Related

How do I download a table from the CDC website as a CSV file?

I am trying to use BRFSS Data from the CDC in R. In particular, I am trying to read the 2014-2018 data into separate dataframes (step 1 complete), add column titles to the dataframes (what I'm working on), and combine all years into one dataframe.
The column titles are not in the ASC data file, but they are on this website in an HTML table:
https://www.cdc.gov/brfss/annual_data/2017/llcp_varlayout_17_onecolumn.html
How can I take the table from this website and download it as a CSV file?
p.s. this is the code I am trying to replicate in order to use the data (if anyone uses BRFSS data and has a better way, let me know). He already created a CSV of column title data that he is using, but it is for a different year so I can't use it and he doesn't give instructions. https://michaelminn.net/tutorials/r-brfss/
You can use rvest
library(rvest)
url <- "https://www.cdc.gov/brfss/annual_data/2017/llcp_varlayout_17_onecolumn.html"
data <- read_html(url) %>%
html_element(xpath="//main//table") %>%
html_table()
data
#> # A tibble: 358 × 3
#> `Starting Column` `Variable Name` `Field Length`
#> <int> <chr> <int>
#> 1 1 _STATE 2
#> 2 17 FMONTH 2
#> 3 19 IDATE 8
#> 4 19 IMONTH 2
#> 5 21 IDAY 2
#> 6 23 IYEAR 4
#> 7 32 DISPCODE 4
#> 8 36 SEQNO 10
#> 9 36 _PSU 10
#> 10 63 CTELENM1 1
#> # … with 348 more rows

use a transaction database to calculate the probability of an item appearing in a future transaction using R or SQL

I have a database of transactions like in the table below
user_id order_id order_number product_name n
<int> <int> <int> <fctr> <int>
1 11878590 3 Pistachios 1
1 11878590 3 Soda 1
1 12878790 4 Yogurt 1
1 12878790 4 Cheddar Popcorn 1
1 12878790 4 Cinnamon Toast Crunch 1
2 12878791 11 Milk Chocolate Almonds 1
2 12878791 11 Half & Half 1
2 12878791 11 String Cheese 1
11 12878792 19 Whole Milk 1
11 12878792 19 Pistachios 1
11 12878792 19 Soda 1
11 12878792 19 Paper Towel Rolls 1
The table has multiple users who each have multiple transactions. Some users only have 3 transactions, other users have 15, etc. This is all in one table.
I'm trying to calculate a transition matrix for a markov model. I want to find the probability that an item will be in a new basket given that it was present in the previous basket of transactions.
I want my final table to look something like this
user_id product_name probability_present probability_absent
1 Soda .5 .5
1 Pistachios .5 .5
I'm having trouble figuring out how to get the data into a form so that I can calculate the probabilities and specifically coming up with a way to compare all of the t,t-1 combinations.
I have code that I've written to get things into this form, but I'm stuck at this point. I've written my code using the dplyr R package, but I could translate something in SQL into the R code. I can post my code in R if it will be helpful, but it is pretty simple at this point as I just had to do a few joins to get the table into this shape.
What else do I have to do to get the table/values that I'm trying to calculate?
This seems to give you the desired probabilities:
SELECT user_id,
product_name,
COUNT(DISTINCT order_number) / COUNT(*) AS prob_present,
1 - COUNT(DISTINCT order_number) / COUNT(*) AS prob_absent
FROM tbl
WHERE user_id = 1
GROUP BY user_id, product_name;
Or at least it gives you the numbers you have. If this is not right, please provide a slightly more complex example dataset.

How to Merge Column of Clusters and Columns with X-Y coord?

I have a column with the name of the points, a column with the X coordinates and a column with Y coordinates.
This is the tab on which I'm working:
I want to create a tab in which I have three columns, one with Clusters' ID, one with X coordinates and another with Y coordinates. And for each ckustr I want the X-Y coordinates.
I've tried the following code:
Xcoord <- sort(unique(tabprof$X_coord))
clusters <- sort(unique(tabprof$Cluster_ID))
I've tried do this in order to merge the two vectors, but it wasn't possible, because they have a different number of rows. It's probably due to the presence of clusters with the same X coord value.
Due to our talk in comments, I'll provide new solution. I'll use fake data.
A <- c(1,1,1,1,2,2,2,2)
B <- c(3,3,4,4,3,3,3,5)
df <- data.frame(A,B)
res <- unique(df)
> df
A B
1 1 3
2 1 3
3 1 4
4 1 4
5 2 3
6 2 3
7 2 3
8 2 5
> res
A B
1 1 3
3 1 4
5 2 3
8 2 5
So as you se if our A column is ClusterID and B X-coords, we have ClusterID duplicated but! we have unique coord to each one. What is more if two different IDs have the same coords it's no problem.
I hope that it'll helpful.

Using rvest to return descendants of a table

I am having trouble figuring out why the following code isn't returning the information specified by the xpath.
I am trying to select the count data found in the 'Core Questions' section of the page. I wanted to get it working for the table of the first question and then intended to extend it to do the same thing for each question/table on the page. Unfortunately I can get it to pull down the section of the table I am interested in. I imagine the answer involves specifying the children of the < tr > node I am interested in, i.e. multiple < td > tags, but my attempts to do this continue to fail. Would anyone be able to help me specify the part of the table I am interested in? (Bonus points if it can be done for all ten tables on the page!)
library(rvest)
detailed <- html("https://www.deakin.edu.au/evaluate/results/old/detail-rep.php?schedule_select=1301&faculty_select=01&school_select=0104&unit_select=MIS202&location_select=B")
q1 <- detailed %>%
html_nodes(xpath='//*[#id="main"]/div/div/form/fieldset[2]/table[1]/tbody/tr/td[2]/div/table/tbody/tr[5]') %>%
html_table(header = TRUE, fill=TRUE)
When I go to the ancestor table it pulls down the information but it is extremely messy and difficult to interpret. When I try to specify elements within this table I am unable to extract info. Is anyone able to explain to me why the descendants of table[1] are not being extracted? Here is the code to pull down table[1]:
q1 <- detailed %>%
html_nodes(xpath='//*[#id="main"]/div/div/form/fieldset[2]/table[1]') %>%
html_table(header = TRUE, fill = TRUE)
Does this get you where you need to be?
allqs <- detailed %>%
html_nodes(css = ".result center") %>%
html_text()
t(matrix(as.numeric(allqs), 5, 10, dimnames = list(c("Strongly Disagree", "Disagree", "Neutral", "Agree", "Strongly Agree"),
paste0("Q", 1:10))))
Which gives:
Strongly Disagree Disagree Neutral Agree Strongly Agree
Q1 0 4 4 9 1
Q2 1 2 2 11 2
Q3 0 0 2 11 5
Q4 1 3 2 9 3
Q5 0 3 4 10 1
Q6 0 1 5 7 2
Q7 0 3 6 6 3
Q8 1 0 2 7 8
Q9 0 0 5 7 5
Q10 0 1 4 7 5

implementation of the Gower distance function

I have a matrix (size: 28 columns and 47 rows) with numbers. This matrix has an extra row that is contains headers for the columns ("ordinal" and "nominal").
I want to use the Gower distance function on this matrix. Here says that:
The final dissimilarity between the ith and jth units is obtained as a weighted sum of dissimilarities for each variable:
d(i,j) = sum_k(delta_ijk * d_ijk ) / sum_k( delta_ijk )
In particular, d_ijk represents the distance between the ith and jth unit computed considering the kth variable. It depends on the nature of the variable:
factor or character columns are
considered as categorical nominal
variables and d_ijk = 0 if
x_ik =x_jk, 1 otherwise;
ordered columns are considered as
categorical ordinal variables and
the values are substituted with the
corresponding position index, r_ik in
the factor levels. These position
indexes (that are different from the
output of the R function rank) are
transformed in the following manner
z_ik = (r_ik - 1)/(max(r_ik) - 1)
These new values, z_ik, are treated as observations of an
interval scaled variable.
As far as the weight delta_ijk is concerned:
delta_ijk = 0 if x_ik = NA or x_jk =
NA;
delta_ijk = 1 in all the other cases.
I know that there is a gower.dist function, but I must do it that way.
So, for "d_ijk", "delta_ijk" and "z_ik", I tried to make functions, as I didn't find a better way.
I started with "delta_ijk" and i tried this:
Delta=function(i,j){for (i in 1:28){for (j in 1:47){
+{if (MyHeader[i,j]=="nominal")
+ result=0
+{else if (MyHeader[i,j]=="ordinal") result=1}}}}
+;result}
But I got error. So I got stuck and I can't do the rest.
P.S. Excuse me if I make mistakes, but English is not a language I very often.
Why do you want to reinvent the wheel billyt? There are several functions/packages in R that will compute this for you, including daisy() in package cluster which comes with R.
First things first though, get those "data type" headers out of your data. If this truly is a matrix then character information in this header row will make the whole matrix a character matrix. If it is a data frame, then all columns will likely be factors. What you want to do is code the type of data in each column (component of your data frame) as 'factor' or 'ordered'.
df <- data.frame(A = c("ordinal",1:3), B = c("nominal","A","B","A"),
C = c("nominal",1,2,1))
Which gives this --- note that all are stored as factors because of the extra info.
> head(df)
A B C
1 ordinal nominal nominal
2 1 A 1
3 2 B 2
4 3 A 1
> str(df)
'data.frame': 4 obs. of 3 variables:
$ A: Factor w/ 4 levels "1","2","3","ordinal": 4 1 2 3
$ B: Factor w/ 3 levels "A","B","nominal": 3 1 2 1
$ C: Factor w/ 3 levels "1","2","nominal": 3 1 2 1
If we get rid of the first row and recode into the correct types, we can compute Gower's coefficient easily.
> headers <- df[1,]
> df <- df[-1,]
> DF <- transform(df, A = ordered(A), B = factor(B), C = factor(C))
> ## We've previously shown you how to do this (above line) for lots of columns!
> str(DF)
'data.frame': 3 obs. of 3 variables:
$ A: Ord.factor w/ 3 levels "1"<"2"<"3": 1 2 3
$ B: Factor w/ 2 levels "A","B": 1 2 1
$ C: Factor w/ 2 levels "1","2": 1 2 1
> require(cluster)
> daisy(DF)
Dissimilarities :
2 3
3 0.8333333
4 0.3333333 0.8333333
Metric : mixed ; Types = O, N, N
Number of objects : 3
Which gives the same as gower.dist() for this data (although in a slightly different format (as.matrix(daisy(DF))) would be equivalent):
> gower.dist(DF)
[,1] [,2] [,3]
[1,] 0.0000000 0.8333333 0.3333333
[2,] 0.8333333 0.0000000 0.8333333
[3,] 0.3333333 0.8333333 0.0000000
You say you can't do it this way? Can you explain why not? As you seem to be going to some degree of effort to do something that other people have coded up for you already. This isn't homework, is it?
I'm not sure what your logic is doing, but you are putting too many "{" in there for your own good. I generally use the {} pairs to surround the consequent-clause:
Delta=function(i,j){for (i in 1:28) {for (j in 1:47){
if (MyHeader[i,j]=="nominal") {
result=0
# the "{" in the next line before else was sabotaging your efforts
} else if (MyHeader[i,j]=="ordinal") { result=1} }
result}
}
Thanks Gavin and DWin for your help. I managed to solve the problem and find the right distance matrix. I used daisy() after I recoded the class of the data and it worked.
P.S. The solution that you suggested at my other topic for changing the class of the columns:
DF$nominal <- as.factor(DF$nominal)
DF$ordinal <- as.ordered(DF$ordinal)
didn't work. It changed only the first nominal and ordinal column.
Thanks again for your help.