In R ggraph, how to fix circular plot structure - igraph

Below code can't get the wished plot (the data structure can' show in current plot), How to fixed it and change the plot to wished plot ? Thanks!
library(ggraph)
library(igraph)
library(tidyverse)
md <- data.frame(category = c('FDM','FDM','FDM'),
item =c('A1','A1','C1'),
subitem =c('A11','A12','C1'),
amount = c(1,2,3))
vertices <- md %>% gather(key='type',value = 'item',- amount) %>% select(- type) %>%
group_by(item) %>% summarise(amount= sum(amount))
pt <- graph_from_data_frame(md,vertices = vertices)
ggraph(pt,layout = 'circlepack', weight =amount)+
geom_node_circle(aes(fill=depth))+
geom_node_label(aes(label = paste0(name,'\n',amount )))+theme_void()

Related

R: Inferring a Common Merge Key

I am trying to webscrape a site to get addresses for a set of names (part A) along with the longitude and latitudes (part B). I don't know how to do this all together, so I did this in two parts:
# part A
library(tidyverse)
library(rvest)
library(httr)
library(XML)
# Define function to scrape 1 page
get_info <- function(page_n) {
cat("Scraping page ", page_n, "\n")
page <- paste0("https://www.mywebsite/",
page_n, "?extension") %>% read_html
tibble(title = page %>%
html_elements(".title a") %>%
html_text2(),
adress = page %>%
html_elements(".marker") %>%
html_text2(),
page = page_n)
}
# Apply function to pages 1:10
df_1 <- map_dfr(1:10, get_info)
# Check dimensions
dim(df_1)
[1] 90
Here is part B:
# Recognize pattern in websites
part1 = "https://www.mywebsite/"
part2 = c(0:55)
part3 = "extension"
temp = data.frame(part1, part2, part3)
# Create list of websites
temp$all_websites = paste0(temp$part1, temp$part2, temp$part3)
# Scrape
df_2 <- list()
for (i in 1:10)
{tryCatch({
url_i <-temp$all_websites[i]
page_i <-read_html(url_i)
b_i = page_i %>% html_nodes("head")
listanswer_i <- b_i %>% html_text() %>% strsplit("\\n")
df_2[[i]] <- listanswer_i
print(listanswer_i)
}, error = function(e){})
}
# Extract long/lat from results
lat_long = grep("LatLng", unlist(df_2[]), value = TRUE)
df_2 = data.frame(str_match(lat_long, "LatLng(\\s*(.*?)\\s*);"))
df_2 = df_2 %>% filter(X1 != "LatLngBounds();")
> dim(df_2)
[1] 86 3
We can see that df_1 and df_2 have a different number of rows - but also, there is no common merge key between df_1 and df_2. How can I re-write my code in such a way that I can create a merge key between df_1 and df_2 such that I can merge the common records between these files together?
I am not sure multiple requests to the same URIs are needed. There are some lat long values not listed either on the results pages or on the result specific linked webpage e.g.Toronto Beaches Dentist from current page 2 results has no lat long shown on either page 2 or the website specific page. In these cases, you may choose to fill the blanks using another service which returns lat long based on an address.
You can re-write your function and alter your regex patterns to produce 2 dataframes which can be joined and the resultant dataframe returned. With the appropriate regex changes, as given below, you can use the address column to join the 2 dataframes. I dislike a key which is an address but it does appear to be internally consistent across the result page. I have used a left join to return all rows from the dentist listings i.e. the practice business names.
library(tidyverse)
library(rvest)
urls <- sprintf("https://www.dentistsearch.ca/search-doctor/%i?category=0&services=0&province=55&city=&k=", 1:10)
pages <- lapply(urls, read_html)
get_dentist_info <- function(page) {
page_text <- page %>% html_text()
address_keys <- page_text %>%
str_match_all('marker_\\d+\\.set\\("content", "(.*?)"\\);') %>%
.[[1]] %>%
.[, 2]
lat_long <- page_text %>%
str_match_all("LatLng\\((.*)\\);(?![\\s\\S]+myOptions)") %>%
.[[1]] %>%
.[, 2]
lat_lon <- tibble(address = address_keys, lat_long = lat_long) %>%
separate(lat_long, into = c("lat", "long"), sep = ", ") %>%
mutate(lat = as.numeric(lat), long = as.numeric(long))
practice_info <- tibble(
title = page %>% html_elements(".title > a") %>% html_text(trim = T),
address = page %>% html_elements(".marker") %>% html_text()
)
dentist_info <- left_join(practice_info, lat_lon, by = "address")
return(dentist_info)
}
all_dentist_info <- map_dfr(pages, get_dentist_info)

Tabulizing data off website PDFs (w/ various formats) assigning each event values according to HTML link titles

I've been trying to automate the process of manually typing down the data from the ATF's trace data site (see "URL") but it's been a fairly big pain as I've only been able to collect the each URL link that holds the PDFs and assign it to its correct State/Territory and Year. The newer files 2017-2019 have data "tables that are relatively easier to pull data from compared to 2014-2016, e.g. I'm referring to page 10 of the Trace Data report 2019 and Trace Data report 2014.
It's the latter that I'm having trouble with the most as the data is not stored in something that looks like a table but surrounding a (!)pie-chart. There have been some promising R packages such as "pdftools" and "tesseract". But I'm very much an amateur when it comes to trouble-shooting advanced analytical packages such as these.
It's my guess that I'm still a ways off from where I want to be with the final product as I would need to mine the bottom text of page 10 to find how many "other" weapons were traced to a city, as well the number of weapons where a recovery city couldn't be determined. But if anyone has any suggestions on what I could try next or to even make the working code more efficient, I'd appreciate it.
URL <- "https://www.atf.gov/resource-center/data-statistics"
html <- paste(readLines(URL))
library(xml2)
library(tidyverse)
library(rvest)
library(stringr)
x <- c('\t\t\t<div>([^<]*)</div>','\t\t</tr><tr><td>([^<]*)</td>','\t\t\t<td>([^<]*)</td>')
r <- read_html(URL) %>% html_nodes("a") %>% map_df(~{
Link <- .x %>% html_attr("href")
Title <- .x %>% html_text()
data_frame(Link, Title)
}) %>%
dplyr::filter(grepl('node',Link, fixed = T))
r <- as.data.frame(r)
x <- c('<ul><li>([^<]*)</li>','\t<li>([^<]*)</li>')
states <- c('Alabama','Alaska','Arizona','Arkansas','California','Colorado','Connecticut','Delaware','District of Columbia','Florida','Georgia','Guam & Northern Mariana Islands','Hawaii','Idaho','Illinois','Indiana','Iowa','Kansas','Kentucky','Louisiana','Maine','Maryland','Massachusetts','Michigan','Minnesota','Mississippi','Missouri','Montana','Nebraska','Nevada','New Hampshire','New Jersey','New Mexico','New York','North Carolina','North Dakota','Ohio','Oklahoma','Oregon','Pennsylvania','Puerto Rico','Rhode Island','South Carolina','South Dakota','Tennessee','Texas','Utah','Vermont','Virginia','Washington','West Virginia','Wisconsin','Wyoming')
s <- list()
for(i in 1:nrow(r)){
s[[i]] <- read_html(r$Link[i]) %>% html_nodes("a") %>% map_df(~{
Link <- .x %>% html_attr("href")
Title <- .x %>% html_text()
data_frame(Link, Title)
}) %>% mutate(Year <- r$Title[i]) %>%
dplyr::filter(Title %in% states | str_detect(Title, "Virgin Islands")) %>%
dplyr::filter(grepl('download',Link, fixed = T))
trace_list = do.call(rbind, s)
}
names(trace_list)[3] <- "Year"
Progress so far...
library(pdftools)
pdf_file <- "https://www.atf.gov/file/146951/download"
text <- pdf_text(pdf_file)
cat(text[10])
vtext <- as.list(str_split(text[10],"\n"))
x <- data.frame(matrix(unlist(vtext), nrow=length(vtext), byrow=TRUE),stringsAsFactors=FALSE)
x1 <- pivot_longer(x, cols = 1:length(x),names_to="X1",values_to="X2")
x1$X2 <- trimws(x1$X2)
x1 <- x1[c(8,12),]
x1[1,2] <- sub(" ","_",x1[1,2],fixed=T)
library(splitstackshape)
x1 <- as.data.frame(cSplit(x1, 'X2', sep=" ", type.convert=FALSE))
x1 <- x1[,c(2:length(x1))]
colnames(x1) <- x1[1,]
x1 <- x1[-1, ]
x2 <- pivot_longer(x1, cols = 1:length(x1),names_to="city",values_to="count")
mixing both pdftools amd tesseract...
library(tesseract)
img_file <- pdftools::pdf_convert("https://www.atf.gov/file/89621/download", format = 'tiff', dpi = 400)
text <- ocr(img_file)
strsplit(text[10],"\n")
Expected output:
year
state
city
count
2019
AL
Birmingham
100
2018
CA
Los Angeles
200
2017
CA
None
30
2017
CA
Other
400

How to parse addresses from website specifying class in R?

I would like to parse addresses of all stores on the following website:
https://www.carrefour.fr/magasin/region/ looping through the regions. So starting for example with the region "auvergne-rhone-alpes-84", hence full url = https://www.carrefour.fr/magasin/region/auvergne-rhone-alpes-84. Note that I can add more regions afterwards, I just want to make it work with one for now.
carrefour <- "https://www.carrefour.fr/magasin/region/"
addresses_vector = c()
for (current_region in c("auvergne-rhone-alpes-84")) {
current_region_url = paste(carrefour, current_region, "/", sep="")
x <- GET(url=current_region_url)
html_doc <- read_html(x) %>%
html_nodes("[class = 'ds-body-text ds-store-card__details--content ds-body-text--size-m ds-body-text--color-standard-2']")
addresses_vector <- c(addresses_vector, html_doc %>%
rvest::html_nodes('body')%>%
xml2::xml_find_all(".//div[contains(#class, 'ds-body-text ds-store-card__details--content ds-body-text--size-m ds-body-text--color-standard-2')]") %>%
rvest::html_text())
}
I also tried with x%>% read_html() %>% rvest::html_nodes(xpath="/html/body/main/div[1]/div/div[2]/div[2]/ol/li[1]/div/div[1]/div[2]/div[2]")%>% rvest::html_text() (copying the whole xpath by hand) or x%>%read_html() %>%html_nodes("div.ds-body-text.ds-store-card__details--content.ds-body-text--size-m.ds-body-text--color-standard-2") %>%html_text() and several other ways but I always get a character(0) element returned.
Any help is appreciated!
You could write a couple of custom functions to help then use purrr to map the store data function to inputs from the output of the first helper function.
First, extract the region urls and extract the region names and region ids. Store these in a tibble. This is the first helper function get_regions.
Then use another function, get_store_info, to extract from these region urls the store info, which is stored in a div tag, from which it is dynamically extracted when JavaScript runs in the browser, but not when using rvest.
Apply the function that extracts the store info over the list of region urls and region ids.
If you use map2_dfr to pass both region id and region link to the function which extracts store data, you then have the region id to link back on to join the result of the map2_dfr to that of region tibble generated earlier.
Then do some column cleaning e.g., drop ones you don't want.
library(rvest)
library(purrr)
library(dplyr)
library(readr)
library(jsonlite)
get_regions <- function() {
url <- "https://www.carrefour.fr/magasin"
page <- read_html(url)
regions <- page %>% html_nodes(".store-locator-footer-list__item > a")
t <- tibble(
region = regions %>% html_text(trim = T),
link = regions %>% html_attr("href") %>% url_absolute(url),
region_id = NA_integer_
) %>% mutate(region_id = str_match(link, "-(\\d+)$")[, 2] %>%
as.integer())
return(t)
}
get_store_info <- function(region_url, r_id) {
region_page <- read_html(region_url)
store_data <- region_page %>%
html_node("#store-locator") %>%
html_attr(":context-stores") %>%
parse_json(simplifyVector = T) %>%
as_tibble()
store_data$region_id <- r_id
return(store_data)
}
region_df <- get_regions()
store_df <- map2_dfr(region_df$link, region_df$region_id, get_store_info)
final_df <- inner_join(region_df, store_df, by = 'region_id') # now clean columns within this.

Several lines with different style in Caption in both html and docx - flextable

I need to show data caption, computer name and period in the header of table.
I have also requirements: zebra theme, merging cells if needed. That's why I chose flextable.
Here is my code:
library(officer) # border settings library
library(flextable) # drawing tables library
library(dplyr)
Caption <- "<b><big>Computer01.domain.com</big></b><br>Network Interface<br>Gbit Total/sec<br><small>2021-05-14 18:04 to 2021-05-25 13:29</small>"
bold_border <- fp_border(color="gray", width =2)
std_border <- fp_border(color="gray")
stub <- "2021-05-14 01:40 to 2021-05-17 08:26"
table_data <- data.frame (
Instance = c("Intel[R] Ethernet 10G",
"Intel[R] Ethernet Converged Network Adapter _1",
"Intel[R] Ethernet Converged Network Adapter _2",
"Intel[R] Ethernet 10G",
"Intel[R] Gigabit"),
Max = c(2.45, 2.41, 2.29, 2.17, 0),
Avg = c(0.15, 0.15, 0.15, 0.17, 0)
)
table <- table_data %>% flextable() %>%
set_caption(caption = Caption , html_escape = F) %>%
bg(bg = "#579FAD", part = "header") %>%
color(color = "white", part = "header") %>%
theme_zebra(
odd_header = "#579FAD",
odd_body = "#E0EFF4",
even_header = "transparent",
even_body = "transparent"
) %>%
set_table_properties(width = 1, layout = "autofit") %>%
hline(part="all", border = std_border ) %>%
vline(part="all", border = std_border ) %>%
border_outer( border = bold_border, part = "all" ) %>%
fix_border_issues() %>%
set_header_labels(
values = list(Instance = InstanceName ) ) %>%
flextable::font (part = "all" , fontname = "Calibri")
save_as_docx( table, path = file.path("c:\\temp", "test01.docx") )
save_as_html (table, path = file.path("c:\\temp", "test01.html"))
Here is what I got in html which is okay for me:
But in docx format my header style is not applied:
How can I create header like I did for html that can be saved to both html and docx?
If I have to create separate tables - one for html, other for docx - it's not so good but acceptable options. That case my question how to create header I made in html but for docx format?

R: Converting ggplot objects to interactive graphs

I am using the R programming language. I am trying to take different types of graphs (bar graphs, pie charts) and put them on the same page. I generated some fake data and made several graphs - then I put them together (see : Combining Different Types of Graphs Together (R))
library(dplyr)
library(ggplot2)
library(cowplot)
library(gridExtra)
library(plotly)
date= seq(as.Date("2014/1/1"), as.Date("2016/1/1"),by="day")
var <- rnorm(731,10,10)
group <- sample( LETTERS[1:4], 731, replace=TRUE, prob=c(0.25, 0.22, 0.25, 0.25) )
data = data.frame(date, var, group)
data$year = as.numeric(format(data$date,'%Y'))
data$year = as.factor(data$year)
###Pie
Pie_2014 <- data %>%
filter((data$year == "2014")) %>%
group_by(group) %>%
summarise(n = n())
Pie_2014_graph = ggplot(Pie_2014, aes(x="", y=n, fill=group)) +
geom_bar(stat="identity", width=1) +
coord_polar("y", start=0) +ggtitle( "Pie Chart 2014")
Pie_2015 <- data %>%
filter((data$year == "2015")) %>%
group_by(group) %>%
summarise(n = n())
Pie_2015_graph = ggplot(Pie_2015, aes(x="", y=n, fill=group)) +
geom_bar(stat="identity", width=1) +
coord_polar("y", start=0) +ggtitle( "Pie Chart 2015")
Pie_total = data %>%
group_by(group) %>%
summarise(n = n())
Pie_total_graph = ggplot(Pie_total, aes(x="", y=n, fill=group)) +
geom_bar(stat="identity", width=1) +
coord_polar("y", start=0) +ggtitle( "Pie Chart Average")
###bars
Bar_years = data %>%
group_by(year, group) %>%
summarise(mean = mean(var))
Bar_years_plot = ggplot(Bar_years, aes(fill=group, y=mean, x=year)) +
geom_bar(position="dodge", stat="identity") + ggtitle("Bar Plot All Years")
Bar_total = data %>%
group_by(group) %>%
summarise(mean = n())
Bar_total_plot = ggplot(Bar_total, aes(x=group, y=mean, fill=group)) +
geom_bar(stat="identity")+theme_minimal() + ggtitle("Bar Plot Average")
#assembling the graphs can be done two different ways
#first way
g1 <- grid.arrange(Pie_2014_graph, Pie_2015_graph , Pie_total_graph, nrow = 1)
g2 <- grid.arrange(Bar_total_plot, Bar_years_plot, nrow = 1)
g = grid.arrange(g1, g2, ncol = 1)
#second way
# arrange subplots in rows
top_row <- plot_grid(Pie_2014_graph, Pie_2015_graph, Pie_total_graph)
middle_row <- plot_grid(Bar_years_plot, Bar_total_plot)
# arrange our new rows into combined plot
p <- plot_grid(top_row, middle_row, nrow = 2)
p
From here, I am trying to use the plotly::ggplotly() command to make the above output "interactive" (move the mouse over the graphs and see labels). I know that this works for individual plots:
ggplotly(Bar_years_plot)
However, this command does not seem to work with the "cowplot" and the "gridExtra" outputs:
#gridExtra version:
ggplotly(g)
Error in UseMethod("ggplotly", p) :
no applicable method for 'ggplotly' applied to an object of class "c('gtable', 'gTree', 'grob', 'gDesc')"
#cowplot version: (produces empty plot)
ggplotly(p)
Warning messages:
1: In geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]) :
geom_GeomDrawGrob() has yet to be implemented in plotly.
If you'd like to see this geom implemented,
Please open an issue with your example code at
https://github.com/ropensci/plotly/issues
2: In geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]) :
geom_GeomDrawGrob() has yet to be implemented in plotly.
If you'd like to see this geom implemented,
Please open an issue with your example code at
https://github.com/ropensci/plotly/issues
Does anyone know if there is a quick way to use the ggplotly() function for objects created with "gridExtra" or "cowplot"?
I know that with a bit of work, it might be possible using "htmltools":
library(htmltools)
doc <- htmltools::tagList(
div(Pie_2014_graph, style = "float:left;width:50%;"),
div(Pie_2015_graph,style = "float:left;width:50%;"),
div(Pie_total_graph, style = "float:left;width:50%;"),
div(Bar_years_plot, style = "float:left;width:50%;"),
div(Bar_total_plot, style = "float:left;width:50%;"))
save_html(html = doc, file = "out.html")
But I am not sure how to do this.
Can someone please show me how to make the collections of graphs interactive either using ggplotly() or with htmltools()?
Thanks.
You should apply ggplotly() to the individual graphs, not the collection graphs.
For example:
Pie_2014_graph = ggplotly(ggplot(Pie_2014, aes(x="", y=n, fill=group)) +
geom_bar(stat="identity", width=1) +
coord_polar("y", start=0) +ggtitle( "Pie Chart 2014") )