I have a dataframe with two columns, one is the name of FB pages, the other is the link of that page.
name.page link.page
"FBpage1" "http://facebook/FBpage1"
"FBpage2" "http://facebook/FBpage2"
"FBpage3" "http://facebook/FBpage3"
"FBpage4" "http://facebook/FBpage4"
and I want the output will be only the name.page with hyperlink, so I can click the page name and link to that FBpage.
name.page
"FBpage1"
"FBpage2"
"FBpage3"
"FBpage4"
I am using R2HTML but don't know how to do it,
any suggestions?
Thank you!
I don't know about using the R2html package, but basically what I gather is that you need a way to generate the markup for each link so that within the table there is a direct link to the page? For demonstrtion purposes im using htmltools and knitr to accomplish this.
PS please use dput so we can more easily reproduce your data in the future:
the part that produces the markup is in the pipe ...
read.table(
textConnection('name.page link.page
"FBpage1" "http://facebook/FBpage1"
"FBpage2" "http://facebook/FBpage2"
"FBpage3" "http://facebook/FBpage3"
"FBpage4" "http://facebook/FBpage4"')) %>% {
colnames(.) <- as.character(.[1,])
.[-1,] %>% mutate(
link_display =
sprintf('%s', link.page, name.page)
) %>%
knitr::kable(., format = "html", escape = FALSE) %>%
htmltools::HTML() %>%
htmltools::html_print()
}
So using sprintf I placed the link in the href attribute which is the html markup command for creating an external link and then used the name column as the display text which results in:
Related
I am trying to extract links from the summary section of a wikipedia page. I tried the below methods :
This url extracts all the links of the Deep learning page:
https://en.wikipedia.org/w/api.php?action=query&prop=links&titles=Deep%20learning
And for extracting links associated to any section I can filter based on the section id - for e.g.,
for the Definition section of same page I can use this url : https://en.wikipedia.org/w/api.php?action=parse&prop=links&page=Deep%20learning§ion=1
for the Overview section of same page I can use this url : https://en.wikipedia.org/w/api.php?action=parse&prop=links&page=Deep%20learning§ion=2
But I am unable to figure out how to extract only the links from summary section
I even tried using pywikibot to extract linkedpages and adjusting plnamespace variable but couldn't get links only for summary section.
You need to use https://en.wikipedia.org/w/api.php?action=parse&prop=links&page=Deep%20learning§ion=0
Note that this also includes links in the
{{machine learning bar}} and {{Artificial intelligence|Approaches}} templates however (to the right of the screen).
You can use Pywikibot with the following commands
>>> import pywikibot
>>> from pwikibot import textlib
>>> site = pywikibot.Site('wikipedia:en') # create a Site object
>>> page = pywikibot.Page(site, 'Deep learning') # create a Page object
>>> sect = textlib.extract_sections(page.text, site) # divide content into sections
>>> links = sorted(link.group('title') for link in pywikibot.link_regex.finditer(sect.head))
Now links is a list containing all link titles in alphabethical order. If you prefer Page objects as result you may create them with
>>> pages = [pywikibot.Page(site, title) for title in links]
It's up to you to create a script with this code snippets.
I am trying to scrape information from HTML webpages, I have the direct links but cannot for some reason get to the relevant text.
These are two examples of the webpages:
http://151.12.58.148:8080/CPC/CPC.detail.html?A00002
http://151.12.58.148:8080/CPC/CPC.detail.html?A00003
After I read the html, I am left with all the source code aside from the relevant text (which should change from page to page).
For example, the first link gives a page with this:
data di nascita 1872
which is coded, when I inspect it on my browser, as:
<p y:role="datasubset" y:arg="DATA_NASCITA" class="smalltitle">
<span class="celllabel">data di nascita</span>
<span y:role="multivaluedcontent" y:arg="DATA_NASCITA">1872</span>
</p>
however, when I read it with my code:
link <- 'http://151.12.58.148:8080/CPC/CPC.detail.html?A00002'
page <- read_html(link)
write.table(as.character(page), "page.txt")
and I print "page", to check what I am getting, the same part of the code is:
<p y:role=\"datasubset\" y:arg=\"NASCITA\" class=\"smalltitle\">
<span class=\"celllabel\">luogo di nascita</span>
<span y:role=\"multivaluedcontent\" y:arg=\"NASCITA\"></span>
</p>
without 1872, which is the piece of information I am interested in.
(and also without not sure if that is indicative of anything).
I can't seem to get around it, would anyone have suggestions?
Thank you very much!
To expand a bit further, the site's HTML code loads a bunch of javascript and then has a template which is filled in after the document loads and also uses the query parameter as some type of value that get computed. I tried to just read in the target javascript file and parse it with V8 but there are too many external dependencies.
To read this, you'll need to use something like splashr or seleniumPipes. I'm partial to the former as I wrote it 😎.
Using either requires running an external program. I will not go into how to install Splash or Selenium in this answer. That's leg work you have to do but splashr makes it pretty easy to use Splash if you are comfortable with Docker.
This bit sets up the necessary packages and starts the Splash server (it will auto-download it first if Docker is available on your system:
library(rvest)
library(splashr)
library(purrr)
start_splash()
This next bit tells Splash to fetch & render the page and then retrieves the page content after javascript has done its work:
splash_local %>%
splash_response_body(TRUE) %>%
splash_user_agent(ua_macos_chrome) %>%
splash_go("http://151.12.58.148:8080/CPC/CPC.detail.html?A00002") %>%
splash_wait(2) %>%
splash_html() -> pg
Unfortunately, it's still a mess. They used namespaces and they are fine in XML docs but somewhat problematic the way they've used them here. But we can work around that with some clever XPath:
html_nodes(pg, "body") %>%
html_nodes(xpath=".//*[local-name()='h4' or local-name()='p' or local-name()='span']/text()") %>%
html_text(trim=TRUE) %>%
discard(`==`, "")
## [1] "Abachisti Vittorio" "data di nascita" "1872"
## [4] "luogo di nascita" "Mirandola, Modena, Emilia Romagna, Italia" "luogo di residenza"
## [7] "Mirandola, Modena, Emilia Romagna, Italia" "colore politico" "socialista"
## [10] "condizione/mestiere/professione" "falegname" "annotazioni riportate sul fascicolo"
## [13] "radiato" "Unità archivistica" "busta"
## [16] "1" "estremi cronologici" "1905-1942"
## [19] "nel fascicolo è presente" "scheda biografica" "A00002"
Do this after you're all done with Splash/splashr to remove the running Docker container:
killall_splash()
I'm trying to scrape the Crossfit Games Open leaderboard. I have a version that worked in previous years, but the website changed and I can't seem to update my code to get it to work with the new site.
The issue I have is I can't seem to get the correct CSS selector to get the athletes name and the link to their profile.
My old code does something similar to this:
library(rvest)
# old site
old_url <- "https://games.crossfit.com/scores/leaderboard.php?stage=1&sort=1&page=1&division=1®ion=0&numberperpage=100&competition=0&frontpage=0&expanded=0&year=16&scaled=0&full=1&showtoggles=0&hidedropdowns=1&showathleteac=1&is_mobile=1"
old_page <- read_html(old_url)
# get the athletes profile url
athlete_link <- html_attr(html_nodes(old_page, "td.name a"), "href")
athlete_name <- html_text(html_nodes(old_page, "td.name a"))
head(athlete_link)
# [1] "http://games.crossfit.com/athlete/124483" "http://games.crossfit.com/athlete/2725" "http://games.crossfit.com/athlete/199938"
# [4] "http://games.crossfit.com/athlete/173837" "http://games.crossfit.com/athlete/2476" "http://games.crossfit.com/athlete/499296"
head(athlete_name)
# [1] "Josh Bridges" "Noah Ohlsen" "Jacob Heppner" "Jonne Koski" "Luke Schafer" "Andrew Kuechler"
# new site
new_url <- "https://games.crossfit.com/leaderboard?page=1&competition=1&year=2017&division=2&scaled=0&sort=0&fittest=1&fittest1=0&occupation=0"
new_page <- read_html(new_url)
# get the athletes profile url
# I would have thought something like this would get it.
# It doens't seem to pull anything
html_attr(html_nodes(new_page, "td.name a.profile-link"), "href")
# character(0)
html_text(html_nodes(new_page, "td.name div.full-name"))
# character(0)
I've tried various other CSS Seclectors, SelectorGadget, and a few other things. I'm experienced in R but this is the only real web scraping project I've ever done so I'm probably missing something very basic.
Which selector should I be using to grab this data?
It looks like the content of this page is generated dynamically with some javascript. You can inspect the source of the page and you'll see something like:
<div class="modal-body">
<!-- dynamically generated content goes here -->
</div>
where the table should go. In these cases, rvest isn't enough.
You can check this recent blog post that has useful pointers: https://rud.is/b/2017/02/09/diving-into-dynamic-website-content-with-splashr/
I'm trying to use the read_html function in the rvest package, but have come across a problem I am struggling with.
For example, if I were trying to read in the bottom table that appears on this page, I would use the following code:
library(rvest)
html_content <- read_html("https://projects.fivethirtyeight.com/2016-election-forecast/washington/#now")
By inspecting the HTML code in the browser, I can see that the content I would like is contained in a <table> tag (specifically, it is all contained within <table class="t-calc">). But when I try to extract this using:
tables <- html_nodes(html_content, xpath = '//table')
I retrieve the following:
> tables
{xml_nodeset (4)}
[1] <table class="tippingpointroi unexpanded">\n <tbody>\n <tr data-state="FL" class=" "> ...
[2] <table class="tippingpointroi unexpanded">\n <tbody>\n <tr data-state="NV" class=" "> ...
[3] <table class="scenarios">\n <tbody/>\n <tr data-id="1">\n <td class="description">El ...
[4] <table class="t-desktop t-polls">\n <thead>\n <tr class="th-row">\n <th class="t ...
Which includes some of the table elements on the page, but not the one I am interested in.
Any suggestions on where I am going wrong would be most appreciated!
The table is built dynamically from data in JavaScript variables on the page itself. Either use RSelenium to grab the text of the page after it's rendered and pass the page into rvest OR grab a treasure trove of all the data by using V8:
library(rvest)
library(V8)
URL <- "http://projects.fivethirtyeight.com/2016-election-forecast/washington/#now"
pg <- read_html(URL)
js <- html_nodes(pg, xpath=".//script[contains(., 'race.model')]") %>% html_text()
ctx <- v8()
ctx$eval(JS(js))
race <- ctx$get("race", simplifyVector=FALSE)
str(race) ## output too large to paste here
If they ever change the formatting of the JavaScript (it's an automated process so it's unlikely but you never know) then the RSelenium approach will be better provided they don't change the format of the table structure (again, unlikely, but you never know).
The following is working:
url_a <- getURL("https://www.sec.gov/Archives/edgar/data/6494/000119312504029815/d8k.htm")
dfx_a <- htmlParse(url_a)
The following is not working:
url_b <- getURL("http://www.sec.gov/Archives/edgar/data/1639947/000104746916009939/a2226912zs-11.htm")
dfx_b <- htmlParse(url_b)
Why is this?
Maybe you are missing the S part in http in the second url.
it should be https?