Scraping Github commit author element - html

Any html whizzes out there able to extract the text for an element on this link: https://github.com/tidyverse/ggplot2
The element text required is
Am currently using rvest in r. Have tried xpath, css etc but just unable to extract the user name. Quite happy to take a link containing the name and cleanse the text using regex if needed.
Any help greatly appreciated.

library(rvest)
read_html("https://github.com/tidyverse/ggplot2") %>%
html_nodes(".user-mention") %>%
html_text()
# [1] "thomasp85"
But if you are trying to grab information from multiple repos, you may want to consider using the official GitHub REST API and/or this lightweight R package client.

Related

returning character(0) when scraping with rvest

I'm trying to do some web scraping with rvest. I'm new to R, so I have a bit of a knowledge barrier. I want to scrape the following URL:
https://www.spa.gov.sa/search.php?lang=ar&search=%D8%AD%D9%83%D9%85
That directs to a website in Arabic, but I don't think you need to be able to read Arabic to advise me. Basically, this is the first results page for a specific search term on this website (which is not a search engine). What I want to do is use rvest to scrape this page to return a list of the titles of the hyperlinks returned by the search. Using selectorgadget, I identified that the node containing those titles is called ".h2Newstitle". However, when I try to scrape that node using the code below, all I get in return is "character(0)":
library(tidyverse)
library(rvest)
read_html("https://www.spa.gov.sa/search.php?lang=ar&search=%D8%AD%D9%83%D9%85") %>%
html_nodes(".h2NewsTitle") %>%
html_text()
I don't think the issue here has to do with the Arabic text itself. I'm pretty sure everything is in UTF-8, and I can scrape other nodes on the same page and return Arabic text without issue. For example, the code below returns the Arabic text "بحث أسبوعي", which corresponds to the Arabic text in that node on the page itself:
read_html("https://www.spa.gov.sa/search.php?lang=ar&search=%D8%AD%D9%83%D9%85") %>%
html_nodes("WeeklySearch") %>%
html_text()
So I'm unsure why it is when I try to scrape the ".h2NewsTitle" node, I just get character(0) in return. I wonder if it has to do with some elements being rendered with JavaScript or something. This is a bit outside my expertise, so any advice on how to proceed would be appreciated. I'd like to continue using R, but am open to switching to Python/Beautiful Soup or something if it's better suited for this.

rvest returning empty list

I am trying to import a table from a website by scraping it by copying the xpath of the html code and using the rvest package. I have done this successfully multiple times before, but when I am trying it now I am merely producing an empty list. In an attempt to diagnose my problem, I ran the following code (taken from https://www.r-bloggers.com/using-rvest-to-scrape-an-html-table/). However, this code is also producing an empty list for me.
Thanks in advance for the help!
library(rvest)
url <- "http://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population"
population <- url %>%
read_html() %>%
html_nodes(xpath='//*[#id="mw-content-text"]/table[1]') %>%
html_table()
Your xpath query is wrong. The table is not a direct child of the node with an id of mw-content-text. It is a descendant though. Try
html_nodes(xpath='//*[#id="mw-content-text"]//table[1]')
Web scraping is a very fragile endeavor and can easily break when websites change their HTML.

Can't XML or HTML Parse This Website in R

I need to scrape this page to get the value of the comment, as well as the Document and Submitter Information on the right side..
https://www.regulations.gov/document?D=FDA-2014-N-1207-7673
I've tried using read_html() and read_xml() from the xml2 package with no luck. I've tried getURLContent() followed by xmlParse() and htmlParse() from RCurl.
I even tried simply readLines(), which does not actually get me the content of the website.
I suppose I don't have a great understanding how this all works. Previous websites I have always been able to scrape with simply html_parse(), html_nodes() and html_attr(). How can I accomplish scraping this website?

not scraping the html source, but the actual website

I am working on a project where I want to scrape a page like this, in order to get the city of origin. I tried to use the css selector: ".type-12~ .type-12+ .type-12" However I do not get the text into R.
Link:
https://www.kickstarter.com/projects/1141096871/support-ctrl-shft/description
I use rvest and and the read_html function.
However, it seems that the source has some scripts in it. Is there a way to scrape the website after the scripts have returned their results (as you see it with a browser)?
PS I looked at similar questions but did find the answer..
Code:
main.names <- read_html(x = paste0("https://www.kickstarter.com/projects/1141096871/support-ctrl-shft/description")) # feed `main.page` to the next step
names1 <- main.names %>% # feed `main.page` to the next step
html_nodes("div.mb0-md") %>% # get the CSS nodes
html_text()# extract the text
You should not do it. They provide a API which you can find here: https://status.kickstarter.com/api
Using APIs or Ajax/JSON calls is usually better since
The server isn't overused because your scraper visits every link it can find causing unnecessary traffic. That is bad for the speed of your program and bad for the servers of the site you are scraping.
You don't have to worry about that they changed a class name or id and your code won't work anymore
Especially the second part should interest you since it can take hours finding which class isn't returning a value anymore.
But to answer your question:
When you use the right scraper you can find all what you want. What tools are you using? There are possibilities to get data before the site is loaded or after. You can execute the JS on the site separately and find hidden content or find things like display:none Css classes...
It really depends on what you are using and how you use it.

RSelenium scraping for Disqus comments

I'm trying to scrape or obtain the text of Disqus comments from an online local newspaper using RSelenium in Chrome but am finding the going a little tough for my capabilities. I have searched many places but did not find the right information or I am using the wrong search terms (most probably).
So far I have managed to get the "normal" html from the pages but cannot pinpoint the right class, css selector or id to get the Disqus comments. I have also tried Selectorgadget but this only points to #dsq-app2 which selects the whole Disqus area at once and does not allow to select smaller parts of the area. I tried the same with RSelenium using elems <- mybrowser$findElement(using = "id", "dsq-app2") with an "environment" being stored in elems. Then I tried to find child elements within elems but came up blank.
Viewing the page via developer tools I can see that the interesting stuff is within an iframe called #dsq-app2 and have managed to extract all its source through the elems$getPageSource() after switching to the frame using elems$switchToFrame("dsq-app2"). This outputs all the html as one big "dirty" chunk and short of searching for the required stuff held in <p> tags and other elements of interest such as poster's usernames in data-role="username" and others, I don't seem to find the right way forward.
I have also tried using the advice given here but the Disqus setup is a little different. One of the pages I'm trying is this with the bulk of the comments area within a section called conversation and a ton of other id's such as posts and the un-ordered list with the id=post-list that ultimately carries the comments I need to scrape.
Any ideas or help tips are most welcome and received with thanks.
After a lot of testing and experimenting I managed. I don't know if it's the cleanest or prettiest solution but it works. Hope others will find it useful. Basically what I did was to find the url that points to the comments only. This is found within the "dsq-app2" iframe and is an attribute called src. At first I was also switching to the iframe but found that this works without.
remDr$navigate("toTheRequiredPage")
elemsource <- remDr$findElement(using = "id", value = "dsq-app2")
src <- elemsource$getElementAttribute("src") # find the src attribute within the iframe`
remDr$navigate(src[[1]]) # navigate to the src url
# find the posts from the new page
elem <- remDr$findElement(using = "id", value = "posts")
elem.posts <- elem$findChildElements(using = "id", value = "post-list")
elem.msgs <- elem.posts[[1]]$findChildElements(using = "class name", value = "post-message")
length(elem.msgs)
msgtext <- elem.msgs[[1]]$getElementText() # find first post's text
msgtext # print message
Update: I found out that if I use remDr$switchToFrame("dsq-app2") I do not need to use the src url as I have explained above. So there are actually two ways of scraping;
Use switchToFrame("nameOfFrame") or
Use my prior solution of using the src URL from the iframe
Hope this makes it clearer.