Get two different html nodes at the same time - html

I'm trying to scrape a web page.
I want to get a dataset from two different html nodes; ".table-grosse-schrift" and "td.zentriert.no-border".
url<-paste0("https://www.transfermarkt.co.uk/serie-a/spieltag/wettbewerb/IT1/saison_id/2016/spieltag/4")
tt<-read_html(url[[x]]) %>%html_nodes(".table-grosse-schrift")%>%html_text()%>%as.matrix()
temp1=data.frame(as.character(gsub("\r|\n|\t|\U00A0", "", tt[,])))
temp2<-(read_html(url[[x]]) %>%html_nodes("td.zentriert.no-border") %>% html_text() %>% data.frame())
The problem is that the order of the nodes of ".table-grosse-schrift" on the web page keep changing, so that I cannot match the data from the two nodes.
I found that the solution can be getting two nodes' data at the same time, like this:
tt<-read_html(url[[x]]) %>%html_nodes(".table-grosse-schrift")%>%html_nodes("td.zentriert.no-border")%>%html_text()%>%as.matrix()
But this code does not work.

If I understand correctly, you should be able to use following-sibling to select the next corresponding sibling in the pair of nodes that you need.
The following-sibling axis indicates all the nodes that have the same
parent as the context node and appear after the context node in the
source document. (Source: https://developer.mozilla.org/en-US/docs/Web/XPath/Axes/following-sibling)

Related

Looping: with different row number in R

I wonder if you could give me a hint on how to get over the problem I encountered when trying to extract data from HTML files. I looked through other questions regarding the issue but still cannot figure out what changes exactly should I make. I have five HTML files in a folder. From each of them, I want to extract HTML links which I will later use. First, I extracted this data without any effort reading every HTML separately and creating a separate data frame for each HTML with much-needed links (/item.asp?id=). Then I used a 'rbind' function to merge columns from each data frame. The key here is that the first three HTML pages have 20 rows of the data I need, the fourth HTML has 16 rows, and the fifth and the last has 9 rows.
The looping code works just fine when I loop over the first three pages for which I have 20 rows each, but the code doesn't allow me to do the same for the fourth and fifth HTML pages because there the row number is different. I get the problem:
Error in [[<-.data.frame(*tmp*, i, value = c("/item.asp?id=22529120", : replacement has 16 rows, data has 20
The code is as follows:
#LOOP over others
path = "C:/Users/Dasha/Downloads/R STUDIO/RECTORS/test retrieve"
out.file<-""
file.names <- dir(path, pattern =".html")
for (i in 1:length(file.names))
{
page <- read_html(file.names[i])
links <- page %>% html_nodes("a") %>% html_attr("href")
##get all links into a dataframe
df <-as.data.frame(links)
##get links which contain /item.asp
page_article <- df[grep("/item.asp", df$links), ]
##for each HTML save a separate data frame with links column
java[i] <-as.data.frame(page_article)
##save number of a page where this link is
page_num[i] <- paste(toString(i))
##save id of a person this page belongs to
id[i] <- as.character(file.names[i])
}
Can anyone give a bit of advice on how to solve this issue? If I am successful, I then must be capable to create a single column with links, another column with an id and a number of the HTML page.
Write a function which returns a dataframe after reading from each HTML file.
read_html_files <- function(filename) {
page <- read_html(filename)
links <- page %>% html_nodes("a") %>% html_attr("href")
page_article <- grep("/item.asp", links, value = TRUE)
data.frame(filename, page_article)
}
Use purrr::map_df and pass this function to every file and combine the output in one dataframe (result).
path = "C:/Users/Dasha/Downloads/R STUDIO/RECTORS/test retrieve"
file.names <- list.files(path, pattern ="\\.html$", full.names = TRUE)
result <- purrr::map_df(file.names, read_html_files, .id = 'id')
result

Web Scraping - Unable to determine node or text heading argument to extract data from URL via functoin htlm_node, htlm_nodes/s at package rvest

Aim to extract three data points from a URL.Able to locate the specific top and individual CSSs nodes and xpaths using selectorGadget. Aim to use html_node function (html(url),CSS)) to extract the elements I am interested at.
Have used the main node CSS (CSS node ._2t2gK1hs") and was able to extract the first element as a string. The top CSS node appears to have embedded only the first element not the other subsequent two although the three elements (one text and the other two numeric elements) share the same CSS node address codes (For all three top "._39sLqIkw" with a heading followed by "._1NHwuRzF")
[![Snapshot of CSS and selector gadget for the specific data points I would like to extract.][1]][1]
In attempting to extract the data I tried:
page0_url<-read_html ("https://www.tripadvisor.com/Hotel_Review-g1063979-d1902679-Reviews-Mas_El_Mir
Ripoll_Province_of_Girona_Catalonia.html")
html_node(page0_url, "._2t2gK1hs")```
#Resulting in a string with the top element I aim to extract embedded.
{html_node}
div class="_2t2gK1hs" data-tab="TABS_ABOUT" data-section-signature="about" id="ABOUT_TAB"
[1] <div>\n<div class="_39sLqIkw">PRICE RANGE</div>\n<div class="_1NHwuRzF">€124<!-- --> - <!-- --€222<!-- --> <!-- -->(Based on Average Rates for a Standard Room) ...>
#Failed to extract the two remaining three elements by selecting the individual CSSs or xpaths.
library(rvest)
page0_url<-read_html ("https://www.tripadvisor.com/Hotel_Review-g1063979-d1902679-Reviews-Mas_El_Mir-Ripoll_Province_of_Girona_Catalonia.html")
html_nodes(xpath = "//*[contains(concat( " ", #class, " " ), concat( " ", "_1NHwuRzF", " " ))]") %>%
html_text(trim = TRUE)```
#Tried passing without success the specific element node followed/preceded by #PRICE RANGE, #LOCATION, #NUMBER OF ROOMS.
#I wonder how should I pass the argument and what node/s to use in the above function.
#Expected result
PRICE RANGE
122 222
LOCATION
Spain Catalonia Province of Gerona Ripoll
NUMBER OF ROOMS
5
Thank you
Those classes look dynamic. Here is a hopefully more robust selector strategy based on the relationship between more stable looking elements, and avoiding using likely dynamic class values:
library(rvest)
library(magrittr)
page0_url <- read_html('https://www.tripadvisor.com/Hotel_Review-g1063979-d1902679-Reviews-Mas_El_MirRipoll_Province_of_Girona_Catalonia.html')
data <- page0_url %>%
html_nodes('.in-ssr-only [data-tab=TABS_ABOUT] div[class]') %>%
html_text()
data

Can't access specific content in html page with rvest and selectorGadget

I'm trying to scrape a ncbi website (https://www.ncbi.nlm.nih.gov/protein/29436380) to obtain information of a protein. I need to access the gene_synonyms and GeneID fields. I have tried to find the relevant nodes with the selectorGadget addon in chrome and with the code inspector in ff. I have tried this code:
require("dplyr")
require("rvest")
require("stringr")
GIwebPage <- read_html("https://www.ncbi.nlm.nih.gov/protein/29436380")
TestHTML <- GIwebPage %>% html_node("div.grid , div#maincontent.col.nine_col , div.sequence , pre.genebank , .feature") %>% html_text(trim = TRUE)
Then I try to find the relevant text but it is simply not there.
str_extract_all(TestHTML, pattern = "(synonym).{30}")
[[1]]
character(0)
str_extract_all(TestHTML, pattern = "(GeneID:).{30}")
[[1]]
character(0)
All I seem to be accessing is some of the text content of the column on the right.
str_extract_all(TestHTML, pattern = "(protein).{30}")
[[1]]
[1] "protein codes including ambiguities a"
[2] "protein sequence for myosin-9 (NP_00"
[3] "protein should not be confused with t"
[4] "protein, partial [Homo sapiens]gi|294"
[5] "protein codes including ambiguities a"
I have tried so many combinations of nodes selections with html_node() that I don't know anymore what to try. Is this content buried in some structure I can't see? or I'm just not skilled enough to realize the node to select?
Thanks a lot,
José.
The page is dynamically loading the information. The underlying information is store at another location.
Using the developer tools from your bowser, look for the link:
The information you are looking for is store at the "viewer.fcgi", right click to copy the link.
See similar question/answers: R not accepting xpath query

How to scrape text based on a specific link with BeautifulSoup?

I'm trying to scrape text from a website, but specifically only the text that's linked to with one of two specific links, and then additionally scrape another text string that follows shortly after it.
The second text string is easy to scrape because it includes a unique class I can target, so I've already gotten that working, but I haven't been able to successfully scrape the first text (with the one of two specific links).
I found this SO question ( Find specific link w/ beautifulsoup ) and tried to implement variations of that, but wasn't able to get it to work.
Here's a snippet of the HTML code I'm trying to scrape. This patter recurs repeatedly over the course of each page I'm scraping:
<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179
The two parts I'm trying to scrape and then store together in a list are the two Chinese-language text strings.
The first of these, 女孩, which means female, is the one I haven't been able to scrape successfully.
This is always preceded by one of these two links:
forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=19 (Female)
forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=15 (Male)
I've tested a whole bunch of different things, including things like:
gender_containers = soup.find_all('a', href = 'forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=19')
print(gender_containers.get_text())
But for everything I've tried, I keep getting errors like:
ResultSet object has no attribute 'get_text'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
I think that I'm not successfully finding those links to grab the text, but my rudimentary Python skills thus far have failed me in figuring out how to make it happen.
What I want to have happen ultimately is to scrape each page such that the two strings in this code (女孩 and 寻找2003年出生2004年失踪贵州省...)
<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179
...are scraped as two separate variables so that I can store them as two items in a list and then iterate down to the next instance of this code, scrape those two text snippets and store them as another list, etc. I'm building a list of list in which I want each row/nested list to contain two strings: the gender (女孩 or 男孩)and then the longer string, which has a lot more variation.
(But currently I have working code that scrapes and stores that, I just haven't been able to get the gender part to work.)
Sounds like you could use attribute = value css selector with $ ends with operator
If there can only be one occurrence per page
soup.select_one("[href$='typeid=19'], [href$='typeid=15']").text
This is assuming those typeid=19 or typeid=15 only occur at the end of the strings of interest. The "," between the two in the selector is to allow for matching on either.
You could additionally handle possibility of not being present as follows:
from bs4 import BeautifulSoup
html ='''<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179'''
soup=BeautifulSoup(html,'html.parser')
gender = soup.select_one("[href$='typeid=19'], [href$='typeid=15']").text if soup.select_one("[href$='typeid=19'], [href$='typeid=15']") is not None else 'Not found'
print(gender)
Multiple values:
genders = [item.text for item in soup.select_one("[href$='typeid=19'], [href$='typeid=15']")]
Try the following code.
from bs4 import BeautifulSoup
data='''<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179'''
soup=BeautifulSoup(data,'html.parser')
print(soup.select_one('em').text)
OutPut:
[女孩]

child number changes when web scraping for different patents

I am relatively new to web scraping.
I am having problems with child numbers when web scraping for multiple patents. The child number changes accordingly to the the location of the table in the web page. Sometimes the child is "div:nth-child(17)" and other times it is "div:nth-child(18)" when searching for different patents.
My line of code is this one:
IPCs <-sapply("http://www.sumobrain.com/patents/us/Sonic-pulse-echo-method-apparatus/4202215.html", function(url1){
tryCatch(url1 %>%
as.character() %>%
read_html() %>%
html_nodes("#inner_content2 > div:nth-child(17) > div.disp_elm_value3 > table") %>%
html_table(),
error = function(e){NA}
)
})
When I search for another patent (for example: "http://www.sumobrain.com/patents/us/Method-apparatus-quantitative-depth-differential/4982090.html") the child number changes to (18).
I am planning to analyse more than a thousand patents so I would need a code that work for both child numbers. Is there a CSS selector which allows me to select more children? I have tried the "div:nth-child(n)" and "div:nth-child(*)" but they do not work.
I am also open to using a different method. Does anybody have any suggestions?
Try this pseudo classes :
It's a range between 17 and 18.
nth-child(17):nth-child(-n+18)