Select XPath to scrape data with srcapy - html

I have looked at some of the XPath threads on here and read 2 XPath guides but I am having trouble writing a working code to scrape a webpage that is full of restaurants, phone numbers, addresses with scrapy.
A picture of the source code of the site and the highlighted data I want.
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select("//b")
items = []
for titles in titles:
item = kosherscrapeitem
item ["BusinessName"] = titles.select('//td[#class="line-content"]//html/body/table/tbody/tr[()').extract()
This is the data from copying the xpath on chrome:
/html/body/table/tbody/tr[485]/td[2]/text()
Can someone help me with my Xpath expression?

Related

Looping: with different row number in R

I wonder if you could give me a hint on how to get over the problem I encountered when trying to extract data from HTML files. I looked through other questions regarding the issue but still cannot figure out what changes exactly should I make. I have five HTML files in a folder. From each of them, I want to extract HTML links which I will later use. First, I extracted this data without any effort reading every HTML separately and creating a separate data frame for each HTML with much-needed links (/item.asp?id=). Then I used a 'rbind' function to merge columns from each data frame. The key here is that the first three HTML pages have 20 rows of the data I need, the fourth HTML has 16 rows, and the fifth and the last has 9 rows.
The looping code works just fine when I loop over the first three pages for which I have 20 rows each, but the code doesn't allow me to do the same for the fourth and fifth HTML pages because there the row number is different. I get the problem:
Error in [[<-.data.frame(*tmp*, i, value = c("/item.asp?id=22529120", : replacement has 16 rows, data has 20
The code is as follows:
#LOOP over others
path = "C:/Users/Dasha/Downloads/R STUDIO/RECTORS/test retrieve"
out.file<-""
file.names <- dir(path, pattern =".html")
for (i in 1:length(file.names))
{
page <- read_html(file.names[i])
links <- page %>% html_nodes("a") %>% html_attr("href")
##get all links into a dataframe
df <-as.data.frame(links)
##get links which contain /item.asp
page_article <- df[grep("/item.asp", df$links), ]
##for each HTML save a separate data frame with links column
java[i] <-as.data.frame(page_article)
##save number of a page where this link is
page_num[i] <- paste(toString(i))
##save id of a person this page belongs to
id[i] <- as.character(file.names[i])
}
Can anyone give a bit of advice on how to solve this issue? If I am successful, I then must be capable to create a single column with links, another column with an id and a number of the HTML page.
Write a function which returns a dataframe after reading from each HTML file.
read_html_files <- function(filename) {
page <- read_html(filename)
links <- page %>% html_nodes("a") %>% html_attr("href")
page_article <- grep("/item.asp", links, value = TRUE)
data.frame(filename, page_article)
}
Use purrr::map_df and pass this function to every file and combine the output in one dataframe (result).
path = "C:/Users/Dasha/Downloads/R STUDIO/RECTORS/test retrieve"
file.names <- list.files(path, pattern ="\\.html$", full.names = TRUE)
result <- purrr::map_df(file.names, read_html_files, .id = 'id')
result

Filter part of the html page when scraping results with Scrapy

I want to scrape the products that are listed in this webpage. So I tried to extract all of the data-tcproduct attributes from the div.product-tile. It contains numerous things including the url of the products I need to visit.
So I did:
def parse_brand(self, response):
for d in set(response.css('div.product-tile::attr(data-tcproduct)').extract()):
d = json.loads(d)
yield scrapy.Request(url=d['product_url_page'].replace("p","P"), callback=self.parse_item)
Yet, I noticed that some attributes from the div.product-tile seems to be hidden in the page and I am not interested by them. Those I want to scrape are rather on product-listing-title.
So how can I filter part of the HTML page when scraping results with Scrapy?
I don't think that you need product-listing-title. You need items from search-result-content div instead:
for d in response.css('div.search-result-content div.product-tile::attr(data-tcproduct)').extract():
d = json.loads(d)
yield scrapy.Request(url=d['product_url_page'].replace("p","P"), callback=self.parse_item)

How to scrape text based on a specific link with BeautifulSoup?

I'm trying to scrape text from a website, but specifically only the text that's linked to with one of two specific links, and then additionally scrape another text string that follows shortly after it.
The second text string is easy to scrape because it includes a unique class I can target, so I've already gotten that working, but I haven't been able to successfully scrape the first text (with the one of two specific links).
I found this SO question ( Find specific link w/ beautifulsoup ) and tried to implement variations of that, but wasn't able to get it to work.
Here's a snippet of the HTML code I'm trying to scrape. This patter recurs repeatedly over the course of each page I'm scraping:
<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179
The two parts I'm trying to scrape and then store together in a list are the two Chinese-language text strings.
The first of these, 女孩, which means female, is the one I haven't been able to scrape successfully.
This is always preceded by one of these two links:
forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=19 (Female)
forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=15 (Male)
I've tested a whole bunch of different things, including things like:
gender_containers = soup.find_all('a', href = 'forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=19')
print(gender_containers.get_text())
But for everything I've tried, I keep getting errors like:
ResultSet object has no attribute 'get_text'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
I think that I'm not successfully finding those links to grab the text, but my rudimentary Python skills thus far have failed me in figuring out how to make it happen.
What I want to have happen ultimately is to scrape each page such that the two strings in this code (女孩 and 寻找2003年出生2004年失踪贵州省...)
<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179
...are scraped as two separate variables so that I can store them as two items in a list and then iterate down to the next instance of this code, scrape those two text snippets and store them as another list, etc. I'm building a list of list in which I want each row/nested list to contain two strings: the gender (女孩 or 男孩)and then the longer string, which has a lot more variation.
(But currently I have working code that scrapes and stores that, I just haven't been able to get the gender part to work.)
Sounds like you could use attribute = value css selector with $ ends with operator
If there can only be one occurrence per page
soup.select_one("[href$='typeid=19'], [href$='typeid=15']").text
This is assuming those typeid=19 or typeid=15 only occur at the end of the strings of interest. The "," between the two in the selector is to allow for matching on either.
You could additionally handle possibility of not being present as follows:
from bs4 import BeautifulSoup
html ='''<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179'''
soup=BeautifulSoup(html,'html.parser')
gender = soup.select_one("[href$='typeid=19'], [href$='typeid=15']").text if soup.select_one("[href$='typeid=19'], [href$='typeid=15']") is not None else 'Not found'
print(gender)
Multiple values:
genders = [item.text for item in soup.select_one("[href$='typeid=19'], [href$='typeid=15']")]
Try the following code.
from bs4 import BeautifulSoup
data='''<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179'''
soup=BeautifulSoup(data,'html.parser')
print(soup.select_one('em').text)
OutPut:
[女孩]

trouble pulling html data

I am attempting to use pandas to scrape a very simple html table from a website. My code works fine on other sites, but not on the site of interest. It gives me "ValueError: No text parsed from document". Are there any common situations that prevent this?
I've also tried using requests, but the text property shows up empty as well.
urllogin = 'http://website.html'
values = {'user': 'id',
'password': 'pass'}
r = requests.post(urllogin, data=values)
url = 'http://wesbite/table'
tables = pd.read_html('url')
EDIT
This returns nothing as well.
soup = BeautifulSoup(urlopen(url), "html.parser")
print(soup.prettify())

Accessing html Tables with rvest

So I am wanting to scrape some NBA data. The following is what I have so far, and it is perfectly functional:
install.packages('rvest')
library(rvest)
url = "https://www.basketball-reference.com/boxscores/201710180BOS.html"
webpage = read_html(url)
table = html_nodes(webpage, 'table')
data = html_table(table)
away = data[[1]]
home = data[[3]]
colnames(away) = away[1,] #set appropriate column names
colnames(home) = home[1,]
away = away[away$MP != "MP",] #remove rows that are just column names
home = home[home$MP != "MP",]
the problem is that these tables don't include the team names, which is important. To get this information, I was thinking I would scrape the four factors table on the webpage, however, rvest doesnt seem to be recognizing this as a table. The div that contains the four factors table is:
<div class="overthrow table_container" id="div_four_factors">
And the table is:
<table class="suppress_all sortable stats_table now_sortable" id="four_factors" data-cols-to-freeze="1"><thead><tr class="over_header thead">
This made me think that I could access the table via something along the lines of
table = html_nodes(webpage,'#div_four_factors')
but this doesnt seem to work as I am getting just an empty list. How can I access the four factors table?
I am by no means an HTML expert but it appears that the table you are interested in is commented out in the source code then the comment is overridden at some point before being rendered.
If we assume that the Home team is always listed second, we can just use positional arguments and scrape another table on the page:
table = html_nodes(webpage,'#bottom_nav_container')
teams <- html_text(table[1]) %>%
stringr::str_split("Schedule\n")
away$team <- trimws(teams[[1]][1])
home$team <- trimws(teams[[1]][2])
Obviously not the cleanest solution but such is life in the world of web scraping