Ruby Nokogiri extract HTML tab value - html

there's a webpage with many pages. And I'd like to know the total pages for each search.
Like the pictures shown below. Since the last page is page 41 and it becomes un_clickable. So I want to extract that value 41 from those 2 span tags.
Any help?
I tried with xpath. But would prefer a CSS solution
Thanks
page_temp = Nokogiri::HTML(browser.html)
page_temp.xpath('tr[#td = "colspan="32""]').each do |node|
puts node.text
Click here to view the snapshot

Since you are using Ruby here's a simple code you can use
page_temp = Nokogiri::HTML(browser.html)
all_pages = page_temp.search("td[colspan='32'] tr td")
puts all_pages.map{|p| p.text} # list all page numbers
puts all_pages.last.text # list the last page number

Related

Filter part of the html page when scraping results with Scrapy

I want to scrape the products that are listed in this webpage. So I tried to extract all of the data-tcproduct attributes from the div.product-tile. It contains numerous things including the url of the products I need to visit.
So I did:
def parse_brand(self, response):
for d in set(response.css('div.product-tile::attr(data-tcproduct)').extract()):
d = json.loads(d)
yield scrapy.Request(url=d['product_url_page'].replace("p","P"), callback=self.parse_item)
Yet, I noticed that some attributes from the div.product-tile seems to be hidden in the page and I am not interested by them. Those I want to scrape are rather on product-listing-title.
So how can I filter part of the HTML page when scraping results with Scrapy?
I don't think that you need product-listing-title. You need items from search-result-content div instead:
for d in response.css('div.search-result-content div.product-tile::attr(data-tcproduct)').extract():
d = json.loads(d)
yield scrapy.Request(url=d['product_url_page'].replace("p","P"), callback=self.parse_item)

How to scrape text based on a specific link with BeautifulSoup?

I'm trying to scrape text from a website, but specifically only the text that's linked to with one of two specific links, and then additionally scrape another text string that follows shortly after it.
The second text string is easy to scrape because it includes a unique class I can target, so I've already gotten that working, but I haven't been able to successfully scrape the first text (with the one of two specific links).
I found this SO question ( Find specific link w/ beautifulsoup ) and tried to implement variations of that, but wasn't able to get it to work.
Here's a snippet of the HTML code I'm trying to scrape. This patter recurs repeatedly over the course of each page I'm scraping:
<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179
The two parts I'm trying to scrape and then store together in a list are the two Chinese-language text strings.
The first of these, 女孩, which means female, is the one I haven't been able to scrape successfully.
This is always preceded by one of these two links:
forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=19 (Female)
forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=15 (Male)
I've tested a whole bunch of different things, including things like:
gender_containers = soup.find_all('a', href = 'forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=19')
print(gender_containers.get_text())
But for everything I've tried, I keep getting errors like:
ResultSet object has no attribute 'get_text'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
I think that I'm not successfully finding those links to grab the text, but my rudimentary Python skills thus far have failed me in figuring out how to make it happen.
What I want to have happen ultimately is to scrape each page such that the two strings in this code (女孩 and 寻找2003年出生2004年失踪贵州省...)
<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179
...are scraped as two separate variables so that I can store them as two items in a list and then iterate down to the next instance of this code, scrape those two text snippets and store them as another list, etc. I'm building a list of list in which I want each row/nested list to contain two strings: the gender (女孩 or 男孩)and then the longer string, which has a lot more variation.
(But currently I have working code that scrapes and stores that, I just haven't been able to get the gender part to work.)
Sounds like you could use attribute = value css selector with $ ends with operator
If there can only be one occurrence per page
soup.select_one("[href$='typeid=19'], [href$='typeid=15']").text
This is assuming those typeid=19 or typeid=15 only occur at the end of the strings of interest. The "," between the two in the selector is to allow for matching on either.
You could additionally handle possibility of not being present as follows:
from bs4 import BeautifulSoup
html ='''<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179'''
soup=BeautifulSoup(html,'html.parser')
gender = soup.select_one("[href$='typeid=19'], [href$='typeid=15']").text if soup.select_one("[href$='typeid=19'], [href$='typeid=15']") is not None else 'Not found'
print(gender)
Multiple values:
genders = [item.text for item in soup.select_one("[href$='typeid=19'], [href$='typeid=15']")]
Try the following code.
from bs4 import BeautifulSoup
data='''<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179'''
soup=BeautifulSoup(data,'html.parser')
print(soup.select_one('em').text)
OutPut:
[女孩]

How to find the index of HTML child tag in Selenium WebDriver?

I am trying to find a way to return the index of a HTML child tag based on its xpath.
For instance, on the right rail of a page, I have three elements:
//*[#id="ctl00_ctl50_g_3B684B74_3A19_4750_AA2A_FB3D56462880"]/div[1]/h4
//*[#id="ctl00_ctl50_g_3B684B74_3A19_4750_AA2A_FB3D56462880"]/div[2]/h4
//*[#id="ctl00_ctl50_g_3B684B74_3A19_4750_AA2A_FB3D56462880"]/div[3]/h4
Assume that I've found the first element, and I want to return the number inside the tag div, which is 1. How can I do it?
I referred to this previous post (How to count HTML child tag in Selenium WebDriver using Java) but still cannot figure it out.
You can get the number using regex:
var regExp = /div\[([^)]+)\]/;
var matches = regExp.exec("//[#id=\"ctl00_ctl50_g_3B684B74_3A19_4750_AA2A_FB3D56462880\"]/div[2]/h4");
console.log(matches[1]); \\ returns 2
You can select preceeding sibling in xpath to get all the reports before your current one like this:
//h4[contains(text(),'hello1')]/preceding-sibling::h4
Now you only have to count how many you found plus the current and you have your index.
Another option would be to select all the reports at once and loop over them checking for their content. They always come in the same order they are in the dom.
for java it could look like this:
List<WebElement> reports = driver.findElements(By.xpath("//*[#id='ctl00_ctl50_g_3B684B74_3A19_4750_AA2A_FB3D56462880']/div/h4")
for(WebElement element : reports){
if(element.getText().contains("report1"){
return reports.indexOf(element) + 1;
}
}
Otherwise you will have to parse the xpath by yourself to extract the value (see LG3527118's answer for this).

Selenium, Python 3, simple scraping text from Erowid LSD experiences?

Based off of an answer on here about a similar thing, I tried to scrape the text of Erowid trip experiences. The URL has a bunch of trip links. I want to click each link and then print the 'report-text-surround' element, which is the trip text.
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.erowid.org/experiences/exp.cgi?S1=2&S2=-3&C1=9&Str=')
#I tried to get hrefs by xpath, knowing that each trip links starts with 'exp.php?ID'.
view_links = driver.find_elements_by_xpath("""//*[contains(text(), 'exp.php?ID')]""")
for index, view in enumerate(view_links):
html = view.get_attribute('innerHTML')
href = html.split('"')[1]
view_links[index] = href
#And then visit each href and get the data
for href in view_links:
driver.get(href)
#I know this is the element containing the trip text.
trip_text = driver.find_elements_by_class_name('report-text-surround')
for trip in trip_text:
print (trip.text.encode('utf-8'))
So you are pretty close but there are just 2 small mistakes.
trip_text = driver.find_elements_by_class_name('report-text-surround')
for trip in trip_text:
print (trip.text.encode('utf-8'))
Your driver.find_elements_by_class_name should not be plural, as there is only one on the page. It has a lot of elements, but only one class ('report-text-surround'). This means you're going to get all the text at once, you could change this but you'd have to go through the child elements or get the elements seperately.
You can change that entire section to this:
text = (driver.find_element_by_class_name('report-text-surround').text).encode('utf-8')
print(text);
That will give you all of the text in the entire article. An easy way to split this up after would be to split each part of the text by \n\n.

$_SERVER[QUERY_STRING] copying itself

This is the part of the code for paging(when you see page 1,page 2...at the bottom).The $_SERVER[QUERY_STRING] is used to copy what was searched on previous page so that page number 2 displays results for same query.
The problem is that on page 2 the "query string" is added with page number &page=2 so when you click for page 3 the $_SERVER[QUERY_STRING] copies the query(which i need to be copied,eg. ?search=salad)and the page number(which is unnecessary),it looks like this &page=2&page=3
Is there any good way to do this?...it would be nice if something could change only the number of page instead copying whole word.
<a href='$_SERVER[PHP_SELF]?$_SERVER[QUERY_STRING]?start=$back'><font face='Verdana' size='2'>PREV</font></a>
$query = http_build_query(array('page' => $num) + $_GET);
printf('Prev', $_SERVER['PHP_SELF'], $query);
This uses the $_GET array, which contains all the values of $_SERVER['QUERY_STRING'] in a neat array, "overwrites" the page value of that array, then re-assembles it into a URL-encoded query string.