Scrapy : List all links and infos contained in same page from a website - html

I have the following mini basic spider I use to get all links from a website.
from scrapy.item import Field, Item
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class SampleItem(Item):
link = Field()
class SampleSpider(CrawlSpider):
name = "sample_spider"
allowed_domains = ["example.com"]
start_urls = ["http://www.example.com/"]
rules = (
Rule(LinkExtractor(), callback='parse_page', follow=True),
)
def parse_page(self, response):
item = SampleItem()
item['link'] = response.url
return item
I was wondering wether it would be possible to add to have this same spider scraping some html (like the one below)from these same links and to list link and info in a csv in two separate columns?
<span class="price">50,00 €</span>
marko

Yes, that's possible of course. First of all you need to use a feed export. This can be set in the settings.py with the options:
FEED_FORMAT = 'csv'
FEED_URL = 'file:///absolute/path/to/the/output.csv'
Then you will have to adjust your items to allow more elements. Currently, you only use the link. You will want to add a price field.
class SampleItem(Item):
link = Field()
price = Field()
One sidenote: Usually we define items in the items.py file, because generally multiple spiders should scrape the same type of item from several pages. You would then import them into your spider using from scrapername.items import SampleItem. An example application for this would be a price scraper which scrapes both Amazon and some smaller shops.
Finally, you will have to adjust the parse_page method of your spider. Currently you only save the URL into your item. You want to find the price and also save it. Finding numbers or texts on a page is a key element of scraping. For this purpose we have selectors. Scapy supports XPath, CSS and regular expression selectors. The first two are especially useful, because they can be nested. Regular expressions would generally be used when you found the correct HTML element, but there is too much information within one element.
A problem you might encounter is that a page might have multiple .price elements. Have you made sure there only is one? Otherwise the selector will give you all of them and you might have to refine your selector using more other tags.
So, let's assume there is only this one .price element and construct our selector. We use CSS selector here, because it's more intuitive in this case. You can call the selectors directly on the response using css and xpath methods. Both of them always return elements on which you might use css() and xpath() again. To get the textual representation you need to call extract() on them. This might be annoying at the beginning, but nesting selectors is very convenient. Note that the selectors give you the full HTML element including the tag. To only get the text content, you need to make this explicit. For CSS selectors via ::text, for XPath via /text().
def parse_page(self, response):
item = SampleItem()
item['link'] = response.url
try:
item['price'] = response.css('.price::text')[0].extract()
except IndexError:
# do whatever is best if price cannot be found
item['price'] = None
return item

Related

Selenium, using find_element but end up with half the website

I finished the linked tutorial and tried to modify it to get somethings else from a different website. I am trying to get the margin table of HHI but the website is coded in a strange way that I am quite confused.
I find the child element of the parent that have the text with xpath://a[#name="HHI"], its parent is <font size="2"></font> and contains the text I wanted but there is a lot of tags named exactly <font size="2"></font> so I can't just use xpath://font[#size="2"].
Attempt to use the full xpath would print out half of the website content.
the full xpath:
/html/body/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr[3]/td/pre/font/table/tbody/tr/td[2]/pre/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font
Is there anyway to select that particular font tag and print the text?
website:
https://www.hkex.com.hk/eng/market/rm/rm_dcrm/riskdata/margin_hkcc/merte_hkcc.htm
Tutorial
https://www.youtube.com/watch?v=PXMJ6FS7llk&t=8740s&ab_channel=freeCodeCamp.org
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
import pandas as pd
# prepare it to automate
from datetime import datetime
import os
import sys
import csv
application_path = os.path.dirname(sys.executable) # export the result to the same file as the executable
now = datetime.now() # for modify the export name with a date
month_day_year = now.strftime("%m%d%Y") # MMDDYYYY
website = "https://www.hkex.com.hk/eng/market/rm/rm_dcrm/riskdata/margin_hkcc/merte_hkcc.htm"
path = "C:/Users/User/PycharmProjects/Automate with Python – Full Course for Beginners/venv/Scripts/chromedriver.exe"
# headless-mode
options = Options()
options.headless = True
service = Service(executable_path=path)
driver = webdriver.Chrome(service=service, options=options)
driver.get(website)
containers = driver.find_element(by="xpath", value='') # or find_elements
hhi = containers.text # if using find_elements, = containers[0].text
print(hhi)
Update:
Thank you to Conal Tuohy, I learn a few new tricks in Xpath. The website is written in a strange way that even with the Xpath that locate the exact font tag, the result would still print all text in every following tags.
I tried to make a list of different products by .split("Back to Top") then slice out the first item and use .split("\n"). I will .split() the lists within list until it can neatly fit into a dataframe with strike prices as index and maturity date as column.
Probably not the most efficient way but it works for now.
product = "HHI"
containers = driver.find_element(by="xpath", value=f'//font[a/#name="{product}"]')
hhi = containers.text.split("Back to Top")
# print(hhi)
hhi1 = hhi[0].split("\n")
df = pd.DataFrame(hhi1)
# print(df)
df.to_csv(f"{product}_{month_day_year}.csv")
You're right that HTML is just awful! But if you're after the text of the table, it seems to me you ought to select the text node that follows the B element that follows the a[#name="HHI"]; something like this:
//a[#name="HHI"]/following-sibling::b/following-sibling::text()[1]
EDIT
Of course that XPath won't work in Selenium because it identifies a text node rather than an element. So your best result is to return the font element that directly contains the //a[#name="HHI"], which will include some cruft (the Back to Top link, etc) but which will at least contain the tabular data you want:
//a[#name="HHI"]/parent::font
i.e. "the parent font element of the a element whose name attribute equals HHI"
or equivalently:
//font[a/#name="HHI"]
i.e. "the font element which has, among its child a elements, one whose name attribute equals HHI"

Select parent of XCUIElement

How can I select the parent element of an XCUIElement in XCUITest? According to the documentation, the class has children() and descendants() but nothing to select parents or siblings. It seems to me I must be missing something - how can Apple have an element tree without navigation in both directions???
I know there is a method containing() on XCUIElementQuery but that is not the same. I also know that an accessibilityIdentifier might help but I am thinking of writing a generic method for testing any view with a given Navbar label. Passing in all the identifiers of all the elements I would like to access does not seem like a good option.
Unfortunately there is no direct named method to access parent elements similar to children() and descendants() provided by Apple but you were actually on the right track with containing(). There are two ways I usually approach it when I need to locate a parent element based on children/descendants:
Using containing(_:identifier:)
let parentElement = app.otherElements.containing(.textField, identifier: "test").firstMatch
or
let parentElement = app.otherElements.containing(.textField, identifier: "test").element(boundBy: 0)
Using containing(_ predicate: NSPredicate)
let parentElement = app.otherElements.containing(NSPredicate(format: "label CONTAINS[c] 'test'").firstMatch
or
let parentElement = app.otherElements.containing(NSPredicate(format: "label CONTAINS[c] 'test'").element(boundBy: 0)
These are just examples with random data/element types because you didn't mention exactly what you want to achieve but you can go from there.
Update:
As usual the Apple documentation doesn't do a good service. They say 'descendants' but what they actually mean is both direct descendants(children) and non-direct descendants(descendants). Unfortunately there is no guarantee and there is no generic solution. It should be based on your current needs and the application implementation. More examples that could be useful:
If you don't want the first element from the query you are better off using element(boundBy: index). So if you know that XCUIElementQuery will give you 5 elements and you know you need the 3rd one:
let parentElement = app.otherElements.containing(.textField, identifier: "test").element(boundBy: 2)
Fine graining of your element locators. Lets say you have 3 views with identifier "SomeView", these 3 views each contain 2 other subviews and the subviews have a button with identifier "SomeButton".
let parentViews = app.otherElements.matching(identifier: "SomeView")
let subView = parentViews.element(boundBy: 2).otherElements.containing(.button, identifier: "SomeButton").element(boundBy: 1)
This will give you the second subview containing a button with identifier "SomeButton" from the third parent view with identifier "SomeView". Using such an approach you can fine tune until you get exactly what you need and not all parents, grandparents, great-grandparents etc.
I wish Apple provided a bit more flexibility for the locators with XCTest like Xpath does for Appium but even these tools can be sufficient most of the time.

Filter part of the html page when scraping results with Scrapy

I want to scrape the products that are listed in this webpage. So I tried to extract all of the data-tcproduct attributes from the div.product-tile. It contains numerous things including the url of the products I need to visit.
So I did:
def parse_brand(self, response):
for d in set(response.css('div.product-tile::attr(data-tcproduct)').extract()):
d = json.loads(d)
yield scrapy.Request(url=d['product_url_page'].replace("p","P"), callback=self.parse_item)
Yet, I noticed that some attributes from the div.product-tile seems to be hidden in the page and I am not interested by them. Those I want to scrape are rather on product-listing-title.
So how can I filter part of the HTML page when scraping results with Scrapy?
I don't think that you need product-listing-title. You need items from search-result-content div instead:
for d in response.css('div.search-result-content div.product-tile::attr(data-tcproduct)').extract():
d = json.loads(d)
yield scrapy.Request(url=d['product_url_page'].replace("p","P"), callback=self.parse_item)

How to scrape text based on a specific link with BeautifulSoup?

I'm trying to scrape text from a website, but specifically only the text that's linked to with one of two specific links, and then additionally scrape another text string that follows shortly after it.
The second text string is easy to scrape because it includes a unique class I can target, so I've already gotten that working, but I haven't been able to successfully scrape the first text (with the one of two specific links).
I found this SO question ( Find specific link w/ beautifulsoup ) and tried to implement variations of that, but wasn't able to get it to work.
Here's a snippet of the HTML code I'm trying to scrape. This patter recurs repeatedly over the course of each page I'm scraping:
<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179
The two parts I'm trying to scrape and then store together in a list are the two Chinese-language text strings.
The first of these, 女孩, which means female, is the one I haven't been able to scrape successfully.
This is always preceded by one of these two links:
forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=19 (Female)
forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=15 (Male)
I've tested a whole bunch of different things, including things like:
gender_containers = soup.find_all('a', href = 'forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=19')
print(gender_containers.get_text())
But for everything I've tried, I keep getting errors like:
ResultSet object has no attribute 'get_text'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
I think that I'm not successfully finding those links to grab the text, but my rudimentary Python skills thus far have failed me in figuring out how to make it happen.
What I want to have happen ultimately is to scrape each page such that the two strings in this code (女孩 and 寻找2003年出生2004年失踪贵州省...)
<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179
...are scraped as two separate variables so that I can store them as two items in a list and then iterate down to the next instance of this code, scrape those two text snippets and store them as another list, etc. I'm building a list of list in which I want each row/nested list to contain two strings: the gender (女孩 or 男孩)and then the longer string, which has a lot more variation.
(But currently I have working code that scrapes and stores that, I just haven't been able to get the gender part to work.)
Sounds like you could use attribute = value css selector with $ ends with operator
If there can only be one occurrence per page
soup.select_one("[href$='typeid=19'], [href$='typeid=15']").text
This is assuming those typeid=19 or typeid=15 only occur at the end of the strings of interest. The "," between the two in the selector is to allow for matching on either.
You could additionally handle possibility of not being present as follows:
from bs4 import BeautifulSoup
html ='''<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179'''
soup=BeautifulSoup(html,'html.parser')
gender = soup.select_one("[href$='typeid=19'], [href$='typeid=15']").text if soup.select_one("[href$='typeid=19'], [href$='typeid=15']") is not None else 'Not found'
print(gender)
Multiple values:
genders = [item.text for item in soup.select_one("[href$='typeid=19'], [href$='typeid=15']")]
Try the following code.
from bs4 import BeautifulSoup
data='''<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179'''
soup=BeautifulSoup(data,'html.parser')
print(soup.select_one('em').text)
OutPut:
[女孩]

How to find the index of HTML child tag in Selenium WebDriver?

I am trying to find a way to return the index of a HTML child tag based on its xpath.
For instance, on the right rail of a page, I have three elements:
//*[#id="ctl00_ctl50_g_3B684B74_3A19_4750_AA2A_FB3D56462880"]/div[1]/h4
//*[#id="ctl00_ctl50_g_3B684B74_3A19_4750_AA2A_FB3D56462880"]/div[2]/h4
//*[#id="ctl00_ctl50_g_3B684B74_3A19_4750_AA2A_FB3D56462880"]/div[3]/h4
Assume that I've found the first element, and I want to return the number inside the tag div, which is 1. How can I do it?
I referred to this previous post (How to count HTML child tag in Selenium WebDriver using Java) but still cannot figure it out.
You can get the number using regex:
var regExp = /div\[([^)]+)\]/;
var matches = regExp.exec("//[#id=\"ctl00_ctl50_g_3B684B74_3A19_4750_AA2A_FB3D56462880\"]/div[2]/h4");
console.log(matches[1]); \\ returns 2
You can select preceeding sibling in xpath to get all the reports before your current one like this:
//h4[contains(text(),'hello1')]/preceding-sibling::h4
Now you only have to count how many you found plus the current and you have your index.
Another option would be to select all the reports at once and loop over them checking for their content. They always come in the same order they are in the dom.
for java it could look like this:
List<WebElement> reports = driver.findElements(By.xpath("//*[#id='ctl00_ctl50_g_3B684B74_3A19_4750_AA2A_FB3D56462880']/div/h4")
for(WebElement element : reports){
if(element.getText().contains("report1"){
return reports.indexOf(element) + 1;
}
}
Otherwise you will have to parse the xpath by yourself to extract the value (see LG3527118's answer for this).