How to scrape text based on a specific link with BeautifulSoup? - html

I'm trying to scrape text from a website, but specifically only the text that's linked to with one of two specific links, and then additionally scrape another text string that follows shortly after it.
The second text string is easy to scrape because it includes a unique class I can target, so I've already gotten that working, but I haven't been able to successfully scrape the first text (with the one of two specific links).
I found this SO question ( Find specific link w/ beautifulsoup ) and tried to implement variations of that, but wasn't able to get it to work.
Here's a snippet of the HTML code I'm trying to scrape. This patter recurs repeatedly over the course of each page I'm scraping:
<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179
The two parts I'm trying to scrape and then store together in a list are the two Chinese-language text strings.
The first of these, 女孩, which means female, is the one I haven't been able to scrape successfully.
This is always preceded by one of these two links:
forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=19 (Female)
forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=15 (Male)
I've tested a whole bunch of different things, including things like:
gender_containers = soup.find_all('a', href = 'forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=19')
print(gender_containers.get_text())
But for everything I've tried, I keep getting errors like:
ResultSet object has no attribute 'get_text'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
I think that I'm not successfully finding those links to grab the text, but my rudimentary Python skills thus far have failed me in figuring out how to make it happen.
What I want to have happen ultimately is to scrape each page such that the two strings in this code (女孩 and 寻找2003年出生2004年失踪贵州省...)
<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179
...are scraped as two separate variables so that I can store them as two items in a list and then iterate down to the next instance of this code, scrape those two text snippets and store them as another list, etc. I'm building a list of list in which I want each row/nested list to contain two strings: the gender (女孩 or 男孩)and then the longer string, which has a lot more variation.
(But currently I have working code that scrapes and stores that, I just haven't been able to get the gender part to work.)

Sounds like you could use attribute = value css selector with $ ends with operator
If there can only be one occurrence per page
soup.select_one("[href$='typeid=19'], [href$='typeid=15']").text
This is assuming those typeid=19 or typeid=15 only occur at the end of the strings of interest. The "," between the two in the selector is to allow for matching on either.
You could additionally handle possibility of not being present as follows:
from bs4 import BeautifulSoup
html ='''<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179'''
soup=BeautifulSoup(html,'html.parser')
gender = soup.select_one("[href$='typeid=19'], [href$='typeid=15']").text if soup.select_one("[href$='typeid=19'], [href$='typeid=15']") is not None else 'Not found'
print(gender)
Multiple values:
genders = [item.text for item in soup.select_one("[href$='typeid=19'], [href$='typeid=15']")]

Try the following code.
from bs4 import BeautifulSoup
data='''<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179'''
soup=BeautifulSoup(data,'html.parser')
print(soup.select_one('em').text)
OutPut:
[女孩]

Related

ROBOTFRAMEWORK - Looping through all images on a page - pulling the link

I am working on a test that checks that all images on a page are visible. I'm running into an issue where its only pulling the link from the first img on the page and logs it the length of the loop. Im currently getting a count of all the images, and in that count I loop through and pull the img source. There are no special classes, or ids. The only thing I have to go off of is . I'm guessing I will somehow need to parse the entire HTML since robotframework only looks at what is viewable on the screen?
My end goal is to pull all img sources on a page and confirm each one returns a 200 status code.
Here is what I have now:
#{all_image_sources} Create List
${all_images} Get Element Count //body//img
FOR ${image} IN RANGE ${all_images}
${img_src} Get Element Attribute tag:img src
log ${img_src}
Append To List ${all_image_sources} ${img_src}
END
Log List ${all_image_sources}'''
You might consider using Get WebElements, this will give you each image locator in a list. You can then loop through the list to get each src attribute.
example:
#{all_image_sources} Create List
${all_images} Get WebElements //body//img
FOR ${image} IN #{all_images}
${img_src} Get Element Attribute ${image} src
Append To List ${all_image_sources} ${img_src}
END
Log List ${all_image_sources}
Get WebElements

xpath scraping data from the second page

I am trying to scrape data from this webpage: http://webfund6.financialexpress.net/clients/zurichcp/PortfolioPriceTable.aspx?SchemeID=33, and I specifically need data for fund number 26.
Have no problem getting data from the first page with this address (funds number 1-25), but for the hell of me can't scrape anything from the second page. Can someone help?
Thanks!
Here is the code I use: in Google Sheets:
=IMPORTXML("http://webfund6.financialexpress.net/clients/zurichcp/PortfolioPriceTable.aspx?SchemeID=33","/html/body/form[#id='MainForm']/table/tr/td/div[#id='main']/div[#id='tabResult']/div[#id='Prices']/table/thead/tr[26]/td[#class='Center'][1]")
You can do 2 things - one is to append the PgIndex=2 onto the end of your URL, and then you can also significantly simplify your xpath to this:
//*[#id='Prices']//tr[2]/td[2]
This specifically grabs the second row on the table (tr which means table-row), in order to bypass the header row, then grabs the second field which is the table-data cell.
=IMPORTXML("http://webfund6.financialexpress.net/clients/zurichcp/PortfolioPriceTable.aspx?SchemeID=33&PgIndex=2","//*[#id='Prices']//tr[2]/td[2]")
To get the second page, add &PgIndex=2 to your url. Then adjust the /table/thead/tr[26] to /table/thead/tr[2]. The result is:
=IMPORTXML("http://webfund6.financialexpress.net/clients/zurichcp/PortfolioPriceTable.aspx?SchemeID=33&PgIndex=2","/html/body/form[#id='MainForm']/table/tr/td/div[#id='main']/div[#id='tabResult']/div[#id='Prices']/table/thead/tr[2]/td[#class='Center'][1]")

Scrapy : List all links and infos contained in same page from a website

I have the following mini basic spider I use to get all links from a website.
from scrapy.item import Field, Item
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class SampleItem(Item):
link = Field()
class SampleSpider(CrawlSpider):
name = "sample_spider"
allowed_domains = ["example.com"]
start_urls = ["http://www.example.com/"]
rules = (
Rule(LinkExtractor(), callback='parse_page', follow=True),
)
def parse_page(self, response):
item = SampleItem()
item['link'] = response.url
return item
I was wondering wether it would be possible to add to have this same spider scraping some html (like the one below)from these same links and to list link and info in a csv in two separate columns?
<span class="price">50,00 €</span>
marko
Yes, that's possible of course. First of all you need to use a feed export. This can be set in the settings.py with the options:
FEED_FORMAT = 'csv'
FEED_URL = 'file:///absolute/path/to/the/output.csv'
Then you will have to adjust your items to allow more elements. Currently, you only use the link. You will want to add a price field.
class SampleItem(Item):
link = Field()
price = Field()
One sidenote: Usually we define items in the items.py file, because generally multiple spiders should scrape the same type of item from several pages. You would then import them into your spider using from scrapername.items import SampleItem. An example application for this would be a price scraper which scrapes both Amazon and some smaller shops.
Finally, you will have to adjust the parse_page method of your spider. Currently you only save the URL into your item. You want to find the price and also save it. Finding numbers or texts on a page is a key element of scraping. For this purpose we have selectors. Scapy supports XPath, CSS and regular expression selectors. The first two are especially useful, because they can be nested. Regular expressions would generally be used when you found the correct HTML element, but there is too much information within one element.
A problem you might encounter is that a page might have multiple .price elements. Have you made sure there only is one? Otherwise the selector will give you all of them and you might have to refine your selector using more other tags.
So, let's assume there is only this one .price element and construct our selector. We use CSS selector here, because it's more intuitive in this case. You can call the selectors directly on the response using css and xpath methods. Both of them always return elements on which you might use css() and xpath() again. To get the textual representation you need to call extract() on them. This might be annoying at the beginning, but nesting selectors is very convenient. Note that the selectors give you the full HTML element including the tag. To only get the text content, you need to make this explicit. For CSS selectors via ::text, for XPath via /text().
def parse_page(self, response):
item = SampleItem()
item['link'] = response.url
try:
item['price'] = response.css('.price::text')[0].extract()
except IndexError:
# do whatever is best if price cannot be found
item['price'] = None
return item

Trouble with Xpath in Google Spreadsheets (ImportXML)

This is a great site, and I've already had a lot of questions answered simply by scrolling and searching through other postings. Unfortunately, I can't seem to track down an answer that specifically helps this problem, and figured I would try posting and looking for help-
I'm using ImportXML and google spreadsheets to 'scrape'a few product descriptions from a retail site. It's been working fine for the most part, and I have done it in 2 ways:
1) Specific call to the description part of a post:
=ImportXML(A1,"//div[#class='desc']")
2) Call to the entire 'product Card', which also returns info such as product title, price, time posted, and places these items in adjacent cells in my Google spreadsheet:
=ImportXML(A1,"//div[#class='productCard']")
Both have worked fine, but I've ran into a different problem using each method. If I can resolve even one of these problems, then I'll happily scrap the other method, I just need one of them to work. The problems are:
Method 1) The website prohibits sellers from including contact information in product postings-- when they include an email address anyways, the site automatically blocks it, so that in the posting it simply appears as "...you can reach me at [obscured]" or something like that. The [obscured] appears in a different colour text and is obviously treated differently somehow. When I scrape these descriptions using Method 1, ImportXML appears to get 'bumped' when it hits the word [obscured], and it passed the remaining text from that product description to the next cell over in my spreadsheet. This ruins the entire organization of the sheet, and I'd like to find a way where I can get ImportXML to just ignore the [obscured], and still place the entire text of the product description in one cell.
Method 2) My call for the entire 'product Card' is as follows:
=ImportXML(A1,"//div[#class='productCard']")
As mentioned, this works fine (for most products), and I don't mind the additional info (price, date, etc.) being posted in adjacent cells.
However, the website also allows certain products to be 'featured', where they appear in a different colour box on the site, and are therefore more likely to get a buyer's attention.
Using this method, the 'featured' products are not scraped or imported into my spreadsheet, but are simply passed over.
The source code (on actual site) (via 'inspect element' in Safari) for both the description (Method 1) and product card (Method 2) look as follows (for a normal product (a) and a featured product (b)):
(a)
<div id="productSearchResults">
<div class="productCard tracked">
<div>...</div>
<div class="stats">...</div>
<div class="desc collapsed descFull">...</div>
</div>
(b)
<div id="productSearchResults">
<div class="productCard featured tracked">
<div>...</div>
<div class="stats">...</div>
<div class="desc collapsed descFull">...</div>
</div>
You can see in both (a) an (b) the 'desc' class that I call in Method 1, which seems to work fine.
From my reading on this site, I think I've learned that a given class can't have more than one word, and therefore the use of "desc collapsed descFull" and "productCard tracked" and "productCard featured tracked" don't represent classes with 3, 2 and 3 words in the title, but instead cases where multiple classes have been assigned?
Regardless, the call to 'desc' (Method 1) works fine and seems to get all descriptions.
In method 2 therefore, I would have thought that a call to 'productCard' would get the info for all products, both featured and regular, as 'featured' is an extra class assigned to some 'productCard's. If I call all 'productCard's, shouldn't the normal AND featured ones be returned? This is currently not the case. I've tried calling just 'tracked' and just 'featured' as classes, and neither returns anything, so my logic that they are their own class equivalent to 'productCard' may be flawed.
In summary, the 'desc' call in Method 1 works fine, and even gets descriptions for 'featured' products. However, when contact information is included in the description and is displayed as [obscured] it bumps my data into the next cell in the spreadsheet, immediately following the word. This throws off and ruins all organization.
In Method 2, I am not getting the featured products at all, which greatly weakens what I am trying to do. Can either (or both!) of these problems be fixed??
Thanks so so much for any help you can give me.
***UPDATE: As seen in the comments below, use of the 'contain' as suggested improved Method 2 by retrieving both regular and featured products. However, featured product cards have extra text elements, and since the entire card is being scraped in this method, featured products do not match the cell alignment that regular products do. If there is a way to fix Method 1, this would therefore be much better.
As outlined in the comments below, the [obscured] text appears in a 'span' that follows underneath/indented from the
<div class="desc descFull collapsed"
as
<span class="obscureText">[obscured]</span>
Is there any way that I can import the 'desc's as I have been, but tell the XPath to essentially 'ignore' the [obscured] span, or at least deal with it in a way that doesn't make description text immediately after [obscured] appear one cell over?
Thanks so much everyone!
You can wrap your function with the concatenate()-function to make sure it all shows up in one cell:
=concatenate(ImportXML(A1,"//div[#class='productCard']"))

Get tabledata from html, JSOUP

What is the best way to extract data from a table from an url?
In short I need to get the actual data from the these 2 tables at: http://www.oddsportal.com/sure-bets/
In this example the data would be "Paddy power" and "3.50"
See this image:
(Sorry for posting image like this, but I still need reputation, i will edit later)
http://img837.imageshack.us/img837/3219/odds2.png
I have tried with Jsoup, but i dont know if this is the best way?
And I can't seem to navigate correctly down the tables, I have tried things like this:
tables = doc.getElementsByAttributeValueStarting("class", "center");
link = doc.select("div#col-content > title").first();
String text1 = doc.select("div.odd").text();
The tables thing seem to get some data, but doesn't include the text in the table
Sorry, man. The second field you want to retrieve is filled by JavaScript. Jsoup does not execute JavaScript.
To select title of first row you can use:
Document doc = Jsoup.connect("http://www.oddsportal.com/sure-bets/").get();
Elements tables = doc.select("table.table-main").select("tr:eq(2)").select("td:eq(2)");
System.out.println(tables.select("a").attr("title"));
Chain selects used for visualization.