How to get plain text with Xpath - html

Hi I got this piece of html and i want to get text elements from it
<span id="product_description" itemprop="description" class="">
<h1>Toltec Lighting 216-BRZ-508 Leaf Collection Traditional Potrack With Italian Marble Glass In Bronze</h1>
<br class="">
<span style="font-weight: bold;" class="">MANUFACTURE: </span>
Toltec Lighting
<br class=" xh-highlight">
<span style="font-weight: bold;" class="">COLLECTION: </span>
Leaf
<br class=" xh-highlight">
</span>
I want to get list of values. In this case it will be "Toltec Lighting" and "Leaf"

You can try this :
//span[#id='product_description']/text()
or if you need to also make sure no empty text nodes selected :
//span[#id='product_description']/text()[normalize-space()]

You may try using this:
//*[text()='Toltec Lighting']

Related

Selenium - What strategies for get link?

I have many web elements like this
<a data-control-name="browsemap_profile" href="/in/quyen-nguyen-63098b123/" id="ember278" class="pv-browsemap-section__member ember-view"> <img width="56" src="https://media-exp1.licdn.com/dms/image/C5603AQFHZ41UPexTLQ/profile-displayphoto-shrink_100_100/0?e=1599091200&v=beta&t=lkoiKVK58W1tciUEc5UUohvEsa99lLTv66a1PJ4hp5k" loading="lazy" height="56" alt="member_name" id="ember279" class="lazy-image pv-browsemap-section__member-image EntityPhoto-circle-4 ember-view">
<div class="pv-browsemap-section__member-detail">
<h3 id="ember280" class="pv-browsemap-section__member-detail--has-hover actor-name-with-distance ember-view"> <span class="name-and-icon"><span class="name">Quyen Nguyen</span>
<span class="distance-and-badge">
<span data-test-distance-badge="" id="ember281" class="distance-badge separator ember-view"><span class="visually-hidden">
2nd degree connection
</span>
<span class="dist-value">2nd</span>
</span><!----> </span>
</span>
</h3>
<p class="pv-browsemap-section__member-headline t-14 t-black t-normal">
<div style="line-height:2rem;max-height:4rem;-webkit-line-clamp:2;" id="ember282" class="inline-show-more-text inline-show-more-text--is-collapsed inline-show-more-text--is-collapsed-with-line-clamp ember-view">I'm looking for IT Director/Admissions Director/Training Head/Marketing Director
<!---->
<!----></div>
</p>
</div>
</a>
I want to get a list of data like this /in/quyen-nguyen-63098b123/? How many way to select then get this data?
I also want to get a list of id in pattern: ember278, ember279 , ember238 , etc.
Use
IEnumerable<IWebElement> connectionBlocks = driver.FindElements(By.XPath("//a[#id[starts-with(., 'ember') and string-length() > 5]]"));
Then use regular expression for next parsing.

accessing specific elements in a class HTML

I would like to get the contents only of the name of the product and its seller. I do not want description or feeedback.
<div class="m-l-50 col-md-7 ">
<span class="font-size-15 " style="vertical-align:top"><strong>How to fix hdd</strong></span><br>
<span>Seller: bestbuy</span><br>
<span>Description: This Method will show you how to </span><br>
Feedback:<strong> <span style="color: green;"> 74 </span> : <span style="color: red;">1 </span><br>
MY CODE
def scrape_this_page(page_source):
page_source=BeautifulSoup(page_source,"html.parser")
products = page_source.findAll(class_='m-l-50 col-md-7')
for product in products:
names.append(product.span[0])
for product in products:
sellers.append(product.span[1])
In selenium just use -> for example: driver.find_element_by_css_selector(div.some_class_name.another_class_name)
And in BeautifulSoup use page_source.select(div.some_class_name.another_class_name)
If you dont have any classname you have to iterate (for loop) over the elements and check if the text starts with "Seller" or access it with Indies (elements[0]) (may be unstable)

How to add a link in the end of each element (string) from the list? Using Thymeleaf

I am fetching lines of text from the list one by one and I need to add a hyper link in the end of each line. Trying the code below, but link is not displayed.
<p th:each="releases : ${release}"
class="releases" th:text="${releases}" th:href="www.abc.com"> New Releases </p>
<p th:each="releases : ${release}"> <span class="releases" th:text="${releases.split('Spotify')[0]}">
New Releases </span> <a class="spoturl" th:href="${releases.split('URL:\s')[1]}"> Spotify URL </a> </p>
My solution
If you want to add a link to the end of each "release" string, you can use this:
<p th:each="releases : ${release}"
class="releases">
<span th:text="${releases}"></span>
<a th:href="#{www.abc.com/${rel}(rel=${releases})}"
th:text=" '[link]'"></a>
</p>
So, for example, if the items in the release list are Some_Release and Another_Release, you will get this:
Some_Release [link]
Another_Release [link]
Each link text will have a customized href.
Try this
<p th:each="releases : ${release}" th:href="www.abc.com"> <span class="releases" th:text="${releases}"> New Releases </span> </p>

How to get the text inside a span tag which is inside another tag using beautifulsoup?

How do I get the value of all the tags that have class="no-wrap text-right circulating-supply"? What I used was:
text=[ ]
text=(soup.find_all(class_="no-wrap text-right circulating-supply"))
Output of text[0]:
'\n\n17,210,662\nBTC\n'
I just want to extract the numeric value.
Example of one instance:
<td class="no-wrap text-right circulating-supply" data-sort="17210662.0">
<span data-supply="17210662.0">
<span data-supply-container="">
17,210,662
</span>
<span class="hidden-xs">
BTC
</span>
</span>
</td>
Thanks.
In case all elements have similar HTML structure try below to get required output:
texts = [node.text.strip().split('\n')[0] for node in soup.find_all(class_="no-wrap text-right circulating-supply")]
This might look like an overkill , You could use use regex to extract numbers
from bs4 import BeautifulSoup
html = """<td class="no-wrap text-right circulating-supply" data-sort="17210662.0">
<span data-supply="17210662.0">
<span data-supply-container="">
17,210,662
</span>
<span class="hidden-xs">
BTC
</span>
</span>
</td>"""
import re
soup = BeautifulSoup(html,'html.parser')
coin_value = [re.findall('(\d+)', node.text.replace(',','')) for node in soup.find_all(class_="no-wrap text-right circulating-supply")]
print coin_value
prints
[[u'17210662']]

Troubleshooting XPath Expression to select nodes based on child node

NB - This question is very similar to the other one I asked - Xpath Expression to select nodes based on presence of child node? - however, I'm trying to extend it, and failing.
I have a HTML page listing products.
I'm trying to use Xpath to distinguish between available and sold-out products.
Available products look like this:
<div class="product-widget-container">
<article itemscope="" itemtype="http://schema.org/Product" class="product grid_4 full space omega large " data-productid="1996364" data-name="Daily Wrinkle Defence Essential Skin Reviver Cream Cleanser - 100ml" data-actual-price="5.99" data-is-available="true" data-low-stock="" data-popularity="6" data-smallimgsrc="https://staging.foo.com.au/site_media/uploads/product_image/2014/1/16/pd1996364_94d4a520-7e4a-11e3-930f-000c29c9a057_image_310x434.JPG" data-largeimgsrc="https://staging.foo.com.au/site_media/uploads/product_image/2014/1/16/pd1996364_94d4a520-7e4a-11e3-930f-000c29c9a057_image_310x434.JPG" data-sizes="[]" data-available-sizes="[]" data-categories="[119977]" data-brand="That Natural Source" data-discount="83" data-default-order="9">
<figure>
<div class="product-img-container ">
<img itemprop="image" class="lazy product-img" src="https://staging.foo.com.au/site_media/uploads/product_image/2014/1/16/pd1996364_94d4a520-7e4a-11e3-930f-000c29c9a057_image_310x434.JPG" data-original="https://staging.foo.com.au/site_media/uploads/product_image/2014/1/16/pd1996364_94d4a520-7e4a-11e3-930f-000c29c9a057_image_310x434.JPG" alt="Up to 85% off Summer Looks Daily Wrinkle Defence Essential Skin Reviver Cream Cleanser - 100ml " style="display: inline;">
<span class="arrow arrow-up"></span>
<div class="quick-buy" style="display: none;">
<span class="arrow-down-trans"></span>
<div class="select-size">
<form class="express-buy" action="/basket/add/1996364/" method="post">
<input type="hidden" id="id_quantity_1996364" class="purchase-quantity" name="quantity" value="1">
<input type="hidden" value="" name="addbasket.x">
<span>
<input class="add-to-basket btn btn-primary btn-large " type="submit" value="ADD TO BASKET">
</span>
</form>
</div>
</div>
</div>
<a itemprop="url" class="overlay-link" href="/event/outlet/up-to-off-summer-looks/1996364-daily-wrinkle-defence-essential-skin-reviver-cream-cleanser-100ml/" title="Daily Wrinkle Defence Essential Skin Reviver Cream Cleanser - 100ml"></a>
<figcaption>
<h2 itemprop="name" class="mason name">
That Natural Source: Daily Wrinkle Defence Essential Skin Reviver Cream Cleanser - 100ml
</h2>
<small itemprop="brand" class="bed"> Up to 85% off Summer Looks</small>
<small class="bed shoes-price">
$5.99
<del>$34.95 RRP</del>
<span class="discount">(83% discount)</span>
</small>
</figcaption>
</figure>
</article>
</div>
Sold-out products look like this:
<div class="product-widget-container">
<article itemscope="" itemtype="http://schema.org/Product" class="product grid_4 full space omega large " data-productid="1996526" data-name="#T58 When Monkeys Fly! - Oz The Great And Powerful Collection By OPI" data-actual-price="10.99" data-is-available="" data-low-stock="true" data-popularity="1" data-smallimgsrc="https://staging.foo.com.au/site_media/uploads/product_image/2014/1/16/pd1996526_d0402efe-7e4a-11e3-930f-000c29c9a057_image_310x434.jpg" data-largeimgsrc="https://staging.foo.com.au/site_media/uploads/product_image/2014/1/16/pd1996526_d0402efe-7e4a-11e3-930f-000c29c9a057_image_310x434.jpg" data-sizes="[]" data-available-sizes="[]" data-categories="[119968]" data-brand="OPI" data-discount="0" data-default-order="39">
<div class="stock-status be_sprites sold-out">Sold Out</div>
<figure>
<div class="product-img-container ">
<img itemprop="image" class="lazy product-img" src="https://staging.foo.com.au/site_media/uploads/product_image/2014/1/16/pd1996526_d0402efe-7e4a-11e3-930f-000c29c9a057_image_310x434.jpg" data-original="https://staging.foo.com.au/site_media/uploads/product_image/2014/1/16/pd1996526_d0402efe-7e4a-11e3-930f-000c29c9a057_image_310x434.jpg" alt="Up to 85% off Summer Looks #T58 When Monkeys Fly! - Oz The Great And Powerful Collection By OPI " style="display: inline;">
<span class="arrow arrow-up"></span>
</div>
<a itemprop="url" class="overlay-link" href="/event/outlet/up-to-off-summer-looks/1996526-t58-when-monkeys-fly-oz-the-great-and-powerful-collection-by-opi/" title="#T58 When Monkeys Fly! - Oz The Great And Powerful Collection By OPI"></a>
<figcaption>
<h2 itemprop="name" class="mason name">
Opi: #T58 When Monkeys Fly! - Oz The Great And Powerful Collection By OPI
</h2>
<small itemprop="brand" class="bed"> Up to 85% off Summer Looks</small>
<small class="bed shoes-price">
$10.99
</small>
</figcaption>
</figure>
</article>
</div>
I was thinking I can go on either the "sold-out" class on the , or the Sold Out text within it.
I've tried all of the following, and none of them seem to work - they all give me the full set of products:
//div[#class="product-widget-container" and not(div[#class="stock-status be_sprites sold-out"])]
//div[#class="product-widget-container" and not(div[contains(#class, "sold-out")])]
//div[#class="product-widget-container" and not(div[contains(., "Sold Out")])]
Any thoughts on what I'm doing wrong in my XPath expression?
Cheers,
Victor
Your expressions have the right idea, but you don't need to nest [ ] brackets. Once you open them, you are in a conditional statement: everything you write will be part of the statement. So when you want to check an attribute of a child node, you just need to select it: node[child/#attribute].
You also need to check for the div at any depth since it isn't the first child node. If you write div[div/#class="foo"], you are checking for <div><div class="foo"></div></div>. If you write div[.//div/#class="foo"], you are checking for <div><anything><bar><div class="foo"></div></bar></anything></div>.
Something like
//div[#class="product-widget-container" and not(.//div/#class="stock-status be_sprites sold-out")]
should work !
try
//div[#class='product-widget-container' and not(#class='stock-status be_sprites sold-out')]
you should remove div[ and ] in the predicate