I am trying to extract the links of similar apps from google playstore from here( using xpath )
https://play.google.com/store/apps/details?id=com.mojang.minecraftpe
Below is the screenshot of the links(marked green) which i wanted to extract
HTML sample
<div class="details">
<a title="Temple Run" href="/store/apps/details?id=com.imangi.templerun" class="title">Temple Run
<span class="paragraph-end"/>
</a>
<div>....</div>
<div>....</div>
</div>
I have used below xpath in chrome console to locate a single link but it doesnt return the href attribute of the tag. but for other attributes it works(for example "title").
Below xpath doesnt work(extract "href")
//*[#id="body-content"]/div/div/div[2]/div[1]//*/a[2]/#href
Below xpath works(extract "title")
//*[#id="body-content"]/div/div/div[2]/div[1]//*/a[2]/#title
Python code
HTML of individual tiles on the right of the linked page is in the following form * :
<div class="details">
<a title="Temple Run" href="/store/apps/details?id=com.imangi.templerun" class="title">Temple Run
<span class="paragraph-end"/>
</a>
<div>....</div>
<div>....</div>
</div>
Turned out that <a> element with class="title" uniquely identify your target <a> elements in that page. So the XPath can be as simple as :
//a[#class="title"]/#href
Anyway, the problem you noticed seems to be specific to the Chrome XPath evaluator **. Since you mentioned about Python, simple Python codes proves that the XPath should work just fine :
>>> from urllib2 import urlopen
>>> from lxml import html
>>> req = urlopen('https://play.google.com/store/apps/details?id=com.mojang.minecraftpe')
>>> raw = req.read()
>>> root = html.fromstring(raw)
>>> [h for h in root.xpath("//a[#class='title']/#href")]
['/store/apps/details?id=com.imangi.templerun', '/store/apps/details?id=com.lego.superheroes.dccomicsteamup', '/store/apps/details?id=com.turner.freefurall', '/store/apps/details?id=com.mtvn.Nickelodeon.GameOn', '/store/apps/details?id=com.disney.disneycrossyroad_goo', '/store/apps/details?id=com.rovio.angrybirdsstarwars.ads.iap', '/store/apps/details?id=com.rovio.angrybirdstransformers', '/store/apps/details?id=com.disney.dinostampede_goo', '/store/apps/details?id=com.turner.atskisafari', '/store/apps/details?id=com.moose.shopville', '/store/apps/details?id=com.DisneyDigitalBooks.SevenDMineTrain', '/store/apps/details?id=com.turner.copatoon', '/store/apps/details?id=com.turner.wbb2016', '/store/apps/details?id=com.tov.google.ben10Xenodrome', '/store/apps/details?id=com.turner.ggl.gumballrainbowruckus', '/store/apps/details?id=com.lego.starwars.theyodachronicles', '/store/apps/details?id=com.mojang.scrolls']
*) Stripped down version. You can take this as an example of providing minimal HTML sample.
**) I can reproduce this problem, that #hrefs are printed as empty string in my Chrome console. The same problem happened to others as well : Chrome element inspector Xpath with #href won't show link text
Related
I am trying to get a series of text from a web element, but unable to.
The HTML code is as follows:
<span class="versionInfo">
<span class="menu-highight">SoftFEPVis (GUI): </span> == $0
"1.6.4"
</span>
Where SoftFEPVis (GUI): and 1.6.4 are the texts which I would like to be able extract.
I am able to locate the element, and print out its class (menu-highlight), but un-able to extract SoftFEPVis (GUI): and 1.6.4.
I tried :
Version_Number = Browser.find_element(By.XPATH,'//[#id="versionDropDown"]/div/span[3]/span').getText()
and got an error:
'WebElement' object has no attribute getText.
Please help.
Instead of using .getText() you could use:
.get_attribute('innerText')
or
.get_attribute('innerHtml')
or
.text
If it helps, here is a more in-depth discussion of the topic:
Given a (python) selenium WebElement can I get the innerText?
getText() is a Selenium Java client method, where as from your code trials and the error message presumably you are using Selenium Python client.
Solution
To print the text SoftFEPVis (GUI): and 1.6.4 you can use the text attribute and you can use either of the following locator strategies:
Using css_selector and get_attribute("innerHTML"):
print(Browser.find_element(By.CSS_SELECTOR, "span.versionInfo").text)
Using xpath and text attribute:
print(Browser.find_element(By.XPATH, "//span[#class='versionInfo']").text)
Note : You have to add the following imports :
from selenium.webdriver.common.by import By
Using chrome and xpath in python3, I try to extract the value of an "href" attribute on this web page. "href" attributes contains the link to the movie's trailer ("bande-annonce" in french) I am interested in.
First thing, using xpath, it appears that the "a" tag is a "span" tag. In fact, using this code:
response_main=urllib.request.urlopen("http://www.allocine.fr/film/fichefilm_gen_cfilm=231874.html")
htmlparser = etree.HTMLParser()
tree_main = etree.parse(response_main, htmlparser)
tree_main.xpath('//*[#id=\"content-start\"]/article/section[3]/div[2]/div/div/div/div[1]/*')
I get this result:
[<Element span at 0x111f70c08>]
So the "div" tag contains no "a" tag but just a "span" tag. I've read that html visualization in browsers doesn't always reflects the "real" html sent by the server. Thus I tried to use this command to extract the href:
response_main=urllib.request.urlopen("http://www.allocine.fr/film/fichefilm_gen_cfilm=231874.html")
htmlparser = etree.HTMLParser()
tree_main = etree.parse(response_main, htmlparser)
tree_main.xpath('//*[#id=\"content-start\"]/article/section[3]/div[2]/div/div/div/div[1]/span/#href')
Unfortunately, this returns nothing... And when I check the attributes within the "span" tag with this command:
tree_main.xpath('//*[#id=\"content-start\"]/article/section[3]/div[2]/div/div/div/div[1]/span/#*')
I got the value of the "class" attribute, but nothing about "href"... :
['ACrL3ZACrpZGVvL3BsYXllcl9nZW5fY21lZGlhPTE5NTYwMDcyJmNmaWxtPTIzMTg3NC5odG1s meta-title-link']
I'd like some help to understand what's happening here. Why the "a" tag is a "span" tag? And the most important question to me, how can I extract the value of the "href" attribute?
Thanks a lot for your help!
Required link generated dynamically with JavaScript. With urllib.request you can get only initial HTML page source while you need HTML after all JavaScript been executed.
You might use selenium + chromedriver to get dynamically generated content:
from selenium import webdriver as web
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait as wait
driver = web.Chrome("/path/to/chromedriver")
driver.get("http://www.allocine.fr/film/fichefilm_gen_cfilm=231874.html")
link = wait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//div[#class='meta-title']/a[#class='xXx meta-title-link']")))
print(link.get_attribute('href'))
I am trying to get value of title attribute for following html code :-
<span class='overlay' title id='ab12'></span>
Actually this code is written for a tooltip. When i view source code for this HTML page , I see following
<span class='overlay' title="Test Tooltip"></span>
So basically id='ab12' in HTML code denotes Test Tooltip.
Could you tell me how can I get this text value (Test Tooltip) using Selenium-Webdriver ?
Actually your question creates some confusion, I don't think what you are saying about id='ab12', but as I'm seeing in your provided HTML class='overlay' is fixed.
(Assuming you're using Java) you should try using By.className() to locate <span> element, then use getAttribute("title") to get tooltip text as below :-
WebElement el = driver.findElement(By.className("overlay"));
String tooltip = el.getAttribute("title");
I am able to successfully test the content in a website where the content does not have any html element formatting such as <b>, <i>, <sup>, etc. This is easy. I just use String.equals("expectedContent"). However, when there is an html element involved in the middle such as <br> or <p>, the test fails because that is not included in the unformatted expected content. Is there a way for Selenium to ignore those html elements so I can compare apples to apples?
here is the sample html:
<p><strong>Paragraph-a.</strong></p>
<div>
<p>paragraph-b.</p><p>paragraph-c.</p>
</div>
my test content is: Paragraph-a. paragraph-b. paragraph-c.
Thanks in advance for your help.
The following results are based off the HTML in the question, slightly modified to include a <br> tag in the first paragraph.
<html><body>
<p><strong>Para<br>graph-a.</strong></p>
<div>
<p>paragraph-b.</p><p>paragraph-c.</p>
</div>
</body></html>
The Python 2.7.6 code I'm using is as follows:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("file:///C:\testing\\test.html")
element = browser.find_element_by_xpath("/html/body")
print element.text
browser.close()
The simple XPath /html/body retrieves the elements without any of the tags.
Para
graph-a.
paragraph-b.
paragraph-c.
I can drill down to the contents of the first paragraph using /html/body/p/strong.
Para
graph-a.
Can you tell what I think the problem is yet? Tags disappear in the sense that it's not outputting the <strong>, but the <br> tag translates into a newline. Let's add a few lines of code to the Python script, just before the browser close:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("file:///C:\testing\\test.html")
element = browser.find_element_by_xpath("/html/body/p/strong")
print element.text
print text == "Paragraph-a."
print text == "Para<br>graph-a."
print text == "Para\ngraph-a."
browser.close()
This script outputs the following:
Para
graph-a.
False
False
True
The conclusion is that while we can ignore most HTML tags, we need to be careful when comparing against elements that include line breaks.
Please try the given below scripting
int no_of_paragraphs = driver.findElements(By.tagName("p")).size();
for(int i=1;i<=no_of_paragraphs;i++)
{
System.out.print(driver.findElement(By.cssSelector("p:nth-of-type("+i+")")).getText() + "\t");
}
I have some HTML like this:
<h4 class="box_header clearfix">
<span>
<a rel="dialog" href="http://www.google.com/?q=word">Search</a>
</span>
<small>
<span>
<a rel="dialog" href="http://www.google.com/?q=word">Search</a>
</span>
</h4>
I am trying to get the href here in Java using Selenium. I have tried the following:
selenium.getText("xpath=/descendant::h4[#class='box_header clearfix']/");
selenium.getAttribute("xpath=/descendant::h4[#class='box_header clearfix']/");
But none of these work. It keeps complaining that my xpath is invalid. Can someone tell me what mistake I am doing?
You should use getAttribute to get the href of the link. Your XPath needs a reference to the final node, plus the required attribute. The following should work:
selenium.getAttribute("xpath=/descendant::h4[#class='box_header clearfix']/a#href");
You could also modify your XPath so that it's a bit more flexible to change, or even use CSS to locate the element:
//modified xpath
selenium.getAttribute("//h4[contains(#class,'box_header')]/a#href");
//css locator
selenium.getAttribute("css=.box_header a#href");
I had similar problems with Selenium and xpath in the past and couldn't really resolve it (other than changing the expression). Just to be sure I suggest trying your xpath expressions with the XPath Checker addon for firefox.