How to access various parts of a link with XPath - html

I'm fairly new to XPath and wanted to see how granular you can get when accessing various HTML components.
I'm currently using the this xpath
//*[#id=\"resultsDiv\"]/p[1]/a
to access the HTML (abbreviated) below:
<p style="margin:0;border-width:0px;">Bill%20Jones</p>
The XPath returns this: Bill%20Jones
But what I'm trying to get is simply the PersonID = 140476.
Question: Is it possible to write an XPath that results in 140476, or do I need to take what was returned and use a regular expression other method to access the PersonID.

If this XPath,
//*[#id=\"resultsDiv\"]/p[1]/a
selects this a element,
Bill%20Jones
then this XPath,
substring-after(//*[#id='resultsDiv']/p[1]/a/#href, 'PersonID=')
will return 140476 alone, as requested.

Related

How to select all elements with a specific name under every li node with the same structure?

I have a certain bunch of XPath locators that hold the elements I want to extract, and they have a similar structure:
/div/ul/li[1]/div/div[2]/a
/div/ul/li[2]/div/div[2]/a
/div/ul/li[3]/div/div[2]/a
...
They are actually simplified from Pixiv user page. Each /div/div[2]/a element has a title string, so they are actually artwork titles.
I want to use a single expression to fetch all the above a elements in an WebExtension called PageProbe. Although I've tried a bunch of methods, it just can't return the wanted result.
However, the following expression does return all the a elements, including the ones I don't need.
/div/
The following expression returns the a element under only the first li item.
/div/ul/li/div/div[2]/a
Sorry for not providing enough info earlier. Hope someone can help me out. Thanks.
According to the information you gave here you can simply use this xpath:
/div/ul/li/div/div[2]/a
however I'm quite sure it should be some better locator based on other attributes like class names etc.

XPath containing 2 or more "OR" conditions not working?

There is an anchor tag, whose value can be changed by the user. Now, i want to write an Xpath query that searches for multiple link text names in one statement itself.
Signin
now, link text value can be changed by user profile. for eg, "Login", "Log-in", "Click here to login" or "Login now"
If i write xpath:-
//a[contains(text(),'Login') or contains(text(),'Log-in') or contains(text(),'Login now') or contains(text(),'click here to Login')]
Then it fails. I need to click this element using Selenium.
Please help.
Important notes:
Only use contains() when you need substring testing. See What does contains() do in XPath?
Understand string-values: See Testing text() nodes vs string values in XPath
Beware of whitespace variations. See What is the purpose of normalize-space()?
Your posted markup has Signin, but your XPath does not.
Mind case sensitivity: click here to Login is not the same as Click here to Login.
XPath 1.0
If you're certain there are no whitespace variations:
//a[.='Login' or .='Log-in' or .='Click here to login' or .='Login now']
Otherwise:
//a[ normalize-space()='Login"
or normalize-space()='Log-in'
or normalize-space()='Click here to login'
or normalize-space()='Login now']
XPath 2.0
//a[normalize-space()=('Login','Log-in','Click here to login','Login now')]
XPath is based on sequences. So you can use a comma separated list, i.e. sequence, to imitate logical OR conditions.
XPath
//a[text()=("Login","Log-in","Click here to login")]

How do I get rid of the tags in XPath

I have a bunch of html files with tons of data in it and I want to extract the important parts of it.
The files are all very similar; I've to search for a <tr> which contains a certain keyword. The third column of this table row always contains the name of the "block" I'm searching for (it's a few table rows).
//body/table/tbody/tr[td = "Deployed to"]/td[3]/div//span[text()]
with this XPath query I get the names (maybe one, maybe more)
The problem is, how do I get rid of the tags around the data?
Right now my output is something like this:
<span class="log_entry_text">Name1</span><span class="log_entry_text">Name2</span><span class="log_entry_text">Name3</span>
I want to have something like that: Name1 Name2 Name3
So I can use it for extracting these blocks more easily.
With string() i can only extract the first element (result would be: Name1)
Thanks for helping me!
Just wrap your xpath with data() element like data(//body/table/tbody/tr[td = "Deployed to"]/td[3]/div//span[text()]) for retrieve text.
Your XPath expression asks to retrieve span elements and that's what it has returned. If you're seeing tags with angle brackets in the output, that's because of the way the XPath result is being processed and rendered by the receiving application.
If you're in XPath 2.0+ or XQuery 1.0+ you can combine the several span elements into a single string using
string-join(//path/span, ' ')

Web scraping without id VBA

I'm trying to scrape a web , some elements were easy to get . But I have a problem with those who have no id like this .
<TABLE class=DisplayMain1 cellSpacing=1 cellPadding=0><TBODY>
<TR class=TitleLabelBig1>
<TD class=Title1 colSpan=100><SPAN style="FONT-FAMILY: arial narrow; FONT-WEIGHT: normal">Tool & </SPAN><BR>PE311934-1-1 </TD></TR></TBODY></TABLE>
i want this ---►PE311934-1-1
i Try with "document.getElementsByClassName" but the vba gave me a error :/..
some tip?
Use Regular Expressions and the XMLHttpRequest object in VBA
I made a AddIn some time ago that does just that:
http://www.analystcave.com/excel-tools/excel-scrape-html-add/
If you just want the source code then here (GetElementByRegex function):
http://www.analystcave.com/excel-scrape-html-element-id/
Now the actual regex will be quite simple:
</SPAN><BR>(.*?)</TD></TR></TBODY></TABLE>
If it captures too much items simply expand the regex.
You don't specify the error and there is not enough HTML to know how many elements there are on the page.
You may have forgotten to use an index with document.getElementsByClassName("Title1"), as it returns a collection
For example, the first item would be: document.getElementsByClassName("Title1")(0)
In the same way, you could use a CSS querySelector such as .Title1
Which says the same thing i.e. select the elements with ClassName "Title1".
For the first instance simply use:
document.querySelector(".Title1")
For a nodeList of all matching
document.querySelectorAll(".Title1")
and then iterate over its length.
You would access the .innerText property of the element, generally, to retrieve the required string.
For the snippet shown, assuming the item is the first .Title1 on the page the CSS selector retrieves the following from your HTML
The resultant string can then be processed for what you want. This method, and regex, are fragile at best considering how easily an updated source page can break these methods.
In your above example, you can use the class name, .Title1, and then use Replace() to remove the Tool & .

Having trouble selecting some specific xpath... (html table, scrapy, xpath)

I'm trying to scrape data (using scrapy) from tables that can be found here:
http://www.bettingtools.co.uk/tipster-table/tipsters
My spider functions when I parse response within the following xpath:
//*[#id="imagetable"]/tbody/tr
Every table on the page shares that id, so I'm basically grabbing all the table data.
However, I only want the table data for the current month (tables in the right column).
When I try and be more specific with my xpath, I get an invalid xpath error even though it seems to be correct. I've tried:
- //*[#id="content"]/[contains(#class, "column2")]/[contains(#class, "table3")]/[#id="imagetable"]/tbody/tr
- //*[#id="content"]/div[contains(#class, "column2")]/div[contains(#class, "table3")]/[#id="imagetable"]/tbody/tr
- //*[#id="content"]/div[2]/div[1]/[#id="imagetable"]/tbody/tr
Also, when I try to select the xpath of a specific table on the page with chrome I just get //*[#id="imagetable"].
Am I missing something obvious here? Why are the 3 above xpath examples I've tried not valid?
Thanks
What makes those 3 invalid xpath is the part with this pattern :
/[predicate expression here]
above xpath missed to select a node on which the predicate would be applied. It should rather looks like this :
/*[predicate expression here]
Here are some examples of valid ones :
1. /table[#id="imagetable"]
2. /div[contains(#class, "column2")]
3. /*[contains(#class, "table3")]
For this specific task, you can try the following xpath which selects rows from table inside <div class="column2"> :
//div[#class='column2']//table[#id="imagetable"]/tbody/tr
Check my anwser Selenium automation- finding best xpath. In short check it by browser, browser can give U unique locator, then check it.