Having trouble selecting some specific xpath... (html table, scrapy, xpath) - html

I'm trying to scrape data (using scrapy) from tables that can be found here:
http://www.bettingtools.co.uk/tipster-table/tipsters
My spider functions when I parse response within the following xpath:
//*[#id="imagetable"]/tbody/tr
Every table on the page shares that id, so I'm basically grabbing all the table data.
However, I only want the table data for the current month (tables in the right column).
When I try and be more specific with my xpath, I get an invalid xpath error even though it seems to be correct. I've tried:
- //*[#id="content"]/[contains(#class, "column2")]/[contains(#class, "table3")]/[#id="imagetable"]/tbody/tr
- //*[#id="content"]/div[contains(#class, "column2")]/div[contains(#class, "table3")]/[#id="imagetable"]/tbody/tr
- //*[#id="content"]/div[2]/div[1]/[#id="imagetable"]/tbody/tr
Also, when I try to select the xpath of a specific table on the page with chrome I just get //*[#id="imagetable"].
Am I missing something obvious here? Why are the 3 above xpath examples I've tried not valid?
Thanks

What makes those 3 invalid xpath is the part with this pattern :
/[predicate expression here]
above xpath missed to select a node on which the predicate would be applied. It should rather looks like this :
/*[predicate expression here]
Here are some examples of valid ones :
1. /table[#id="imagetable"]
2. /div[contains(#class, "column2")]
3. /*[contains(#class, "table3")]
For this specific task, you can try the following xpath which selects rows from table inside <div class="column2"> :
//div[#class='column2']//table[#id="imagetable"]/tbody/tr

Check my anwser Selenium automation- finding best xpath. In short check it by browser, browser can give U unique locator, then check it.

Related

How to select all elements with a specific name under every li node with the same structure?

I have a certain bunch of XPath locators that hold the elements I want to extract, and they have a similar structure:
/div/ul/li[1]/div/div[2]/a
/div/ul/li[2]/div/div[2]/a
/div/ul/li[3]/div/div[2]/a
...
They are actually simplified from Pixiv user page. Each /div/div[2]/a element has a title string, so they are actually artwork titles.
I want to use a single expression to fetch all the above a elements in an WebExtension called PageProbe. Although I've tried a bunch of methods, it just can't return the wanted result.
However, the following expression does return all the a elements, including the ones I don't need.
/div/
The following expression returns the a element under only the first li item.
/div/ul/li/div/div[2]/a
Sorry for not providing enough info earlier. Hope someone can help me out. Thanks.
According to the information you gave here you can simply use this xpath:
/div/ul/li/div/div[2]/a
however I'm quite sure it should be some better locator based on other attributes like class names etc.

How do I get rid of the tags in XPath

I have a bunch of html files with tons of data in it and I want to extract the important parts of it.
The files are all very similar; I've to search for a <tr> which contains a certain keyword. The third column of this table row always contains the name of the "block" I'm searching for (it's a few table rows).
//body/table/tbody/tr[td = "Deployed to"]/td[3]/div//span[text()]
with this XPath query I get the names (maybe one, maybe more)
The problem is, how do I get rid of the tags around the data?
Right now my output is something like this:
<span class="log_entry_text">Name1</span><span class="log_entry_text">Name2</span><span class="log_entry_text">Name3</span>
I want to have something like that: Name1 Name2 Name3
So I can use it for extracting these blocks more easily.
With string() i can only extract the first element (result would be: Name1)
Thanks for helping me!
Just wrap your xpath with data() element like data(//body/table/tbody/tr[td = "Deployed to"]/td[3]/div//span[text()]) for retrieve text.
Your XPath expression asks to retrieve span elements and that's what it has returned. If you're seeing tags with angle brackets in the output, that's because of the way the XPath result is being processed and rendered by the receiving application.
If you're in XPath 2.0+ or XQuery 1.0+ you can combine the several span elements into a single string using
string-join(//path/span, ' ')

Why does my XPath not select based on text()?

I have a page in firefox (no frame) which contains the following part of html code:
...
<div class="col-sm-6 align-right">
<a href="/efelg/download_zip" class="alert-link">
Download all results in .zip format
</a>
</div>
...
which I want to select with a selenium XPATH expression. In order to test my XPATH expression, I installed an add-on for firefox called 'TryXpath' in order to check my expression. However, the expression seems to be incorrect, as no element is selected. Here is the expression:
//a[text()= "Download all results in .zip format"]
but what is wrong with that expression? I found it in different SO answers - but for me this does not seem to work. Why do I get 0 hits? Why is the expression wrong find the html element I posted above (no frame, element is visible and clickable...)
You can try this:
//a[contains(text(),'Download all results in .zip format')]
it is working in my side, Please try at let me know
The reason your XPath isn't selecting the shown a element is due to the leading and trail white space surrounding your targeted text. While you could use contains() as the currently upvoted and selected answer does, be aware that it could also match when the targeted string is a substring of what's found in the HTML in an a element -- this may or may not be desirable.
Consider instead using normalized-space() and testing via equality:
//a[normalize-space()='Download all results in .zip format']
This will check that the (space-normalized) string value of a equals the given text.
See also
Testing text() nodes vs string values in XPath

How to access various parts of a link with XPath

I'm fairly new to XPath and wanted to see how granular you can get when accessing various HTML components.
I'm currently using the this xpath
//*[#id=\"resultsDiv\"]/p[1]/a
to access the HTML (abbreviated) below:
<p style="margin:0;border-width:0px;">Bill%20Jones</p>
The XPath returns this: Bill%20Jones
But what I'm trying to get is simply the PersonID = 140476.
Question: Is it possible to write an XPath that results in 140476, or do I need to take what was returned and use a regular expression other method to access the PersonID.
If this XPath,
//*[#id=\"resultsDiv\"]/p[1]/a
selects this a element,
Bill%20Jones
then this XPath,
substring-after(//*[#id='resultsDiv']/p[1]/a/#href, 'PersonID=')
will return 140476 alone, as requested.

Get Image with Xpath using class of Div

How do I write the xpath to get the main news image in this article?
The below one failed for me.
//div[contains(#class,'sectionColumns')]//div[contains(#class,'column2']//*img"]
I want it to return all images in case of slideshow. I want it to be flexible as some classes
change when news changes.
Without looking at "this article", there is an obvious syntax error in your XPath expression:
//div[contains(#class,'sectionColumns')]//div[contains(#class,'column2']//*img"]
The substring of the above:*img", contains two errors -- * followed by a name, and an unbalanced quote.
Probably you want:
//div[contains(#class,'sectionColumns')]//div[contains(#class,'column2']//img]