selenium xpath scrape of mixed content html span

selenium xpath scrape of mixed content html span - html

I'm trying to scrape a span element that has mixed content
<span id="span-id">
<!--starts with some whitespace-->
<b>bold title</b>
<br/>
text here that I want to grab....
</span>
And here's a code snippet of a grab that identifies the span. It picks it up without a problem but the text field of the webelement is blank.
IWebDriver driver = new FirefoxDriver();
driver.Navigate().GoToUrl("http://page-to-examine.com");
var query = driver.FindElement(By.XPath("//span[#id='span-id']"));
I've tried adding /text() to the expression which also returns nothing. If I add /b I do get the text content of the bolded text - which happens to be a title that I'm not interested in.
I'm sure with a bit of xpath magic this should be easy but I'm not finding it so far!! Or is there a better way? Any comments gratefully received.

I've tried adding /text() to the expression which also returns nothing
This selects all the text-node-children of the context node -- and there are three of them.
What you refer to "nothing" is most probably the first of these, which is a white-space-only text node (thus you see "nothing" in it).
What you need is:
//span[#id='span-id']/text()[3]
Of course, there are other variations possible:
//span[#id='span-id']/text()[last()]
Or:
//span[#id='span-id']/br/following-sibling::text()[1]
XSLT-based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="node()|#*">
"<xsl:copy-of select="//span[#id='span-id']/text()[3]"/>"
</xsl:template>
</xsl:stylesheet>
This transformation simply outputs whatever the XPath expression selects. When applied on the provided XML document (comment removed):
<span id="span-id">
<b>bold title</b>
<br/>
text here that I want to grab....
</span>
the wanted result is produced:
"
text here that I want to grab....
"

I believe the following xpath query should work for your case. following-sibling useful for what you're trying to do.
//span[#id='span-id']/br/following-sibling::text()

Related

Need an xpath that finds all the elements of a particular type before the first occurence of a certain element

I need an xpath that fetches all the elements of a particular element type, say input, that occurs before the first occurrence of another element. the problem is, there is no proper hierarchy between the targeted elements and the 'another element'. and there can be any number of 'another element' present in the html.
i tried using the 'following' axes and it works if there is only one 'another element'. but if there are many it doesn't work
<a>
<b>
<input>zyx</input>
<div>abc</div>
<span>def</span>
<input>ghi</input>
</b>
<c>
<div class="SameAttribute">Test</div>
<input>jkl</input>
<div>mno</div>
</c>
<d>
<div class="SameAttribute">Test</div>
<input>pqr</input>
<div>stu</div>
</d>
</a>
as per the html structure above, i want only the input elements that are within the <b> tag. the xpath needs to ignore the input elements that are within <c> and <d> tags
Tried this
.//*[self::input][following::div[#class = 'SameAttribute']]
but it picks the elements from both <b> and <c> tags.
When i try this, nothing gets selected
.//*[self::input][following::(div[#class = 'SameAttribute'])[1]]
I cannot write xpaths containing any of the tags <b>, <c>, <d> due to other constraints

i want only the input elements that are within the <b> tag. the xpath
needs to ignore the input elements that are within <c> and <d> tags
Use:
//b//input
I need an xpath that fetches all the elements of a particular element
type, say input, that occurs before the first occurrence of another
element. the problem is, there is no proper hierarchy between the
targeted elements and the 'another element'. and there can be any
number of 'another element' present in the html.
This is not equivalent to the first requirement quoted above.
You don't specify what is mean't by "another element" but combining the two quoted requirements, and the provided source xml document, one can logically conclude that "another element" here means any following sibling of the element /a/b[1]
These will be selected by:
(//b)[1]//input
or for the provided xml document just:
/a/b[1]//input
If the document had more than one /a/b elements and you wanted to get the input descendants of only these /a/b/ elements that precede any /a/{X} elements, where {X} is a name different from b, use:
/a/b[not(preceding-sibling::*[not(self::b)])]//input
Finally, in the most general case, if you want to select the input descendants of only such b elements that come **before* any other (non-b) element (excluding the top element -- if the top element is a b then any input descendant of the top element satisfies the requirement, here is one XPath expression that selects these:
/*//b[not(ancestor::*[not(self::b) and parent::*])
and not(preceding::*[not(self::b)])]
//input
Here we use the fact that if an element x is before (in document order) an element y, then x is either an ancestor of y (belongs to itsancestor::* axis) or is a preceding element (belongs to its preceding::* axis)
XSLT-based verification:
This transformation evaluates all 5 XPath expressions and outputs the selected nodes:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:copy-of select="//b//input"/>
==================================
<xsl:copy-of select="(//b)[1]//input"/>
==================================
<xsl:copy-of select="/a/b[1]//input"/>
==================================
<xsl:copy-of select="/a/b[not(preceding-sibling::*[not(self::b)])]//input"/>
==================================
<xsl:copy-of select=
"/*//b[not(ancestor::*[not(self::b) and parent::*])
and not(preceding::*[not(self::b)])]
//input"/>
</xsl:template>
</xsl:stylesheet>
When applied on the originally-provided XML document:
<a>
<b>
<input>zyx</input>
<div>abc</div>
<span>def</span>
<input>ghi</input>
</b>
<c>
<div class="SameAttribute">Test</div>
<input>jkl</input>
<div>mno</div>
</c>
<d>
<div class="SameAttribute">Test</div>
<input>pqr</input>
<div>stu</div>
</d>
</a>
the wanted, correct result is selected when evaluating each expression:
<input>zyx</input>
<input>ghi</input>
==================================
<input>zyx</input>
<input>ghi</input>
==================================
<input>zyx</input>
<input>ghi</input>
==================================
<input>zyx</input>
<input>ghi</input>
==================================
<input>zyx</input>
<input>ghi</input>

You can try this xpath
This is for indexing all input (please change count number for other) :
(.//*[self::input][following::div[#class = 'SameAttribute']])[1]
This is for simple way, input between tag <b> :
//b//input

One Xpath that would seem to meet your criteria is:
//input[not(preceding-sibling::*[contains(#class,'SameAttribute')])]
This will find all input elements that do not have a preceding sibling that has a class attribute that contains the class SameAttribute.

The way you've described the problem, the simplest solution is //b/*. Alternatively, if you want all the elements with the same parent as the first input element, you might want (//input)[1]/following-sibling::*.
You certainly don't want the following axis here: read up on the difference between following and following-sibling.
Your expression //*[self::input] is a very convoluted way of saying //input.

I tried using the combination of preceding and ancestor axes to arrive at the solution. following is the xpath that worked for me
(.//div[#class='SameAttribute'])[1]/preceding::*[self::input][ancestor::a]

XSLT How to select only the value of an element with children

I have this code and can't edit it:
<elenco-treni>
<treno id='1'> Moderno
<percorso>Terni - Spoleto</percorso>
<tipo genere='locale'> aaa
<fermata>Narni</fermata>
<fermata informale='s'>Giano</fermata>
</tipo>
</treno>
<treno id='5' codice='G140'> Jazz
<percorso>Orte - Terontola</percorso>
<tipo genere='regionale'>
<fermata>Terni</fermata>
<fermata>Spoleto</fermata>
<fermata>Foligno</fermata>
<fermata>Assisi</fermata>
<fermata>Perugia</fermata>
</tipo>
</treno>
</elenco-treni>
and I got some problems:
When I select "elenco-treni", everything doesn't work
<xsl:for-each select="elenco-treni">
<xsl:value-of select="treno"/>
gives me blank result.
I can't get the value of tipo which is "aaa"
<xsl:for-each select="treno">
<xsl:value-of select="tipo"/>
gives me all of "tipo" children and it's value.

This is badly designed XML, in that it is using mixed content (elements that have both text nodes and other elements as children) in a way that mixed content wasn't designed to be used. Constructs like xsl:value-of work well if mixed content is used properly, but they don't work well on this kind of input.
When you're dealing with badly designed XML, the best thing is to start by transforming it to something cleaner. You could do this here with a transformation that wraps the text nodes in an element:
<xsl:template match="treno/text()[normalize-space(.)]">
<veicolo><xsl:value-of select="normalize-space(.)"/></veicolo>
</xsl:template>
This takes care only to wrap the non-whitespace text nodes.

How do I assign the text of HTML Elements to a XSLT Variable?

I want to create a dynamic XSLT Variable. It should fetch the content of the first td of each row, something like this:
<tr><td>1</td><td>not Important</td></tr>
<tr><td>2</td><td>not Important</td></tr>
<tr><td>3</td><td>not Important</td></tr>
My XSL:Variable looks something like this:
<xsl:variable name="name" select="concat('out/',//td[1]/text(),'.html')"/>
I want to use the elements content (in my Case 1, 2, 3) to create new Html Files and name them accordingly:
<xsl:result-document href="{$name}">
Result:
1.html
2.html
3.html
With my current XSL:Variable Oxygen would give me that error:
A sequence of more than one item is not allowed as the second argument of concat()

If you want to map each row to a result document then I suggest to write a template
<xsl:template match="tr">
<xsl:result-document href="out{td[1]}.html">
...
</xsl:result-document>
</xsl:template>
then make sure there is an apply-templates for the parent table that makes sure the tr elementd are processed.

The problem you got there is, that the concat()-function can put strings together, but your statement "//td[1]/text" does indeed select 3 strings, not just one.
A way to generate these 3 filenames would be iterating over the tr-nodes and selecting the first td-node in each of them:
<xsl:for-each select="//tr">
<xsl:variable name="justOneNameAtATime"
select="concat('out/',.//td[1]/text(),'.html')" />
<!-- do whatever you want with the single name, e.g.: -->
<xsl:result-document href="{$name}">
</xsl:for-each>
Notice the dot in front of the "//", meaning that the search for "td"-nodes will only take place in the current context (= in the "tr"-node).

How to store superscript in XML attribute and read using XSL?

I have a requirement where I need create an XML document dynamically. Some of the attributes of the nodes of this XML contain superscript Reg etc. My question is how should I store such superscript characters in XML and then read it using XSL to render as HTML. A sample XML is shown below:
<?xml version="1.0" encoding="utf-8"?>
<node name="Some text <sup>®</sup>"/>
I know this cannot be stored under sup tag inside attribute as it breaks XML. I tried using <sup> also in place of opening and closing tag. But then they are rendered as <sup> on HTML instead of actually making it superscript.
Please let me know the solution for this problem. I have control over generation of XML. I can write it the correct way, If I know what is the right way to store superscripts.

Since you're using XSL to transform the input into HTML, I would suggest using a different method to encode the fact that some things need to be superscripts. Make up your own simple markup, for example
<node name="Some text [[®]]"/>
The markup can be anything that you can uniquely identify later and doesn't occur naturally in your data. Then in your XSL process the attribute values that can contain this markup with a custom template that converts the special markup to <sup> and </sup>. This allows you to keep the document structure (i.e. not move these string values to text nodes) and still achieve your goal.

Please let me know the solution for this problem. I have control over
generation of XML. I can write it the correct way, If I know what is
the right way to store superscripts.
Because attributes can only contain values (no nodes), the solution is to store markup (nodes) inside elements:
<node>
<name>Some text <sup>®</sup></name>
</node>

If it's only single characters like ® that need to be made superscript, then you can leave the XML without crooks like <sup>, i.e. like
<node name="Some text ®"/>
and look for the to-be-superscripted characters during processing. A template like this might help:
<xsl:template match="node/#name">
<xsl:param name="nameString" select="string()"/>
<!-- We're stepping through the string character by character -->
<xsl:variable name="firstChar" select="substring($nameString,1,1)"/>
<xsl:choose>
<!-- '®' can be extended to be a longer string of single characters
that are meant to be turned into superscript -->
<xsl:when test="contains('®',$firstChar)">
<sup><xsl:value-of select="$firstChar"/></sup>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="$firstChar"/>
</xsl:otherwise>
</xsl:choose>
<!-- If we we didn't yet step through the whole string,
chop off the first character and recurse. -->
<xsl:if test="$firstChar!=''">
<xsl:apply-templates select=".">
<xsl:with-param name="nameString" select="substring($nameString,2)"/>
</xsl:apply-templates>
</xsl:if>
</xsl:template>
This approach is however not very efficient, especially if you have lots of name attributes and/or very long name attributes. If your application is performance critical, then better do some testing whether the impact on processing times is justifiable.

Is there a way to detect numeric string in xslt?

I am now doing a html to xml xslt transformation, pretty straigh-forward. But I have one slight problem that is left unsolved.
For example, in my source html, a node looks like:
<p class="Arrow"><span class="char-style-override-12">4</span><span class="char-style-override-13"> </span>Sore, rash, growth, discharge, or swelling.</p>
As you can see, the first child node < span> has a value of 4, is it actually rendered as a arrow point in the browser (maybe some encoding issue, it is treated as a numeric value in my xml editor).
So my question is, I wrote a template to match the tag, then pass the text content of it to another template match :
<xsl:template match="text()">
<xsl:variable name="noNum">
<xsl:value-of select="normalize-space(translate,'4',''))"/>
</xsl:variable>
<xsl:copy-of select="$noNum"/>
</xsl:template>
As you can see, this is definitely not a good solution, it will replace all the numbers appearing in the string, not only the first character. So I wonder if there is a way to remove only the first character IF it is a number, maybe using regular expression? Or, I am actually going the wrong way, should there be a better way to think of solving this problem(e.g, changing the encoding)?
Any idea is welcomed! Thanks in advance!

Just use this :
<xsl:variable name="test">4y4145</xsl:variable>
<xsl:if test= "not(string(number(substring($test,1,1)))='NaN')">
<xsl:message terminate="no">
<xsl:value-of select="substring($test,2)"/>
</xsl:message>
</xsl:if>
This is a XSLT 1.0 solution. I think regex is an overkill for this.
Output :
[xslt] y4145

Use this single XPath expression:
concat(translate(substring(.,1,1), '0123456789', ''),
substring(.,2)
)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

selenium xpath scrape of mixed content html span - html

I believe the following xpath query should work for your case. following-sibling useful for what you're trying to do. //span[#id='span-id']/br/following-sibling::text()

Related

Need an xpath that finds all the elements of a particular type before the first occurence of a certain element

XSLT How to select only the value of an element with children

How do I assign the text of HTML Elements to a XSLT Variable?

How to store superscript in XML attribute and read using XSL?

Is there a way to detect numeric string in xslt?

Categories

Resources