Xpath select between elements under condition (containing text) - html

I have a page like this (a speech or a dialogue page organised like this, so speaker name in bold and then paragraphs of his speech):
<body>
<p>
<b>
speaker abc:
</b>
some wanted text here
</p>
<p>
some other text wanted, maybe containing speaker abc
</p>
<p>
some other text wanted, maybe containing speaker cde
</p>
<p>
some other text wanted
</p>
<p>
<b>
speaker cde (can be random):
</b>
</p>
<p>
some other text UNwanted, maybe containing speaker abc
</p>
<p>
some other text UNwanted, maybe containing speaker cde
</p>
<p>
some other text UNwanted
</p>
<p>
<b>
speaker abc:
</b>
</p>
<p>
some other text wanted
</p>
<p>
<b>
speaker fgh:
</b>
</p>
<p>
some other text UNwanted
</p>
</body>
I would like to select (using xpath) all text elements marked as wanted text in example (all phrases spoken by one particular speaker, say abc).
I am not very fluent with xpath and html, I suspect there should be some usage of axis but struggle to figure out how.

This is very difficult to do using XPath 1.0 alone.
In XSLT 2.0+, use positional grouping:
<xsl:for-each-group select="p" group-starting-with="p[b]">...</
and then select the groups you are interested in.
If you have to do it using XPath 1.0, consider pre-processing the input using XSLT to split the text into speeches, using xsl:for-each-group as suggested.

The following XPath will do this:
"//*[preceding-sibling::p[contains(.,'speaker abc')] and following-sibling::p[contains(.,'speaker cde')]]"
We are limiting the wanted p nodes by preceding-sibling p node containing the wanted text speaker name in front and by following-sibling p node containing the next, unwanted speaker name on the end.
the output is
some other text wanted, maybe containing abc
some other text wanted, maybe containing cde
some other text wanted

Related

get text part upto a specific tag in xpath

i am trying to get text till 1st <.hr>(ignore dot) tag using Xpath
<div class="entry">
<p> some text</p>
<p> some text2</p>
<p> some text3</p>
<p> some text4</p>
<hr>(get text part before this hr tag)
<p> some text5</p>
<hr>
<p> some text6</p>
</div>
tried this
//hr[1]/ancestor::div[#class="entry"]/text()
and some similar variants but couldn't get the expected output
Something along these lines will give you the set of nodes before the hr node
//div[#class="entry"]/*[not(preceding-sibling::hr | self::hr)]
It will list those nodes that
are children of the div with class name "entry",
are not preceded by a node named hr and
are not themselves a hr node

Getting text within <a > tag inside <p> tag

Hi i have been trying to get all the text part within the div - p tags up to the hr tag so somebody gave this xpath
//div[#class="entry"]/*[not(preceding-sibling::hr | self::hr)]/text()
which works fine but this ignores the text part within the <.a> tag in the p tag
any ideas to grab that text as well?
<div class="entry">
<p> some text</p>
<p> some text2</p>
<p> some text3</p>
<p> some text4
<a href='somelink'> this text here i want to get through xpath</a>
some text5
</p>
<hr>(up to this hr tag)
<p> some text5</p>
<hr>
<p> some text6</p>
</div>
One way might be //div[#class="entry"]/*[not(preceding-sibling::hr | self::hr)]//text() though I might prefer to simply select the elements //div[#class="entry"]/*[not(preceding-sibling::hr | self::hr)] and use the string value.
You can simply pull data based on xpath.
//div[#class="entry"]/p[0]
//div[#class="entry"]/p[1]
//div[#class="entry"]/p[2]
//div[#class="entry"]/p[3]
//div[#class="entry"]/p[4]
//div[#class="entry"]/p[5]

XPath Match br tags that does not have text before or after them in a tag

I have a requirement where I have to eliminate <br> tags enclosed in <p> tags whenever they are not preceded with text or followed with text, let me give a complete example.
Asterisk (*) tags are meant to be matched, the others are meant to be left untouched.
<div>
<p>
<br/>*
<span>Text1</span>
<br/>
<i>Text2
</i>
</p>
<p>
<b>
<i>
<br/>*
</i>
</b>
<span>Text3</span>
<br/>
<br/>
Text4
<i>
<br/>*
</i>
</p>
<p>
<span>Text4</span>
<br/>*
</p>
</div>
Putting things simple, I need to normalize the text formatting from some Word documents where the editors were doing line-breaks act like paragraphs, line-breaks are meant to break text and not imply spacing between lines, this is the paragraph's job.
So, all I need is to keep <br/> tags surrounded by text safe and match the rest to issue a delete.
Thanks!
You could use two queries:
//p/descendant-or-self::*/*[1 ]/self::br[not(preceding-sibling::node()/normalize-space()!='')]
//p/descendant-or-self::*/*[last()]/self::br[not(following-sibling::node()/normalize-space()!='')]

xpath for extracting text from self and child node

here is my situation
i want to select "Buy 2 Hills Feline Maint Light 10kg and Save a further £4.00!" only from bellow html
Note: i am using XPath 1.0
<div>
<a>
<b>
<u>Multi-Buy:</u>
</b>
<br/>
Buy
<b>2</b>
Hills Feline Maint Light 10kg and
<b>
<font color="#CC0000">Save a further £4.00!</font>
</b>
<br/>
<i>Simply add 2 to your basket.</i>
</a>
</div>
here is my effort
//div/a/text()
by using this i am missing child node text
/div/a//text()
if i use this i am getting extra text
Since this HTML is not structured in any way that would facilitate extracting this in any clean way, I would propose the following:
/div/a//text()[not(. = 'Multi-Buy:' or contains(., 'to your basket'))]

Replacing end tags using Regex

I have some expressions of type <h2> text </p>.
How do I search for the </p> tags and replace them for </h2> using regex?
Use this regex
<([^>]*)>([^<]*)</[^>]*>
and replace with
<$1>$2</$1>
so for sample input text
<h2> text </p>
<h1> some text </invalidtag>
your result is:
<h2> text </h2>
<h1> some text </h1>
You should learn the regular expressions usage in the context of javascript or HTML5.