xpath for extracting text from self and child node

xpath for extracting text from self and child node - html

here is my situation
i want to select "Buy 2 Hills Feline Maint Light 10kg and Save a further £4.00!" only from bellow html
Note: i am using XPath 1.0
<div>
<a>
<b>
<u>Multi-Buy:</u>
</b>
<br/>
Buy
<b>2</b>
Hills Feline Maint Light 10kg and
<b>
<font color="#CC0000">Save a further £4.00!</font>
</b>
<br/>
<i>Simply add 2 to your basket.</i>
</a>
</div>
here is my effort
//div/a/text()
by using this i am missing child node text
/div/a//text()
if i use this i am getting extra text

Since this HTML is not structured in any way that would facilitate extracting this in any clean way, I would propose the following:
/div/a//text()[not(. = 'Multi-Buy:' or contains(., 'to your basket'))]

Related

Xpath select between elements under condition (containing text)

I have a page like this (a speech or a dialogue page organised like this, so speaker name in bold and then paragraphs of his speech):
<body>
<p>
<b>
speaker abc:
</b>
some wanted text here
</p>
<p>
some other text wanted, maybe containing speaker abc
</p>
<p>
some other text wanted, maybe containing speaker cde
</p>
<p>
some other text wanted
</p>
<p>
<b>
speaker cde (can be random):
</b>
</p>
<p>
some other text UNwanted, maybe containing speaker abc
</p>
<p>
some other text UNwanted, maybe containing speaker cde
</p>
<p>
some other text UNwanted
</p>
<p>
<b>
speaker abc:
</b>
</p>
<p>
some other text wanted
</p>
<p>
<b>
speaker fgh:
</b>
</p>
<p>
some other text UNwanted
</p>
</body>
I would like to select (using xpath) all text elements marked as wanted text in example (all phrases spoken by one particular speaker, say abc).
I am not very fluent with xpath and html, I suspect there should be some usage of axis but struggle to figure out how.

This is very difficult to do using XPath 1.0 alone.
In XSLT 2.0+, use positional grouping:
<xsl:for-each-group select="p" group-starting-with="p[b]">...</
and then select the groups you are interested in.
If you have to do it using XPath 1.0, consider pre-processing the input using XSLT to split the text into speeches, using xsl:for-each-group as suggested.

The following XPath will do this:
"//*[preceding-sibling::p[contains(.,'speaker abc')] and following-sibling::p[contains(.,'speaker cde')]]"
We are limiting the wanted p nodes by preceding-sibling p node containing the wanted text speaker name in front and by following-sibling p node containing the next, unwanted speaker name on the end.
the output is
some other text wanted, maybe containing abc
some other text wanted, maybe containing cde
some other text wanted

Xpath to get only first text tag and ignore break tag before it

I checked for similar questions and but I couldn't find answer for mine.
I need to collect the text value comes inside a h1 tag, as per the example value "text1", which comes in 3 different situation. I am sharing all 3 html codes below:
First Case:
<h1 class="h1">
text1
<br>
<span>text2</span>
</h1>
Second Case:
<h1 class="h1">
<span>text1</span>
</h1>
Third Case:
<h1 class="h1">
<br>
text1
<span>text2</span>
</h1>
I used the xpath
//h1[#class="h1"]/text()[1]|//h1[#class="h1"]/span[1]
But it select the <br> tag in the third case. Is there anyway, I can ignore the break tag and get the text1 value in all 3 cases?

Try this:
//h1/descendant-or-self::text()[normalize-space()][1]
It selects the first descending text node of h1 that is not empty or contains only whitespace.

Loop based on tag in HTML document

I'm trying to extract certain details from articles which are combined in an html file. The html file will consist of 1000s of articles so trying to automate the extraction using BeautifulSoup. For the first article I can get it to extract but do not seem to get it to automatically move on to the next article. This is how the html looks like:
<DOCFULL> -->
<br/>
<div class="c0">
<p class="c1">
<span class="c2">
2 of 4 DOCUMENTS
</span>
</p>
</div>
<br/>
<div class="c0">
<br/>
<p class="c1">
<span class="c2">
The New York Times
<br/>
</span>
...
</DOCFULL>
...
<DOCFULL> -->
<br/>
<div class="c0">
<p class="c1">
<span class="c2">
1 of 4 DOCUMENTS
So, somehow I need the following commands, once fulfilled, to again apply to the next article, that will start again will -->. But I just cannot get it to work the way needed. For example to extract 'The New York Times' in the partial html above I use, and it should automatically also been done to the 2nd/3rd/4th etc article.
journal = soup.find_all('span', class_='c2')[1].getText()
If anyone can point me in the direction I should start thinking it would be really appreciated!
EDIT:
Just to put more into perspective what I am trying to achieve. I can get the latter parts to work, but do not get it to check each article after the former.
For Each Article:
* Determine Newspaper
* If newspaper = x
.
.
.
* Else
Continue

Is there any way to format style and text inside the Text Area?

Hi have the code where i whrite the code inside the Text area to get the format like below.But my problem is the text is format correctly and how i can place the same text format inside the textarea.
<input type="textarea">
<P align = center>
<B>About Salman Khan </B>
</P>
<P align = left>Salman Khan (pronunciation born Abdul Rashid Salim Salman Khan on 27 December 1965)[5] is an INDIAN film actor and producer. He is cited in the media as one of the most commercially successful actors of Hindi cinema.
</P>
<P>
<ul>
<li>Undertake our tasks and activities in utmost good faith, objectivity, transparency, competence, due care and professionalism.</li>
<li>Abide by the highest standards of politeness and good conduct.</li>
</ul>
</P>
</input>

Sorry, you can't style a textarea. However it does seem like there are some workarounds available. Here's the first thing I found:
Format text in a <textarea>?
So basically you put all your content in a div and style it to make it work like a text area. Here you can see I've made it scrollable, made it editable, plus setting the height/width and borders. (There don't need to be set, depending on exactly what you want.) I think the biggest problem here is that different browsers style text areas differently so there probably is no really easy to way style it so it looks exactly like a native textarea in all browsers.
<div id='fake_textarea' style="overflow:scroll; height:100px; width:400px; border:solid; border-width:1px" contenteditable>
<P align=center>
<B>About Salman Khan </B>
</P>
<P align=left>
Salman Khan (pronunciation born Abdul Rashid Salim Salman Khan on 27 December 1965)[5] is an INDIAN film actor and producer. He is cited in the media as one of the most commercially successful actors of Hindi cinema.
</P>
<P>
<ul>
<li>Undertake our tasks and activities in utmost good faith, objectivity, transparency, competence, due care and professionalism.</li>
<li>Abide by the highest standards of politeness and good conduct.</li>
</ul>
</P>
</div>
Notice that this the same as your code, except I replaced the input tag with a styled div.
See the posts in the above answer if you need to be use this field in a form and must submit the contents. It takes a minor work around.

How to add a <br/> to IE 10 without effecting the other browsers

Hi I'm sure there is a simple way of doing this but I just can't find what im looking for.
my problem is I have an ordered list code shown below.
<ol>
<li>Identification code of product type </li>
<b> WC flushing cistern - Class 2</b>
<li> Serial number allowing identification of the construction product as required under article 11(4): </li>
<b> PACWHB313397</b>
<li>Intended use in accordance with the applicable harmonised standard, as foreseen by the manufacturer </li>
<b>Personal Hygiene</b>
<li>Name, and contact address of manufacturer as required under article 11(5): </li>
<b> Thomas Dudley Limited, 295 Birmingham New Road, Dudley, West Midlands, United Kingdom, DY1 4SJ </b>
<li>Where applicable, Name and contact details of representative who's mandate covers tasks specified in article 12(2):</li>
<b>N/A</b>
<li>System or Systems of assessment and verification of constancy of performance of the construction product as set out in CPR, annex V:</li>
<b> System 4</b>
<li>Applicable Standards, In case of the declaration of performance concerning a construction product covered by a harmonised standard</li>
<b>BS EN 14055:2010 WC and Urinal Flushing Cisterns (Class 2)</b>
<b>BS 6920-2.1:2000 Suitability of non-metallic products for use in contact with water intended for human consumption.</b>
<b>BS 1212-3:1990 Diaphragm type float operated valves (plastic bodied) for cold water services only (Excluding Floats)</b>
</ol>
Now in firefox it displays correctly for example it should look like this
example example
bold text here
example example
bold text here
example example
bold text here
Which is working ok in all browser accept IE 10 it displays like this
example example bold text here
example example bold text here
example example bold text here
So I was just wondering is there a simple way of sorting this out using css or a certain tag to use to sort this out.
Any help will be much appreciated.

Take a look at this fiddle http://jsfiddle.net/Lcy2L/
You can only have <li> elements inside the element. You should put all your content in the <li> and insert <br /> and <b> or other tags to format your content inside the <li>.
Your code could look something like this
<ol>
<li>
Content<br />
<b>Comment</b>
</li>
</ol>

You can see this fiddle, contains example of what you can do. with CSS:
.styling {
font-weight:bold;
}
and li element:
<li>Identification code of product type
<br><span class="styling"> WC flushing cistern - Class 2</span>
</li>

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

xpath for extracting text from self and child node - html

Since this HTML is not structured in any way that would facilitate extracting this in any clean way, I would propose the following: /div/a//text()[not(. = 'Multi-Buy:' or contains(., 'to your basket'))]

Related

Xpath select between elements under condition (containing text)

Xpath to get only first text tag and ignore break tag before it

Loop based on tag in HTML document

Is there any way to format style and text inside the Text Area?

How to add a <br/> to IE 10 without effecting the other browsers

Categories

Resources