How to use XPath to select child text after another child element - html

I'm using the Crawler library that helps you to make some XPath expressions to get the content of the HTML tags. I'm currently reading a HTML5 content from a page and I want to retrieve a text that is not inserted in a tag in this way.
<div class="country">
<strong> USA </strong>
Some text here
</div>
So I'm trying to get this text Some text here but the crawler library allows to get just what's in a tag and not outside it.
So any alternative please.
These's the Crawler part :
$crawler = new Crawler();
$crawler->xpathSingle($xml, '//div[#class="country"]/strong/#text');

Either of these XPaths will return "Some text here" as requested:
normalize-space(substring-after(//div[#class="country"], 'USA'))
normalize-space(//div[#class="country"]/strong/following-sibling::text())
Choose based on the sort of variations you wish to accommodate.
Credit: Second example is derived from suggestion first made in comment by #Keith Hall.
Update:
As I mentioned you'll need to choose your XPath based on the variations you wish to accomodate. No sooner did I post than you encountered a variation:
<div class="country">
<strong> USA </strong>
Some text here
<i>Do not want this text</i>
</div>
You can exclude "Do not want this text" and return "Some text here" as requested using the second XPath above but just grab the first following text node:
normalize-space(//div[#class="country"]/strong/following-sibling::text()[1])

Related

How can I get the element of a-tag in the div class with selenium?

I recently work on the project that I have to get the element from a specific website.
I want to get the text elements that are something below.
<div class="block-content">
<div class="block-heading">
<a href="https://www~~~~~~">
<i class="fa fa-map">
::before
</i>
"Text I want to get"
</a>
</div>
</div>
I have been trying to solve this for a while, but I could not find anything working fine.
I would love you if you could help me.
Thank you.
According to the information you provided the text you are looking for is inside a element so the xpath for this element is something like:
//a[contains(#href,'https://www')]
But since there is also i element inside it, getting the text from a element will give you both text contained in a itself and the text inside the i.
So you should get the text from i that is looking like just a (space) here and reduce it from the text you are receiving from the a.
In case you want to perform this action on all the a elements containing href and i element inside it you can use the following xpath:
//a[#href and ./i]
If there are more specific definitions about the elements you are looking for - the xpath I mentioned should be updated accordingly
From your comment, I understood that you would like to extract that text. So here is the code for you which would extract the text you want.
Selenium::WebDriver::Wait
.new(timeout: 60)
.until { !driver.find_element(xpath: "//i[#class='fa fa-map-marker']/..").text.empty? }
p driver.find_element(xpath: "//i[#class='fa fa-map-marker']/..").text[/(?<=before \")\w+ \w+ \w+ \w+ \w+/]
output
"Text I want to get"
I couldn't get the elements that I wanted directly, so here's what I did.
It is just that I did modify the elements with some methods though.
def seller_name
shop_info_elements = #driver.find_elements(:class_name, "block-content")
shop_info_text= shop_info_elements.first.text
shop_info_text_array = shop_info_text.lines
seller_name = shop_info_text_array.first.chomp
seller_name
end
It is not beautiful, but it can work for any other pages on the same site.

XPath : How to get text between 2 html tags with same level?

I'm new to xpath and I'm working with scrapy to get text from different html pages that are generated.
I get the {id} of a header tag from the user (<h1|2|.. id="title-{id}">text</h1|2|3..>). I need to get text from all html tags between this header and the next header of same level. So if the header is h1 I need to get all text of all tags until next h1 header.
All headers ids have same pattern "title-{id}" where {id} is generated.
To make it more clear here is an example :
<html>
<body>
...
<h2 id="tittle-id1">id1</h2>
bunch of tags containing text I want to get
<h2 id="tittle-id2">id2</h2>
...
</body>
</html>
NOTE : I don't know what header it might be. It could be any of the html header tags from <h1> to <h6>
UPDATE :
While trying few things around I noticed that I'm not sure if the next header is of same level or even exists. Since the headers are used as titles and sub-titles. The given id may be of last sub-title hence I'll have a header of higher level after or even be the last of the page. So basicaly I only have the id of the header and I need to get all text of the "paragraph".
Work Around :
I found a kindof workaround solution :
I do it in 3 steps :
First, I use //*[#id='title-{id}] which allows me to get the full line with the tag so now I know which tag header it is.
Second, I use //*[id='title-{id}]/following-sibling::* this allows to look for next header of same or higher level {myHeader}.
Last, I use //*[id='title-{id}]/following-sibling::* and //{myHeader}//preceding-sibling::* to get what's between or go 'till the end of page if no header found.
Here is the xpath to get all the elements between h2 tags.
//h2/following-sibling::*[count(following-sibling::h2)=1]
Here is the sample html I used to simulate the scenario. (update the id to check different options shown in the below).
//[#id='tittle-id1' ]/following::[count(following-sibling::[name()=name(preceding-sibling::[#id='tittle-id1'])])=1]
<html><head></head><body>
...
<h2 id="tittle-id1">id1</h2>
<h3 id="tittle-id3"> h3 tag</h3>
<h4 id="tittle-id4"> h4 tag</h4>
<h3 id="tittle-id5"> 2nd h3 tag</h3>
bunch of tags containing text I want to get
<h5 id="tittle-id6"> h5 tag </h5>
<h2 id="tittle-id2">id2</h2>
<h4 id="tittle-id7"> 2nd h4 tag</h4>
...
</body></html>
output if User input: {id1}
output if user input: {id4}
output if user input: {id3}
Note: This xpath is designed to suite the original post scenario.
Because predicates in XPath filter the context node list you can't perform a join selection unless you are able to reintroduce target values from a relative context of your source values. Example selecting all the elements with the same name as that having specific id attribute:
//*[name()=name(//*[#id=$generated-id-string])]
Now, for the in "between marks problem" use as usually the Kaysian method for intersection:
//*[name()=name(//*[#id=$generated-id-string])]/preceding-sibling::node()[
count(.|//*[#id=$generated-id-string]/following-sibling::node())
=
count(//*[#id=$generated-id-string]/following-sibling::node())
]
Test in http://www.xpathtester.com/xpath/0dcfdf59dccb8faf3705c22167ae45f1
This is what worked for me :
For this keep in mind that I'm using scrapy with python-2.7 :
name_query = u"//*[name()=name(//*[#id='"+id+"'])]"
all = response.xpath(name_query)
for selector in all.getall():
if self.id in selector:
position = all.getall().index(selector)
balise = "h" + all.getall()[position].split("<h")[1][0]
title = all.getall()[position].split(">")[1].split("<")[0]
query = u"//*[preceding-sibling::"+balise+"[1] ='"+title+"' and following-sibling::"+balise+"]"
self.log('query = '+query)
results = response.xpath(query)
results.pop(len(results)-1)
with open(filename,'wb') as f:
for text in results.css("::text").getall():
f.write(text.encode('utf-8')+"\n")
This should work in general I tested it against multiple headers wih different levels it works fine for me.

Regex find text between tags in files

I am trying to find in alley files using a regex search in WebStorm. I have 2 scenarios.
Scenario 1: text inside html tag
<p>testing</p>
Scenario 2: dynamic text inside {{ and }} inside html text
<p>{{testing}}<p>
I was able to find text between html tags using below regex for Scenario 1
>(.*?)</
I am trying to find only places with scenario 1 and not with scenario 2. I mean I want to see all the hard coded text between html tags and not any text between {{ and }}. Any suggestion or pointer?
Have you tried using regexr.com?
Edit
How is this:
>(\w+)</

HTML XPath: Extracting text mixed in with multiple level and complex tags?

related questions before:
HTML XPath: Extracting text mixed in with multiple tags?
HTML XPath: Selectively avoiding tags when extracting text
//sorry for my poor English
I'm a beginner of writing web crawler, I'm trying to extract main content from a web pages(in Chinese) by xpath(though I have learned that there are algorithms both taditional and machine learning ways to extracting web main content) ,and I'm a very beginner at writing xpath rules.
I'm in faced with a web page that contains text mixed in complex tags,I summarize it as follows,where character(e.g. A,A2) means text only,'...' means more tags even nested without text.I want to get "AA2BB2CDEFGHIJKLMNOP"
...
<div id="artibody" class="art_context">
<div align="center">...</div>
<div align="center"><font>A</font>A2</div>
<div align="left"><br><br><strong>B</strong>B2</div>
<div align="left">
<p>C<a>D</a>E</p>
<p>F<a>G</a>H<a>I</a>J</p>K
</div>
<div align="center">...</div>
<div align="center"><font>L</font></div>
<p>M</p><!--M contains only text luckly-->
<p>N</p>
<p>O</p>
<p>P<span>...</span><div class="shareBox">...</div>
</p>
<span id="arctTailMark"></span>
<script>
var page_navigation = document.getElementById('page_navigation');
...
</script>
<div style="padding:10px 0 30px 0">...</div>
</div>
Thanks for previous questions, I write a rule
'string(//div[#class=\"art_context\"])'
I get all content in plain text I want without tags ,but the js code in <script> is extracted as well.I tried the following,but it seems not helpful.There are still js codes in it .
'string(//div[#class=\"art_context\" and not(self::script)])'
The following one get "\r\n" only.
'//div[#class=\"art_context\" and not(self::script)]/text()'
Here are my questions:
1.How to write the xpath rule to meet my need : extracting content in div[#id="artibody"] except codes in <script>
2.Is the rule for question1 simple and powerful? Maybe I will meet more pages with a div[#id="artibody"] but the descendant nodes are quite different.
3.Any further suggestions on my task? Extracting web content from one website,but the main content lays in <div> with different id,class,and descendant node structure. I run the spider on my laptop(Intel corei5 3225,8G RAM) while using machine learning algorithms may decrease the crawl speed significantly.At the same time writing many xpath rule seems bothering.
I'd appreciate it if you could give me any suggestions on this question(and my English).
To get all descendant text nodes except the script contents, you can use this:
//div[#class="art_context"]//*[not(self::script)]/text()
In natural language: “Get all text nodes from descendants of all div[#class="art_context"] elements that are not script elements”.
The // after div[#class="art_context"] is needed to select descendants, not just children.
In comparison, the //div[#class="art_context" and not(self::script)]/text() expression in the question says “Get all text-node children of all div[#class="art_context"] elements that are not also script elements.”
So the and not(self::script) part in the expression in the question is redundant, because all the expression is doing is selecting just //div[#class="art_context"] anyway, and then the /text() part is selecting only the text-node direct children of that div, which is just line breaks.
Also, if instead of using XPath to just get the set of text nodes, you want to use XPath to get the result as a single string, you can use the functions string-join(…) and normalize-space(…):
normalize-space(string-join(//div[#class="art_context"]//*[not(self::script)]/text(), ""))

Xpath query to grab text between different html tags

I am using R to screen scrape. I've grabbed a page and I've managed to find all the links on the page that found in a certain place on the page (anchor tags within anchor tags with a name attribute) using:
links <- xpathSApply(doc, "//a[#name]//a/#href")
Now I have grabbed got the documents from the links with Curl and I want to scrape a certain amount of text. The text seems to always be between an <p> tag (although there are other <p> tags in the text and end before the following text
</pre><hr>Back to: <a href="#TOP">
I decided to grab all the text between <p> and <a href="#TOP"> and I cant seem to nail the xpath query. So far I have got:
text <- xpathSApply(doc, '"/ //text()[preceding:://a/#href="#TOP"] and following::*//p')
Could anyone point me in the right direction? There are quite a few xpath answers on stackoverflow but they don't always explain the answer which makes it hard to edit them for my own use.
Sample HTML:
<span ID="MSGHDR-CONTENT-TYPE-H-PRE">Content-type:</b></span> <span ID="MSGHDR-CONTENT- TYPE-PRE">text/plain; charset=us-ascii</span>
</span><p>
lots and lots of text here that I want
</pre><hr>Back to: Top of message | Previous page | Main CYBCOM page<p>
The HTML is badly formed, so it was difficult for me to figure out what a well-formed instance would look like when parsed into a tree of nodes.
Something like the following might work. It assumes that all of the <p> elements declared inside of the <pre> are children of it (even though not closed in the HTML).
It looks for the text() that is a child of the <p> that does not have a child <p> and is a descendant of the <pre> that has a following-sibling who's first <a> has an href with the value "#TOP".
//body/pre[following-sibling::a[position()=1 and #href='#TOP']]//p[not(p)]/text()