Xpath get element with condition - html

I have some block of code and need to get data out of it and trying different version of xpath commands but with no success.
<div>
<div class="some_class">
<a title="id" href="some_href">
<nobr>1<br>
</a>
</div>
<div class="some_other_class">
<a title="name" href="some_href">
<nobr>John<br>
</a>
</div>
</div>
<div>
<div class="some_class">
<a title="id" href="some_href">
<nobr>2<br>
</a>
</div>
<div class="some_other_class">
<a title="name" href="some_href">
<nobr>John<br>
</a>
</div>
</div>
// and many blocks like this
So, this div blocks are the same except they are different by content of its sub-element. I need xpath query to get John's href which <a title="id"> is equal to 1.
I've tried something like this:
//div[./div/nobr='1' AND ./div/nobr='John']
to get only div that contains data I need and then wouldn't be hard to get John's href.
Also, I've managed to get John's href with:
//a[./nobr='John'][#title='name']/#href
but that way it doesn't depend on value from <a title="id"...> element but it has to depend on it.
Any suggestions?

I think what you want is
//div/div[a/#title='id']/following-sibling::div[1]/a/#href
which, given a well-formed input document, will return (individual results separated by --------):
href="some_href"
-----------------------
href="some_href"
You did not explain it very clearly though, as kjhughes has noted, and perhaps your sample HTML is not ideal.
Regarding your attempted path expressions, as the input is HTML, it is hard to know whether
<nobr>John<br>
means that "John" is inside the nobr element or not.

Thanks Mathias, your example was helpful, but as there are many elements with #title='id' it isn't reliable solution that will always catch good elements.
I've managed to make workaround, first catched the whole div, and then extract href I need.
//div[./div/a[#title='name']/nobr='John' and ./div/a[#title='id']/nobr='1']
//a[./nobr='John'][#title='name']/#href

Related

Find xpath nearest element

I need to define an xpath before an element on the page. I have a string(FIO) that I can find using xpath and I need to bind to it. I don't understand how to do it.
My xpath witch i can find on page:
/html/body/div[1]/div[2]/section/div/div[1]/div/ul/li[2]//div[1]/span[contains(., '"+FIO+"')]
look at screenshot, i need find string 1, it have xpath:
/html/body/div[1]/div[2]/section/div/div[1]/div/ul/li[2]/ul/li[4]/ul/li[1]/div/div/a
image
string with my param(FIO) 2, have xpath:
/html/body/div[1]/div[2]/section/div/div[1]/div/ul/li[2]/ul/li[4]/ul/li[1]/div/div/div[1]/span
and i shortened it and inserted a variable:
/html/body/div[1]/div[2]/section/div/div[1]/div/ul/li[2]//div[1]/span[contains(., '"+FIO+"')]
how i can get xpath to element 2 with binding at element 1 ? maybe following sibling ?
sorry, i can't copy the code correctly, only like this:
</div>
</div>
<ul>
<li>
<div class="structure2__item1">
<div class="structure2__item2" style="">
<a class="structure2__position" href=https://**>
"String 2"
</a>
<div class="structure2__name" style="">
<span>String_FIO</span>
</div>
</div>
</div>
</li>
<li>
//div[child::span[contains(text(), "String_FIO")]]/preceding-sibling::a
This would help fetch the a tag from the span.
(From next time - please look out for the standards mentioned in the comments.)

Scrapy, how to extract s subtext from <b>

I got a html like this:
<section id="SECTION_A">
<h4>List</h4>
<a style="text-decoration: none;" href="#list" data-toggle="collapse">
<div class="ITEM">
TEXT
</div>
</a>
<div id="IDENTIFICATION" class="collapse">
</div>
<a style="text-decoration: none;" href="#list" data-toggle="collapse">
<div class="ITEM2">
TEXT2
</div>
</a>
<div id="IDENTIFICATION2" class="collapse">
<div><b>TITLE</b>: CONTENT</div>
<div><b>TITLE2</b>: CONTENT2</div>
</div>
</section>
I've got stored it in a selector XPATH like this, because the html got several sections with similar structure, tags and repeated data:
sectionA = response.xpath('//section[#id="SECTION_A"]')
Now, I want to extract the ITEMS and their IDENTIFICATIONS and write them into a file.
Extracting the ITEM gave no problem with:
item = sectionA.xpath('.//div/#class[contains(.,"ITEM")]').extract()
And it returns:
[u'ITEM', u'ITEM2']
But I cannot extract the TEXT of the ITEMS, I've tried:
item = sectionA.xpath('.//div/#class[contains(.,"ITEM")]/text()').extract()
But returns an empty list.
I'm also unable to extract the IDENTIFICATIONS, one problem with these is that they may not have any content or several, so I've tried to extract a selector of them from the SECTIONA selector like this:
identifications = sectionA.xpath('.//div/#id[contains(.,"IDENTIFICATION")]')
It retunrs me a selector similar to sectionA, but when I try to search in it i got nothing with this:
for id in identifications:
title= signature.xpath('.//div')
I've tried sevelal combinations like .//div/b or .//b or just .// but i got nothing.
Anyone know how I can get the ITEM-TEXT and IDENTIFICATIONS-CONTENT from an html like this?
The problem you are facing is not in the steps applied but is a Logical mistake. The reason why you are not getting the Text inside the 'ITEM' class is due to an extra / that you are using.
In the code that you wrote :
item = sectionA.xpath('.//div/#class[contains(.,"ITEM")]').extract()
Here it returns [u'ITEM', u'ITEM2'] due to the use of / before #class in //div/#class , which basically here means : return me the value used in the class which contains "ITEM" substring in it. And since the attribute #class is being pointed to here, it returns [] for there is no text to be pointed to.
What you instead want to do is :
item = sectionA.xpath('.//div[contains(#class,"ITEM")]/text()').extract()
Here the output of sectionA.xpath('.//div[contains(#class,"ITEM")]') is the selector:
[<Selector xpath='.//div[contains(#class,"ITEM")]' data=u'<div class="ITEM">'>, <Selector xpath='.//div[contains(#class,"ITEM")]' data=u'<div class="ITEM2">'>]
Similar mistake is made in the extraction for "IDENTIFICATIONS", with one more grave Logical Problem. The usage of // in title = signature.xpath('.//div') is not the appropriate method since it will not show the div in just the div IDENTIFICATIONS, but will instead try with all divs preset in the HTML. Again, this may not be a problem unless there was a div with substring "IDENTIFICATION" outside the div we are searching in. So a better way to do it, instead is to do something similar follows as per requirement :
>>> identification=sectionA.xpath('.//div[contains(#id,"IDENTIFICATION")]')
>>> for id in identification:
... print(id.xpath('div/b')).extract()

Cannot find correct element with same class name

I have the following HTML snippet:
<div id="result-1">
<div class="page">
<div class="collapsingblock">
<h4>Click Me</h4>
</div>
<div class="collapsingblock collapsed">
<h4>No, Click Me</h4>
</div>
</div>
</div>
What I'm trying to do, is to find the second collapsingblock and it's h4
I have the following:
(//div[#id="result-1"]/div[#class="page"]/div[#class="collapsingblock"])[2]/h4
My xPath doesn't return the element. If I replace it with [1] it finds the first instance of collapsingblock though
Any ideas?
Thanks
UPDATE:
I have just noticed, that the HTML is using JavaScript to add/remove an additional class to the second collapsingblock, which collapsed
The problem is that the value of the class attribute of the second inner div element is not equal to "collapsingblock", as you can see:
<div class="collapsingblock collapsed">
<h4>No, Click Me</h4>
</div>
Even though class has very clear-cut semantics in HTML, it does not mean anything special to XPath, it's an attribute like any other.
Use contains() to avoid this problem:
(//div[#id="result-1"]/div[#class="page"]/div[contains(#class,"collapsingblock")])[2]/h4
Then, the only result of the expression above is
<h4>No, Click Me</h4>
By the way, parentheses around the lefthand part of the expression are not necessary in this case:
//div[#id="result-1"]/div[#class="page"]/div[contains(#class,"collapsingblock")][2]/h4
will do exactly the same, given this particular input document.
the parenthesis is necessary because of priority :
(//div[#id="result-1"]/div[#class="page"]/div[#class="collapsingblock"])[2]/h4

Rejecting some HTML tags in BeautifulSoup

I know that it might be very simple, but I couldn't find the proper way to handle it.
I have a HTML document, which I want to extract its content. The body of the body of this document is:
<div class="articleContent">
<div class="dateblock">
<div class="textsize">
<span class="textsize_label">
Font Size</span> <a href="javascript:decreaseFontSize();"
title="Increase font-size" class="txtsizeminus"><span>-</span></a> <a href="javascript:increaseFontSize();"
title="Increase font-size" class="txtsizeplus"><span>+</span></a>
</div>
<p class="article_date">
Last Update: date
</p>
</div>
<div id="ctl00_ctl00_cpAB_cp1_cbcContentBreak">
<div class="zoomMe">
<P>The Content is here</p>
</div>
What I want is the content of the document not the other info like "Font Size" and "Last Update". But since all of these information are children of "articleContent", I don't know how to get rid of them.
I have to note that since these additional info might change from one document to another, I can not use simple regular expressions to remove them from the final strings. I have to filter them out while I'm processing the HTML file.
I have to add that I am using the following commands to extract this part of the document, and also its content:
body = soup.find("div", {"class":"articleContent"})
pars= [s for s in body.strings if s.strip() != '']
So, the question is how to avoid having those additional info in the "pars" array?
Any ideas?
Thanks
Did you try just looking for the particular tag you want?
desired_div = soup.find("div", attrs={"class": "zoomMe"})
print(desired_div.text)

anchor tag not working

I am trying to create an anchor tag but its not working in any of the browsers
I am going from one page to another
<p>
View All Code Related Issues
</p>
and its going to this page having 10-12 anchor tags..
<div class="grouping">
<h4 id="Code2011">
<a>Code 2011</a>
</h4>
</div>
I tried these too:
<div class="grouping">
<h4 id="Code2011">
<a id="Code2011">Code 2011</a>
</h4>
</div>
and
<div class="grouping">
<h4>
<a name="Code2011">Code 2011</a>
</h4>
</div>
but none of them are working: When I go to that page and press enter on the url it then works...so that means my url is coming up fine...any ideas?
I found that this works better. Don't know why.
<div class="grouping">
<h4>
<a name="Code2011"></a>
Code 2011
</h4>
</div>
I have found that sometimes you can mistakenly have another element with the same ID. In my case, it was an option tag, which cannot be moved into view. As such, I would recommend you try $('#yourid') to see if there are any tags unexpectedly with the same ID.
In general:
'name' is deprecated, so don't use it.
All id's must be unique, no exceptions. You cannot have duplicated
id's.
Anchor id's need to occur in anchor tags. So something like <h4
id="myanchor"> wouldn't work as an anchor.
Your second example would work for you if you removed (or rename) the id from the H4 tag.
For others future reference, I've noticed anchors not working well within some divs. They seem to work better when placed next to a recognizable page element like an image or a table row, something on the page that isn't buried within a div. I think what may happen is with floated elements and relative positioning the page can't find the precise spot of your anchor so you get nothing.
Try:
Code 2011