Scrapy, how to extract s subtext from <b> - html

I got a html like this:
<section id="SECTION_A">
<h4>List</h4>
<a style="text-decoration: none;" href="#list" data-toggle="collapse">
<div class="ITEM">
TEXT
</div>
</a>
<div id="IDENTIFICATION" class="collapse">
</div>
<a style="text-decoration: none;" href="#list" data-toggle="collapse">
<div class="ITEM2">
TEXT2
</div>
</a>
<div id="IDENTIFICATION2" class="collapse">
<div><b>TITLE</b>: CONTENT</div>
<div><b>TITLE2</b>: CONTENT2</div>
</div>
</section>
I've got stored it in a selector XPATH like this, because the html got several sections with similar structure, tags and repeated data:
sectionA = response.xpath('//section[#id="SECTION_A"]')
Now, I want to extract the ITEMS and their IDENTIFICATIONS and write them into a file.
Extracting the ITEM gave no problem with:
item = sectionA.xpath('.//div/#class[contains(.,"ITEM")]').extract()
And it returns:
[u'ITEM', u'ITEM2']
But I cannot extract the TEXT of the ITEMS, I've tried:
item = sectionA.xpath('.//div/#class[contains(.,"ITEM")]/text()').extract()
But returns an empty list.
I'm also unable to extract the IDENTIFICATIONS, one problem with these is that they may not have any content or several, so I've tried to extract a selector of them from the SECTIONA selector like this:
identifications = sectionA.xpath('.//div/#id[contains(.,"IDENTIFICATION")]')
It retunrs me a selector similar to sectionA, but when I try to search in it i got nothing with this:
for id in identifications:
title= signature.xpath('.//div')
I've tried sevelal combinations like .//div/b or .//b or just .// but i got nothing.
Anyone know how I can get the ITEM-TEXT and IDENTIFICATIONS-CONTENT from an html like this?

The problem you are facing is not in the steps applied but is a Logical mistake. The reason why you are not getting the Text inside the 'ITEM' class is due to an extra / that you are using.
In the code that you wrote :
item = sectionA.xpath('.//div/#class[contains(.,"ITEM")]').extract()
Here it returns [u'ITEM', u'ITEM2'] due to the use of / before #class in //div/#class , which basically here means : return me the value used in the class which contains "ITEM" substring in it. And since the attribute #class is being pointed to here, it returns [] for there is no text to be pointed to.
What you instead want to do is :
item = sectionA.xpath('.//div[contains(#class,"ITEM")]/text()').extract()
Here the output of sectionA.xpath('.//div[contains(#class,"ITEM")]') is the selector:
[<Selector xpath='.//div[contains(#class,"ITEM")]' data=u'<div class="ITEM">'>, <Selector xpath='.//div[contains(#class,"ITEM")]' data=u'<div class="ITEM2">'>]
Similar mistake is made in the extraction for "IDENTIFICATIONS", with one more grave Logical Problem. The usage of // in title = signature.xpath('.//div') is not the appropriate method since it will not show the div in just the div IDENTIFICATIONS, but will instead try with all divs preset in the HTML. Again, this may not be a problem unless there was a div with substring "IDENTIFICATION" outside the div we are searching in. So a better way to do it, instead is to do something similar follows as per requirement :
>>> identification=sectionA.xpath('.//div[contains(#id,"IDENTIFICATION")]')
>>> for id in identification:
... print(id.xpath('div/b')).extract()

Related

Find xpath nearest element

I need to define an xpath before an element on the page. I have a string(FIO) that I can find using xpath and I need to bind to it. I don't understand how to do it.
My xpath witch i can find on page:
/html/body/div[1]/div[2]/section/div/div[1]/div/ul/li[2]//div[1]/span[contains(., '"+FIO+"')]
look at screenshot, i need find string 1, it have xpath:
/html/body/div[1]/div[2]/section/div/div[1]/div/ul/li[2]/ul/li[4]/ul/li[1]/div/div/a
image
string with my param(FIO) 2, have xpath:
/html/body/div[1]/div[2]/section/div/div[1]/div/ul/li[2]/ul/li[4]/ul/li[1]/div/div/div[1]/span
and i shortened it and inserted a variable:
/html/body/div[1]/div[2]/section/div/div[1]/div/ul/li[2]//div[1]/span[contains(., '"+FIO+"')]
how i can get xpath to element 2 with binding at element 1 ? maybe following sibling ?
sorry, i can't copy the code correctly, only like this:
</div>
</div>
<ul>
<li>
<div class="structure2__item1">
<div class="structure2__item2" style="">
<a class="structure2__position" href=https://**>
"String 2"
</a>
<div class="structure2__name" style="">
<span>String_FIO</span>
</div>
</div>
</div>
</li>
<li>
//div[child::span[contains(text(), "String_FIO")]]/preceding-sibling::a
This would help fetch the a tag from the span.
(From next time - please look out for the standards mentioned in the comments.)

Cannot find correct element with same class name

I have the following HTML snippet:
<div id="result-1">
<div class="page">
<div class="collapsingblock">
<h4>Click Me</h4>
</div>
<div class="collapsingblock collapsed">
<h4>No, Click Me</h4>
</div>
</div>
</div>
What I'm trying to do, is to find the second collapsingblock and it's h4
I have the following:
(//div[#id="result-1"]/div[#class="page"]/div[#class="collapsingblock"])[2]/h4
My xPath doesn't return the element. If I replace it with [1] it finds the first instance of collapsingblock though
Any ideas?
Thanks
UPDATE:
I have just noticed, that the HTML is using JavaScript to add/remove an additional class to the second collapsingblock, which collapsed
The problem is that the value of the class attribute of the second inner div element is not equal to "collapsingblock", as you can see:
<div class="collapsingblock collapsed">
<h4>No, Click Me</h4>
</div>
Even though class has very clear-cut semantics in HTML, it does not mean anything special to XPath, it's an attribute like any other.
Use contains() to avoid this problem:
(//div[#id="result-1"]/div[#class="page"]/div[contains(#class,"collapsingblock")])[2]/h4
Then, the only result of the expression above is
<h4>No, Click Me</h4>
By the way, parentheses around the lefthand part of the expression are not necessary in this case:
//div[#id="result-1"]/div[#class="page"]/div[contains(#class,"collapsingblock")][2]/h4
will do exactly the same, given this particular input document.
the parenthesis is necessary because of priority :
(//div[#id="result-1"]/div[#class="page"]/div[#class="collapsingblock"])[2]/h4

Change CSS through div and not class

I'm using an app that allow me to change only the CSS of a webpage.
The devs create a step_btn class for 2 distinct buttons (previous and next button) and i want to change the CSS of each button separately.
I tried to make the changes through the div id but nothing happened...
Here is the page code :
<div id="j_id0:j_id73" class="navbuttonsContainer">
<div id="j_id0:j_id74" style="height:35px;overflow:visible;position:relative;" class="rowElem">
<div id="j_id0:j_id78" class="btnWrapper">
<a class="step_btn" onclick="saveAndGoTo('next')" title="Next Page">
<span id="j_id0:nextBtn">Next Page</span>
</a>
</div>
<div id="j_id0:j_id81"></div>
</div>
</div>
Here are my tries :
#j_id0:j_id78
div#j_id0:j_id78 {}
div#j_id0:j_id78 .step_btn {}
Is it my tries that failed or the webpage doesn't interpret my code correctly ?
Thanks for your help !
use an attribute selector to enclose the id as a string, e.g.
[id="j_id0:j_id78"] .step_btn { ... }
In fact using the id selector — like for #j_id0:j_id78 — the parser would consider j_id0 as the id and :j_id78 as a pseudoselector
Note: no need to specify the div element in front of the attribute, since the id should be unique in the page.

Xpath get element with condition

I have some block of code and need to get data out of it and trying different version of xpath commands but with no success.
<div>
<div class="some_class">
<a title="id" href="some_href">
<nobr>1<br>
</a>
</div>
<div class="some_other_class">
<a title="name" href="some_href">
<nobr>John<br>
</a>
</div>
</div>
<div>
<div class="some_class">
<a title="id" href="some_href">
<nobr>2<br>
</a>
</div>
<div class="some_other_class">
<a title="name" href="some_href">
<nobr>John<br>
</a>
</div>
</div>
// and many blocks like this
So, this div blocks are the same except they are different by content of its sub-element. I need xpath query to get John's href which <a title="id"> is equal to 1.
I've tried something like this:
//div[./div/nobr='1' AND ./div/nobr='John']
to get only div that contains data I need and then wouldn't be hard to get John's href.
Also, I've managed to get John's href with:
//a[./nobr='John'][#title='name']/#href
but that way it doesn't depend on value from <a title="id"...> element but it has to depend on it.
Any suggestions?
I think what you want is
//div/div[a/#title='id']/following-sibling::div[1]/a/#href
which, given a well-formed input document, will return (individual results separated by --------):
href="some_href"
-----------------------
href="some_href"
You did not explain it very clearly though, as kjhughes has noted, and perhaps your sample HTML is not ideal.
Regarding your attempted path expressions, as the input is HTML, it is hard to know whether
<nobr>John<br>
means that "John" is inside the nobr element or not.
Thanks Mathias, your example was helpful, but as there are many elements with #title='id' it isn't reliable solution that will always catch good elements.
I've managed to make workaround, first catched the whole div, and then extract href I need.
//div[./div/a[#title='name']/nobr='John' and ./div/a[#title='id']/nobr='1']
//a[./nobr='John'][#title='name']/#href

Rejecting some HTML tags in BeautifulSoup

I know that it might be very simple, but I couldn't find the proper way to handle it.
I have a HTML document, which I want to extract its content. The body of the body of this document is:
<div class="articleContent">
<div class="dateblock">
<div class="textsize">
<span class="textsize_label">
Font Size</span> <a href="javascript:decreaseFontSize();"
title="Increase font-size" class="txtsizeminus"><span>-</span></a> <a href="javascript:increaseFontSize();"
title="Increase font-size" class="txtsizeplus"><span>+</span></a>
</div>
<p class="article_date">
Last Update: date
</p>
</div>
<div id="ctl00_ctl00_cpAB_cp1_cbcContentBreak">
<div class="zoomMe">
<P>The Content is here</p>
</div>
What I want is the content of the document not the other info like "Font Size" and "Last Update". But since all of these information are children of "articleContent", I don't know how to get rid of them.
I have to note that since these additional info might change from one document to another, I can not use simple regular expressions to remove them from the final strings. I have to filter them out while I'm processing the HTML file.
I have to add that I am using the following commands to extract this part of the document, and also its content:
body = soup.find("div", {"class":"articleContent"})
pars= [s for s in body.strings if s.strip() != '']
So, the question is how to avoid having those additional info in the "pars" array?
Any ideas?
Thanks
Did you try just looking for the particular tag you want?
desired_div = soup.find("div", attrs={"class": "zoomMe"})
print(desired_div.text)