XPath to select all paragraphs between two headers? - html

I am trying to all p elements located between two h5 elements. The starting h5 text is "Subject" and the second h5 text is "tenders file".
You may see the picture attached as well.
I don't want to have other p elements which are coming after the second h5.
I have tried the following XPath:
//p[preceding-sibling::h5//*[contains(text() , 'SUBJECT')] and following-sibling::h5//*[contains(text() , 'Tender’s Files,')]] trying to get idea from [enter link description here][2]
but could not get the right paragraphs. It still selects other paragraphs after the second h5.
<div>
<table class="table table-striped table-bordered table-hover" width="90%">
<tbody>
<tr>
<td style="vertical-align: middle;" colspan="2" width="90%">
<h5 style="padding-left: 10px;"><strong><span style="color: #3577be;">Tender Title:</span> Testing of Non-Fortified Wheat Flour in NES</strong></h5>
</td>
</tr>
<tr>
<td style="vertical-align: middle;" width="45%">
<h5 style="padding-left: 10px;"><strong><span style="color: #3577be;">Tender No:</span> SYRIA-TA-2021-005</strong></h5>
</td>
<td style="vertical-align: middle;">
<h5 style="padding-left: 10px;"><strong><span style="color: #3577be;">Location:</span> North East Syria</strong></h5>
</td>
</tr>
<tr>
<td style="vertical-align: middle;" colspan="2">
<h5 style="padding-left: 10px;"><strong><span style="color: #3577be;">Tender Package Available from:</span> 2021-01-10</strong></h5>
</td>
</tr>
<tr>
<td style="vertical-align: middle;" colspan="2">
<h5 style="padding-left: 10px;"><strong><span style="color: #3577be;">Deadline for Offer Submission:</span> 2021-01-18 17:00 (Iraqi Time)</strong></h5>
</td>
</tr>
</tbody>
</table>
<table class="table " width="90%">
<tbody>
<tr>
<td style="text-align: center;"> </td>
</tr>
</tbody>
</table>
<h5><strong><u>SUBJECT:</u></strong> <strong>Testing of Non-Fortified Wheat Flour in NES</strong></h5>
<p>Our organization, a non-profit organization, provides humanitarian assistance to “people in need”, is seeking quotations from eligible contractors to <strong>Testing of Non-Fortified Wheat Flour in NES</strong>. Our organization anticipates awarding Multiple or Single contract(s) as a result of this Solicitation. Our organization reserves the right to award more or none under this RFQ.</p>
<p>All bids shall be submitted <strong>via e-mail to</strong> <span id="cloak1f9ac73a082c1f52174ccee4f406b81c"><strong>Syr-tendering#blumont.org</strong></span> <strong>as PDF format and clearly written the subject of the tender</strong> This RFQ is in no way obligates our organization Our organization to award a contract nor does it commit our organization to pay any cost incurred in the preparation and submission of a proposal.</p>
<p>Our organization bears no responsibility for data errors resulting from transmission or conversion processes.</p>
<p> </p>
<ul>
<li><strong>To help us with our procurement effort, please indicate in your email where (ngotenders.net) you saw this tender/procurement notice.</strong></li>
</ul>
<p><strong>Sincerely</strong></p>
<p><strong>Procurement Committee</strong></p>
<h5><strong>Tender’s Files,</strong></h5>
<h5><strong>5ffb04ba52a49-005-announcement.zip, </strong></h5>
<hr>
<h5 dir="rtl"><strong><u>الموضوع</u></strong><strong><u>:</u></strong> <strong>فحص الطحين الغير مدعم في شمال شرق سوريا.</strong><strong> </strong></h5>
<p dir="rtl">منظمتنا و هي منظمة غير ربحية تعمل لخدمة المنكوبين في العالم و تسعى للحصول على عروض أسعار من المقاولين المؤهلين لغرض الموضوع: <strong>فحص الطحين الغير مدعم في شمال شرق سوريا.</strong> وتتوقع منظمتنا منح (عقود) متعددة أو مفردة نتيجة لهذا الطلب. وتحتفظ منظمتنا بالحق في منح التعاقد بأكثر أو أقل من المتوقع للطلب أعلاه.</p>
<p dir="rtl">لهذا الطلب. وتحتفظ منظمتنا بالحق في منح التعاقد بأكثر أو أقل من المتوقع للطلب أعلاه.</p>
<p dir="rtl"> يجب على جميع مقدمي العطاءات تقديم العروض عبر الايميل :<strong>عبر الايميل: </strong><span id="cloakc42a61e471daa10a7992dbd8b44f9b26"><strong>Syr-tendering#blumont.org</strong></span> <strong>و بصيغة</strong><strong> PDF</strong> و تم التوضيح للموضوع المناقصة بان المنظمة لا تلتزم بأي حال من الأحوال بمنح العقد كما أن المنظمة لا تلتزم بدفع أي تكاليف متكبدة في إعداد وتقديم العرض.</p>
<p dir="rtl">كما ان منظمتنا لا تتحمل أية مسؤولية عن أي أخطاء في البيانات الناتجة عن عمليات النقل أو التحويل او المحادثة.</p>
<p dir="rtl">
</p><p dir="rtl"><strong>مع فائق الاحترام و التقدير</strong></p>
<p dir="rtl"><strong>لجنة المشتريات</strong></p>
<h5><strong>Tender’s Files,</strong></h5>
<h5><strong>5ffb04ba52a49-005-announcement.zip, </strong></h5>
</div>
the page source code.
enter link description here

Using techniques from the following Q/A:
XPath to select all elements between two headings?
Testing text() nodes vs string values in XPath
The following XPath,
//p[ preceding-sibling::h5[starts-with(normalize-space(),'SUBJECT:')]
and following-sibling::h5[normalize-space()='Tender’s Files,']]
will select all p elements between your two targeted headlines, as requested.
Update after OP included actual markup:
Your actual markup includes duplicate
<h5><strong>Tender’s Files,</strong></h5>
headings. The above XPath will select through to the last such heading.
If you want to select through only the first such heading, use this XPath instead:
//p[ preceding-sibling::h5[starts-with(normalize-space(),'SUBJECT:')]
and following-sibling::h5[normalize-space()='Tender’s Files,']
and not(preceding-sibling::h5[normalize-space()='Tender’s Files,'])]

Your xpath should work if you add this:
//p[preceding-sibling::h5//*[contains(text() , 'SUBJECT')] and (following-sibling:: h5//*[contains(text() , 'Tender’s Files,')])[2]]

Related

How to filter url links with criteria via beautifulsoup? is it possible? YES indeed

There are always some new posts in any forum. The one I visited gives a "new" sticker to the post. How do i filter and retrieve the URLs with new stickers? Tricky...
I usually just grabbed off first page. But it seems unprofessional. Actually there are also author and date stickers in each section. Can these be filtering criteria via beautifulsoup? I am feeling so much to learn.
This is the DOM:
<!-- 三級置頂分開 -->
<tbody id="stickthread_10432064">
<tr>
<td class="folder"><img src="images/green001/folder_new.gif"/></td>
<td class="icon">
  </td>
<th class="new">
<label>
<img alt="" src="images/green001/agree.gif"/>
<img alt="本版置顶" src="images/green001/pin_1.gif"/>
 </label>
<em>[痴女]</em> <span id="thread_10432064">(セレブの友)(CESD-???)大槻ひびき</span>
<img alt="附件" class="attach" src="images/attachicons/common.gif"/>
<span class="threadpages"> <img src="images/new2.gif"/></span> ### new sticker
</th>
<td class="author"> ### author sticker
<cite>
新片<img align="absmiddle" border="0" src="images/thankyou.gif"/>12 </cite>
<em>2019-4-23</em> ### date sticker
</td>
<td class="nums"><strong>6</strong> / <em>14398</em></td>
<td class="nums">7.29G / MP4
</td>
<td class="lastpost">
<em>2019-4-25 14:11</em>
<cite>by 22811</cite>
</td>
</tr>
</tbody><!-- 三級置頂分開 -->
Let's put it this way, it seems that I didn't express myself well enough. What i'm saying is this: for example, I wanna find all 'tbody' with either 'author' of 新片, or 'date' of 2019-4-23, or with a sticker called "images/new2.gif". I would get a lists of tbodys presumably, and then, I wanna find the href in them via
blue = soup.find_all('a', style="font-weight: bold;color: blue")
Thanks chiefs!
There is a class new so I am wondering if you could just use that? That would be:
items = soup.select('tbody:has(.new)')
for item in items:
print([i['href'] for i in item.select('a')])
Otherwise, you can use :has and :contains pseudo classes (bs4 4.7.1) to specify those patterns
items = soup.select('tbody:has(.author a:contains("新片")), tbody:has(em:contains("2019-4-23")), tbody:has([src="images/new2.gif"])')
You can then get hrefs with a loop
for item in items:
print([i['href'] for i in item.select('a')])
First you need to find out the parent tag and then need to find the next sibling and then find the respective tag.Hope you will get your answer.try below code.
from bs4 import BeautifulSoup
import re
data='''<tbody id="stickthread_10432064">
<tr>
<td class="folder"><img src="images/green001/folder_new.gif"/></td>
<td class="icon">
</td>
<th class="new">
<label>
<img alt="" src="images/green001/agree.gif"/>
<img alt="本版置顶" src="images/green001/pin_1.gif"/>
</label>
<em>[痴女]</em> <span id="thread_10432064">(セレブの友)(CESD-???)大槻ひびき</span>
<img alt="附件" class="attach" src="images/attachicons/common.gif"/>
<span class="threadpages"> <img src="images/new2.gif"/></span> ### new sticker
</th>
<td class="author"> ### author sticker
<cite>
新片<img align="absmiddle" border="0" src="images/thankyou.gif"/>12 </cite>
<em>2019-4-23</em> ### date sticker
</td>
<td class="nums"><strong>6</strong> / <em>14398</em></td>
<td class="nums">7.29G / MP4
</td>
<td class="lastpost">
<em>2019-4-25 14:11</em>
<cite>by 22811</cite>
</td>
</tr>
</tbody>'''
soup=BeautifulSoup(data,'html.parser')
for item in soup.find_all('img',src=re.compile('images/new')):
parent=item.parent.parent
print(parent.find_next_siblings('td')[0].find('a').text)
print(parent.find_next_siblings('td')[0].find('em').text)

How to extract different types of bold text and the text in between them using BeautifulSoup?

I have to parse html documents having bold text as section identifiers. But bold text is in different forms , some examples shown below .
Using Beautiful soup I am able to parse them but have to write lot of if else to handle different types of bold. Is there an optimal way to find such bold text and the text in between with using so many if else.
<div style="line-height:120%;padding-bottom:12px;font-size:10pt;">
<font style="font-family:inherit;font-size:10pt;font-weight:bold;">List 1. Work</font>
</div>
<td style="vertical-align:top;padding-left:2px;padding-top:2px;padding-bottom:2px;padding-right:2px;">
<div style="text-align:left;font-size:10pt;">
<font style="font-family:inherit;font-size:10pt;font-weight:bold;">List 1.</font>
</div>
</td>
<td style="vertical-align:top;padding-left:2px;padding-top:2px;padding-bottom:2px;padding-right:2px;">
<div style="text-align:left;font-size:10pt;">
<font style="font-family:inherit;font-size:10pt;font-weight:bold;">Work.</font>
</div>
</td>
<p style="font-family:times;text-align:justify">
<font size="2">
<a name="de42901_List_1._Work"> </a>
<a name="toc_de42901_2"> </a>
</font>
<font size="2"><b> List 1. Work <br> </b></font>
</p>
<p style="font-family:times;text-align:justify">
<font size="2">
<a name="da18101_List_1._Work"> </a>
<a name="toc_da18101_3"> </a>
</font>
<font size="2"><b> List 1. </b></font>
<font size="2"><b><i>Work <br> </i></b></font>
</p>
use the split and join function to remove unwanted /n /b /t and &nbsp:
soup = BeautifulSoup(data, 'html.parser')
data = soup.find_all('b')
for i in data :
final = ' '.join([x for x in i.text.split()])
print (final)
it will arrange your data in same format, hope it resolves your query

BeautifulSoup4 and HTML

I want to extract from the following html code, the following information using python and bs4;
h2 class placename value,
span class value,
div class="aithousaspec" value
<div class="results-list">
<div class="piatsaname">city center</div>
<table>
<tr class="trspacer-up">
<td>
<a href="hall.aspx?id=1001173">
<h2 class="placename">ARENA
<span class="boldelement"><img src="/images/sun.png" height="16" valign="bottom" style="padding:0px 3px 0px 10px" >Θερινός<br>
25 Richmond Avenue st, Leeds</span>
</h2>
<p>
+4497XXXXXXX<br>
STEREO SOUND
</p>
Every Monday 2 tickets 8,00 pounds
</a>
</td>
</tr>
<tr class="trspacer-down">
<td>
<p class="coloredelement">Italian Job</p>
<div class="aithousaspec">
<b></b> Thu.-Wed.: 20.50/ 23.00
<b></b>
</div>
The code that i m using doesnt seem efficient
# parse the html using beautiful soup and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')
print(soup.prettify())
mydivs = soup.select('div.results-list')
for info in mydivs:
time= info.select('div.aithousaspec')
print time
listCinemas = info.select("a[href*=hall.aspx]")
print listCinemas
print len(listCinemas)
for times in time:
proj= times.find('div.aithousaspec')
print proj
for names in listCinemas:
theater = names.find('h2', class_='placename')
print(names.find('h2').find(text=True).strip())
print (names.find('h2').contents[1].text.strip())
Is there any better way to get the mentioned info?
data = '''<div class="results-list">
<div class="piatsaname">city center</div>
<table>
<tr class="trspacer-up">
<td>
<a href="hall.aspx?id=1001173">
<h2 class="placename">ARENA
<span class="boldelement"><img src="/images/sun.png" height="16" valign="bottom" style="padding:0px 3px 0px 10px" >Θερινός<br>
25 Richmond Avenue st, Leeds</span>
</h2>
<p>
+4497XXXXXXX<br>
STEREO SOUND
</p>
Every Monday 2 tickets 8,00 pounds
</a>
</td>
</tr>
<tr class="trspacer-down">
<td>
<p class="coloredelement">Italian Job</p>
<div class="aithousaspec">
<b></b> Thu.-Wed.: 20.50/ 23.00
<b></b>
</div>'''
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(data, 'lxml')
print(soup.select('h2.placename')[0].contents[0].strip())
print(re.sub(r'\s{2,}', ' ', soup.select('span.boldelement')[0].text.strip()))
print(soup.select('div.aithousaspec')[0].text.strip())
This will print:
ARENA
Θερινός 25 Richmond Avenue st, Leeds
Thu.-Wed.: 20.50/ 23.00

How to match and replace multiline html file with sed

I have a text file something like this.
<tbody>
<tr>
<td>
String1
</td>
<td>
String2
</td>
<td>
String3
</td>
...
...
<td>
StringN
</td>
</tr>
</tbody>
This is the output that I want.
<tbody>
<tr>
String1;String2;String3;... ...;StringN
</tr>
</tbody>
Here is my BUGGY code.
sed '{
:a
N
$!ba
s|<td.*>\(.*\)</td>|\1|
}'
I wanted to remove all <td> and </td> tags and get all the strings delimitered by some string (I can filter those strings later using that as the delimiter charater). I used the solution given in this URL. Output does not come as I expected.
This is the actual Code
<tbody>
<tr>
<td>
120.52.72.58:80
</td>
<td>
HTTP
</td>
<td>
<span class="text-danger">Transparent</span>
</td>
<td>
<abbr title="2016-12-15 00:07:46">12h ago</abbr>
</td>
<td class="small">
<span class="text-muted">—</span>
</td>
<td>
<img src="/flags/png/cn.png" alt="China (CN)" title="China (CN)" onerror="this.style.display='none'"> <abbr title="China">CN</abbr>
</td>
<td class="small">
Beijing
</td>
<td class="small">
Beijing
</td>
<td class="small">
China Unicom IP network
</td>
<td class="small">
<span class="text-muted">—</span>
</td>
</tr>
</tbody>
Output does not come as I expected.
Your sed code does not work because the <td.*>\(.*\)</td> matches the part of the pattern space from the first <td up to the last </td> due to the greediness of the * quantifier. Unfortunately, sed doesn't support a more modern regex flavor with ungreedy quantifiers; thus, some other tool would be more appropriate.
I wanted to remove all <td> and </td> tags and get all the strings delimitered by some string …
If those tags are always (as in your examples) on a separate line, we can do with a simple sed command:
sed '/<\/*td.*>/d'
All the strings are thereafter delimited by some string which is \n followed by spaces.

Using ruby and nokogiri to parsing HTML using HTML comments as markers

How could I use ruby to extract information from a table consisting of these rows? Is it possible to detect the comments using nokogiri?
<!-- Begin Topic Entry 4134 -->
<tr>
<td align="center" class="row2"><image src='style_images/ip.boardpr/f_norm.gif' border='0' alt='New Posts' /></td>
<td align="center" width="3%" class="row1"> </td>
<td class="row2">
<table class='ipbtable' cellspacing="0">
<tr>
<td valign="middle"><alink href='http://www.xxx.com/index.php?showtopic=4134&view=getnewpost'><image src='style_images/ip.boardpr/newpost.gif' border='0' alt='Goto last unread' title='Goto last unread' hspace=2></a></td>
<td width="100%">
<div style='float:right'></div>
<div> <alink href="http://www.xxx.com/index.php?showtopic=4134&hl=">EXTRACT LINK 1</a> </div>
</td>
</tr>
</table>
<span class="desc">EXTRACT DESCRIPTION</span>
</td>
<td class="row2" width="15%"><span class="forumdesc"><alink href="http://www.xxx.com/index.php?showforum=19" title="Living">EXTRACT LINK 2</a></span></td>
<td align="center" class="row1" width='10%'><alink href='http://www.xxx.com/index.php?showuser=1642'>Mr P</a></td>
<td align="center" class="row2"><alink href="javascript:who_posted(4134);">1</a></td>
<td align="center" class="row1">46</td>
<td class="row1"><span class="desc">Today, 12:04 AM<br /><alink href="http://www.xxx.com/index.php?showtopic=4134&view=getlastpost">Last post by:</a> <b><alink href='http://www.xxx.com/index.php?showuser=1649'>underft</a></b></span></td>
</tr>
<!-- End Topic Entry 4134 -->
-->
Try to use xpath instead:
html_doc = Nokogiri::HTML("<html><body><!-- Begin Topic Entry 4134 --></body></html>")
html_doc.xpath('//comment()')
You could implement a Nokogiri SAX Parser. This is done faster than it might seem at first sight. You get events for Elements, Attributes and Comments.
Within your parser, your should rememeber the state, like #currently_interested = true to know which parts to rememeber and which not.