BeautifulSoup4 and HTML - html

I want to extract from the following html code, the following information using python and bs4;
h2 class placename value,
span class value,
div class="aithousaspec" value
<div class="results-list">
<div class="piatsaname">city center</div>
<table>
<tr class="trspacer-up">
<td>
<a href="hall.aspx?id=1001173">
<h2 class="placename">ARENA
<span class="boldelement"><img src="/images/sun.png" height="16" valign="bottom" style="padding:0px 3px 0px 10px" >Θερινός<br>
25 Richmond Avenue st, Leeds</span>
</h2>
<p>
+4497XXXXXXX<br>
STEREO SOUND
</p>
Every Monday 2 tickets 8,00 pounds
</a>
</td>
</tr>
<tr class="trspacer-down">
<td>
<p class="coloredelement">Italian Job</p>
<div class="aithousaspec">
<b></b> Thu.-Wed.: 20.50/ 23.00
<b></b>
</div>
The code that i m using doesnt seem efficient
# parse the html using beautiful soup and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')
print(soup.prettify())
mydivs = soup.select('div.results-list')
for info in mydivs:
time= info.select('div.aithousaspec')
print time
listCinemas = info.select("a[href*=hall.aspx]")
print listCinemas
print len(listCinemas)
for times in time:
proj= times.find('div.aithousaspec')
print proj
for names in listCinemas:
theater = names.find('h2', class_='placename')
print(names.find('h2').find(text=True).strip())
print (names.find('h2').contents[1].text.strip())
Is there any better way to get the mentioned info?

data = '''<div class="results-list">
<div class="piatsaname">city center</div>
<table>
<tr class="trspacer-up">
<td>
<a href="hall.aspx?id=1001173">
<h2 class="placename">ARENA
<span class="boldelement"><img src="/images/sun.png" height="16" valign="bottom" style="padding:0px 3px 0px 10px" >Θερινός<br>
25 Richmond Avenue st, Leeds</span>
</h2>
<p>
+4497XXXXXXX<br>
STEREO SOUND
</p>
Every Monday 2 tickets 8,00 pounds
</a>
</td>
</tr>
<tr class="trspacer-down">
<td>
<p class="coloredelement">Italian Job</p>
<div class="aithousaspec">
<b></b> Thu.-Wed.: 20.50/ 23.00
<b></b>
</div>'''
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(data, 'lxml')
print(soup.select('h2.placename')[0].contents[0].strip())
print(re.sub(r'\s{2,}', ' ', soup.select('span.boldelement')[0].text.strip()))
print(soup.select('div.aithousaspec')[0].text.strip())
This will print:
ARENA
Θερινός 25 Richmond Avenue st, Leeds
Thu.-Wed.: 20.50/ 23.00

Related

I am Trying to scrape overall product details like brand, ingredient and flavour

Can anyone pls help me to scrape Flavour and brand details as key value pair using beautifulsoup. I am new in this:
Desired output would be
Flavour - Green Apple
Brand - Carabau
the html looks like this:
Html Code -
<tr class="a-spacing-small">
<td class="a-span3">
<span class="a-size-base a-text-bold">Flavour</span>
</td>
<td class="a-span9">
<span class="a-size-base">Green Apple</span>
</td>
<tr class="a-spacing-small">
<td class="a-span3">
<span class="a-size-base a-text-bold">Brand</span>
</td>
<td class="a-span9">
<span class="a-size-base">Carabau</span>
</td>
I have take data as html and you can use find method on respective tag to get exact data also you can use find_next() alternatively
html="""<tr class="a-spacing-small">
<td class="a-span3">
<span class="a-size-base a-text-bold">Flavour</span>
</td>
<td class="a-span9">
<span class="a-size-base">Green Apple</span>
</td>
</tr>"""
Code:
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,"html.parser")
dict={}
data=soup.find("td",class_="a-span3").find_next().text
data1=soup.find("td",class_="a-span9").find("span",class_="a-size-base").text
print(data+" - "+data1)
dict[data]=data1
Output:
Flavour - Green Apple
You can do like this.
Select the <tr> and use .stripped_strings to get a list of strings inside <tr>.
Note: If you have multiple <tr> then use .find_all() to select each of it and do the same.
from bs4 import BeautifulSoup
s = """
<tr class="a-spacing-small">
<td class="a-span3">
<span class="a-size-base a-text-bold">Flavour</span>
</td>
<td class="a-span9">
<span class="a-size-base">Green Apple</span>
</td>
</tr>
"""
soup = BeautifulSoup(s, 'lxml')
tr = soup.find('tr')
print(list(tr.stripped_strings))
['Flavour', 'Green Apple']
There's actually no need in .stripped_strings as mentioned by Ram since you can directly call a specific CSS selector which will be safer since it will grab data from specific elements, not from something else, and this doesn't create a dictionary key-value pair as you wanted.
You're looking for this:
# ...
data = []
for result in soup.select('tr'):
# CSS selector for flavour detail
flavor_name = result.select_one('.a-span9 .a-size-base').text
# appends to list() as a dict() -> key-value pair
data.append({
"flavour": flavor_name
})
print(data)
# # [{'flavour': 'Green Apple'}]
Code and example in the online IDE (will return key-value pair):
from bs4 import BeautifulSoup
html = '''
<tr class="a-spacing-small">
<td class="a-span3">
<span class="a-size-base a-text-bold">Flavour</span>
</td>
<td class="a-span9">
<span class="a-size-base">Green Apple</span>
</td>
'''
soup = BeautifulSoup(html, 'html.parser')
# temp list()
data = []
for result in soup.select('tr'):
# flavor = soup.select_one('.a-text-bold').text # returns just Flavour word
flavor_name = result.select_one('.a-span9 .a-size-base').text
data.append({
"flavour": flavor_name
})
print(data)
# [{'flavour': 'Green Apple'}]
Access created data:
for flavour in data:
print(flavour["flavour"])
# Green Apple

Use HTML::TreeBuilder in Perl to extract all instances of a specific span class

Trying to make a Perl script to open an HTML file and extract anything contained within <span class="postertrip"> tags.
Sample HTML:
<table>
<tbody>
<tr>
<td class="doubledash">>></td>
<td class="reply" id="reply2">
<a name="2"></a> <label><input type="checkbox" name="delete" value="1199313466,2" /> <span class="replytitle"></span> <span class="commentpostername">Test1</span><span class="postertrip">!AAAAAAAA</span> 08/01/03(Thu)02:06</label> <span class="reflink"> No.2 </span> <br /> <span class="filesize">File: <a target="_blank" href="test">1199326003295.jpg</a> -(<em>65843 B, 288x412</em>)</span> <span class="thumbnailmsg">Thumbnail displayed, click image for full size.</span><br /> <a target="_blank" test"> <img src="test" width="139" height="200" alt="65843" class="thumb" /></a>
<blockquote>
<p>Test message 1</p>
</blockquote>
</td>
</tr>
</tbody>
</table>
<table>
<tbody>
<tr>
<td class="doubledash">>></td>
<td class="reply" id="reply5">
<a name="5"></a> <label><input type="checkbox" name="delete" value="1199313466,5" /> <span class="replytitle"></span> <span class="commentpostername">Test2</span><span class="postertrip">!BBBBBBBB</span> 08/01/03(Thu)16:12</label> <span class="reflink"> No.5 </span>
<blockquote>
<p>Test message 2</p>
</blockquote>
</td>
</tr>
</tbody>
</table>
<table>
<tbody>
<tr>
<td class="doubledash">>></td>
<td class="reply" id="reply7">
<a name="7"></a> <label><input type="checkbox" name="delete" value="1199161229,7" /> <span class="replytitle"></span> <span class="commentpostername">Test3</span><span class="postertrip">!CCCCCCCC.</span> 08/01/01(Tue)17:53</label> <span class="reflink"> No.7 </span>
<blockquote>
<p>Test message 3</p>
</blockquote>
</td>
</tr>
</tbody>
</table>
Desired output:
!AAAAAAAA
!BBBBBBBB
!CCCCCCCC
Current script:
#!/usr/bin/env perl
use warnings;
use strict;
use 5.010;
use HTML::TreeBuilder;
open(my $html, "<", "temp.html")
or die "Can't open";
my $tree = HTML::TreeBuilder->new();
$tree->parse_file($html);
foreach my $e ($tree->look_down('class', 'reply')) {
my $e = $tree->look_down('class', 'postertrip');
say $e->as_text;
}
Bad output of script:
!AAAAAAAA
!AAAAAAAA
!AAAAAAAA
in your foreach-loop you have to look down from the element you found. So the correct code is:
foreach my $parent ($tree->look_down('class', 'reply')) {
my $e = $parent->look_down('class', 'postertrip');
say $e->as_text;
}
I've never liked HTML::TreeBuilder. It's a bit of a complicated mess, and it hasn't been updated in three years. Using CSS selectors with Mojo::DOM is pretty easy though. Its find does all that work that the various look_downs do:
use v5.10;
use Mojo::DOM;
my $html = do { local $/; <DATA> };
my #values = Mojo::DOM->new( $html )
->find( 'td.reply span.postertrip' )
->map( 'all_text' )
->each;
say join "\n", #values;
Note that in your HTML::TreeBuilder code, you don't have the logic to select the tags you care about. You can do it but you need extra work. The CSS selectors take care of that for you.

How to filter url links with criteria via beautifulsoup? is it possible? YES indeed

There are always some new posts in any forum. The one I visited gives a "new" sticker to the post. How do i filter and retrieve the URLs with new stickers? Tricky...
I usually just grabbed off first page. But it seems unprofessional. Actually there are also author and date stickers in each section. Can these be filtering criteria via beautifulsoup? I am feeling so much to learn.
This is the DOM:
<!-- 三級置頂分開 -->
<tbody id="stickthread_10432064">
<tr>
<td class="folder"><img src="images/green001/folder_new.gif"/></td>
<td class="icon">
  </td>
<th class="new">
<label>
<img alt="" src="images/green001/agree.gif"/>
<img alt="本版置顶" src="images/green001/pin_1.gif"/>
 </label>
<em>[痴女]</em> <span id="thread_10432064">(セレブの友)(CESD-???)大槻ひびき</span>
<img alt="附件" class="attach" src="images/attachicons/common.gif"/>
<span class="threadpages"> <img src="images/new2.gif"/></span> ### new sticker
</th>
<td class="author"> ### author sticker
<cite>
新片<img align="absmiddle" border="0" src="images/thankyou.gif"/>12 </cite>
<em>2019-4-23</em> ### date sticker
</td>
<td class="nums"><strong>6</strong> / <em>14398</em></td>
<td class="nums">7.29G / MP4
</td>
<td class="lastpost">
<em>2019-4-25 14:11</em>
<cite>by 22811</cite>
</td>
</tr>
</tbody><!-- 三級置頂分開 -->
Let's put it this way, it seems that I didn't express myself well enough. What i'm saying is this: for example, I wanna find all 'tbody' with either 'author' of 新片, or 'date' of 2019-4-23, or with a sticker called "images/new2.gif". I would get a lists of tbodys presumably, and then, I wanna find the href in them via
blue = soup.find_all('a', style="font-weight: bold;color: blue")
Thanks chiefs!
There is a class new so I am wondering if you could just use that? That would be:
items = soup.select('tbody:has(.new)')
for item in items:
print([i['href'] for i in item.select('a')])
Otherwise, you can use :has and :contains pseudo classes (bs4 4.7.1) to specify those patterns
items = soup.select('tbody:has(.author a:contains("新片")), tbody:has(em:contains("2019-4-23")), tbody:has([src="images/new2.gif"])')
You can then get hrefs with a loop
for item in items:
print([i['href'] for i in item.select('a')])
First you need to find out the parent tag and then need to find the next sibling and then find the respective tag.Hope you will get your answer.try below code.
from bs4 import BeautifulSoup
import re
data='''<tbody id="stickthread_10432064">
<tr>
<td class="folder"><img src="images/green001/folder_new.gif"/></td>
<td class="icon">
</td>
<th class="new">
<label>
<img alt="" src="images/green001/agree.gif"/>
<img alt="本版置顶" src="images/green001/pin_1.gif"/>
</label>
<em>[痴女]</em> <span id="thread_10432064">(セレブの友)(CESD-???)大槻ひびき</span>
<img alt="附件" class="attach" src="images/attachicons/common.gif"/>
<span class="threadpages"> <img src="images/new2.gif"/></span> ### new sticker
</th>
<td class="author"> ### author sticker
<cite>
新片<img align="absmiddle" border="0" src="images/thankyou.gif"/>12 </cite>
<em>2019-4-23</em> ### date sticker
</td>
<td class="nums"><strong>6</strong> / <em>14398</em></td>
<td class="nums">7.29G / MP4
</td>
<td class="lastpost">
<em>2019-4-25 14:11</em>
<cite>by 22811</cite>
</td>
</tr>
</tbody>'''
soup=BeautifulSoup(data,'html.parser')
for item in soup.find_all('img',src=re.compile('images/new')):
parent=item.parent.parent
print(parent.find_next_siblings('td')[0].find('a').text)
print(parent.find_next_siblings('td')[0].find('em').text)

How to extract different types of bold text and the text in between them using BeautifulSoup?

I have to parse html documents having bold text as section identifiers. But bold text is in different forms , some examples shown below .
Using Beautiful soup I am able to parse them but have to write lot of if else to handle different types of bold. Is there an optimal way to find such bold text and the text in between with using so many if else.
<div style="line-height:120%;padding-bottom:12px;font-size:10pt;">
<font style="font-family:inherit;font-size:10pt;font-weight:bold;">List 1. Work</font>
</div>
<td style="vertical-align:top;padding-left:2px;padding-top:2px;padding-bottom:2px;padding-right:2px;">
<div style="text-align:left;font-size:10pt;">
<font style="font-family:inherit;font-size:10pt;font-weight:bold;">List 1.</font>
</div>
</td>
<td style="vertical-align:top;padding-left:2px;padding-top:2px;padding-bottom:2px;padding-right:2px;">
<div style="text-align:left;font-size:10pt;">
<font style="font-family:inherit;font-size:10pt;font-weight:bold;">Work.</font>
</div>
</td>
<p style="font-family:times;text-align:justify">
<font size="2">
<a name="de42901_List_1._Work"> </a>
<a name="toc_de42901_2"> </a>
</font>
<font size="2"><b> List 1. Work <br> </b></font>
</p>
<p style="font-family:times;text-align:justify">
<font size="2">
<a name="da18101_List_1._Work"> </a>
<a name="toc_da18101_3"> </a>
</font>
<font size="2"><b> List 1. </b></font>
<font size="2"><b><i>Work <br> </i></b></font>
</p>
use the split and join function to remove unwanted /n /b /t and &nbsp:
soup = BeautifulSoup(data, 'html.parser')
data = soup.find_all('b')
for i in data :
final = ' '.join([x for x in i.text.split()])
print (final)
it will arrange your data in same format, hope it resolves your query

lining up text in html/css

I am trying to line up all the #comments so that they all start from the same distance, from the first #comment to the last #comment.
This is what my code looks like http://jsfiddle.net/#&togetherjs=7C48oh5dl7
I have tried making each comment into a span and adding a text-indent, but as you can see this does not seem to work.
I have also tried adding a padding/margin on the span but it distorts the appearance.
HTML code
<p id="var_ex"> x = 2 <span style="display:inline-block; text-indent: 70px;"> # stores the value 2 into x</span> </p>
<p id="var_ex"> x,y = 2,3 <span style="display:inline-block; text-indent: 70px;"> # assigns 2 and 3 to x and y, respectively</span> </p>
<p id="var_ex"> myText = "This is a string" <span style="display:inline-block; text-indent: 70px;"> # assigning a variable to a string</span> </p>
<p id="var_ex"> myList = [2,3,4,5]<span style="display:inline-block; text-indent: 70px; "> # assigning a variable to a list/array</span> </p>
<p id="var_ex"> fahrenheit = celsius*(9.0/5.0) + 32 <span style="display:inline-block; margin-left:300px;"> #using mathematical expresions</span> </p>
You can achieve it by using <table> element as shown in the following demo sample pertinent to your case:
<table>
<tr>
<td width=30%>
x = 2
</td>
<td width=70%>
# assigns 2 and 3 to x and y, respectively
</td>
</tr>
<tr>
<td>
x,y = 2,3
</td>
<td>
# assigning a variable to a list/array
</td>
</tr>
</table>
You can specify the column width either in absolute (px), or relative units (%).
For more information on <table> formatting with CSS3 (in particular, using header cell tag <th>, also <thead>, <tfoot> and <tbody> section elements, you can refer to the article:
HTML5 Tables formatting: alternate rows, color gradients, shadows (http://www.codeproject.com/Tips/262546/HTML-Tables-formating-best-practices)
Best Regards,