I'm interest in learning about scraping a website. now I learn how to scraping table on the website. I used BeautifulSoup.
I have a simple HTML table to parse but somehow Beautifulsoup I try to get row in tbody but always get word in "thead" ones. . I'm wondering if anyone would take a look at that and see what's wrong. So I already created the rows object from the HTML table:
<table id="companyTable" class="table table--zebra table-content-page width-block dataTable no-footer" role="grid" aria-describedby="companyTable_info" style="width: 868px;">
<thead>
<tr role="row">
<th class="sorting_disabled" rowspan="1" colspan="1" style="width: 41px;">No</th>
<th class="sorting_disabled" rowspan="1" colspan="1" style="width: 224px;">Kode/Nama Perusahaan</th>
<th class="sorting_disabled" rowspan="1" colspan="1" style="width: 267px;">Nama</th>
<th class="sorting_disabled" rowspan="1" colspan="1" style="width: 187px;">Tanggal Pencatatan</th>
</tr>
</thead>
<tbody>
<tr role="row" class="odd">
<td class="text-center">1</td>
<td class="text-center">AALI</td>
<td>Astra Agro Lestari Tbk</td>
<td>09 Des 1997</td>
</tr>
<tr role="row" class="even">
<td class="text-center">2</td>
<td class="text-center">ABBA</td>
<td>Mahaka Media Tbk</td>
<td>03 Apr 2002</td>
</tr>
I'm really really sorry I've already read and tried this Beautifulsoup HTML table parsing--only able to get the last row? . but still, don't get it.. and get '[ ]' at output.
here's the link that I want to scrape it. : https://www.idx.co.id/perusahaan-tercatat/profil-perusahaan-tercatat/
I want to get this row.
<tr role="row" class="odd">
<td class="text-center">1</td>
<td class="text-center">AALI</td>
<td>Astra Agro Lestari Tbk</td>
<td>09 Des 1997</td>
</tr>
I try to get it but always get word in "thead" ones.
here's my code :
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
url = 'https://www.idx.co.id/perusahaan-tercatat/profil-perusahaan-tercatat/'
uClient = uReq(url)
pageHtml = uClient.read()
uClient.close()
pageSoup = soup(pageHtml, "html.parser")
table = pageSoup.findAll('table', id = "companyTable")
table = table[0]
for row in table.findAll('tr'):
for cell in row.findAll('th'):
print(cell.text)
You just need the first tr in the tbody tag. So I'd use this:
first_row = s.find('tbody').find('tr')
Where s is the soup in my case. Here's an example:
>>> html = """<table id="companyTable" class="table table--zebra table-content-page width-block dataTable no-footer" role="grid" aria-describedby="companyTable_info" style="width: 868px;">
... <thead>
... <tr role="row">
... <th class="sorting_disabled" rowspan="1" colspan="1" style="width: 41px;">No</th>
... <th class="sorting_disabled" rowspan="1" colspan="1" style="width: 224px;">Kode/Nama Perusahaan</th>
... <th class="sorting_disabled" rowspan="1" colspan="1" style="width: 267px;">Nama</th>
... <th class="sorting_disabled" rowspan="1" colspan="1" style="width: 187px;">Tanggal Pencatatan</th>
... </tr>
... </thead>
... <tbody>
... <tr role="row" class="odd">
... <td class="text-center">1</td>
... <td class="text-center">AALI</td>
... <td>Astra Agro Lestari Tbk</td>
... <td>09 Des 1997</td>
... </tr>
... <tr role="row" class="even">
... <td class="text-center">2</td>
... <td class="text-center">ABBA</td>
... <td>Mahaka Media Tbk</td>
... <td>03 Apr 2002</td>
... </tr>
... """
>>> s = BeautifulSoup(html)
>>> first_row = s.find('tbody').find('tr')
>>> first_row
<tr class="odd" role="row">
<td class="text-center">1</td>
<td class="text-center">AALI</td>
<td>Astra Agro Lestari Tbk</td>
<td>09 Des 1997</td>
</tr>
It works because find only returns the first element that matches
Solving the problem
If I understood it right, you just want to get the table data from this site. However, inspecting the site and analyzing the requests and responses using the Google Network tools, I just found out that the site is using DataTables and fills the table using JS, with the responses from this request.
In other words, you could just have made
import requests
url = "https://www.idx.co.id/umbraco/Surface/Helper/GetEmiten?emitenType=s"
response = requests.get(url)
print(response.json())
What you should learn from this
Inspecting the page elements and requests/responses in order to know what is the easiest way to get the data. The tool I suggest is the Chrome Devtools, but you may use the browser that fits you the best.
Related
Hi all I would like to extract 25.8 value from this html block using xpath
the html code is from a weather website, https://app.weathercloud.net/
"<div id=""gauge-rainrate""><h3>Intensidad de lluvia</h3><canvas id=""rainrate"" width=""200"" height=""200""></canvas><div class=""summary"">
<table>
<tbody><tr>
<th> mm/h</th>
<th class=""max""><i class=""icon-chevron-up icon-white""></i> Máx </th>
</tr>
<tr>
<td class=""grey"">Diaria</td>
<td><a id=""gauge-rainrate-max-day"" rel=""tooltip"" title="""" data-original-title=""22/04/2022 00:00"">0.0</a></td>
</tr>
<tr>
<td class=""grey"">Mensual</td>
<td><a id=""gauge-rainrate-max-month"" rel=""tooltip"" title=""21/04/2022 02:15"">25.8</a></td>
</tr>
<tr>
<td class=""grey"">Anual</td>
<td><a id=""gauge-rainrate-max-year"" rel=""tooltip"" title=""21/04/2022 02:15"">25.8</a></td>
</tr>
</tbody></table>
</div></div>"
I use this expression to extract in a google spreadsheet cell
=IMPORTXML("https://app.weathercloud.net/d5044837546#current";"//a[#id='gauge-rainrate-max-month']")
apparently the code is correct but my output is always
-
I don't understand why...
Sorry for being such a newbie as I haven't really tried this before. If this is already answered before, can you please provide the link so I can review it? Here's pretty much what I'd like to accomplish:
enter image description here
I tried to display it in rows in hopes that it will display beside each other such as, Item 1 | Item 2 | Item 3| but it's a no go. I'm pretty much displaying the item list from a Sales Order record or a transaction record into the Advanced PDF layout for the record type.
Thank you if anybody will be able to help. Here's the code block. I'm trying to display the items as the image shows:
<body padding="8mm 13mm 8mm 13mm" size="A4">
<#if record.item?has_content>
<table class="itemTable" width="100%"><!-- start items --><#list record.item
as item><#if item_index==0>
<thead>
<tr>
<th colspan="6" class="itemHeader" align="left" padding-
bottom="8px">Code</th>
<th colspan="6" class="itemHeader" align="left" padding-bottom="8px"
padding-left="10px">Qty</th>
<th colspan="6" class="itemHeader" align="left" padding-
bottom="8px">Units</th>
<th colspan="18" class="itemHeader" align="left" padding-bottom="8px"
padding-left="15px">Product Description</th>
<th colspan="8" class="itemHeader" align="left" padding-bottom="8px">Unit
Price</th>
<th colspan="8" class="itemHeaderEnd" align="left" padding-bottom="8px"
padding-left="10px">Amount</th>
</tr>
</thead>
<!-- Print items -->
</#if><tr>
<td colspan="6" class="itemDetail" align="left"><#printCode item.item
/></td>
<td colspan="6" class="itemDetail" align="left" padding-
left="20px">${item.quantity}</td>
<td colspan="6" class="itemDetail" align="center">${item.units}</td>
<td colspan="18" class="itemDetail" align="left" letter-spacing= "0px"
padding-left="15px" padding-right="50px">${item.description}</td>
<td colspan="8" class="itemDetail" align="left" padding-left="20px">
<#if item.rate?is_number>${item.rate?string("#,##0.00")}<#else>${item.rate}
</#if></td>
<td colspan="8" class="itemDetailEnd" align="left" padding-left="30px">
<#if item.amount?is_number>${item.amount?string("#,##0.00")}
<#else>${item.amount}</#if></td>
</tr>
</#list><!-- end items --></table>
</#if>
I know that the above displays the item with default look from top to bottom, What I would like to achieve is to have it showing from left to right.
Thank you in advance.
-Joe
The way to do that in BFO is to use a table using the chunk built-in. Then fill the last row with the missing cells.
e.g. ignoring header
<#list record.item?chunk(3) as row>
<tr>
<#list row as item>... </#list>
<#if row?size lt 3 ><td> </td></#if><!-- fill the row -->
<#if row?size lt 2 ><td> </td></#if>
</tr>
</#list>
I have a pre-built ecomm module for a website that is storing HTML in an XML field in SQL.
That being said when I select data from that field it's all jargon instead of my HTML.
How can I recode the XML into HTML in a select statement?
Data Stored in xml field [Overview]:
<locale en-US="<h3>As Shown Details</h3> <p>6514/1 SN AWH TABLE LAMP AS SHOWN</p> <h3>Item Details</h3> <div class="table-responsive"> <table class="table table-striped table-condensed" width="100%" border="0"> <tbody> <tr> <th scope="row">Manufacturer</th> <td>Holtkotter International</td> </tr> <tr> <th scope="row">Dimensions</th> <td>Width 7.25 x Depth 7.25 x Height 18.5</td> </tr> <tr> <th scope="row">Seat Height</th> <td></td> </tr> <tr> <th scope="row">Arm Height</th> <td></td> </tr> <tr> <th scope="row">Inside Depth</th> <td></td> </tr> <tr> <th scope="row">Fabric Content</th> <td></td> </tr> <tr> <th scope="row">Country of Origin</th> <td></td> </tr> </tbody> </table> </div> <div class="part1">This modern table lamp adds style and versatility to virtually any decor. Equipped with a full-range, turn-knob dimmer and a 100 Watt Halogen bulb by Osram. Pair it with the matching wall sconce 9426, floor lamp 6515, or swing-arm floor lamp 9434.</div><br><br><div class="part2">Available in Hand Brushed Old Bronze (shown), Antique Brass, Brushed Brass, Chrome, and Satin Nickel Finishes.</div><br><br><div class="part3">Halogen Line Voltage 100W bulb included.</div>" />
Should be clean HTML I've tried using
select convert(xml, [Overview]) as code
from [mytable]
for xml path (''), type
but it returns
<code><locale en-US="<h3>As Shown Details</h3> <p>6514/1 SN AWH TABLE LAMP AS SHOWN</p> <h3>Item Details</h3> <div class="table-responsive"> <table class="table table-striped table-condensed" width="100%" border="0"> <tbody> <tr> <th scope="row">Manufacturer</th> <td>Holtkotter International</td> </tr> <tr> <th scope="row">Dimensions</th> <td>Width 7.25 x Depth 7.25 x Height 18.5</td> </tr> <tr> <th scope="row">Seat Height</th> <td></td> </tr> <tr> <th scope="row">Arm Height</th> <td></td> </tr> <tr> <th scope="row">Inside Depth</th> <td></td> </tr> <tr> <th scope="row">Fabric Content</th> <td></td> </tr> <tr> <th scope="row">Country of Origin</th> <td></td> </tr> </tbody> </table> </div> <div class="part1">This modern table lamp adds style and versatility to virtually any decor. Equipped with a full-range, turn-knob dimmer and a 100 Watt Halogen bulb by Osram. Pair it with the matching wall sconce 9426, floor lamp 6515, or swing-arm floor lamp 9434.</div><br><br><div class="part2">Available in Hand Brushed Old Bronze (shown), Antique Brass, Brushed Brass, Chrome, and Satin Nickel Finishes.</div><br><br><div class="part3">Halogen Line Voltage 100W bulb included.</div>" /></code>
There's no need to do any manual re-codings. Reading the attribute value directly from the XML will do all the re-encoding for you implicitly:
DECLARE #mockup TABLE(Overview XML);
INSERT INTO #mockup(Overview)
VALUES(N'<locale en-US="<h3>As Shown Details</h3> <p>6514/1 SN AWH TABLE LAMP AS SHOWN</p> <h3>Item Details</h3> <div class="table-responsive"> <table class="table table-striped table-condensed" width="100%" border="0"> <tbody> <tr> <th scope="row">Manufacturer</th> <td>Holtkotter International</td> </tr> <tr> <th scope="row">Dimensions</th> <td>Width 7.25 x Depth 7.25 x Height 18.5</td> </tr> <tr> <th scope="row">Seat Height</th> <td></td> </tr> <tr> <th scope="row">Arm Height</th> <td></td> </tr> <tr> <th scope="row">Inside Depth</th> <td></td> </tr> <tr> <th scope="row">Fabric Content</th> <td></td> </tr> <tr> <th scope="row">Country of Origin</th> <td></td> </tr> </tbody> </table> </div> <div class="part1">This modern table lamp adds style and versatility to virtually any decor. Equipped with a full-range, turn-knob dimmer and a 100 Watt Halogen bulb by Osram. Pair it with the matching wall sconce 9426, floor lamp 6515, or swing-arm floor lamp 9434.</div><br><br><div class="part2">Available in Hand Brushed Old Bronze (shown), Antique Brass, Brushed Brass, Chrome, and Satin Nickel Finishes.</div><br><br><div class="part3">Halogen Line Voltage 100W bulb included.</div>" />')
SELECT m.Overview.value(N'(/locale/#en-US)[1]','nvarchar(max)')
FROM #mockup AS m
The result can not be casted to XML due to the less strict rules of HTML (namely the unclosed <br> elements). But it works perfectly. Embedded in a website the browser showed me this
(important Your CSS classes are missing of course...):
How do I get all the links in a table based on the table caption?
<table class="wikitable sortable plainrowheaders">
<caption>Film</caption>
<tr>
<th scope="col">Year</th>
<th scope="col">Title</th>
<th scope="col">Role</th>
<th scope="col" class="unsortable">Notes</th>
</tr>
<tr>
<td style="text-align:center;">1997</td>
<th scope="row"><i><span class="sortkey">Ice Storm, The</span><span class="vcard"><span class="fn">The Ice Storm</span> </span></i></th>
<td>Libbets Casey</td>
<td>First professional role</td>
</tr>
</table>
I tried this
doc = Nokogiri::HTML(str)
doc.xpath('//table[caption=''Film'']//a/#href').each do |href|
p href
end
But this doesn't print anything.
You can write your code as below :-
require 'nokogiri'
doc = Nokogiri::HTML::Document.parse <<-EOT
<table class="wikitable sortable plainrowheaders">
<caption>Film</caption>
<tr>
<th scope="col">Year</th>
<th scope="col">Title</th>
<th scope="col">Role</th>
<th scope="col" class="unsortable">Notes</th>
</tr>
<tr>
<td style="text-align:center;">1997</td>
<th scope="row"><i><span class="sortkey">Ice Storm, The</span><span class="vcard"><span class="fn">The Ice Storm</span> </span></i></th>
<td>Libbets Casey</td>
<td>First professional role</td>
</tr>
</table>
EOT
doc.xpath("//table[./caption[text()='Film']]//a").each do |node|
p node['href']
end
# >> "/wiki/The_Ice_Storm_(film)"
I am using mechanize/nokogiri and need to parse out a HTML with a lot of these tables:
<table width="100%" onclick="javascript:abredown('c7a8e8041a5031f127d5d27f3f071cbb');" class="buscaDestaque" bgcolor="#F7D36A">
<tr>
<td rowspan="2" scope="col" style="width:5%"><img src="images/gold.gif" border="0"></td>
<td scope="col" style="width:45%" class="mais"><b>Community - 2nd Season</b><br />Community - 2ª Temporada<br/><b>Downloads: </b> 2496 <b>Comentários: </b>17<br><b>Avaliação: </b> 10/10</td>
<td scope="col" style="width:20%">28/03/2011 - 21:07</td>
<td scope="col" style="width:20%">SubsOTF</td>
<td scope="col" style="width:10%"><img src='images/flag_br.gif' border='0'></td>
</tr>
<tr>
<td colspan="4">Release: <span class="brls">Community.S02E19.HDTV.XviD-LOL/DIMENSION</span></td>
</tr>
</table>
I want this output
Community.S02E19.HDTV.XviD-LOL/DIMENSION, ('c7a8e8041a5031f127d5d27f3f071cbb')
Can anyone help me?
require 'nokogiri'
html = Nokogiri::HTML html_with_many_tables
results = html.css('table.buscaDestaque').map do |table|
jsid = table['onclick'][/'(\w+)'/,1]
brls = table.at_css('.brls').text
"#{brls}, #{jsid}"
end
p results
#=>["Community.S02E19.HDTV.XviD-LOL/DIMENSION, c7a8e8041a5031f127d5d27f3f071cbb",
#=> "AnotherBRLS, anotherJSID"]