Ruby Nokogiri Parsing HTML table III - html

I am using mechanize/nokogiri and need to parse out a HTML with a lot of these tables:
<table width="100%" onclick="javascript:abredown('c7a8e8041a5031f127d5d27f3f071cbb');" class="buscaDestaque" bgcolor="#F7D36A">
<tr>
<td rowspan="2" scope="col" style="width:5%"><img src="images/gold.gif" border="0"></td>
<td scope="col" style="width:45%" class="mais"><b>Community - 2nd Season</b><br />Community - 2ª Temporada<br/><b>Downloads: </b> 2496 <b>Comentários: </b>17<br><b>Avaliação: </b> 10/10</td>
<td scope="col" style="width:20%">28/03/2011 - 21:07</td>
<td scope="col" style="width:20%">SubsOTF</td>
<td scope="col" style="width:10%"><img src='images/flag_br.gif' border='0'></td>
</tr>
<tr>
<td colspan="4">Release: <span class="brls">Community.S02E19.HDTV.XviD-LOL/DIMENSION</span></td>
</tr>
</table>
I want this output
Community.S02E19.HDTV.XviD-LOL/DIMENSION, ('c7a8e8041a5031f127d5d27f3f071cbb')
Can anyone help me?

require 'nokogiri'
html = Nokogiri::HTML html_with_many_tables
results = html.css('table.buscaDestaque').map do |table|
jsid = table['onclick'][/'(\w+)'/,1]
brls = table.at_css('.brls').text
"#{brls}, #{jsid}"
end
p results
#=>["Community.S02E19.HDTV.XviD-LOL/DIMENSION, c7a8e8041a5031f127d5d27f3f071cbb",
#=> "AnotherBRLS, anotherJSID"]

Related

Repeat Table Column Headings on Multiple Pages Using BFO, Freemarker and HTML - NetSuite

I'm creating an Advanced PDF/HTML template using NetSuite that is essentially a list of items that spans multiple pages.
The problem I have is that the table column headings only show on the first page.
I would like the table column headings to display on each page where the table spans.
I am stuck and not sure what I need to do.
Any guidance would be much appreciated.
Thanks
You can just put your header in <thead>...</thead> and the BFO engine will repeat the header on each page that the table wraps to. Same with <tfoot>...</tfoot>.
Turned out that the code was correct but there was a defect with the Price List Advanced PDF/HTML template. Still being investigated by NetSuite but glad I wasn't missing something in the code itself.
same issue and I have a thead...
<table class="item-table-orca" cellmargin="0px" style="margin-top: 10px;width:100%;" direction="${direction}" lang="${pdfLang}">
<#list record.item as item>
<#assign itemUnit = customUnitExpr>
<#if item.units?has_content && UNIT_NAMES[item.units]??>
<#assign itemUnit = UNIT_NAMES[item.units][pdfLang]>
</#if>
<#if !item.quantity?? || !item.quantity?has_content>
<#assign itemUnit = "">
</#if>
<#if item_index==0>
<thead>
<tr>
<th align="center" colspan="2">#</th>
<th align="${align}" colspan="8">
<#if pdfLang == 'en'> Activity <#else> פעילות </#if>
</th>
<th align="center" colspan="4">${getTranslatedLabel("item.custcol_rr_start_date", item.custcol_rr_start_date)}</th>
<th align="center" colspan="4">${getTranslatedLabel("item.custcol_rr_end_date", item.custcol_rr_end_date)}</th>
<th align="center" colspan="4">${getTranslatedLabel("item.quantity", item.quantity#label)}</th>
<th align="${oposAlign}" colspan="5">${getTranslatedLabel("item.amount", item.amount#label)}</th>
</tr>
</thead>
</#if>
<tr>
<td valign="middle" align="center" colspan="2"><p> ${item_index + 1} </p></td>
<td valign="middle" align="${align}" colspan="8">${item.item}</td>
<td valign="middle" align="center" colspan="4"><#if item.custcol_rr_start_date?has_content>${item.custcol_rr_start_date?date?string(dateFormat)}</#if></td>
<td valign="middle" align="center" colspan="4"><#if item.custcol_rr_end_date?has_content>${item.custcol_rr_end_date?date?string(dateFormat)}</#if></td>
<td valign="middle" align="center" colspan="4"><#if hasPrintData && printData.soLineData??> ${printData.soLineData[item_index?string].soQty} <#else> ${item.quantity} </#if></td>
<td valign="middle" align="${oposAlign}" colspan="5">${curr_symbol}${item.amount?string(currFormat)}</td>
</tr>
</#list>
</table>

BeautifulSoup HTML scraping, how to get row after thead in tbody

I'm interest in learning about scraping a website. now I learn how to scraping table on the website. I used BeautifulSoup.
I have a simple HTML table to parse but somehow Beautifulsoup I try to get row in tbody but always get word in "thead" ones. . I'm wondering if anyone would take a look at that and see what's wrong. So I already created the rows object from the HTML table:
<table id="companyTable" class="table table--zebra table-content-page width-block dataTable no-footer" role="grid" aria-describedby="companyTable_info" style="width: 868px;">
<thead>
<tr role="row">
<th class="sorting_disabled" rowspan="1" colspan="1" style="width: 41px;">No</th>
<th class="sorting_disabled" rowspan="1" colspan="1" style="width: 224px;">Kode/Nama Perusahaan</th>
<th class="sorting_disabled" rowspan="1" colspan="1" style="width: 267px;">Nama</th>
<th class="sorting_disabled" rowspan="1" colspan="1" style="width: 187px;">Tanggal Pencatatan</th>
</tr>
</thead>
<tbody>
<tr role="row" class="odd">
<td class="text-center">1</td>
<td class="text-center">AALI</td>
<td>Astra Agro Lestari Tbk</td>
<td>09 Des 1997</td>
</tr>
<tr role="row" class="even">
<td class="text-center">2</td>
<td class="text-center">ABBA</td>
<td>Mahaka Media Tbk</td>
<td>03 Apr 2002</td>
</tr>
I'm really really sorry I've already read and tried this Beautifulsoup HTML table parsing--only able to get the last row? . but still, don't get it.. and get '[ ]' at output.
here's the link that I want to scrape it. : https://www.idx.co.id/perusahaan-tercatat/profil-perusahaan-tercatat/
I want to get this row.
<tr role="row" class="odd">
<td class="text-center">1</td>
<td class="text-center">AALI</td>
<td>Astra Agro Lestari Tbk</td>
<td>09 Des 1997</td>
</tr>
I try to get it but always get word in "thead" ones.
here's my code :
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
url = 'https://www.idx.co.id/perusahaan-tercatat/profil-perusahaan-tercatat/'
uClient = uReq(url)
pageHtml = uClient.read()
uClient.close()
pageSoup = soup(pageHtml, "html.parser")
table = pageSoup.findAll('table', id = "companyTable")
table = table[0]
for row in table.findAll('tr'):
for cell in row.findAll('th'):
print(cell.text)
You just need the first tr in the tbody tag. So I'd use this:
first_row = s.find('tbody').find('tr')
Where s is the soup in my case. Here's an example:
>>> html = """<table id="companyTable" class="table table--zebra table-content-page width-block dataTable no-footer" role="grid" aria-describedby="companyTable_info" style="width: 868px;">
... <thead>
... <tr role="row">
... <th class="sorting_disabled" rowspan="1" colspan="1" style="width: 41px;">No</th>
... <th class="sorting_disabled" rowspan="1" colspan="1" style="width: 224px;">Kode/Nama Perusahaan</th>
... <th class="sorting_disabled" rowspan="1" colspan="1" style="width: 267px;">Nama</th>
... <th class="sorting_disabled" rowspan="1" colspan="1" style="width: 187px;">Tanggal Pencatatan</th>
... </tr>
... </thead>
... <tbody>
... <tr role="row" class="odd">
... <td class="text-center">1</td>
... <td class="text-center">AALI</td>
... <td>Astra Agro Lestari Tbk</td>
... <td>09 Des 1997</td>
... </tr>
... <tr role="row" class="even">
... <td class="text-center">2</td>
... <td class="text-center">ABBA</td>
... <td>Mahaka Media Tbk</td>
... <td>03 Apr 2002</td>
... </tr>
... """
>>> s = BeautifulSoup(html)
>>> first_row = s.find('tbody').find('tr')
>>> first_row
<tr class="odd" role="row">
<td class="text-center">1</td>
<td class="text-center">AALI</td>
<td>Astra Agro Lestari Tbk</td>
<td>09 Des 1997</td>
</tr>
It works because find only returns the first element that matches
Solving the problem
If I understood it right, you just want to get the table data from this site. However, inspecting the site and analyzing the requests and responses using the Google Network tools, I just found out that the site is using DataTables and fills the table using JS, with the responses from this request.
In other words, you could just have made
import requests
url = "https://www.idx.co.id/umbraco/Surface/Helper/GetEmiten?emitenType=s"
response = requests.get(url)
print(response.json())
What you should learn from this
Inspecting the page elements and requests/responses in order to know what is the easiest way to get the data. The tool I suggest is the Chrome Devtools, but you may use the browser that fits you the best.

NetSuite Advanced PDF > Display items in a Grid format:

Sorry for being such a newbie as I haven't really tried this before. If this is already answered before, can you please provide the link so I can review it? Here's pretty much what I'd like to accomplish:
enter image description here
I tried to display it in rows in hopes that it will display beside each other such as, Item 1 | Item 2 | Item 3| but it's a no go. I'm pretty much displaying the item list from a Sales Order record or a transaction record into the Advanced PDF layout for the record type.
Thank you if anybody will be able to help. Here's the code block. I'm trying to display the items as the image shows:
<body padding="8mm 13mm 8mm 13mm" size="A4">
<#if record.item?has_content>
<table class="itemTable" width="100%"><!-- start items --><#list record.item
as item><#if item_index==0>
<thead>
<tr>
<th colspan="6" class="itemHeader" align="left" padding-
bottom="8px">Code</th>
<th colspan="6" class="itemHeader" align="left" padding-bottom="8px"
padding-left="10px">Qty</th>
<th colspan="6" class="itemHeader" align="left" padding-
bottom="8px">Units</th>
<th colspan="18" class="itemHeader" align="left" padding-bottom="8px"
padding-left="15px">Product Description</th>
<th colspan="8" class="itemHeader" align="left" padding-bottom="8px">Unit
Price</th>
<th colspan="8" class="itemHeaderEnd" align="left" padding-bottom="8px"
padding-left="10px">Amount</th>
</tr>
</thead>
<!-- Print items -->
</#if><tr>
<td colspan="6" class="itemDetail" align="left"><#printCode item.item
/></td>
<td colspan="6" class="itemDetail" align="left" padding-
left="20px">${item.quantity}</td>
<td colspan="6" class="itemDetail" align="center">${item.units}</td>
<td colspan="18" class="itemDetail" align="left" letter-spacing= "0px"
padding-left="15px" padding-right="50px">${item.description}</td>
<td colspan="8" class="itemDetail" align="left" padding-left="20px">
<#if item.rate?is_number>${item.rate?string("#,##0.00")}<#else>${item.rate}
</#if></td>
<td colspan="8" class="itemDetailEnd" align="left" padding-left="30px">
<#if item.amount?is_number>${item.amount?string("#,##0.00")}
<#else>${item.amount}</#if></td>
</tr>
</#list><!-- end items --></table>
</#if>
I know that the above displays the item with default look from top to bottom, What I would like to achieve is to have it showing from left to right.
Thank you in advance.
-Joe
The way to do that in BFO is to use a table using the chunk built-in. Then fill the last row with the missing cells.
e.g. ignoring header
<#list record.item?chunk(3) as row>
<tr>
<#list row as item>... </#list>
<#if row?size lt 3 ><td> </td></#if><!-- fill the row -->
<#if row?size lt 2 ><td> </td></#if>
</tr>
</#list>

parsing/escape in Swift

Currently i have a html string (here is a part of it) in swift where i want to escape a special part
<tr style="color:White;background-color:#32B4FA;border-width:1px;border-style:solid;font-weight:normal;">
<th scope="col" style="border-color:Black;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:X-Small;width:25px;"> </th><th scope="col" style="border-color:Black;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:X-Small;width:20px;">Park-<br>stätte</th><th scope="col" style="border-color:Black;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:X-Small;width:25px;">Parkmöglichkeit</th><th scope="col" style="border-color:Black;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:X-Small;width:25px;">Anzahl Stellplätze</th><th scope="col" style="border-color:Black;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:X-Small;width:25px;">Freie Stellplätze</th>
</tr>
<tr style="color:#000066;">
<td align="center" style="border-width:1px;border-style:solid;font-size:X-Small;width:25px;">
<span id="GridView1__Id_0" title="Kennzeichen" ReadOnly="true" style="display:inline-block;border-width:0px;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:X-Small;width:25px;">P1</span>
</td>
<td align="center" style="border-width:1px;border-style:solid;font-size:X-Small;">
<img src="Images/Symbol_Tiefgarage.jpg" style="width:20px;" />
</td>
<td align="left" style="border-width:1px;border-style:solid;font-size:Small;">
<a id="GridView1_HyperLink1_0" href="http://www.paderborn.de/microsite/asp/parken_in_der_city/TG_Koenigsplatz.php" target="_top" style="display:inline-block;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:X-Small;width:150px;">Königsplatz</a>
</td>
<td align="center" style="border-width:1px;border-style:solid;font-size:X-Small;width:40px;">
<span id="GridView1__AnzahlPlaetze_0" title="Anzahl Plätze" ReadOnly="true" style="display:inline-block;border-width:0px;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:X-Small;">810</span>
</td>
<td align="center" style="border-width:1px;border-style:solid;font-size:Smaller;width:40px;">
<span id="GridView1__AnzahlFreiePlaetze_0" title="Freie Plätze" ReadOnly="true" style="display:inline-block;border-width:0px;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:X-Small;">0</span>
</td>
</tr>
the Part for me that is interesting is the "810"( could be 0-1000 or a text string) from
<td align="center" style="border-width:1px;border-style:solid;font-size:X-Small;width:40px;">
<span id="GridView1__AnzahlPlaetze_0" title="Anzahl Plätze" ReadOnly="true" style="display:inline-block;border-width:0px;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:X-Small;">810</span>
</td>
i did try to get use to regEx but this did not work out for me.
I suggest you use a XML/HTML parser which supports CSS selectors to retrieve that string, since the span that contains that string has a id = "GridView1__AnzahlPlaetze_0", and you can use query "#GridView1__AnzahlPlaetze_0" to retrieve it.
For example, with a Swift library called Fuzi that wraps libxml2
import Fuzi
let doc = try? HTMLDocument(string: htmlString)
if let result = doc?.firstChild(css: "#GridView1__AnzahlPlaetze_0") {
print(result.stringValue)
}
The above code is tested.

How do I get the links in a table based on caption using nokogiri

How do I get all the links in a table based on the table caption?
<table class="wikitable sortable plainrowheaders">
<caption>Film</caption>
<tr>
<th scope="col">Year</th>
<th scope="col">Title</th>
<th scope="col">Role</th>
<th scope="col" class="unsortable">Notes</th>
</tr>
<tr>
<td style="text-align:center;">1997</td>
<th scope="row"><i><span class="sortkey">Ice Storm, The</span><span class="vcard"><span class="fn">The Ice Storm</span> </span></i></th>
<td>Libbets Casey</td>
<td>First professional role</td>
</tr>
</table>
I tried this
doc = Nokogiri::HTML(str)
doc.xpath('//table[caption=''Film'']//a/#href').each do |href|
p href
end
But this doesn't print anything.
You can write your code as below :-
require 'nokogiri'
doc = Nokogiri::HTML::Document.parse <<-EOT
<table class="wikitable sortable plainrowheaders">
<caption>Film</caption>
<tr>
<th scope="col">Year</th>
<th scope="col">Title</th>
<th scope="col">Role</th>
<th scope="col" class="unsortable">Notes</th>
</tr>
<tr>
<td style="text-align:center;">1997</td>
<th scope="row"><i><span class="sortkey">Ice Storm, The</span><span class="vcard"><span class="fn">The Ice Storm</span> </span></i></th>
<td>Libbets Casey</td>
<td>First professional role</td>
</tr>
</table>
EOT
doc.xpath("//table[./caption[text()='Film']]//a").each do |node|
p node['href']
end
# >> "/wiki/The_Ice_Storm_(film)"