Currently i have a html string (here is a part of it) in swift where i want to escape a special part
<tr style="color:White;background-color:#32B4FA;border-width:1px;border-style:solid;font-weight:normal;">
<th scope="col" style="border-color:Black;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:X-Small;width:25px;"> </th><th scope="col" style="border-color:Black;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:X-Small;width:20px;">Park-<br>stätte</th><th scope="col" style="border-color:Black;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:X-Small;width:25px;">Parkmöglichkeit</th><th scope="col" style="border-color:Black;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:X-Small;width:25px;">Anzahl Stellplätze</th><th scope="col" style="border-color:Black;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:X-Small;width:25px;">Freie Stellplätze</th>
</tr>
<tr style="color:#000066;">
<td align="center" style="border-width:1px;border-style:solid;font-size:X-Small;width:25px;">
<span id="GridView1__Id_0" title="Kennzeichen" ReadOnly="true" style="display:inline-block;border-width:0px;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:X-Small;width:25px;">P1</span>
</td>
<td align="center" style="border-width:1px;border-style:solid;font-size:X-Small;">
<img src="Images/Symbol_Tiefgarage.jpg" style="width:20px;" />
</td>
<td align="left" style="border-width:1px;border-style:solid;font-size:Small;">
<a id="GridView1_HyperLink1_0" href="http://www.paderborn.de/microsite/asp/parken_in_der_city/TG_Koenigsplatz.php" target="_top" style="display:inline-block;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:X-Small;width:150px;">Königsplatz</a>
</td>
<td align="center" style="border-width:1px;border-style:solid;font-size:X-Small;width:40px;">
<span id="GridView1__AnzahlPlaetze_0" title="Anzahl Plätze" ReadOnly="true" style="display:inline-block;border-width:0px;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:X-Small;">810</span>
</td>
<td align="center" style="border-width:1px;border-style:solid;font-size:Smaller;width:40px;">
<span id="GridView1__AnzahlFreiePlaetze_0" title="Freie Plätze" ReadOnly="true" style="display:inline-block;border-width:0px;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:X-Small;">0</span>
</td>
</tr>
the Part for me that is interesting is the "810"( could be 0-1000 or a text string) from
<td align="center" style="border-width:1px;border-style:solid;font-size:X-Small;width:40px;">
<span id="GridView1__AnzahlPlaetze_0" title="Anzahl Plätze" ReadOnly="true" style="display:inline-block;border-width:0px;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:X-Small;">810</span>
</td>
i did try to get use to regEx but this did not work out for me.
I suggest you use a XML/HTML parser which supports CSS selectors to retrieve that string, since the span that contains that string has a id = "GridView1__AnzahlPlaetze_0", and you can use query "#GridView1__AnzahlPlaetze_0" to retrieve it.
For example, with a Swift library called Fuzi that wraps libxml2
import Fuzi
let doc = try? HTMLDocument(string: htmlString)
if let result = doc?.firstChild(css: "#GridView1__AnzahlPlaetze_0") {
print(result.stringValue)
}
The above code is tested.
Related
I'm interest in learning about scraping a website. now I learn how to scraping table on the website. I used BeautifulSoup.
I have a simple HTML table to parse but somehow Beautifulsoup I try to get row in tbody but always get word in "thead" ones. . I'm wondering if anyone would take a look at that and see what's wrong. So I already created the rows object from the HTML table:
<table id="companyTable" class="table table--zebra table-content-page width-block dataTable no-footer" role="grid" aria-describedby="companyTable_info" style="width: 868px;">
<thead>
<tr role="row">
<th class="sorting_disabled" rowspan="1" colspan="1" style="width: 41px;">No</th>
<th class="sorting_disabled" rowspan="1" colspan="1" style="width: 224px;">Kode/Nama Perusahaan</th>
<th class="sorting_disabled" rowspan="1" colspan="1" style="width: 267px;">Nama</th>
<th class="sorting_disabled" rowspan="1" colspan="1" style="width: 187px;">Tanggal Pencatatan</th>
</tr>
</thead>
<tbody>
<tr role="row" class="odd">
<td class="text-center">1</td>
<td class="text-center">AALI</td>
<td>Astra Agro Lestari Tbk</td>
<td>09 Des 1997</td>
</tr>
<tr role="row" class="even">
<td class="text-center">2</td>
<td class="text-center">ABBA</td>
<td>Mahaka Media Tbk</td>
<td>03 Apr 2002</td>
</tr>
I'm really really sorry I've already read and tried this Beautifulsoup HTML table parsing--only able to get the last row? . but still, don't get it.. and get '[ ]' at output.
here's the link that I want to scrape it. : https://www.idx.co.id/perusahaan-tercatat/profil-perusahaan-tercatat/
I want to get this row.
<tr role="row" class="odd">
<td class="text-center">1</td>
<td class="text-center">AALI</td>
<td>Astra Agro Lestari Tbk</td>
<td>09 Des 1997</td>
</tr>
I try to get it but always get word in "thead" ones.
here's my code :
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
url = 'https://www.idx.co.id/perusahaan-tercatat/profil-perusahaan-tercatat/'
uClient = uReq(url)
pageHtml = uClient.read()
uClient.close()
pageSoup = soup(pageHtml, "html.parser")
table = pageSoup.findAll('table', id = "companyTable")
table = table[0]
for row in table.findAll('tr'):
for cell in row.findAll('th'):
print(cell.text)
You just need the first tr in the tbody tag. So I'd use this:
first_row = s.find('tbody').find('tr')
Where s is the soup in my case. Here's an example:
>>> html = """<table id="companyTable" class="table table--zebra table-content-page width-block dataTable no-footer" role="grid" aria-describedby="companyTable_info" style="width: 868px;">
... <thead>
... <tr role="row">
... <th class="sorting_disabled" rowspan="1" colspan="1" style="width: 41px;">No</th>
... <th class="sorting_disabled" rowspan="1" colspan="1" style="width: 224px;">Kode/Nama Perusahaan</th>
... <th class="sorting_disabled" rowspan="1" colspan="1" style="width: 267px;">Nama</th>
... <th class="sorting_disabled" rowspan="1" colspan="1" style="width: 187px;">Tanggal Pencatatan</th>
... </tr>
... </thead>
... <tbody>
... <tr role="row" class="odd">
... <td class="text-center">1</td>
... <td class="text-center">AALI</td>
... <td>Astra Agro Lestari Tbk</td>
... <td>09 Des 1997</td>
... </tr>
... <tr role="row" class="even">
... <td class="text-center">2</td>
... <td class="text-center">ABBA</td>
... <td>Mahaka Media Tbk</td>
... <td>03 Apr 2002</td>
... </tr>
... """
>>> s = BeautifulSoup(html)
>>> first_row = s.find('tbody').find('tr')
>>> first_row
<tr class="odd" role="row">
<td class="text-center">1</td>
<td class="text-center">AALI</td>
<td>Astra Agro Lestari Tbk</td>
<td>09 Des 1997</td>
</tr>
It works because find only returns the first element that matches
Solving the problem
If I understood it right, you just want to get the table data from this site. However, inspecting the site and analyzing the requests and responses using the Google Network tools, I just found out that the site is using DataTables and fills the table using JS, with the responses from this request.
In other words, you could just have made
import requests
url = "https://www.idx.co.id/umbraco/Surface/Helper/GetEmiten?emitenType=s"
response = requests.get(url)
print(response.json())
What you should learn from this
Inspecting the page elements and requests/responses in order to know what is the easiest way to get the data. The tool I suggest is the Chrome Devtools, but you may use the browser that fits you the best.
I'm calling an int value from a database to determine the number of stars that should be displayed in my html using thymeleaf and Spring Boot, but using ${#numbers.sequence(1,obj.stars)} doesn't seem to work.
this is my html-thymeleaf code:
<tr th:each="obj : ${allObjs}" class="pointer" th:onclick="'javascript:openobj(\'' + ${obj.id} + '\');'">
<td class="text-center" th:text="${obj.id}"></td>
<td class="text-center" th:text="${obj.code}"></td>
<td class="text-center" th:text="${obj.name}"></td>
<td class="text-center" th:text="${obj.contract}"></td>
<td class="text-center" th:text="${obj.difficulty}"></td>
<td class="text-center" th:text="${obj.priority}"></td>
<td class="text-center">
<!--this is the line I can't get to work :(-->
<span class="fa fa-star-o" th:each="star:${#numbers.sequence(1,obj.stars)}"></span>
</td>
<td class="text-center" th:text="${obj.state}"></td>
<td class="text-center" th:text="${obj.percent}"></td>
<td class="text-center" th:text="${obj.term}"></td>
<td class="text-center" th:text="${obj.version}"></td>
<td class="text-center" th:text="${obj.price}"></td>
</tr>
and my controller
#GetMapping("/Obj")
public ModelAndView index() {
ModelAndView view = new ModelAndView("/Obj/index");
view.addObject("title", "Obj");
List<Obj> allObjs = ObjService.findAll();
view.addObject("allObjs", allObjs);
return view;
}
Well, I know it's weird to answer your own question but, thanks to Michael Petch who tested it, I found that the problem was in the sequence. It was starting from 1 when I had values of 0 in obj.stars so the sequence couldn't be created with a step of 1.
Changing it to
<span class="fa fa-star-o" th:each="star:${#numbers.sequence(0,obj.stars)}"></span>
Solved the problem.
I know it might be a duplicate but I am not able to extract a value from this HTML source. Any help would be greatly appreciated.
So what I am trying to do is get the pid of the project from page.
The names of the project are being read from a csv file and I need to get the pid.
For example if the project here is "AA project", just the project key "AA" can also be used, the pid that needs to be extracted is 10441.
Since the values are not a label, I cannot figure out how to extract these.
Update : just using pid=(\d....) gives all the pid without any reference to the project name or key.
<table id="project-list" class="aui">
<thead>
<tr>
<th></th>
<th>Name</th>
<th>Key</th>
<th class="project-list-type">Project Type</th>
<th>URL</th>
<th>Project Lead</th>
<th>Default Assignee</th>
<th>Operations</th>
</tr>
</thead>
<tbody>
<tr data-project-key="AA">
<td class="cell-type-icon" data-cell-type="avatar">
<div class="aui-avatar aui-avatar-small aui-avatar-project jira-system-avatar"><span class="aui-avatar-inner"><img src="/secure/projectavatar?pid=10441&avatarId=10011&size=small" alt="Project Avatar for 10441" /></span></div>
</td>
<td data-cell-type="name">
<a id="view-project-10441" href="/plugins/servlet/project-config/AA/summary">AA project</a>
</td>
<td data-cell-type="key">AA</td>
<span>Software</span>
</td>
<td class="cell-type-url" data-cell-type="url">
No URL
</td>
<td class="cell-type-user" data-cell-type="lead">
<a class="user-hover" rel="localadmin" id="view_AA_projects_localadmin" href="/secure/ViewProfile.jspa?name=localadmin">Atlassian Administrator</a>
</td>
<td class="cell-type-user" data-cell-type="default-assignee">
Unassigned
</td>
<td data-cell-type="operations">
<ul class="operations-list">
<li><a class="edit-project" id="edit-project-10441" href="/secure/project/EditProject!default.jspa?pid=10441&returnUrl=ViewProjects.jspa">Edit</a></li>
<li><a id="change_project_type_10441" class="change-project-type-link" data-project-id="10441" href="#">Change project type</a></li>
<li><a id="delete_project_10441" href="/secure/project/DeleteProject!default.jspa?pid=10441&returnUrl=ViewProjects.jspa">Delete</a></li>
</ul>
</td>
</tr>
<tr data-project-key="AAL">
<td class="cell-type-icon" data-cell-type="avatar">
<div class="aui-avatar aui-avatar-small aui-avatar-project jira-system-avatar"><span class="aui-avatar-inner"><img src="/secure/projectavatar?pid=10442&avatarId=10011&size=small" alt="Project Avatar for 10442" /></span></div>
</td>
<td data-cell-type="name">
<a id="view-project-10442" href="/plugins/servlet/project-config/AAL/summary">AAL project</a>
</td>
<td data-cell-type="key">AAL</td>
<td class="cell-type-project-type">
<span>Software</span>
</td>
<td class="cell-type-url" data-cell-type="url">
No URL
</td>
<td class="cell-type-user" data-cell-type="lead">
<a class="user-hover" rel="localadmin" id="view_AAL_projects_localadmin" href="/secure/ViewProfile.jspa?name=localadmin">Atlassian Administrator</a>
</td>
<td class="cell-type-user" data-cell-type="default-assignee">
Unassigned
</td>
<td data-cell-type="operations">
<ul class="operations-list">
I wouldn't recommend using regular expressions to parse HTML data as it will be a headache to develop and maintain and it will be very sensitive to markup changes hence very fragile, see https://stackoverflow.com/a/1732454/2897748 for details.
Go for XPath Extractor instead, the relevant configuration would be:
Reference Name: anything meaningful, i.e. id
XPath Query: substring-after(//tr[#data-project-key='AA']/td[#data-cell-type='name']/a/#id,'view-project-')
Check Use Tidy if your response is not XHTML-compliant
Demo:
References:
XPath Tutorial
XPath Language Reference
I am using mechanize/nokogiri and need to parse out a HTML with a lot of these tables:
<table width="100%" onclick="javascript:abredown('c7a8e8041a5031f127d5d27f3f071cbb');" class="buscaDestaque" bgcolor="#F7D36A">
<tr>
<td rowspan="2" scope="col" style="width:5%"><img src="images/gold.gif" border="0"></td>
<td scope="col" style="width:45%" class="mais"><b>Community - 2nd Season</b><br />Community - 2ª Temporada<br/><b>Downloads: </b> 2496 <b>Comentários: </b>17<br><b>Avaliação: </b> 10/10</td>
<td scope="col" style="width:20%">28/03/2011 - 21:07</td>
<td scope="col" style="width:20%">SubsOTF</td>
<td scope="col" style="width:10%"><img src='images/flag_br.gif' border='0'></td>
</tr>
<tr>
<td colspan="4">Release: <span class="brls">Community.S02E19.HDTV.XviD-LOL/DIMENSION</span></td>
</tr>
</table>
I want this output
Community.S02E19.HDTV.XviD-LOL/DIMENSION, ('c7a8e8041a5031f127d5d27f3f071cbb')
Can anyone help me?
require 'nokogiri'
html = Nokogiri::HTML html_with_many_tables
results = html.css('table.buscaDestaque').map do |table|
jsid = table['onclick'][/'(\w+)'/,1]
brls = table.at_css('.brls').text
"#{brls}, #{jsid}"
end
p results
#=>["Community.S02E19.HDTV.XviD-LOL/DIMENSION, c7a8e8041a5031f127d5d27f3f071cbb",
#=> "AnotherBRLS, anotherJSID"]
How could I use ruby to extract information from a table consisting of these rows? Is it possible to detect the comments using nokogiri?
<!-- Begin Topic Entry 4134 -->
<tr>
<td align="center" class="row2"><image src='style_images/ip.boardpr/f_norm.gif' border='0' alt='New Posts' /></td>
<td align="center" width="3%" class="row1"> </td>
<td class="row2">
<table class='ipbtable' cellspacing="0">
<tr>
<td valign="middle"><alink href='http://www.xxx.com/index.php?showtopic=4134&view=getnewpost'><image src='style_images/ip.boardpr/newpost.gif' border='0' alt='Goto last unread' title='Goto last unread' hspace=2></a></td>
<td width="100%">
<div style='float:right'></div>
<div> <alink href="http://www.xxx.com/index.php?showtopic=4134&hl=">EXTRACT LINK 1</a> </div>
</td>
</tr>
</table>
<span class="desc">EXTRACT DESCRIPTION</span>
</td>
<td class="row2" width="15%"><span class="forumdesc"><alink href="http://www.xxx.com/index.php?showforum=19" title="Living">EXTRACT LINK 2</a></span></td>
<td align="center" class="row1" width='10%'><alink href='http://www.xxx.com/index.php?showuser=1642'>Mr P</a></td>
<td align="center" class="row2"><alink href="javascript:who_posted(4134);">1</a></td>
<td align="center" class="row1">46</td>
<td class="row1"><span class="desc">Today, 12:04 AM<br /><alink href="http://www.xxx.com/index.php?showtopic=4134&view=getlastpost">Last post by:</a> <b><alink href='http://www.xxx.com/index.php?showuser=1649'>underft</a></b></span></td>
</tr>
<!-- End Topic Entry 4134 -->
-->
Try to use xpath instead:
html_doc = Nokogiri::HTML("<html><body><!-- Begin Topic Entry 4134 --></body></html>")
html_doc.xpath('//comment()')
You could implement a Nokogiri SAX Parser. This is done faster than it might seem at first sight. You get events for Elements, Attributes and Comments.
Within your parser, your should rememeber the state, like #currently_interested = true to know which parts to rememeber and which not.