Parsing HTML source code using AppleScript - html

I'm trying to parse an HTML file which I have converted to a TXT file inside of Automator.
I previously downloaded the HTML file from a website using Automator, and I am now struggling to parse the source code.
Preferably, I want to take the information of just the table and I need to repeat this action for 1800 different HTML files.
Here is an example of the source code:
</head>
<body>
<div id="header">
<div class="wrapper">
<span class="access">
<div id="fb-root"></div>
<span class="access">
Gold Account: <a class="upgrade" title="Account Details" href="http://www.hedge-professionals.com/account-details.html" >Active </a> Logged in as Edward | Sign Out
</span>
</span>
</div><!-- /wrapper -->
</div><!-- /header -->
<div id="masthead">
<div class="wrapper">
<a href="http://www.hedge-professionals.com" ><img src="http://www.hedge-professionals.com/images/hedgep_logo_white.png" alt="Hedge Professionals Database" width="333" height="46" class="logo" border="0" /></a>
<div id="navigation">
<ul>
<li ><a href='http://www.hedge-professionals.com/dashboard.html' >Dashboard</a></li> <li ><a href='http://www.hedge-professionals.com/people.html'class='current' >People</a></li><li ><a href='http://www.hedge-professionals.com/watchlists.html' >My Watchlists</a></li><li ><a href='http://www.hedge-professionals.com/my-searches.html' >My Searches</a></li><li ><a href='http://www.hedge-professionals.com/my-profile.html' >My Profile</a></li></ul>
</div><!-- /navigation -->
</div><!-- /wrapper -->
</div><!-- /masthead -->
<div id="content">
<div class="wrapper">
<div id="main-content">
<!-- per Project stuff -->
<span class="section">
<img src="http://www.hedge-professionals.com/images/people/noimage_53x53.jpg" alt="Christian Sieling" width="52" height="53" class="profile-pic" id="profile-pic-104947"/>
<h1><span id="profile-name-104947" >Christian Sieling</span></h1>
<ul class="gbutton-group right">
<li><a class="gbutton bold pill" href="http://www.hedge-professionals.com/people.html">« Back </a></li>
<li><a class="gbutton bold pill boxy on-click" href="http://www.hedge-professionals.com/addtoWatchlist.php?usr=114752" id="row-104947" title='Add to Watchlist' >Add to Watchlist</a></li>
</ul>
<div style="float:right;padding:3px 3px;text-align:center;margin-top:5px;" >
<span id="profile-updated-date" >Updated On: 4 Aug, 2010</span><br/>
<a class="gbutton bold pill" href="http://www.hedge-professionals.com/profile/suggest/people/104947/Christian-Sieling" style="margin:5px;" title='Report Inaccurate Data' >Report Inaccurate Data</a>
</div>
<h2><span id="profile-details-104947" > at <a href="http://www.hedge-professionals.com/quicksearch/search/Lumix+Capital+Management+Ltd." ><span title='Lumix Capital Management Ltd.' >Lumix Capital Management Ltd.</span></a></span><input type="hidden" name="sub-id" id="sub-id" value="114752"></h2>
</span>
<table width="100%" border="0" cellspacing="0" cellpadding="0" id="profile-table">
<tr>
<th>Role</th>
<td>
<p>Other</p> </td>
</tr>
<tr>
<th>Organisation Type</th>
<td>
<p>Asset Manager</p> </td>
</tr>
<tr>
<th>Email</th>
<td><a href="mailto:cs#lumixcapital.com" title="cs#lumixcapital.com" >cs#lumixcapital.com</a></td>
</tr>
<tr>
<th>Website</th>
<td><a href="http://www.lumixcapital.com/" target="_new" title="http://www.lumixcapital.com/" >http://www.lumixcapital.com/</a></td>
</tr>
<tr>
<th>Phone</th>
<td>41 78 616 7334</td>
</tr>
<tr>
<th>Fax</th>
<td></td>
</tr>
<tr>
<th>Mailing Address</th>
<td>Birrenstrasse 30</td>
</tr>
<tr>
<th>City</th>
<td>Schindellegi</td>
</tr>
<tr>
<th>State</th>
<td>CH</td>
</tr>
<tr>
<th>Country</th>
<td>Switzerland</td>
</tr>
<tr>
<th class="lastrow" >Zip/ Postal Code</th>
<td class="lastrow" >8834</td>
</tr>
</table>
</div><!-- /main-content -->
<div id="sidebar" >
</div>
<div id="similar_sidebar" class="similar_refine" >
</div>
</div><!-- /wrapper -->
</div><!-- /content -->
<div id="footer">
</div>
My AppleScript attempt that is using text item delimiters to extract the table in a similar fashion:
set p to input
set ex to extractBetween(p, "<table>", "</table>") -- extract the URL
to extractBetween(SearchText, startText, endText)
set tid to AppleScript's text item delimiters
set AppleScript's text item delimiters to startText
set endItems to text of text item -1 of SearchText
set AppleScript's text item delimiters to endText
set beginningToEnd to text of text item 1 of endItems
set AppleScript's text item delimiters to tid
return beginningToEnd
end extractBetween
How can I parse the table from the HTML file?

Rather than make your own HTML parser, you can exploit the HTML parser in Safari via the do javascript command. JavaScript has built-in functionality for working with HTML elements and data.
This script gets the HTML for just the first table in a page:
tell application "Safari"
tell document 1
set theFirstTableHTML to do JavaScript "document.getElementsByTagName('table')[0].innerHTML"
end tell
end tell
You can use this technique to apply basic DOM Scripting to any page and grab out any data that you want to read out. You can get just the values of the table cells, or whatever you want.

You're really close. The problem is your startText variable. The starting table tag is not in the html text so it can't be found. The line that starts the table is actually...
<table width="100%" border="0" cellspacing="0" cellpadding="0" id="profile-table">
So I modified your code to look for that tag in 2 steps. First...
<table
And then this separately...
>
In this way we can ignore all of the code that comes with the table tag (width, border etc.) because I assume it will vary between the files. After doing this we get only the code of the table. Try this...
set p to input
set ex to extractBetween(p, "<table", ">", "</table>")
to extractBetween(SearchText, startText1, startText2, endText)
set tid to AppleScript's text item delimiters
set AppleScript's text item delimiters to startText1
set endItems to text item -1 of SearchText
set AppleScript's text item delimiters to endText
set beginningToEnd to text item 1 of endItems
set AppleScript's text item delimiters to startText2
set finalText to (text items 2 thru -1 of beginningToEnd) as text
set AppleScript's text item delimiters to tid
return finalText
end extractBetween

Try:
set xxx to read alias "Mac OS X:Users:paolo:Desktop:paolo.html"
set yyy to do shell script "echo " & quoted form of xxx & " | grep -o \\<table.*table\\>"

One-line wonder that works:
tell application "Safari" to set sourceCode to characters (offset of <table in (source of document 1 as string)) thru ((offset of "/table" in (source of document 1 as string)) + (count of "/table")) of (source of document 1 as string) as string

Related

Unable to extract value using xpath query

Learning to use xpath queries. I am having an issue were I am unable to extract a value that changes whenever the page is refreshed.
For example, I am trying to extract the value '62804' from the following html code: "canvas.strokeText('Answer: 62804',90,112);" . Any ideas how this can be done. Thanks
<html>
<div id="content" class="large-12 columns">
<div class="example">
<h3>Challenging DOM</h3>
<p>The hardest part in automated web testing is finding the best locators (e.g., ones that well named, unique, and unlikely to change). It's more often than not that the application you're testing was not built with this concept in mind. This example demonstrates that with unique IDs, a table with no helpful locators, and a canvas element.</p>
<hr>
<div class="row">
<div class="large-12 columns large-centered">
<div class="large-2 columns">
<a id="debcda40-b692-0137-457b-2213fbd48497" href="" class="button">qux</a><br>
<a id="debce410-b692-0137-457c-2213fbd48497" href="" class="button alert">baz</a><br>
<a id="debd03d0-b692-0137-457d-2213fbd48497" href="" class="button success">foo</a><br>
</div>
<div class="large-10 columns">
<table>
<thead>
<tr>
<th>Lorem</th>
<th>Ipsum</th>
<th>Dolor</th>
<th>Sit</th>
<th>Amet</th>
<th>Diceret</th>
<th>Action</th>
</tr>
</thead>
<tbody>
<tr>
<td>Iuvaret0</td>
<td>Apeirian0</td>
<td>Adipisci0</td>
<td>Definiebas0</td>
<td>Consequuntur0</td>
<td>Phaedrum0</td>
<td>
edit
delete
</td>
</tr>
<tr>
<td>Iuvaret1</td>
<td>Apeirian1</td>
<td>Adipisci1</td>
<td>Definiebas1</td>
<td>Consequuntur1</td>
<td>Phaedrum1</td>
<td>
edit
delete
</td>
</tr>
<tr>
<td>Iuvaret2</td>
<td>Apeirian2</td>
<td>Adipisci2</td>
<td>Definiebas2</td>
<td>Consequuntur2</td>
<td>Phaedrum2</td>
<td>
edit
delete
</td>
</tr>
<tr>
<td>Iuvaret3</td>
<td>Apeirian3</td>
<td>Adipisci3</td>
<td>Definiebas3</td>
<td>Consequuntur3</td>
<td>Phaedrum3</td>
<td>
edit
delete
</td>
</tr>
<tr>
<td>Iuvaret4</td>
<td>Apeirian4</td>
<td>Adipisci4</td>
<td>Definiebas4</td>
<td>Consequuntur4</td>
<td>Phaedrum4</td>
<td>
edit
delete
</td>
</tr>
<tr>
<td>Iuvaret5</td>
<td>Apeirian5</td>
<td>Adipisci5</td>
<td>Definiebas5</td>
<td>Consequuntur5</td>
<td>Phaedrum5</td>
<td>
edit
delete
</td>
</tr>
<tr>
<td>Iuvaret6</td>
<td>Apeirian6</td>
<td>Adipisci6</td>
<td>Definiebas6</td>
<td>Consequuntur6</td>
<td>Phaedrum6</td>
<td>
edit
delete
</td>
</tr>
<tr>
<td>Iuvaret7</td>
<td>Apeirian7</td>
<td>Adipisci7</td>
<td>Definiebas7</td>
<td>Consequuntur7</td>
<td>Phaedrum7</td>
<td>
edit
delete
</td>
</tr>
<tr>
<td>Iuvaret8</td>
<td>Apeirian8</td>
<td>Adipisci8</td>
<td>Definiebas8</td>
<td>Consequuntur8</td>
<td>Phaedrum8</td>
<td>
edit
delete
</td>
</tr>
<tr>
<td>Iuvaret9</td>
<td>Apeirian9</td>
<td>Adipisci9</td>
<td>Definiebas9</td>
<td>Consequuntur9</td>
<td>Phaedrum9</td>
<td>
edit
delete
</td>
</tr>
</tbody></table>
<div class="row">
<div class="large-10 columns">
<canvas id="canvas" width="599" height="200" style="border:1px dotted;"></canvas>
</div>
</div>
</div>
</div>
</div>
<hr>
</div>
<script>
var canvas_el = document.getElementById('canvas');
var canvas = canvas_el.getContext('2d');
canvas.font = '60px Arial';
canvas.strokeText('Answer: 62804',90,112);
</script>
</div>
</html>
In order to use the XPath query the input document must be a valid XML.
In your case it isn't, because there are some tags that are not properly closed (you can verify it using an XMLLint tool).
E.g.
<hr> and <br> should be replaced with <hr/> and <br/>.
Once the XML is corrected, you can use an XPath query.
The fist step is select the script element:
//script
The output is:
Element='<script>
var canvas_el = document.getElementById('canvas');
var canvas = canvas_el.getContext('2d');
canvas.font = '60px Arial';
canvas.strokeText('Answer: 62804',90,112);
</script>'
Then you have to convert the Element Node in a String and then perform some parsing:
substring-before(substring-after(//script/text(), 'canvas.strokeText(''Answer: ') , ''',90,112)')
The result is the following:
String='62804'
Note: You can do the same operation in a more elastic way using Javascript, for example.
XPath is very good to query an XML (like the first operation that I mentioned) but quite complicated to do String parsing (like the second operation that I mentioned).
Hope it can help.

How to filter url links with criteria via beautifulsoup? is it possible? YES indeed

There are always some new posts in any forum. The one I visited gives a "new" sticker to the post. How do i filter and retrieve the URLs with new stickers? Tricky...
I usually just grabbed off first page. But it seems unprofessional. Actually there are also author and date stickers in each section. Can these be filtering criteria via beautifulsoup? I am feeling so much to learn.
This is the DOM:
<!-- 三級置頂分開 -->
<tbody id="stickthread_10432064">
<tr>
<td class="folder"><img src="images/green001/folder_new.gif"/></td>
<td class="icon">
  </td>
<th class="new">
<label>
<img alt="" src="images/green001/agree.gif"/>
<img alt="本版置顶" src="images/green001/pin_1.gif"/>
 </label>
<em>[痴女]</em> <span id="thread_10432064">(セレブの友)(CESD-???)大槻ひびき</span>
<img alt="附件" class="attach" src="images/attachicons/common.gif"/>
<span class="threadpages"> <img src="images/new2.gif"/></span> ### new sticker
</th>
<td class="author"> ### author sticker
<cite>
新片<img align="absmiddle" border="0" src="images/thankyou.gif"/>12 </cite>
<em>2019-4-23</em> ### date sticker
</td>
<td class="nums"><strong>6</strong> / <em>14398</em></td>
<td class="nums">7.29G / MP4
</td>
<td class="lastpost">
<em>2019-4-25 14:11</em>
<cite>by 22811</cite>
</td>
</tr>
</tbody><!-- 三級置頂分開 -->
Let's put it this way, it seems that I didn't express myself well enough. What i'm saying is this: for example, I wanna find all 'tbody' with either 'author' of 新片, or 'date' of 2019-4-23, or with a sticker called "images/new2.gif". I would get a lists of tbodys presumably, and then, I wanna find the href in them via
blue = soup.find_all('a', style="font-weight: bold;color: blue")
Thanks chiefs!
There is a class new so I am wondering if you could just use that? That would be:
items = soup.select('tbody:has(.new)')
for item in items:
print([i['href'] for i in item.select('a')])
Otherwise, you can use :has and :contains pseudo classes (bs4 4.7.1) to specify those patterns
items = soup.select('tbody:has(.author a:contains("新片")), tbody:has(em:contains("2019-4-23")), tbody:has([src="images/new2.gif"])')
You can then get hrefs with a loop
for item in items:
print([i['href'] for i in item.select('a')])
First you need to find out the parent tag and then need to find the next sibling and then find the respective tag.Hope you will get your answer.try below code.
from bs4 import BeautifulSoup
import re
data='''<tbody id="stickthread_10432064">
<tr>
<td class="folder"><img src="images/green001/folder_new.gif"/></td>
<td class="icon">
</td>
<th class="new">
<label>
<img alt="" src="images/green001/agree.gif"/>
<img alt="本版置顶" src="images/green001/pin_1.gif"/>
</label>
<em>[痴女]</em> <span id="thread_10432064">(セレブの友)(CESD-???)大槻ひびき</span>
<img alt="附件" class="attach" src="images/attachicons/common.gif"/>
<span class="threadpages"> <img src="images/new2.gif"/></span> ### new sticker
</th>
<td class="author"> ### author sticker
<cite>
新片<img align="absmiddle" border="0" src="images/thankyou.gif"/>12 </cite>
<em>2019-4-23</em> ### date sticker
</td>
<td class="nums"><strong>6</strong> / <em>14398</em></td>
<td class="nums">7.29G / MP4
</td>
<td class="lastpost">
<em>2019-4-25 14:11</em>
<cite>by 22811</cite>
</td>
</tr>
</tbody>'''
soup=BeautifulSoup(data,'html.parser')
for item in soup.find_all('img',src=re.compile('images/new')):
parent=item.parent.parent
print(parent.find_next_siblings('td')[0].find('a').text)
print(parent.find_next_siblings('td')[0].find('em').text)

Thymeleaf: From controller to html

I have a table of values with an href.
<tr th:each="note : ${test}">
<td th:text="${note.name}">></td>
<td th:text="${note.lastName}"></td>
<td th:text="${note.studentId}"></td>
<td>
<a th:href="#{'/seeStudent/' + ${note.studentId}}">Ver</a>
</td>
</tr>
This is my controller:
#GetMapping("/seeStudent/{id}")
public String getStudentById(#PathVariable Long id, Model model) {
model.addAttribute("student", repo.findById(id));
return "seeStudent";
}
This is my HTML:
<div>
<h1>Student information</h1>
<ul>
<div th:object="${student}">
<li>
<h4>
<span th:text="${name}"></span>
</h4>
<h4>
<span th:text="${lastName}"></span>
</h4>
</li>
</div>
</ul>
</div>
For some reason the display is comming up blank, it is as if it wasn't finding the student by Id. When I debug it does locate the student and return it. I believe I may be doing something wrong when trying to pull the object data and display it in my html. Any ideas?
If you're using th:object, your expression should be (with an asterisk) *{name} and *{lastName}. If you want to use the equivalent (dollar sign) ${...} expression, it would be ${student.name} and ${student.lastName}.

how display grids using table

Am using grid to diplay data from my data base. I would display grids in a table of 3 columns inthis way :
I try this code
<table style="width:70%" ng-repeat="e in events" >
<tr>
<td>
<div class="col-md-3 ticket-grid" >
<div class="tickets">
<div class="grid-right">
<font color="red"><h3>{{e.name}}</h3></font>
Location: {{e.loc}}<br>
Category: Sport <br>
Start date: <br>
End date:
Description: <span>{{e.description}}</span> <br>
Contact: {{e.contact}}<br>
Confirm
Refuse
</div>
<div class="clearfix"> </div>
</div>
</td>
</tr>
<tr>
</tr>
<tr>
</tr>
</table>
PS: events is an array that contains my data.
But i get this result
Any help please
You need to add the ng-repeat to the element you want repeated. In your version you are creating new <table> with all contents for each record in events. Moving it to the <td> will leave you with same issue as it will just create a new row for each record.
If you are using bootstrap - you should just use the grid system instead of the <table> for layouts, something like.
<div ng-repeat="e in events">
<div class="col-md-4 ticket-grid" >
<div class="tickets">
<div class="grid-right">
<font color="red"><h3>{{e.name}}</h3></font>
Location: {{e.loc}}<br>
Category: Sport <br>
Start date: <br>
End date:
Description: <span>{{e.description}}</span> <br>
Contact: {{e.contact}}<br>
Confirm
Refuse
</div>
<div class="clearfix"> </div>
</div>
</div>
You will need something like, to break out of the ng-repeat loop every 3 records:
<div class="clearfix" ng-if="$index % 3 == 0"></div>
See Plunker examples for more of an idea here

jQuery Mobile append string issues

What does it mean if you cannot expand a DOM node in the Chrome Elements view tool?
I have manually appended a very long string of html containing nested elements, but when I view the rendered page and examine the elements, only the first level element exists, with a ... for the text field. When I try to expand the selection to see the inner contents, the closing tag for the only visible element goes away, and none of the contents appear visible.
Any ideas? Malformed HTML? I really don't think it's incorrectly formatted...
I am using jQuery Mobile for this.
EDIT: Here is how I got the page to render what I wanted.
$('#myID').append(
'<li id="'+id+'" name="chat2" dsid="chat2_'+j+'" class="'+currentPage+'_chat2 ui-li-static ui-body-inherit ui-btn-up-a ui-first-child" data-icon="false">
<div class="ui-li-static-container">
<div class="'+currentPage+'_itemGrid2_wrapper" data-wrapper-for="itemGrid2" dsrefid="chat2" _idx="_'+j+'">
<table id="'+currentPage+'_itemGrid2" class="'+currentPage+'_itemGrid2" dsid="itemGrid2" name="itemGrid2" cellpadding="0" cellspacing="0">
<colgroup>
<col style="width:auto;">
<col style="width:172px;">
</colgroup>
<tbody>
<tr class="'+currentPage+'_itemGrid2_row_0">
<td id="'+currentPage+'_picCell2_'+j+'" name="picCell2" class="'+currentPage+'_picCell2" colspan="1" rowspan="1">
<div class="cell-wrapper"><img class="'+currentPage+'_itemImage2"src="'+image+'" id="'+currentPage+'_itemImage2" dsid="itemImage2" name="itemImage2">
</div>
</td>
<td id="'+currentPage+'_Cell2_'+j+'" name="Cell2" class="'+currentPage+'_Cell2" colspan="1" rowspan="1">
<div class="cell-wrapper">
<div name="sellerName" id="'+currentPage+'_sellerName" dsid="sellerName" data-role="appery_label" class="messagePerson '+currentPage+'_sellerName">'+sellerName+'</div>
<div name="lastMessage2" id="'+currentPage+'_lastMessage2" dsid="lastMessage2" data-role="appery_label" class="'+currentPage+'_lastMessage2">'+message+'</div>
<div name="productID2" id="'+currentPage+'_productID2" dsid="productID2" data-role="appery_label" class="'+currentPage+'_productID2">'+productID+'</div>
<div name="messageID2" id="'+currentPage+'_messageID2" dsid="messageID2" data-role="appery_label" class="'+currentPage+'_messageID2">'+messageID+'</div>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
</li>');
Figured I'de put the whole in there in case of syntax..
EDIT2:
The only reason I am doing this clunky solution is because I can't append a collapsed DOM js variable.
for(var i = 0; i < this.cache[0].length; i++){
var div = Appery("chat1").clone();
div.attr("id",currentPage+"_chat1_"+i);
div.find(Appery("messageID")).text(this.cache[0][i]._id);
div.find(Appery("buyerName")).text(this.cache[0][i].user_name);
div.find(Appery("productID")).text(this.cache[0][i].productID);
Appery("allChats1").append(div);
}