How can I traverse HTML DOM with Swift - html

I have an http POST response which I receive in HTML. Now I want to display the results in my view Controller. How can I parse the DOM of the response to get the elements I want?
This is the response in raw html:
<tr>
<td style="text-align:center;">1</td>
<td>9.99</td>
<td style="text-align:center;" class="show_on_masters hide"></td>
<td style="text-align:center;">1.4</td>
<td>DE GRASSE, ANDRE</td>
<td style="text-align:center;">ON</td>
<td>
<div data-tooltip="Speed Academy Athletics Club">SAAC</div>
</td>
<td>94</td>
<td>2</td>
<!--<td class="rankings_hide_992">UF Tom Jones Invitational (Olympic Development)</td>-->
<!--<td class="rankings_hide_768">Gainesville , FL</td>-->
<td>
<div data-tooltip="UF Tom Jones Invitational (Olympic Development)" style="cursor:default;">Gainesville , FL</div>
</td>
<td>17/04/2021</td>
</tr>
<tr>
<td style="text-align:center;">2</td>
<td>10.08</td>
<td style="text-align:center;" class="show_on_masters hide"></td>
<td style="text-align:center;">1.9</td>
<td>BROWN, AARON</td>
<td style="text-align:center;">ON</td>
<td>
<div data-tooltip="Phoenix Athletics Assoc. of Ontario">PHNX</div>
</td>
<td>92</td>
<td>7</td>
<!--<td class="rankings_hide_992">World Athletics - Miramar</td>-->
<!--<td class="rankings_hide_768">Miramar, FL</td>-->
<td>
<div data-tooltip="World Athletics - Miramar" style="cursor:default;">Miramar, FL</div>
</td>
<td>10/04/2021</td>
</tr>
<tr>
<td style="text-align:center;">3</td>
<td>10.14</td>
<td style="text-align:center;" class="show_on_masters hide"></td>
<td style="text-align:center;">0.7</td>
<td>WARNER, DAMIAN</td>
<td style="text-align:center;">ON</td>
<td>
<div data-tooltip="London Western T.F.C.">LWTF</div>
</td>
<td>89</td>
<td>1dec5</td>
<!--<td class="rankings_hide_992">Hypo-Meeting</td>-->
<!--<td class="rankings_hide_768">Götzis, AUT</td>-->
<td>
<div data-tooltip="Hypo-Meeting" style="cursor:default;">Götzis, AUT</div>
</td>
<td>29/05/2021</td>
</tr>
I'm currently trying to use HTMLKit based on a couple tutorials, but I can't truly traverse the DOM with this library. Any ideas?
HTMLKit Tutorial
HTMLKit Video Tutorial

You can try SwiftSoup library that allows HTML parsing.
Usage
do {
let html: String = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
let doc: Document = try SwiftSoup.parse(html)
let link: Element = try doc.select("a").first()!
let text: String = try doc.body()!.text(); // "An example link"
let linkHref: String = try link.attr("href"); // "http://example.com/"
let linkText: String = try link.text(); // "example""
let linkOuterH: String = try link.outerHtml(); // "<b>example</b>"
let linkInnerH: String = try link.html(); // "<b>example</b>"
} catch Exception.Error(let type, let message) {
print(message)
} catch {
print("error")
}

Related

Xpath grep elements

I`m using Scrapy Python to try to grep data from the site.
How I can grep this structure with Xpath?
<div class="foo">
<h3>Need this text_1</h3>
<table class="thesamename">
<tbody>
<tr>
<td class="tmp_year">
45767
</td>
<td class="tmp_outcome">
<b>Win_1</b><br>
<span class="tmp_category">TEST_1</span>
</td>
</tr>
<tr>
<td class="tmp_year">
1232004
</td>
<td class="tmp_outcome">
<b>Win_2</b><br>
<span class="tmp_category">TEST_2</span>
</td>
</tr>
<tr>
<td class="tmp_year">
122004
</td>
<td class="tmp_outcome">
<b>Win_3</b><br>
<span class="tmp_category">TEST_3</span>
</td>
</tr>
</tbody>
<h3>Need this text_2</h3>
<table class="thesamename">
<tbody>
<td class="tmp_year">
234
</td>
<td class="tmp_outcome">
<b>Win_E</b><br>
<span class="tmp_category">TEST_E</span>
</td>
</tr>
<tr>
<td class="tmp_year">
3476
</td>
<td class="tmp_outcome">
<b>Win_C</b><br>
<span class="tmp_category">TEST_C</span>
</td>
</tr>
</tbody>
<h3>Need this text_3</h3>
<table class="thesamename">
<tbody>
<tr>
<td class="tmp_year">
85567
</td>
<td class="tmp_outcome">
<b>Win_T</b><br>
<span class="tmp_category">TEST_T</span>
</td>
</tr>
<tr>
<td class="tmp_year">
435656
</td>
<td class="tmp_outcome">
<b>Win_A</b><br>
<span class="tmp_category">TEST_A</span>
</td>
</tr>
<tr>
<td class="tmp_year">
980
</td>
<td class="tmp_outcome">
<b>Win_Z</b><br>
<span class="tmp_category">TEST_Z</span>
</td>
</tr>
</tbody>
I would like to have output with this structure:
"Section": {
Need this text_1 :
[45767 : Win_1 : TEST_1]
[1232004 : Win_2 : TEST_2]
[122004: Win_3 : TEST_3]
,
Need this text_2:
[234 : Win_E : TEST_E]
[3476 : Win_C : TEST_C]
,
Need this text_3:
[85567 : Win_T : TEST_T]
[435656 : Win_A : TEST_A]
[980: Win_Z : TEST_Z]
}
How can I create the proper xpath select to take this structure?
I can take separately all "h3" , all "a" then all tags with class but how I can match?
GREP YOU SAY?! LOL Well, You would be entirely wron to name it so but for the sake ofkeeping the jargon cleanfor understanding your just parsing/extracting.... So new to scrapy? or web dev sideof things? No matter... Theres no way I couldexpect to teach you in one answer here how to xpth/regex like a pro... only wayis for you to keep at but I throw in my input.
First of all, xpath is amazingly usefull wen it comes to websites that are necessarily build to stadard, which doesnt make them bad per say but in the html snipet you gave... its structured all right soo.. Id recommend css extract .. THESE ARE THE VALUES...
year = response.css('td.tmp_year a::text').extract()
outcome = response.css('td.tmp_outcome b::text').extract()
category= response.css('span.tmp_category::text').extract()
PRO-TIP: For what ever case you deem it neccesary, you can save a web page asan HTML file and use scrapy shell by referencing the direct file path to it... So I save you html snippet to a file on my desktop then ran...
scrapy shell file:///home/scriptso/Desktop/letsGREPlol.html
ANYWAYS... as far as xpath... since you asked lol... cake. lets compare the xpath with the cssand tell me you can see... it? lol
response.css('td.tmp_outcome b::text').extract()
so is a td tag....and the class name is tmp_outcome, thn the next node is a bold tag... of which where the text is thusly declaring it as text with the ::text
response.xpath('//td[#class="tmp_outcome"]/b/text()').extract()
So xpath is basically saying we star with a patter inthe entire site of the td tag... and class= tmp_outcome, then the bold, then in xpath to declare type /text() is for text.... /#href is for.. yeah you guessedit

How to parse tables without id on HTML using HtmlAgilityPack

I have a problem getting the values of a table in HTML cause it doesn't have a ids. I need to get all the values on the second column and keep them into an array. I am using HtmlAgilityPack and my problems comes when selecting nodes:
Dim doc As HtmlDocument
Dim web As New HtmlWeb()
Dim str As String
doc = Web.Load("http://www.dietas.net/tablas-y-calculadoras/tabla-de-composicion-nutricional-de-los-alimentos/carnes-y-derivados/aves/pechuga-de-pollo.html#")
Dim nodes_filas As HtmlNode() = doc.DocumentNode.SelectNodes("//table[#id='']//tr").ToArray
Dim nodes_columnas As HtmlNode() = doc.DocumentNode.SelectNodes("//td").ToArray
For Each row As HtmlNode In nodes_filas
For Each column As HtmlNode In nodes_columnas
str = column.InnerHtml & vbCrLf
Next
Next
This is the table:
<table cellspacing="1" cellpadding="3" width="100%" border="0">
<tr>
<td colspan="2" style="font-size:13px;color:#55711C;padding-bottom:5px;">Aporte por ración</td>
</tr>
<tr style="background-color:#EBEBEB">
<td width="125">Energía [Kcal]</td>
<td class="td_right">145,00</td>
</tr>
<tr>
<td>Proteína [g]</td>
<td class="td_right">22,20</td>
</tr>
<tr style="background-color:#EBEBEB">
<td>Hidratos carbono [g]</td>
<td class="td_right">0,00</td>
</tr>
<tr>
<td>Fibra [g]</td>
<td class="td_right">0,00</td>
</tr>
<tr style="background-color:#EBEBEB">
<td>Grasa total [g]</td>
<td class="td_right">6,20</td>
</tr>
<tr>
<td>AGS [g]</td>
<td class="td_right">1,91</td>
</tr>
<tr style="background-color:#EBEBEB">
<td>AGM [g]</td>
<td class="td_right">1,92</td>
</tr>
<tr>
<td>AGP [g]</td>
<td class="td_right">1,52</td>
</tr>
<tr style="background-color:#EBEBEB">
<td>AGP /AGS</td>
<td class="td_right">0,79</td>
</tr>
<tr>
<td>(AGP + AGM) / AGS</td>
<td class="td_right"> 1,80</td>
</tr>
<tr style="background-color:#EBEBEB">
<td>Colesterol [mg]</td>
<td class="td_right">62,00</td>
</tr>
<tr>
<td>Alcohol [g]</td>
<td class="td_right">0,00</td>
</tr>
<tr style="background-color:#EBEBEB">
<td>Agua [g]</td>
<td class="td_right">71,60</td>
</tr>
</table>
Sorry I don't have VB installed but C# version should be enough to give you an idea. You have td_right class, you can use either lambda or xpath to query it.
I like lambda/linq version more because I am familiar with linq, and I don't need to remember XPATH syntax.
Lambda:
public static bool HasClass(this HtmlNode node, params string[] classValueArray)
{
var classValue = node.GetAttributeValue("class", "");
var classValues = classValue.Split(' ');
return classValueArray.All(c => classValues.Contains(c));
}
var url = "http://www.dietas.net/tablas-y-calculadoras/tabla-de-composicion-nutricional-de-los-alimentos/carnes-y-derivados/aves/pechuga-de-pollo.html#";
var htmlWeb = new HtmlWeb();
var htmlDoc = htmlWeb.Load(url);
var nodes = htmlDoc.DocumentNode.Descendants("td").Where(_ => _.HasClass("td_right")).Select(_ => _.InnerText);
XPATH:
var nodes2 = htmlDoc.DocumentNode.SelectNodes("//td[#class='td_right']");

How to populate an array with text from html webscraping in ruby

I have used the nokogiri ruby gem to webscrape an html file for only the text under the tableData class. The html code is setup like so:
<div class="table-wrap">
<table class="table">
<tbody>
<tr>
<td class="tableData"> Jane Doe</td>
<td class="tableData"> 01/01/2017</td>
<td class="tableData">01/09/2017 </td>
<td class="tableData">Vacation</td>
</tr>
<tr>
<td class="tableData">John Doe</td>
<td class="tableData"> 01/01/2017</td>
<td class="tableData">01/09/2017 </td>
<td class="tableData">Vacation</td>
</tr>
</tbody>
</table>
</div>
and the code I used to webscrape looks like this:
vt = page.css("td[class='tableData']").text
puts vt
Which gives this output:
Jane Doe 01/01/201701/09/2017 VacationJohn Doe 01/01/201701/09/2017 Vacation
I want to populate an array within an array with only the 4 text values pertaining to each person. Which should look like this:
[[Jane Doe, 01/01/2017, 01/09/2017, Vacation], [John Doe, 01/01/2017, 01/09/2017, Vacation]]
I am new to coding and I'm not sure how to create a for loop to iterate over either the html code itself or the vt variable to produce an array of arrays. I know there are some push statements involved following the for loop but its the actual structure of the for loop that I am having trouble putting together. If you could provide some explanation in your answer for how the for loop works in this situation it would be much appreciated.
This is the basic structure you need. map is needed :
html=%q(<div class="table-wrap">
<table class="table">
<tbody>
<tr>
<td class="tableData"> Jane Doe</td>
<td class="tableData"> 01/01/2017</td>
<td class="tableData">01/09/2017 </td>
<td class="tableData">Vacation</td>
</tr>
<tr>
<td class="tableData">John Doe</td>
<td class="tableData"> 01/01/2017</td>
<td class="tableData">01/09/2017 </td>
<td class="tableData">Vacation</td>
</tr>
</tbody>
</table>
</div>)
require 'nokogiri'
doc = Nokogiri::XML(html)
array = doc.xpath('//tr').map do |tr|
tr.xpath('td').map{ |td| td.text }
end
p array
# [[" Jane Doe", " 01/01/2017", "01/09/2017 ", "Vacation"], ["John Doe", " 01/01/2017", "01/09/2017 ", "Vacation"]]
Try parsing the snippet as XML, finding all "tr" elements via XPath, and collecting their "td//text()" children:
require 'nokogiri'
doc = Nokogiri::XML(get_html_snippet)
data = doc.xpath('//tr').map do |tr|
tr.xpath('td').map { |td| td.text.strip }
end
data # => [["Jane Doe", "01/01/2017", "01/09/2017", "Vacation"], ["John Doe", "01/01/2017", "01/09/2017", "Vacation"]]

Using Reg expressions to search through HTML? [swift 1.2]

Im trying to perform a screen scrape because i can't find a relevant free API to get the data i need. I've managed to perform the scrape and grab the HTML page but the part i'm stuck on is getting the relevant information out of the grabbed content. I'm guessing i will need to use REG expressions to search through the HTML but unsure how to do this. the information I'm after is MAKE, MODEL, YEAR of the current car search.
var url = NSURL(string: "https://www.rac.co.uk/buying-a-car/car-passport/report/buyer/purchase/?BuyerVrm=yg06dxt")
if url != nil {
let task = NSURLSession.sharedSession().dataTaskWithURL(url!, completionHandler: { (data, response, error) -> Void in
print(data)
if error == nil {
var urlContent = NSString(data: data, encoding: NSASCIIStringEncoding) as NSString!
print(urlContent)
}
})
task.resume()
}
}
heres a sample of the retuned information
<p class="CarMiniProfile-caveat u-hidden">*image for illustrative purposes only</p>
<div>
<table class="CarMiniProfile-table">
<tbody>
<tr class="CarMiniProfile-tableFirstRow">
<td class="CarMiniProfile-tableHeader">
Make
</td>
<td>
FIAT
</td>
</tr>
<tr>
<td class="CarMiniProfile-tableHeader">
Model
</td>
<td>
PUNTO SPORTING M-JET
</td>
</tr>
<tr>
<td class="CarMiniProfile-tableHeader">
Colour
</td>
<td>
BLUE
</td>
</tr>
<tr>
<td class="CarMiniProfile-tableHeader">
Year
</td>
<td>
2006
</td>
</tr>
<tr>
<td class="CarMiniProfile-tableHeader">
Engine Size
</td>
<td>
1910 cc
</td>
</tr>
</tbody>
</table>
</div>
<h3 class="CarMiniProfile-subheading">Check this car in 3 simple steps...</h3>
Using regexes for html isn't a good idea, I agree. Sometimes I've had to do some real nasty stuff with regexes and html.
If you absolutely must do it this way then here's one for MAKE:
<td.*?CarMiniProfile-tableHeader.*?\n*(.*?)\n*<\/td>
You should be able to customise this for everything else you need. Using regexes is definitely not a recommended solution for this though.

Generate a pdf from html page by using jspdf in angularjs

I am trying to generate pdf from HTML table using jspdf.In this case the pdf is generated but the format is not suitable to original.
This is my code.
html code is
<div class="invoice" id="customers">
<table ng-repeat="aim in input" id="example">
<tr>
<th class="inv-left"><div align="left"><img src="./images/logo.png" alt=""></div></th>
<th class="inv-right"><div align="right"><br>
101 Convention Center<br>
dr #700, Las Vegas, <br>
NV - 89019
</div></th>
</tr>
<tr >
<th><div cg-busy="{promise:viewPromise}" align="left">
<b>Invoiced to</b><br>
{{aim.user.username}}<br>
{{aim.vendor.address}}
</div></th>
<th class="inv-right">
<div align="right"><b>INVOICE</b><br>
Invoice ID: {{aim.invoiceId}}<br>
Invoice Date: {{aim.invoiceDate.date| dateFormat | date:'MM-dd-yyyy'}}<br>
Due Date: {{aim.dueDate.date| dateFormat | date:'MM-dd-yyyy'}}
</div></th>
</tr>
<div class="invoice-content clearfix" cg-busy="{promise:viewPromise}" >
<tr>
<td class="inv-thours">Total Hours</td>
<td align="center">{{aim.totalHours}}</td>
</tr>
<tr>
<td class="inv-rate">Rate</td>
<td align="center">{{aim.billRate}}</td>
</tr>
<tr>
<td class="inv-rate">Amount</td>
<td align="center">{{(aim.totalHours) * (aim.billRate)}}</td>
</tr>
<tr>
<td class="inv-thours">totalExpenses</td>
<td align="center">{{aim.totalExpenses}}</td>
</tr>
<tr>
<td class="inv-thours">Total Amount</td>
<td align="center">{{aim.amount}}</td>
</tr>
<tr>
<td>
</td>
<td ng-if="aim.status === 'UNCONFIRMED'">
<div align="right" style="margin-right:10px;"><input type="submit" value="Confirm" data-ng-click="confirmStatus(aim)"> |
<button onclick="goBack()">Cancel</button></div>
</td>
<td ng-if="aim.status === 'CONFIRMED'">
<div align="right" style="margin-right:10px;">
<button onclick="goBack()">BACK</button></div>
</td>
<td ng-if="!(aim.status === 'UNCONFIRMED') && !(aim.status === 'CONFIRMED')">
<button onclick="javascript:demoFromHTML();">PDF</button>
</td>
</tr>
</table>
<script type="text/javascript" src="http://mrrio.github.io/jsPDF/dist/jspdf.debug.js"></script>
<script>
function demoFromHTML() {
var pdf = new jsPDF('p', 'pt', 'letter');
var imgData = '.............';
pdf.setFontSize(40);
pdf.addImage(imgData, 'PNG', 12, 30, 130, 40);
pdf.cellInitialize();
pdf.setFontSize(10);
$.each($('#customers tr'), function (i, row) {
$.each($(row).find("th"), function (j, cell) {
var txt = $(cell).text();
var width = (j == 4) ? 300 : 300; //make with column smaller
pdf.cell(10, 30, width, 70, txt, i);
});
$.each($(row).find("td"), function (j, cell) {
var txt = $(cell).text().trim() || " ";
var width = (j == 4) ? 200 : 300; //make with column smaller
pdf.cell(10, 50, width, 30, txt, i);
});
});
pdf.save('sample-file.pdf');
}
I whant to generate pdf to this formate
http://i.stack.imgur.com/nrR7l.png
but generate pdf formate is
http://i.stack.imgur.com/DGSxE.png
please help me to this problem.
Thank you.
I think CSS is missing in your generated PDF, and found this,
github issue link
diegocr commented on 25 Sep 2014
I'm afraid the fromHTML plugin is kinda limited when it comes to support css styles. Also, we have an addSVG plugin to deal with SVG elements, but the fromHTML does not uses it. So, no, the issue isn't Angular, you may could use the new addHTML (#270) but i dunno if that will deal with SVG. (html2canvas, that is)