Im trying to perform a screen scrape because i can't find a relevant free API to get the data i need. I've managed to perform the scrape and grab the HTML page but the part i'm stuck on is getting the relevant information out of the grabbed content. I'm guessing i will need to use REG expressions to search through the HTML but unsure how to do this. the information I'm after is MAKE, MODEL, YEAR of the current car search.
var url = NSURL(string: "https://www.rac.co.uk/buying-a-car/car-passport/report/buyer/purchase/?BuyerVrm=yg06dxt")
if url != nil {
let task = NSURLSession.sharedSession().dataTaskWithURL(url!, completionHandler: { (data, response, error) -> Void in
print(data)
if error == nil {
var urlContent = NSString(data: data, encoding: NSASCIIStringEncoding) as NSString!
print(urlContent)
}
})
task.resume()
}
}
heres a sample of the retuned information
<p class="CarMiniProfile-caveat u-hidden">*image for illustrative purposes only</p>
<div>
<table class="CarMiniProfile-table">
<tbody>
<tr class="CarMiniProfile-tableFirstRow">
<td class="CarMiniProfile-tableHeader">
Make
</td>
<td>
FIAT
</td>
</tr>
<tr>
<td class="CarMiniProfile-tableHeader">
Model
</td>
<td>
PUNTO SPORTING M-JET
</td>
</tr>
<tr>
<td class="CarMiniProfile-tableHeader">
Colour
</td>
<td>
BLUE
</td>
</tr>
<tr>
<td class="CarMiniProfile-tableHeader">
Year
</td>
<td>
2006
</td>
</tr>
<tr>
<td class="CarMiniProfile-tableHeader">
Engine Size
</td>
<td>
1910 cc
</td>
</tr>
</tbody>
</table>
</div>
<h3 class="CarMiniProfile-subheading">Check this car in 3 simple steps...</h3>
Using regexes for html isn't a good idea, I agree. Sometimes I've had to do some real nasty stuff with regexes and html.
If you absolutely must do it this way then here's one for MAKE:
<td.*?CarMiniProfile-tableHeader.*?\n*(.*?)\n*<\/td>
You should be able to customise this for everything else you need. Using regexes is definitely not a recommended solution for this though.
Related
So I am having trouble getting the variable values to be shown in an email template. The 3rd party email templating provider is Postmark and it uses Mustache. My template is set up like this (I have ommitted some of the irrelevant html to keep things shorter):
{{#discount_group.delivery_fee}}
<tr>
<td width="30%" class="purchase_footer" valign="middle">
<p class="purchase_total">{{delivery_fee}}</p>
</td>
</tr>
{{/discount_group.delivery_fee}}
{{#discount_group.discount}}
<tr>
<td width="30%" class="purchase_footer" valign="middle">
<p class="purchase_total">{{discount}}</p>
</td>
</tr>
<tr>
<td width="30%" class="purchase_footer" valign="middle">
<p class="purchase_total_bold">{{grandtotal}}</p>
</td>
</tr>
{{/discount_group.discount}}
And my json payload looks like this:
"discount_group": {
"delivery_fee":"delivery_fee_Value",
"discount": "discount_Value",
"grandtotal": "grandtotal_Value"
}
But when I send out the email, the sections render properly but the variable values are blank (red box):
If I remove "delivery_fee" from the json payload, the section is not rendered as expected but the values are sill missing:
I have also tried {{discount_group.delivery_fee}} and {discount_group.discount}} etc but that still had the missing values.
What am I doing wrong?
Thanks in advance
So I figured it out. I'm not sure why it has to be this way but it does. My problem was in the payload. The payload should be formatted like this:
"discount_group": {
"delivery_fee":{
"delivery_fee":"delivery_fee_Value"
},
"discount": {
"discount":"discount_Value",
"grandtotal": "grandtotal_Value"
}
}
When you wrap a block of code in mustache, what you're doing is stepping into that object in your data in an effort to make your code more readable. Postmarks documentation calls it 'Scoping'. You can read up on here!
Therefore, by starting blocks with, for example, {{#discount_group.delivery_fee}}, you are already at delivery_fee and calling it again will return nothing since it doesn't exist.
With how your data was originally structured, you had everything you needed nested in discount_group, so you didn't need to nest further in your brackets. I know you have found a resolve, but in the future, instead of changing your data to match your code, you could consider instead update your code to be as follows:
{{#discount_group}}
<tr>
<td width="30%" class="purchase_footer" valign="middle">
<p class="purchase_total">{{delivery_fee}}</p>
</td>
</tr>
<tr>
<td width="30%" class="purchase_footer" valign="middle">
<p class="purchase_total">{{discount}}</p>
</td>
</tr>
<tr>
<td width="30%" class="purchase_footer" valign="middle">
<p class="purchase_total_bold">{{grandtotal}}</p>
</td>
</tr>
{{/discount_group}}
I have an http POST response which I receive in HTML. Now I want to display the results in my view Controller. How can I parse the DOM of the response to get the elements I want?
This is the response in raw html:
<tr>
<td style="text-align:center;">1</td>
<td>9.99</td>
<td style="text-align:center;" class="show_on_masters hide"></td>
<td style="text-align:center;">1.4</td>
<td>DE GRASSE, ANDRE</td>
<td style="text-align:center;">ON</td>
<td>
<div data-tooltip="Speed Academy Athletics Club">SAAC</div>
</td>
<td>94</td>
<td>2</td>
<!--<td class="rankings_hide_992">UF Tom Jones Invitational (Olympic Development)</td>-->
<!--<td class="rankings_hide_768">Gainesville , FL</td>-->
<td>
<div data-tooltip="UF Tom Jones Invitational (Olympic Development)" style="cursor:default;">Gainesville , FL</div>
</td>
<td>17/04/2021</td>
</tr>
<tr>
<td style="text-align:center;">2</td>
<td>10.08</td>
<td style="text-align:center;" class="show_on_masters hide"></td>
<td style="text-align:center;">1.9</td>
<td>BROWN, AARON</td>
<td style="text-align:center;">ON</td>
<td>
<div data-tooltip="Phoenix Athletics Assoc. of Ontario">PHNX</div>
</td>
<td>92</td>
<td>7</td>
<!--<td class="rankings_hide_992">World Athletics - Miramar</td>-->
<!--<td class="rankings_hide_768">Miramar, FL</td>-->
<td>
<div data-tooltip="World Athletics - Miramar" style="cursor:default;">Miramar, FL</div>
</td>
<td>10/04/2021</td>
</tr>
<tr>
<td style="text-align:center;">3</td>
<td>10.14</td>
<td style="text-align:center;" class="show_on_masters hide"></td>
<td style="text-align:center;">0.7</td>
<td>WARNER, DAMIAN</td>
<td style="text-align:center;">ON</td>
<td>
<div data-tooltip="London Western T.F.C.">LWTF</div>
</td>
<td>89</td>
<td>1dec5</td>
<!--<td class="rankings_hide_992">Hypo-Meeting</td>-->
<!--<td class="rankings_hide_768">Götzis, AUT</td>-->
<td>
<div data-tooltip="Hypo-Meeting" style="cursor:default;">Götzis, AUT</div>
</td>
<td>29/05/2021</td>
</tr>
I'm currently trying to use HTMLKit based on a couple tutorials, but I can't truly traverse the DOM with this library. Any ideas?
HTMLKit Tutorial
HTMLKit Video Tutorial
You can try SwiftSoup library that allows HTML parsing.
Usage
do {
let html: String = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
let doc: Document = try SwiftSoup.parse(html)
let link: Element = try doc.select("a").first()!
let text: String = try doc.body()!.text(); // "An example link"
let linkHref: String = try link.attr("href"); // "http://example.com/"
let linkText: String = try link.text(); // "example""
let linkOuterH: String = try link.outerHtml(); // "<b>example</b>"
let linkInnerH: String = try link.html(); // "<b>example</b>"
} catch Exception.Error(let type, let message) {
print(message)
} catch {
print("error")
}
I`m using Scrapy Python to try to grep data from the site.
How I can grep this structure with Xpath?
<div class="foo">
<h3>Need this text_1</h3>
<table class="thesamename">
<tbody>
<tr>
<td class="tmp_year">
45767
</td>
<td class="tmp_outcome">
<b>Win_1</b><br>
<span class="tmp_category">TEST_1</span>
</td>
</tr>
<tr>
<td class="tmp_year">
1232004
</td>
<td class="tmp_outcome">
<b>Win_2</b><br>
<span class="tmp_category">TEST_2</span>
</td>
</tr>
<tr>
<td class="tmp_year">
122004
</td>
<td class="tmp_outcome">
<b>Win_3</b><br>
<span class="tmp_category">TEST_3</span>
</td>
</tr>
</tbody>
<h3>Need this text_2</h3>
<table class="thesamename">
<tbody>
<td class="tmp_year">
234
</td>
<td class="tmp_outcome">
<b>Win_E</b><br>
<span class="tmp_category">TEST_E</span>
</td>
</tr>
<tr>
<td class="tmp_year">
3476
</td>
<td class="tmp_outcome">
<b>Win_C</b><br>
<span class="tmp_category">TEST_C</span>
</td>
</tr>
</tbody>
<h3>Need this text_3</h3>
<table class="thesamename">
<tbody>
<tr>
<td class="tmp_year">
85567
</td>
<td class="tmp_outcome">
<b>Win_T</b><br>
<span class="tmp_category">TEST_T</span>
</td>
</tr>
<tr>
<td class="tmp_year">
435656
</td>
<td class="tmp_outcome">
<b>Win_A</b><br>
<span class="tmp_category">TEST_A</span>
</td>
</tr>
<tr>
<td class="tmp_year">
980
</td>
<td class="tmp_outcome">
<b>Win_Z</b><br>
<span class="tmp_category">TEST_Z</span>
</td>
</tr>
</tbody>
I would like to have output with this structure:
"Section": {
Need this text_1 :
[45767 : Win_1 : TEST_1]
[1232004 : Win_2 : TEST_2]
[122004: Win_3 : TEST_3]
,
Need this text_2:
[234 : Win_E : TEST_E]
[3476 : Win_C : TEST_C]
,
Need this text_3:
[85567 : Win_T : TEST_T]
[435656 : Win_A : TEST_A]
[980: Win_Z : TEST_Z]
}
How can I create the proper xpath select to take this structure?
I can take separately all "h3" , all "a" then all tags with class but how I can match?
GREP YOU SAY?! LOL Well, You would be entirely wron to name it so but for the sake ofkeeping the jargon cleanfor understanding your just parsing/extracting.... So new to scrapy? or web dev sideof things? No matter... Theres no way I couldexpect to teach you in one answer here how to xpth/regex like a pro... only wayis for you to keep at but I throw in my input.
First of all, xpath is amazingly usefull wen it comes to websites that are necessarily build to stadard, which doesnt make them bad per say but in the html snipet you gave... its structured all right soo.. Id recommend css extract .. THESE ARE THE VALUES...
year = response.css('td.tmp_year a::text').extract()
outcome = response.css('td.tmp_outcome b::text').extract()
category= response.css('span.tmp_category::text').extract()
PRO-TIP: For what ever case you deem it neccesary, you can save a web page asan HTML file and use scrapy shell by referencing the direct file path to it... So I save you html snippet to a file on my desktop then ran...
scrapy shell file:///home/scriptso/Desktop/letsGREPlol.html
ANYWAYS... as far as xpath... since you asked lol... cake. lets compare the xpath with the cssand tell me you can see... it? lol
response.css('td.tmp_outcome b::text').extract()
so is a td tag....and the class name is tmp_outcome, thn the next node is a bold tag... of which where the text is thusly declaring it as text with the ::text
response.xpath('//td[#class="tmp_outcome"]/b/text()').extract()
So xpath is basically saying we star with a patter inthe entire site of the td tag... and class= tmp_outcome, then the bold, then in xpath to declare type /text() is for text.... /#href is for.. yeah you guessedit
I have used the nokogiri ruby gem to webscrape an html file for only the text under the tableData class. The html code is setup like so:
<div class="table-wrap">
<table class="table">
<tbody>
<tr>
<td class="tableData"> Jane Doe</td>
<td class="tableData"> 01/01/2017</td>
<td class="tableData">01/09/2017 </td>
<td class="tableData">Vacation</td>
</tr>
<tr>
<td class="tableData">John Doe</td>
<td class="tableData"> 01/01/2017</td>
<td class="tableData">01/09/2017 </td>
<td class="tableData">Vacation</td>
</tr>
</tbody>
</table>
</div>
and the code I used to webscrape looks like this:
vt = page.css("td[class='tableData']").text
puts vt
Which gives this output:
Jane Doe 01/01/201701/09/2017 VacationJohn Doe 01/01/201701/09/2017 Vacation
I want to populate an array within an array with only the 4 text values pertaining to each person. Which should look like this:
[[Jane Doe, 01/01/2017, 01/09/2017, Vacation], [John Doe, 01/01/2017, 01/09/2017, Vacation]]
I am new to coding and I'm not sure how to create a for loop to iterate over either the html code itself or the vt variable to produce an array of arrays. I know there are some push statements involved following the for loop but its the actual structure of the for loop that I am having trouble putting together. If you could provide some explanation in your answer for how the for loop works in this situation it would be much appreciated.
This is the basic structure you need. map is needed :
html=%q(<div class="table-wrap">
<table class="table">
<tbody>
<tr>
<td class="tableData"> Jane Doe</td>
<td class="tableData"> 01/01/2017</td>
<td class="tableData">01/09/2017 </td>
<td class="tableData">Vacation</td>
</tr>
<tr>
<td class="tableData">John Doe</td>
<td class="tableData"> 01/01/2017</td>
<td class="tableData">01/09/2017 </td>
<td class="tableData">Vacation</td>
</tr>
</tbody>
</table>
</div>)
require 'nokogiri'
doc = Nokogiri::XML(html)
array = doc.xpath('//tr').map do |tr|
tr.xpath('td').map{ |td| td.text }
end
p array
# [[" Jane Doe", " 01/01/2017", "01/09/2017 ", "Vacation"], ["John Doe", " 01/01/2017", "01/09/2017 ", "Vacation"]]
Try parsing the snippet as XML, finding all "tr" elements via XPath, and collecting their "td//text()" children:
require 'nokogiri'
doc = Nokogiri::XML(get_html_snippet)
data = doc.xpath('//tr').map do |tr|
tr.xpath('td').map { |td| td.text.strip }
end
data # => [["Jane Doe", "01/01/2017", "01/09/2017", "Vacation"], ["John Doe", "01/01/2017", "01/09/2017", "Vacation"]]
I have similiar structure to this:
<table class="superclass">
<tr>
<td>
</td>
<td>
</td>
</tr>
<tr>
<td>
</td>
<td>
</td>
</tr>
</table>
<table cellspacing="0">
<tr>
<td>
</td>
<td>
</td>
</tr>
<tr>
<td>
</td>
<td>
</td>
</tr>
</table>
This is how I get the first table with class:
HtmlNode firstTable = document.DocumentNode.SelectSingleNode("//table[#class=\"superclass\"]");
Then I read the data. However I don't know how to get straight to the another table and read that data too. Any ideas?
I'd rather avoid counting which table it is and then using index to that table.
There is XPath following-sibling axis which allows you to get element following current context element at the same level :
HtmlNode firstTable = document.DocumentNode.SelectSingleNode("//table[#class=\"superclass\"]");
HtmlNode nextTable = firstTable.SelectSingleNode("following-sibling::table");
If you want to access multiple nodes, you can consider SelectNodes(xpath) method over SelectSingleNode(xpath) method.
I'll provide a sample code here for reference, it may not work towards your need.
var tables = htmlDocument.DocumentNode.SelectNodes("//table");
foreach (HtmlNode table in tables)
{
if (table.GetAttributeValue("class", "").Contains("superclass"))
{
//this is the table of class="superclass"
}
else
{
//this is the other table.
}
}