Cheerio not finding table content - cheerio

I need to parse an HTML containg a table.
<div>
<table id="tableID">
<tr>
<td class="tdClass">
<span id="id1">Some data i need to access</span>
</td>
<td class="tdClass">
<span id="id2">Some data i need to access</span>
</td>
</tr>
</table>
</div>
I'm using cheerio on a NW.js app. I can't figure out how to access the datas, I've tried with span's ids, but it doesn't work.
The div is contained in the body of the page.
var $ = cheerio.load(html)
alert($('#id1').html())
I'm getting null when I'm trying to alert the content of the span.

Try this:
$ = cheerio.load(html, {normalizeWhitespace: false, xmlMode: true});

Related

Google App script for scraping WSJ and Yahoo Finance

I am trying to pull data from WSJ and yahoofiance using google sheet app script.
I was able to pull some data through following code with HTML of current price from following page ie <span id="quote_val">(.+?)<\/span>....please note that it containing Id .
Now i am trying to pull target price with HTML of <span class="data_data"><sup>$</sup>50.80</span> . This will not give me desired result . please note that it contains Class not id..
When we choose url as lets say https://www.wsj.com/market-data/quotes/PCT/research-ratings
function SAMPLE(url) {
const html = UrlFetchApp.fetch(url).getContentText();
const res = html.match(/<span id="quote_val">(.+?)<\/span>/);
if (!res) throw new Error("Value cannot be retrieved.")
return isNaN(res[1]) ? res[1] : Number(res[1]);
}
Is there a simple solution?
I would tackle this in two steps:
Use a regular expression to extract the subsection of the web page that you are interested in - specifically the "Stock Price Target" table.
Parse that <table>...</table> string into an HTML document and then iterate over the nodes in that document to extract each relevant item from the table.
function scrapeDemo() {
var url = 'https://www.wsj.com/market-data/quotes/PCT/research-ratings';
var html = UrlFetchApp.fetch(url).getContentText();
var res = html.match(/<div class="cr_data rr_stockprice module">.+?(<table .+?<\/table>)/);
var document = XmlService.parse(res[1]);
var root = document.getRootElement();
var trNodes = root.getChild('tbody').getChildren();
trNodes.forEach((trNode) => {
var tdNodes = trNode.getChildren();
var fieldName;
var fieldValue;
tdNodes.forEach((tdNode, idx) => {
if (idx % 2 === 0) {
fieldName = tdNode.getValue().trim();
} else {
fieldValue = tdNode.getValue().trim();
console.log( fieldName + " : " + fieldValue );
}
} );
} );
}
The regular expression uses this as its starting point:
<div class="cr_data rr_stockprice module">
This is because we need a reliably unique element which is a parent of the table we want (the table itself does not contain anything which uniquely identifies it).
This gives us the table in the res[1] captured group. Here is that HTML:
<table class="cr_dataTable">
<tbody>
<tr>
<td>
<span class="data_lbl">High</span>
</td>
<td>
<span class="data_data">
<sup>$</sup>48.00</span>
</td>
</tr>
<tr>
<td>
<span class="data_lbl">Median</span>
</td>
<td>
<span class="data_data">
<sup>$</sup>37.50</span>
</td>
</tr>
<tr>
<td>
<span class="data_lbl">Low</span>
</td>
<td>
<span class="data_data">
<sup>$</sup>24.00</span>
</td>
</tr>
<tr class="highlight">
<td>
<span class="data_lbl">Average</span>
</td>
<td>
<span class="data_data">
<sup>$</sup>36.75</span>
</td>
</tr>
<tr>
<td>
<span class="data_lbl">Current Price</span>
</td>
<td>
<span class="data_data">
<sup>$</sup>24.18</span>
</td>
</tr>
</tbody>
</table>
Now we perform step 2 using XmlService.parse() to create a mini-XML document containing our HTML.
Then we iterate over the elements of that document, by drilling down into each level's child nodes.
Each field's value is written to the console, so for this table...
...we get this data:
High : $48.00
Median : $37.50
Low : $24.00
Average : $36.75
Current Price : $24.18
In my experience, doing this type of scraping can be difficult. Any unexpected changes in web page structure, from one page to another, can cause the regular expression to fail, or cause the drill-down into the table to fail. In other words, this type of approach should work, but it may also break unexpectedly.

Unable to extract value using xpath query

Learning to use xpath queries. I am having an issue were I am unable to extract a value that changes whenever the page is refreshed.
For example, I am trying to extract the value '62804' from the following html code: "canvas.strokeText('Answer: 62804',90,112);" . Any ideas how this can be done. Thanks
<html>
<div id="content" class="large-12 columns">
<div class="example">
<h3>Challenging DOM</h3>
<p>The hardest part in automated web testing is finding the best locators (e.g., ones that well named, unique, and unlikely to change). It's more often than not that the application you're testing was not built with this concept in mind. This example demonstrates that with unique IDs, a table with no helpful locators, and a canvas element.</p>
<hr>
<div class="row">
<div class="large-12 columns large-centered">
<div class="large-2 columns">
<a id="debcda40-b692-0137-457b-2213fbd48497" href="" class="button">qux</a><br>
<a id="debce410-b692-0137-457c-2213fbd48497" href="" class="button alert">baz</a><br>
<a id="debd03d0-b692-0137-457d-2213fbd48497" href="" class="button success">foo</a><br>
</div>
<div class="large-10 columns">
<table>
<thead>
<tr>
<th>Lorem</th>
<th>Ipsum</th>
<th>Dolor</th>
<th>Sit</th>
<th>Amet</th>
<th>Diceret</th>
<th>Action</th>
</tr>
</thead>
<tbody>
<tr>
<td>Iuvaret0</td>
<td>Apeirian0</td>
<td>Adipisci0</td>
<td>Definiebas0</td>
<td>Consequuntur0</td>
<td>Phaedrum0</td>
<td>
edit
delete
</td>
</tr>
<tr>
<td>Iuvaret1</td>
<td>Apeirian1</td>
<td>Adipisci1</td>
<td>Definiebas1</td>
<td>Consequuntur1</td>
<td>Phaedrum1</td>
<td>
edit
delete
</td>
</tr>
<tr>
<td>Iuvaret2</td>
<td>Apeirian2</td>
<td>Adipisci2</td>
<td>Definiebas2</td>
<td>Consequuntur2</td>
<td>Phaedrum2</td>
<td>
edit
delete
</td>
</tr>
<tr>
<td>Iuvaret3</td>
<td>Apeirian3</td>
<td>Adipisci3</td>
<td>Definiebas3</td>
<td>Consequuntur3</td>
<td>Phaedrum3</td>
<td>
edit
delete
</td>
</tr>
<tr>
<td>Iuvaret4</td>
<td>Apeirian4</td>
<td>Adipisci4</td>
<td>Definiebas4</td>
<td>Consequuntur4</td>
<td>Phaedrum4</td>
<td>
edit
delete
</td>
</tr>
<tr>
<td>Iuvaret5</td>
<td>Apeirian5</td>
<td>Adipisci5</td>
<td>Definiebas5</td>
<td>Consequuntur5</td>
<td>Phaedrum5</td>
<td>
edit
delete
</td>
</tr>
<tr>
<td>Iuvaret6</td>
<td>Apeirian6</td>
<td>Adipisci6</td>
<td>Definiebas6</td>
<td>Consequuntur6</td>
<td>Phaedrum6</td>
<td>
edit
delete
</td>
</tr>
<tr>
<td>Iuvaret7</td>
<td>Apeirian7</td>
<td>Adipisci7</td>
<td>Definiebas7</td>
<td>Consequuntur7</td>
<td>Phaedrum7</td>
<td>
edit
delete
</td>
</tr>
<tr>
<td>Iuvaret8</td>
<td>Apeirian8</td>
<td>Adipisci8</td>
<td>Definiebas8</td>
<td>Consequuntur8</td>
<td>Phaedrum8</td>
<td>
edit
delete
</td>
</tr>
<tr>
<td>Iuvaret9</td>
<td>Apeirian9</td>
<td>Adipisci9</td>
<td>Definiebas9</td>
<td>Consequuntur9</td>
<td>Phaedrum9</td>
<td>
edit
delete
</td>
</tr>
</tbody></table>
<div class="row">
<div class="large-10 columns">
<canvas id="canvas" width="599" height="200" style="border:1px dotted;"></canvas>
</div>
</div>
</div>
</div>
</div>
<hr>
</div>
<script>
var canvas_el = document.getElementById('canvas');
var canvas = canvas_el.getContext('2d');
canvas.font = '60px Arial';
canvas.strokeText('Answer: 62804',90,112);
</script>
</div>
</html>
In order to use the XPath query the input document must be a valid XML.
In your case it isn't, because there are some tags that are not properly closed (you can verify it using an XMLLint tool).
E.g.
<hr> and <br> should be replaced with <hr/> and <br/>.
Once the XML is corrected, you can use an XPath query.
The fist step is select the script element:
//script
The output is:
Element='<script>
var canvas_el = document.getElementById('canvas');
var canvas = canvas_el.getContext('2d');
canvas.font = '60px Arial';
canvas.strokeText('Answer: 62804',90,112);
</script>'
Then you have to convert the Element Node in a String and then perform some parsing:
substring-before(substring-after(//script/text(), 'canvas.strokeText(''Answer: ') , ''',90,112)')
The result is the following:
String='62804'
Note: You can do the same operation in a more elastic way using Javascript, for example.
XPath is very good to query an XML (like the first operation that I mentioned) but quite complicated to do String parsing (like the second operation that I mentioned).
Hope it can help.

Scraping HTML by Class in VBA

I have a html code as shown
<div class="property-title visible-xs">
<a href="/property/473902/Office-Lot">
<h2><b> 2nd Floor, Block D5, Solaris Dutamas, No. 1, Jalan Dutamas 1, 50480, Kuala Lumpur</b></h2>
</a>
</div>
<p style="color: #0071ee;">Office Lot</p>
<h4><b>RM 880,000</b></h4>
<div>
<table>
<!-- <tr><td>Office Lot</td></tr> -->
<tr>
<td>Property Code</td><td>:</td><td>PB473902</td>
</tr>
<tr>
<td>Auction Date</td><td>:</td><td>2016-02-26</td>
</tr>
<tr>
<td>Built up </td><td>:</td><td>754 sq.ft </td>
</tr>
<tr>
<td>Tenure</td><td>:</td><td>Freehold</td>
</tr>
and I used the following code to extract the details "2nd Floor, Block D5,...."
objIE1.Document.getElementsByClassName("property-title visible-xs").getElementsByTagName ("a")
but it don't seem to get the result I need. Please help.
The html code shown is in multiple form.
This will work:
extract1 = objIE1.Document.getElementsByClassName("property-title visible-xs")(0).getElementsByTagName ("a")(0).innerText
Cells(1,1).Value = extract1
When a function has getElementsBy (plural - "Elements") such as getElementsByClassName or getElementsByTagName the code will extract a collection of elements so you need to specify which one you want, in this case it is the first which in html is 0. When a function uses getElementBy (singular - "Element") such as getElementById this extracts a single element and therefore does not need an index specification as there is no collection.

HTML - Change display order of entire tables on a page

I have a list of tables with various elements on a page. I want to have the display order of the various tables change randomly each time a page is loaded. Any ideas on how to do this? For reference, the code below shows the first two tables. Say I wanted to randomly change their display order - how would I do that?
<table>
<tr>
<td class="lender-logo" width="200" height="168x"><img src="http://www.texaspaceauthority.org/wp-content/uploads/2015/05/CleanFund_LOGO.jpg" alt="Clean Fund LLC" width="200" />
</td>
<td width="15px"></td>
<td width="340px">
<strong>Clean Fund LLC</strong>
<span style="font-size: small;"><strong>Preferred Financing Range:</strong> $500K - $15M
<strong>Types of Projects:</strong> Any
<strong>Contact:</strong> Josh Kagan
www.cleanfund.com
</span>
</td>
</tr>
</table>
<hr />
<table>
<tr>
<td class="lender-logo" width="200" height="168x"><img src="http://www.texaspaceauthority.org/wp-content/uploads/2015/05/Greenworks-Lending-Logo.jpg" alt="Greenworks Lending" width="200" />
</td>
<td width="15px"></td>
<td width="340px">
<strong>Greenworks Lending</strong>
<span style="font-size: small;"><strong>Preferred Financing Range:</strong> $30K - $5M
<strong>Types of Projects:</strong> Any Eligible Technologies and Properties
<strong>Contact:</strong> azech#greenworkslending.com
www.greenworkslending.com
</span>
</td>
</tr>
</table>
Are you able to use javascript on that page? My suggestion would be to write a javascript function that selects the table elements and then appends them back to their parent element in a random order.
This shows how to do with jQuery. You would need to use some JavaScript to accomplish this and jQuery is one option. You need a language like JavaScript to do this type of dynamic content.
https://css-tricks.com/snippets/jquery/shuffle-dom-elements/
Here is the code on the page in case it gets deleted, this was take from that page:
$.fn.shuffle :
(function($){
$.fn.shuffle = function() {
var allElems = this.get(),
getRandom = function(max) {
return Math.floor(Math.random() * max);
},
shuffled = $.map(allElems, function(){
var random = getRandom(allElems.length),
randEl = $(allElems[random]).clone(true)[0];
allElems.splice(random, 1);
return randEl;
});
this.each(function(i){
$(this).replaceWith($(shuffled[i]));
});
return $(shuffled);
};
})(jQuery);
And the usage is as follows:
// Shuffle all list items within a list:
$('ul#list li').shuffle();
// Shuffle all DIVs within the document:
$('div').shuffle();
// Shuffle all <a>s and <em>s:
$('a,em').shuffle();
In your case, you would:
Include jQuery js file in your webpage
Save the code for $.fn.shuffle and save in a js file, and in include that in your webpage
Include the javescript to call the shuffle: $('table').shuffle();

How to get line from table with Jsoup

I have table without any class or id (there are more tables on the page) with this structure:
<table cellpadding="2" cellspacing="2" width="100%">
...
<tr>
<td class="cell_c">...</td>
<td class="cell_c">...</td>
<td class="cell_c">...</td>
<td class="cell">SOME_ID</td>
<td class="cell_c">...</td>
</tr>
...
</table>
I want to get only one row, which contains <td class="cell">SOME_ID</td> and SOME_ID is an argument.
UPD.
Currently i am doing iy in this way:
doc = Jsoup.connect("http://www.bank.gov.ua/control/uk/curmetal/detail/currency?period=daily").get();
Elements rows = doc.select("table tr");
Pattern p = Pattern.compile("^.*(USD|EUR|RUB).*$");
for (Element trow : rows) {
Matcher m = p.matcher(trow.text());
if(m.find()){
System.out.println(m.group());
}
}
But why i need Jsoup if most of work is done by regexp ? To download HTML ?
If you have a generic HTML structure that always is the same, and you want a specific element which has no unique ID or identifier attribute that you can use, you can use the css selector syntax in Jsoup to specify where in the DOM-tree the element you are after is located.
Consider this HTML source:
<html>
<head></head>
<body>
<table cellpadding="2" cellspacing="2" width="100%">
<tbody>
<tr>
<td class="cell">I don't want this one...</td>
<td class="cell">Neither do I want this one...</td>
<td class="cell">Still not the right one..</td>
<td class="cell">BINGO!</td>
<td class="cell">Nothing further...</td>
</tr> ...
</tbody>
</table>
</body>
</html>
We want to select and parse the text from the fourth <td> element.
We specify that we want to select the <td> element that has the index 3 in the DOM-tree, by using td:eq(3). In the same way, we can select all <td> elements before index 3 by using td:lt(3). As you've probably figured out, this is equal and less than.
Without using first() you will get an Elements object, but we only want the first one so we specify that. We could use get(0) instead too.
So, the following code
Element e = doc.select("td:eq(3)").first();
System.out.println("Did I find it? " + e.text());
will output
Did I find it? BINGO!
Some good reading in the Jsoup cookbook!