Save generated HTML using Canopy - html

Can a website's generated HTML be saved using Canopy? Looking at the documentation under 'Getting Started', I could not find anything related.

You can run arbitrary JavaScript using js, document.documentElement.outerHTML will return the current DOM, so
let html = js "return document.documentElement.outerHTML" |> string
does the trick.

Canopy is a wrapper around Selenium that provides some useful helper functions. But it also provides access to the Selenium IWebElement instances in case you need them, via the element function (halfway down the page; there don't seem to be internal anchors in that page so I couldn't link directly to the function). Then once you have the IWebElement object, your problem becomes similar to this one, where the answer seems to be elem.getAttribute("innerHtml") where elem is the elememt whose content you want (which might even be the html element). Note that the innerHtml attribute is not a standard DOM attribute, so this won't work with all Selenium drivers; it will be dependent on which browser you're running in. But it apparently works on all major Web browsers.
See Get HTML Source of WebElement in Selenium WebDriver using Python for a related question using Python, which has more discussion about whether the innetHtml attribute will work in all browsers. If it doesn't, Canopy also has the js function, which you could leverage to run some Javascript to get the HTML you're looking for -- but if you're having trouble with that, you probably need to ask a Javascript question rather than an F# question.

Related

How to use chrome dev tools to find elements based on css class or id?

Long time automation developer here (just for context).
It's been bugging me for quite a while that the dev tools in chrome used to find elements just don't seem to work as I expect. Hopefully someone can point out what I'm doing wrong.
Looking at , say, sauce labs page: https://saucelabs.com/blog/selenium-tips-finding-elements-by-their-inner-text-using-contains-a-css-pseudo-class
ok now that page has div's and anchors
and indeed I can do find ('a') or find('div')
but why do I have a problem using classes or id's ?
The find() method refers to window.find(), a non-standard API for the browser's built-in Find function. It does not find web elements the same way Selenium or Capybara do, and so it does not parse the input as a selector.
You find elements with selectors in Chrome DevTools using document.querySelector() or document.querySelectorAll(). There are no special methods in Chrome DevTools for this, however it does provide the $() and $$() aliases (respectively) to save you time and keystrokes.
You can use jquery code in chrome console, for example if you want to find something with class of "foo" you can write $('.foo') or a id of "bar" you write $('#bar')
You can read all about it here
Also you can just google what you want "Jquery how to find a div with id"

JSoup Select Tag Recursive Search

I recently tried to work with JSoup to parse HTML documents, I went through the turorial on JSoup and found that the select-Method might be what I am looking for.
What I try to accomplish is to find all elements in a html document which possess a certain class. To test that, I tried this with the amazon web page (idea: find all deals with certain offers).
So I inspected the web page to see which classes and ids are being used and then I tried to integrate this into a small code snippet. In this example I found the follwing element:
<span id="dealTitle" class="a-size-base a-color-link dealTitleTwoLine restVisible singleCellTitle autoHeight">PROCAVE Matratzen-Brücke aus Schaumstoff 25 x 200 cm für ...</span>
This element is embedded in other elements and exists multiple times (for each deal of course). So here is my code to read the deal elements:
Document doc = Jsoup.connect("https://www.amazon.de/gp/angebote/ref=gbph_ftr_s-8_cd61_page_1?gb_f_LD=dealStates:AVAILABLE%252CWAITLIST%252CWAITLISTFULL%252CUPCOMING,dealTypes:LIGHTNING_DEAL,page:1,sortOrder:BY_SCORE,dealsPerPage:8&pf_rd_p=425ddcb8-bed4-4e85-ac0f-c1a79d14cd61&pf_rd_s=slot-8&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=A3JWKAKR8XB7XF&pf_rd_r=BTHRY008J9N3N5CCMNEN&gb_f_second=dealStates:AVAILABLE%252CWAITLIST%252CWAITLISTFULL,dealTypes:COUPON_DEAL,page:8,sortOrder:BY_SCORE,dealsPerPage:8").timeout(0).get();
Elements deals = doc.select("span.a-size-base.a-color-link.dealTitleTwoLine.restVisible.singleCellTitle.autoHeight");
for (Element deal : deals) {
if (deal.text().contains("ItemMatch")) {
System.out.println("Found deal: " + deal.text());
}
}
Unfortunately I can't get the element I am looking for. deals has always the size of 0. I tried to modify my select with only part of the classes, I added the id-attribute and so on. Nevertheless, I do not get the elements (in this case these are nested into some others). If I try an element which is above this element in the DOM hierarchy (e.g. the div with class "a-section a-spacing-none slotContainer"), this is found.
Do I actually need to specify the whole DOM hierarchy (by using ">" in my select expressions? I expected to be able to define a selector and JSoup would travers and search the whole DOM-tree.
No, you do not have to specify the full DOM hierarchy. Your test should work, if the elements are really part of the DOM. I suspect that they might not be part of DOM as it is loaded be JSoup. The reason might me, that the inner DOM nodes are filled by JavaScript through AJAX. JSoup does not run JavaScript, so dynamically loaded parts of the DOM are not accessible. To achieve what you want you can either look into the AJAX calls directly and analyze them, or you move on to another solution like selenium webdriver, which runs a real browser including a working JavaScript engine.

Is it possible to nest one data: URI inside another?

If I use a data URI to construct a src attribute for an HTML element, can it in turn have another data URI inside it?
I know you can't use data uri's for iframes (I'm actually trying to construct an OSDX document and pass it to the browser with an icon encoded in base64 but that's a really niche use case and this is more of a general question), but assuming you could, my use case would look like:
var iframe = document.createElement('iframe');
var icon = document.createElement('image');
var iSrc = '*[REALLY LONG STRING]*/';
iframe.src='data:text/html,<html><body><image src="'+iSrc+'" /></body</html>
document.body.appendChild(iframe);
Basically what I'm after is is there anything in a data uri that would break a parent data uri?
Yes you can. I really thought it was impossible, as did everyone I asked.
Example:
Pasting the following into your browser's URL bar should render a gmail logo in an html page that says hello world.
data:text/html,<html><body><p>hello world</p><img src="" /></body></html>
or for a shorter example courtesy of Pumbaa80:
data:text/html,<script src="data:text/javascript,alert('hello world')"></script>
MSDN explicitly supports this:
Data URIs can be nested.
An old blog entry talks a little bit more about embedding images within CSS using data: :
Neither dataURI spec nor any other mentions if dataURI’es can not be nested. So here’s the testcase where dataURI’ed CSS has dataURI’ed image embedded. IE8b1, Firefox3 and Safari applied the stylesheet and showed the image, Opera9.50 (build 9613) applies the stylesheet but doesn’t show the embedded image! So it seems that Opera9 doesn’t expect to get anything embedded inside of an already embedded resource! :D
But funny thing, as IE8b1 supports expressions and also supports nested data URI’es, it has the same potential security flaw as Firefox does (as described in the section above). See the testcase — embedded CSS has the following code: body { background: expression(a()); } which calls function a() defined in the javascript of the main page, and this function is called every time the expression is reevaluated. Though IE8b1 has limited expressions support (which is going to be explained in a separate post) you can’t use any code as the expression value, but you can only call already defined functions or use direct string values. So in order to exploit this feature we need to have a ready javascript function already located on the page and then we can just call it from the expression embedded in the stylesheet. That’s not very trivial obviously, but if you have a website that allows people to specify their own stylesheets and you want to be on the safe side, you have to either make sure you don’t have a javascript function that can cause any potential harm or filter expressions from people’s stylesheets.

Getting started styling JSON search results from DocumentCloud

I'm looking to build a system that styles the search results from DocumentCloud (and allows me to link to a given document).
I know I can query DocumentCloud and return JSON results using a search string like this:
https://www.documentcloud.org/api/search.json?q=obama
I don't know how to:
Grab the output of the search and put it on my own page
Style the data once I have it on my page
I'd just like to know how to get started with this, I'm experienced with HTML and CSS but I've never worked with JSON before.
There's more info here but I just don't know where to get started: https://www.documentcloud.org/help/api
It sounds like you're not so familiar with JavaScript, correct? JSON stands for JavaScript Ojbect Notation, so to work with it, you'll have to dive in a bit. I strongly recommend looking into using a JavaScript framework/library, namely jQuery to handle the heavy lifting. (There are other worthy libraries, but jQuery is by far the most popular, and is very friendly, using CSS-like selectors to manipulate the document object model).
check this jQuery tutorial: How jQuery Works
Here's a primer on using jQuery's jsonp to fetch remote rsults and using them in a page: http://www.ibm.com/developerworks/library/wa-aj-jsonp1/
You might end up with code in a javascript file, or a script tag (following a link to the jQuery library) that looks like this:
$(document).ready(function () {
$.getJSON('https://www.documentcloud.org/api/search.json?q=obama&callback=?', null, function (results) {
// this would append whatever the json returns for 'total'
// inside an element on your page with an id of 'resultsCount':
$('#restulsCount').append(data.total);
});
});
As a result, extra text & markup can be added to elements you already have on your page in whatever form/position you need it, and regular CSS rules from any style block or CSS file linked on your page will apply to them.
Good luck.

how to create XPATH for a HTML DOM element?

How to create XPATH for a HTML DOM element?
for example, "/HTML/BODY/DIV[1]/TABLE[1]/TR[2]/TD[1]/INPUT".
Given an DOM element how to get this XPATH string?
Any ideas?
Thanks,
Dattebayo.
You can create a new domdocument and then import the node element
$DD= new DOMDocument('1.0', 'utf-8');
$DD->loadXML( "<html></html>" );
$DD->documentElement->appendChild($DD->importNode($DE,true));
then you can use xpath insithe the domelement:
$xpathe=new DOMXPath($DD);
As I recall, the xpath checker extension to firefox gives you a point-and-click interface for getting the xpath to DOM elements in a HTML document.
After a lot of struggle I found a way to do so.
Along with the DOM path also use the SourceIndex of each node. Like "/Html:1/Body:2/Div:5/Input:6"
But again,
1. This might not work in case of dynamic page (ajax to modify the content).
2. This might not be unique accross browsers since the sourceIndex might vary accross browsers based on the Browser Rendering Engine arranges the nodes. (not sure of this yet though, just a thought).
In Mozilla an xpath generator component was implemented although it never made it to default builds.
You can find the tests in the "final patch" attached to the bug I linked to to see how it can be used. You can also look up its implementation, might be helpful.
Here is a chrome extension that might help you (ChromyQlip)
https://chrome.google.com/extensions/detail/bkmllkjbfbeephbldeflbnpclgfbjfmn