How to serialize browser HTML DOM to XML? - html

I need to serialize browser parsed HTML DOM to well-format XML.
In firefox (gecko), this works:
// serialize body to well-format XML.
var xml = new XMLSerializer().serializeToString(document.body);
But in webkit, result is equivalent to document.body.outerHTML, not well-format XML (for example: <br> won't become <br />)
How to serialize browser HTML DOM to XML in webkit?
Thanks.

I have a setInnerXHTML method (not the Facebook version) which should work for this. The method is included in the base framework file, hemi.js, available from the Hemi Project Page. It is also included in my older libXmlRequest library.
Example:
var oXml = Hemi.xml.newXmlDocument("Xhtml");
Hemi.xml.setInnerXHTML(oXml.documentElement, document.documentElement, oXml);
var sSerial = Hemi.xml.serialize(oXml);
If you want to test this on a particular browser, navigate to the Hemi Project Page, click the upper-right tool icon, and click the Active Source tab. Copy and paste the sample code into the textarea and click Eval Source (the response will be a node name). Type in sSerial into the input field and hit enter, or click Eval, and you should see the serialized XML of the copied HTML DOM.

Related

Jsoup - hidden div class?

Im trying to scrape a div class but everything I have tried has failed so far :(
Im trying to scrape the element(s):
<a href="http://www.bellator.com/events/d306b5/bellator-newcastle-pitbull-vs-
scope"><div class="s_buttons_button s_buttons_buttonAlt
s_buttons_buttonSlashBack">More info</div></a>
from the website: http://www.bellator.com/events
I tried accessing the list of elements by doing
Elements elements = document.select("div[class=s_container] > li");
but that didnt return anything.
Then i tried accessing just the parent with
Elements elements = document.select("div[class=s_container]");
and that returned two div with classname "s_container", non of which is the one I needed :<
then i tried accessing that ones parent with
Elements elements = document.select("div[class=ent_m152_bellator module
ent_m152_bellator_V1_1_0 ent_m152]");
And that didnt return anything
I also tried
Elements elements = document.select("div[class=ent_m152_bellator]");
because I wasnt sure about the white spaces but it didnt return anything either
Then I tried accessing its parent by
Elements elements = document.select("div#t3_lc");
and that worked, but it returned an element containing
<div id="t3_lc">
<div class="triforce-module" id="t3_lc_promo1"></div>
</div>
which is kinda weird because i cant see that it has that child when i inspect the website in chrome :S
Anyone knows whats going on? I feel kinda lost..
What you see in your web browser is not what Jsoup sees. Disable JavaScript and refresh page to get what Jsoup gets OR press CTRL+U ("Show source", not "Inspect"!) in your browser to see original HTML document before JavaScript modifications. When you use your browser's debugger it shows final document after modifications so it's not not suitable for your needs.
It seems like whole "UPCOMING EVENTS" section is dynamically loaded by JavaScript.
Even more, this section is asynchronously loaded with AJAX. You can use your browsers debugger (Network tab) to see every possible request and response.
I found it but unfortunately all the data you need is returned as JSON so you're going to need another library to parse JSON.
That's not the end of the bad news and this case is more complicated. You could make direct request for the data:
http://www.bellator.com/feeds/ent_m152_bellator/V1_1_0/d10a728c-547e-4a6f-b140-7eecb67cff6b
but the URL seems random and few of these URLs (one per upcoming event?) are included inside JavaScript code in HTML.
My approach would be to get the URLs of these feeds with something like:
List<String> feedUrls = new ArrayList<>();
//select all the scripts
Elements scripts = document.select("script");
for(Element script: scripts){
if(script.text().contains("http://www.bellator.com/feeds/")){
// here use regexp to get all URLs from script.text() and add them to feedUrls
}
}
for(String feedUrl : feedUrls){
// iterate over feed URLs, download each of them
String json = Jsoup.connect(feedUrl).ignoreContentType(true).get().body().toString();
// here use JSON parsing library to get the data you need
}
ALTERNATIVE approach would be to stop using Jsoup because of its limitations and use Selenium Webdriver as it supports dynamic page modifications by JavaScript so you'd get the HTML of the final result - exactly what you see in web browser and Inspector.
If anyone finds this in the future; I managed to solve it with Selenium, dont know if its a good/correct solution but it seems to be working.
System.setProperty("webdriver.chrome.driver", "C:\\Users\\PC\\Desktop\\Chromedriver\\chromedriver.exe");
WebDriver driver = new ChromeDriver();
driver.get("http://www.bellator.com/events");
String html = driver.getPageSource();
Document doc = Jsoup.parse(html);
Elements elements = doc.select("ul.s_layouts_lineListAlt > li > a");
for(Element element : elements) {
System.out.println(element.attr("href"));
}
Output:
http://www.bellator.com/events/d306b5/bellator-newcastle-pitbull-vs-scope
http://www.bellator.com/events/ylcu8d/bellator-215-mitrione-vs-kharitonov
http://www.bellator.com/events/yk2djw/bellator-216-mvp-vs-daley
http://www.bellator.com/events/e8rdqs/bellator-217-gallagher-vs-graham
http://www.bellator.com/events/281wxq/bellator-218-sanchez-vs-grimshaw
http://www.bellator.com/events/8lcbdi/bellator-219-koreshkov-vs-larkin
http://www.bellator.com/events/9rqguc/bellator-macdonald-vs-fitch

xpath for HTMLagility pack on site with frames

I'm trying to extract all the station names which are encased in the left frame from http://www.raws.dri.edu/wraws/orF.html using HTMLAgility pack.
My Xpath string is currently //frame[#name='list'] at this point it returns the node but I can't seem to access any of it's child nodes. Ultimately I'm trying to return all the attributes that are in frameset[1]/html/body/[#a] which looks something like this :
<a onmouseover="popup('<font color=Black><strong> IDARNG1 RG2 Idaho (RAWS) </strong> </font> ',615,307);update('IDARNG1 RG2 Idaho (RAWS)',615,307,'idIAN1','raw');return true;" onmouseout="removeBox();removedot();" href="/cgi-bin/rawMAIN.pl?idIAN1">`
Here is what the browser is currently doing:
It opens http://www.raws.dri.edu/wraws/orF.html
It parses the source code, and perform another request for every <iframe> that appears on it.
That means you need to open manually the url the <iframe> is pointing to, which can be found in the src attribute. Below is an example:
string src = doc.DocumentNode.SelectSingleNode("//frame[#name='list']").GetAttribute("src", "");
string url = "http://www.raws.dri.edu/wraws/" + src;
The URL you're looking for is:
http://www.raws.dri.edu/wraws/orlst.html
Go and open it manually and you will see only the left sidebar is loaded.
Next time make sure you use a HTTP Web Debugger like Firebug or Fiddler, to see what is happening behind the scenes.

How can I embed a PDF document in a KnockoutJS template without PDF plugins issuing warnings?

I am trying to embed a PDF document in an HTML view, with a Knockout ViewModel providing the URL for the document. I believe that the correct HTML element for this is <object>, so I have the following view:
<div class="documentviewerpdf">
<object data-docType="pdf" data-bind="attr: { 'data': EmbedPDFLink }" type="application/pdf" width="100%" />
</div>
and the following as a view model:
function AppViewModel() {
this.EmbedPDFLink = "http://acroeng.adobe.com/Test_Files/browser_tests/embedded/simple5.pdf";
}
ko.applyBindings(new AppViewModel());
jsFiddle
This displays the PDF in Chrome, Chrome Canary (both using native Chrome PDF plugin), and Firefox 27 (Adobe Reader XI plugin), however all three browsers display a warning in a bar across the top of the screen. Chrome's is yellow and states that it Could not load Chrome PDF Viewer, while Firefox's is grey with an information icon and states that this PDF document might not be displayed correctly. The same code loads the plugin empty on IE9.
If I replace the data-bind attribute with a direct data attribute containing the hard coded URL for the PDF document, Chrome and Firefox display correctly, while IE9 displays nothing at all, not even the empty plugin.
I have tried setting the data attribute using a <param> element within the <object> as well, and that did not work at all in any of these browsers.
I have also tried using an <embed> tag, which gives similar results though works in IE9, however this does not seem like it is semantically correct. However, the embed element documentation states that any attributes are passed to the plugin - given that the elements are so similar, is it likely that the data-bind attribute is being passed to the PDF plugins, and causing this problem?
It appears that the only difference in mark-up between the hardcoded and data-bind versions is the presence of a data-bind attribute on the latter, so I think that is causing the problem with the plugins, as the data URL attribute is being set correctly.
Is there a way to set the data attribute on the object using Knockout, without leaving a data-bind attribute there as well? Is there another way that anyone knows to avoid this issue?
I'm not 100% sure, but I think this is what's happening.
Your html markup has an <object data-docType='pdf' /> - so that is there immediately upon DOM load. However, the data attribute of it is using a KO binding. So immediately upon DOM load when the <object> html element is loaded, the KO bindings aren't applied just yet and you get the error.
I tested it out, and I constructed the <object> html markup in Javascript and then added it to the DOM and the error went away. Hope it helps, see fiddle
function AppViewModel() {
this.EmbedPDFLink = "http://acroeng.adobe.com/Test_Files/browser_tests/embedded/simple5.pdf";
this.addPdf = function () {
var html = "<object data-docType=\"pdf\" data=\"" + this.EmbedPDFLink + "\" type=\"application/pdf\" width=\"100%\" />";
$('.documentviewerpdf').append(html);
};
}
ko.applyBindings(new AppViewModel());
and the HTML
<button data-bind="click: addPdf">Load PDF</button>
<div class="documentviewerpdf"></div>
Edit
Here's an updated fiddle that will automatically load the PDF when you get to the page (more in line with what you want your end result to be, I think). I tested it in IE, FIrefox, and Chrome (latest versions) and received no errors.
Fiddle
Use bindings to keep all of your attributes hidden until the source path has been evaluated. The plugin sees your other attributes and thinks you have a bad element.
data-bind="attr: { 'each-attribute-here': true }"
http://jsfiddle.net/C8txY/8/
Chrome recognizes your PDF too quickly, before the value is evaluated. Tie all of the properties that the plugin is looking for into your binding.
You could also use a custom binding here to add the attributes and pass in the value of the location of the PDF. This custom binding handler should not directly inject HTML into the DOM.

html is blank in a blackberry cascades webview (C++, QT, QML)

in Blackberry Cascades (C++, QT, QML), I am trying to read the html of a webview - but it is returning blank. This webview uses "setUrl(url") to set the url, and does not use "setHtml(html)". Anyway - I have this code:
WebView {
id: loginView
objectName: "loginView"
onMicroFocusChanged: {
console.log("html: " + html);
}
}
And the webview url has two textfields, and when I put my cursor into those text fields or when I type in them, the html of the webview shows up as blank - but I need to see the html, because I am trying to be able to parse that html to get the content of those textfields.
How come the html is blank - and how can I get access to this html?
The Html property of the WebView only returns the code that was inserted with setHtml. (Documentation)
Even if it did report code loaded from a web address, I doubt it would be updated with the current value of textboxes.
To read their content, I recommend you look into the messageReceived signal of the WebView. If you can change the html-code that contains your text boxes, you can use javascript navigator.cascades.postMessage() to send the data to your application.
If you do not control the html, you can still use the evaluateJavaScript method to extract the values of the textboxes with DOM functions from inside your app.

display an (x)html document structure in a new window

For debugging purpose, I need to create an new xml document popup to display the (x)html source structure of my current document.
But the following code does not work:
var w = window.open();
w.document.open('text/xml');
w.document.write(window.document.documentElement.innerHTML);
w.document.close();
It seems that document.open() does not accept contentType anymore.
Is there any other solution ?
Just put a textarea in your existing page and copy the innerHTML into the textarea.