Recovering HTML rendered in browser using JSoup - html

Can somebody kindly suggest the proper way to use JSoup on a website like "https://network.axial.net/a/company/business-team-san-francisco/"?
This website has a lot of Javascripting, and no matter what I do {documentObj.body().data(), documentObj.html(), connectionObj.response().body(), Jsoup.connect(urlStr).userAgent("Mozilla").data("name", "jsoup") etc.}, I am not able to recover the html that is rendered in a browser.

This is not possible with JSoup. The intent of JSoup is to parse HTML only.
If you are looking for something that can evaluate Javascript to return the resulting DOM, you might want to look at either Selenium or HtmlUnit.

Related

Automate Web Applications -parsing HTML Data

I just want to automate a web application, where that application parses the HTML page and pulls all the HTML Tags inner text based on some condition like if we have a tag called Span Example has given whose class="spanclass_1"
This is span tag...
which has particular class id. so that app parses and pulls that span into it.
And here the main pain area is, I should not use the developer code to automate that same parsing the HTML.
I want to automate that parsing done correctly, simply by using the parsed data which is shown in UI.
Any help, would be great.
Appreciating your time reading this.
(Note span tag is not shown)
Thanks buddies.
not enough details.
is this html page just a file in local filesystem on it is internet webpage?
do u have access to pages? can u modify it ? if answer yes, that just add javascript to page which will extract data and post to server.
if answer not, than it depends on language u use to programm.
Find good framework to parse html. load page parse it and extract data. Several situation can be there.
Worse scenario - page generated on client side using js.
Best scenario - page is in xhtml mode( u are lucky. any xml parser will help to build dom and extract data)
So so - page is simple html format (try several html parser to find most suitable for u)

HTML parsing in Clojure

I'm looking for a good way to parse HTML in Clojure.
Exactly what I'm trying to do is get content of a web page with crawler and then get content of some HTML tags or their attributes.
So I have URL to the page, and I get html as String, but how do get data I need?
Use https://github.com/cgrand/enlive
It allows you to select and retrieve with CSS-alike selectors.
Or https://github.com/nathell/clj-tagsoup
I am not experienced with tag-soup but I can tell that enlive works well for most scraping.

HTML video detection

I got a HTML-document and I want to extract every single URL of a video-file. Whats the best way to do this, since there are different HTML-versions and different possibilities to embed a video-file into a HTML-document. For this purpose I'd use the Html Agility Pack (c#).
You should parse the html with a regular expression for getting the video URL's.

parse html in adobe air

I am trying to load and parse html in adobe air. The main purpose being to extract title, meta tags and links. I have been trying the HTMLLoader but I get all sort of errors, mainly javascript uncaught exceptions.
I also tried to load the html content directly (using URLLoader) and push the text into HTMLLoader (using loadString(...)) but got the same error. Last resort was to try and load the text into xml and then use E4X queries or xpath, no luck there cause the html is not well formed.
My questions are:
Is there simple and reliable (air/action script) DOM component there (I do not need to display the page and headless mode will do)?
Is there any library to convert (crappy) html into well formed xml so I can use xpath/E4X
Any other suggestions on how to do this?
thx
ActionScript is supposed to be a superset of JavaScript, and thankfully, there's...
Pure JavaScript/ActionScript HTML Parser
created by Javascript guru and jQuery creator John Resig :-)
One approach is to run the HTML through HTMLtoXML() then use E4X as you please :)
Afaik:
No :-(
No :-(
I think the easiest way to grab title and meta tags is writing some regular expressions. You can load the page's HTML code into a string and then read out whatever you need like this:
var str:String = ""; // put HTML code in here
var pattern:RegExp = /<title>(.+)<\/title>/i;
trace(pattern.exec(str));

Is it possible to retrieve the contents of a html textarea with XPath?

I've looked all over but I can't find any leads. Is it possible to do something like:
//textarea/<some kind of function?>
or
<function>(//textarea)
I know I can do this using JS, or any number of other techniques, but I'm asking because I'm using WebDriver and Firefox to test a TinyMCE textarea input, and because of JS execution delays, I'd like to wait for the textarea to display a certain string after clicking a formatting control and the only way I can think of to achieve this in WebDriver is with SlowLoadableComponent and XPath. That or Thread.sleep but I'd like to avoid that ;)
Thanks in advance.
With TinyMCE, the editor area isn't actually a TextArea - it's an IFrame. Your best bet would be to use Google Chrome (or similar, with Developer-targeted features) to have a look at what TMCE generates. You can then use XPath with something like this:
//IFRAME
which will retrieve all IFRAMEs on the page - or you can add in properties:
//IFRAME[id="myiframe"]
which will retrieve all IFRAMEs on the page with the ID attribute set to 'myiframe'
Hope this helps :)