I was wondering what the best approach is on Android to retrieve information from a HTML page hosted on the internet?
For example I'd like to be able to get the text from the following page at the start of each day:
http://www.met.ie/forecasts/sea-area.asp
I have been downloading and parsing XML files but I have never tried to parse information from a HTML type file before.
Is there a native way to parse the information I want?
Or do I need a third party library?
Or do I need to look into screen scraping?
If you are parsing HTML, regardless of how you do it, you are screen scraping. Techniques run the gambit from regular expressions to 3rd party libraries like jTidy. Only problem is does jTidy work on Android? I don't know. You'll have to research it.
I'd suggest using regular expressions, compile them, and cache the Pattern object for performance.
If you can't get a proper webservice API for the data you want then you always run the risk of the author changing the layout and moving the data on you and breaking your code. That's why screen scraping is generally frowned upon and only used as a last ditch effort.
If you don't want to go the third party way - you could use a webview and inject javascript to it to extract the information you want.
Example code:
WebView webview = new WebView(context);
webView.addJavascriptInterface(new jsInterface() {
public void parseForcast(String html){
// do something with html
}
}, "Foo");
webView.setWebViewClient(new WebViewClient() {
#Override
public void onPageFinished(WebView view, String url){
if (url.equals(FORECAST_URL){
loadUrl("javascript:window.Foo.parseForecast(document.getElementById('content').innerHTML);");
}
});
webview.loadUrl(FORECAST_URL);
Is there a native way to parse the information I want?
No.
Or do I need a third party library?
Yes.
Or do I need to look into screen scraping?
What you are looking to do fits the term "screen scraping" as it is used with respect to Web sites. As I wrote in a previous question on this topic, to parse HTML, you use an HTML parser. There are several open source ones, and it is reasonably likely that one or more will work on Android with few modifications if any.
Related
I just want to automate a web application, where that application parses the HTML page and pulls all the HTML Tags inner text based on some condition like if we have a tag called Span Example has given whose class="spanclass_1"
This is span tag...
which has particular class id. so that app parses and pulls that span into it.
And here the main pain area is, I should not use the developer code to automate that same parsing the HTML.
I want to automate that parsing done correctly, simply by using the parsed data which is shown in UI.
Any help, would be great.
Appreciating your time reading this.
(Note span tag is not shown)
Thanks buddies.
not enough details.
is this html page just a file in local filesystem on it is internet webpage?
do u have access to pages? can u modify it ? if answer yes, that just add javascript to page which will extract data and post to server.
if answer not, than it depends on language u use to programm.
Find good framework to parse html. load page parse it and extract data. Several situation can be there.
Worse scenario - page generated on client side using js.
Best scenario - page is in xhtml mode( u are lucky. any xml parser will help to build dom and extract data)
So so - page is simple html format (try several html parser to find most suitable for u)
I've tried various approaches, the current is as follows
$(document).ready(function(){
$('#stage').click(function(){
jQuery.getJSON('https://mtgox.com/api/1/BTCUSD/ticker?callback=showTick',function(ticker){
$('div#tickerbox').html(ticker)}
)})})
Losing my mind . . .
I built php tools to make this easy , providing pure text tickers, html tickers, and even image ticker( and other tools like rss ticker feeds ).
have a look at the code on :
https://github.com/neofutur/bitcoin_simple_php_tools
more details and examples on :
https://bitcointalk.org/index.php?topic=68205
the tools are including a 30 seconds caching system so you wont hit the api too often and thus avoid being blackisted by the anti-ddos system
I dont think javascript is the best idea to add a mtgox ticker, but if you really want it to be js, theres at least one javascript implementation, which is the firefox addon for those tickers :
https://github.com/joric/mtgox-ticker
https://github.com/joric/mtgox-ticker/blob/master/lib/main.js
also, know that SE also have a dedicated space for bitcoin related questions :
http://bitcoin.stackexchange.com
you could perhaps have had more answers here, where all bitcoiners are ;)
Unfortunately, the Mt. Gox API does not support JSONP nor CORS at the time of this writing. It seems like it would be easy enough for them to add JSONP support, so if they add it in the near future, this answer should help; until then, however, this answer does not help. The rest of this answer assumes now is the future and they support JSONP.
First of all, you'll want to change callback=showTick to callback=? so jQuery knows to put its autogenerated callback name there. Then when your callback is called, ticker will be a decoded JSON object, not a string, so you'll want to pull the information you want out of there. For example, to show the average price:
jQuery.getJSON('https://mtgox.com/api/1/BTCUSD/ticker?callback=?', function(data) {
// We can't use .return because return is a JavaScript keyword.
alert(data['return'].avg.display_short);
});
I know there is a list of similar questions but all handle pages without user interaction (static even though some js may be there).
Let's say we've a page the user can interact (e.g. svg than changes, or html tables with drilldown - content changes). Those interactions will change the page. Same happens in stackoverflow when entering the question...
The idea is adding a button, "convert to pdf" taking the state of the html and sending to the user back a pdf version (we've a Java server).
Using the print of the browser is not the answer I'm looking for :-).
Is this a stick in the moon ?
You would have to store the parameters that generate the HTML view (i.e. what the user clicks on, what selections they make, etc). If you can have a list of parameters that generate the HTML view, you can have a method which accepts the list of parameters (JSON post?), generates the HTML view and passes it to your PDF generating routine. I'm not too familiar with Java libraries for this purpose, but PHP has TCPDF can take html output to basically generate a PDF for you. Certainly, there are Java libraries which will allow you to do the same thing, or you can use the parameters to get a list of rows/arrays which can be iterated over and output using the PDF library of your choice.
Both iTextPDF and Aspose.PDF would allow you to do that (I've seen them used in two different projects), but there is no magic and you will have to do some work.
The steps are roughly:
Get (as a string) the part of the document which you want to print with jQuery or innerHTML
Call a service on the server side to convert this to PDF
[Serverside] Use a whitlist - based tool to clean up the hmtl (unless you want to be hacked). JSoup is great for that.
[Serverside] Use IText or Aspose API to create the PDF from the HTML (this is not trivial, you will have to read the doc)
Download the document
I'd also recommend DocRaptor, an HTML to PDF API built by my company, Expected Behavior.
DocRaptor uses Prince XML to generate PDFs, and thus produces higher quality results than similar products.
Adding PDF generation to your own web application using our service is as simple as making an HTTP POST request to our server.
Here's a link to DocRaptor's home page:
DocRaptor
And a link to our API documentation:
DocRaptor API documentation
In my WPF project I need to render HTML-based content, where the content is stored in a resource assembly referenced by my WPF project.
I have looked at the WPF Frame and WebBrowser controls. Unfortunately, they both only expose Navigation events (Navigating, Navigated), but not any events that would allow me, based on the requested URL, to return HTML content retrieved from the resource assembly.
I can intercept navigation requests and serve up HTML content using the Navigating event and the NavigateToString() method. But that doesn't work for intercepting load calls for images, CSS files, etc.
Furthermore, I am aware of an HTML to Flowdocument SDK sample application that might be useful, but I would probably have to extend the sample considerably to deal with images and style sheets.
For what it is worth, we also generate the HTML content to be rendered (via Wiki pages) so the source HTML is somewhat predictable (e.g., maybe no JavaScript) in terms for referenced image locations and CSS style sheets used. We are looking to display random HTML content from the internet.
Update:
There is also the possibility to create an MHT file for each HTML page, which would 'inline' all images as MIME-types and alleviate the need to have finer-grained callbacks.
If you're okay with using a 28 meg DLL, you may want to take a look at BerkeliumSharp, which is a managed wrapper around the awesome Berkelium library. Berkelium uses the chromium browser at its core to provide offscreen rendering and a delegated eventing model. There are tons of really cool things you can do with this, but for your particular problem, in Berkelium there is an interface called ProtocolHandler. The purpose of a protocol handler is to take in a URL and provide the HTTP headers and body back to the underlying rendering engine.
In the BerkeliumSharp test app (one of the projects available in the source), you can see one particular use of this is the FileProtocolHandler -- it handles all the file IO for the "file://" protocol using .NET managed classes (System.IO). You could do the same thing for a made up protocol like "resource://". There's really only one method you have to override called HandleRequest that looks like this:
bool HandleRequest (string url, ref byte[] responseBody, ref string[] responseHeaders)
So you'd take a URL like "resource://path/to/my/html" and do all the necessary Assembly.GetResourceStream etc. in that method. It should be pretty easy to take a look at how FileProtocolHandler is used to adapt your own.
Both berkelium and berkelium sharp are open source with a BSD license.
The WebBrowser exposes a NavigateToStream(Stream) method that might work for you:
If your content is then stored as an embedded resource, you could use:
var browser = new WebBrowser();
var source = Assembly.Load("ResourceAssemblyName");
browser.NavigateTo(source.GetManifestResourceStream("ResourceNamespace.ResourceName"));
There is also a NavigateToString(string) method that expects the string content of the document.
Note: I have never used this in anger, so I have no idea how much help it will be!
My colleague is extremely 'hot' on properly formatted and indented html being delivered to the client browser. This is so that the page source is easily readable by a human.
Firstly, if I have a partial view that is used in a number of different areas in my site, should the rendering engine be automatically formatting the indentations for me (ala setting the Formatting property on an XmlTextWriter)?
Secondly, my colleague has created a number of HtmlHelper extension methods for writing to the response. These all require a CurrentIndent parameter to be passed to them. This smells wrong to me.
Can anyone help with this?
This sounds difficult to maintain. If someone removed an outer element from the HTML, would anyone bother to update the CurrentIndent values in the code? These days most developers usually view their HTML through Firebug anyway, which formats the markup automatically with indentation.
If you really want to post-process HTML through a formatting filter then try a .NET port of HTML Tidy.
Browsers absolutely don't care how beautiful the HTML indentation is. What's even more, deeply nested (and thus heavily indented) HTML adds a slight overhead to the page (in terms of bytes to download). Granted, you can always compress response and well-indented HTML is nicer to support.
Even if for some crazy reason it HAS TO be indented "properly", it shouldn't be done the way your colleague suggests.
An HttpModule attached to ReleaseRequestState event of the HttpApplication object should do the trick. And of course, you're going to need to come up with a filter that handles this indenting.
public class IndentingModule: IHttpModule {
public void Dispose() {
}
public void Init(HttpApplication context) {
context.ReleaseRequestState +=
new EventHandler(context_ReleaseRequestState);
}
void context_ReleaseRequestState(object sender, EventArgs e) {
HttpApplication app = (HttpApplication)sender;
app.Response.Filter = new IndentingFilter(app.Response.Filter)
}
}
Rather than waste time implementing a proper indenting solution which would affect all HTTP requests (thus adding CPU and bandwidth overhead), just suggest to your colleague that he use an HTML beautifier. That way the one person that cares about it is the one person that pays the cost of it.
This Firefox plugin is an HTML validator that also includes a beautification function. See the documentation here.