Parse HTML From URL - windows-store-apps

I need to download an HTML file and parse it to extract certain tags. Short of building my own parser, is there a way to do this in C++/CX?
I tried using XmlDocument, but obviously ran into problems thanks to XML having much stricter rules than HTML.
I also tried several libraries which I found online, all of which failed due to differences between C++/CLI and C++/CX.
I would need one of the following:
A built-in class or set of methods to parse HTML
A library which works in C++/CX
Another method of extracting tags and attributes
I realize that roll-your-own would be the fallback, but building an HTML parser can be a daunting task, to say the least.
Any and all ideas would be appreciated.

Related

Parsing HTML with OCaml

I'm looking for a library to parse HTML files in OCaml.
Basically the equivalent of Jsoup/Beautiful Soup.
The main requirement is being able to query the DOM with CSS selectors.
Something in the form of
page.fetch("http://www.url.com")
page.find("#tag")
I had a need for something like this recently, so after seeing this question and reading the recommendations in the comments, I wrote a library "Lambda Soup" over the weekend for fun.
You will want to use a library like ocurl or Cohttp to retrieve the actual HTML. After you have it, you can do
html |> parse $ "#tag"
to do what is asked in the question. For other possibilities and the full signature, see the documentation. You may want to look at the documentation postprocessor or tests for a fairly thorough demonstration of usage and capabilities, including CSS support and extensions.
Per comments, Lambda Soup uses Ocamlnet's HTML parser. Lambda Soup uses Markup.ml. Otherwise, it has no dependencies, except OUnit if you wish to run the tests. I'm happy for any feedback, including about modifying the interface (it is at an early stage) or discussions of adding an HTTP downloader to the library (which seems iffy because it greatly alters the scope of the library as it now is, but I am happy to hear arguments).
The license is BSD.

Rails, HTML to JSON?

Given a static HTML page, is there an automated way to generate json?
For a large website that contains a lot of static HTML I am wanting to generate json for RSS feeds and search functionality and am looking for a way to convert HTML to json.
I could obviously write json templates for every page and every language but that would be a unmaintainable. That would double an 800page website to 1600 pages and that is not an option.
One approach I thought of could be to write a bot that would loop through the routes to index the pages and save data to a database which would give me all the choices I could wish for, for searching such as solr, elastic search, thinking sphinx etc...
I could use capybarra to aid me in this by visiting each path and extracting text to save to a database in a rake task as a background job but not sure how that would work in a production environment and it seems that such a common requirement might have already been achieved but for the life of me I can't find one.
I would be far happier (I think) if I could find a way to convert HTML text content to JSON
Any ideas? Has this already been done? are there any gems that might help? or is there built in functionality that I have not thought of, maybe a way to get html into a hash that could then be converted into json? whatever the approach it needs to be automated. I'm just stuck for the best approach.
Basically html looks a lot like xml, but with strong tag meanings, so you could use xml to json conversion, if it all ends up getting tree of html tags embedded in each other.
And so your question becomes this question Except you might get problems with single tags, without closing one. So you might get all of these and put a closing bracket after each one before trying to get it as hash from xml. Oh, early answer. Btw in general for parsing text data you should look at regular expressions.
I chose to go with a nokogiri solution in the end and wrote a parser to meet my needs

Parse HTML to XML

I am trying to figure out how to parse HTML to XML, but I cannot figure it out. I want to use the MSXML2.ServerXMLHTTP object (in an .asp file).
<%
url = "http://www.website.com/file.asp"
set xmlhttp = CreateObject("MSXML2.ServerXMLHTTP")
xmlhttp.open "POST", url, false
xmlhttp.send
Response.write xmlhttp.responseText
set xmlhttp = nothing
%>
This gives me the text, but I really don't know where to go from here.
I think problem is in HEAD of HTML file.
From MSDN: resonse should return XML ("text/xml"), but your http://www.website.com/file.asp returns HTML content, with ("text/html") mime type.
Native XML Extensions
I prefer using one of the native XML extensions since they come bundled with PHP, are usually faster than all the 3rd party libs and give me all the control I need over the markup.
DOM
The DOM extension allows you to operate on XML documents through the DOM API with PHP 5. It is an implementation of the W3C's Document Object Model Core Level 3, a platform- and language-neutral interface that allows programs and scripts to dynamically access and update the content, structure and style of documents.
DOM is capable of parsing and modifying real world (broken) HTML and it can do XPath queries. It is based on libxml.
It takes some time to get productive with DOM, but that time is well worth it IMO. Since DOM is a language-agnostic interface, you'll find implementations in many languages, so if you need to change your programming language, chances are you will already know how to use that language's DOM API then.
A basic usage example can be found in grabbing the href attribute of an A element and a general conceptual overview can be found at DOMDocument in PHP.
How to use the DOM extension has been covered extensively on StackOverflow, so if you choose to use it, you can be sure most of the issues you run into can be solved by searching/browsing StackOverflow.
XMLReader
The XMLReader extension is an XML pull parser. The reader acts as a cursor going forward on the document stream and stopping at each node on the way.
XMLReader, like DOM, is based on libxml. I am not aware of how to trigger the HTML Parser Module, so chances are using XMLReader for parsing broken HTML might be less robust than using DOM where you can explicitly tell it to use libxml's HTML parser module.
A basic usage example can be found at getting all values from h1 tags using PHP.
XML Parser
This extension lets you create XML parsers and then define handlers for different XML events. Each XML parser also has a few parameters you can adjust.
The XML Parser library is also based on libxml, and implements a SAX style XML push parser. It may be a better choice for memory management than DOM or SimpleXML, but will be more difficult to work with than the pull parser implemented by XMLReader.
SimpleXml
The SimpleXML extension provides a very simple and easily usable toolset to convert XML to an object that can be processed with normal property selectors and array iterators.
SimpleXML is an option when you know the HTML is valid XHTML. If you need to parse broken HTML, don't even consider SimpleXml because it will choke.
A basic usage example can be found at A simple program to CRUD node and node values of xml file and there is lots of additional examples in the PHP manual.
3rd Party Libraries (libxml based)
If you prefer to use a 3rd-party lib, I'd suggest using a lib that actually uses DOM/libxml underneath instead of string parsing.
FluentDom - Repo
FluentDOM provides a jQuery-like fluent XML interface for the DOMDocument in PHP. Selectors are written in XPath or CSS (using a CSS to XPath converter). Current versions extend the DOM implementing standard interfaces and add features from the DOM Living Standard. FluentDOM can load formats like JSON, CSV, JsonML, RabbitFish and others. Can be installed via Composer.
HtmlPageDom
Wa72\HtmlPageDom` is a PHP library for easy manipulation of HTML documents using It requires DomCrawler from Symfony2 components for traversing the DOM tree and extends it by adding methods for manipulating the DOM tree of HTML documents.
phpQuery (not updated for years)
phpQuery is a server-side, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery JavaScript Library written in PHP5 and provides additional Command Line Interface (CLI).
Also see: https://github.com/electrolinux/phpquery
Zend_Dom
Zend_Dom provides tools for working with DOM documents and structures. Currently, we offer Zend_Dom_Query, which provides a unified interface for querying DOM documents utilizing both XPath and CSS selectors.
QueryPath
QueryPath is a PHP library for manipulating XML and HTML. It is designed to work not only with local files, but also with web services and database resources. It implements much of the jQuery interface (including CSS-style selectors), but it is heavily tuned for server-side use. Can be installed via Composer.
fDOMDocument
fDOMDocument extends the standard DOM to use exceptions at all occasions of errors instead of PHP warnings or notices. They also add various custom methods and shortcuts for convenience and to simplify the usage of DOM.
sabre/xml
sabre/xml is a library that wraps and extends the XMLReader and XMLWriter classes to create a simple "XML to object/array" mapping system and design pattern. Writing and reading XML is single-pass and can therefore be fast and require low memory on large XML files.
FluidXML
FluidXML is a PHP library for manipulating XML with a concise and fluent API. It leverages XPath and the fluent programming pattern to be fun and effective.
3rd-Party (not libxml-based)
The benefit of building upon DOM/libxml is that you get good performance out of the box because you are based on a native extension. However, not all 3rd-party libs go down this route. Some of them listed below.
PHP Simple HTML DOM Parser
An HTML DOM parser written in PHP5+ lets you manipulate HTML in a very easy way!
Require PHP 5+.
Supports invalid HTML.
Find tags on an HTML page with selectors just like jQuery.
Extract contents from HTML in a single line.
I generally do not recommend this parser. The codebase is horrible and the parser itself is rather slow and memory hungry. Not all jQuery Selectors (such as child selectors) are possible. Any of the libxml based libraries should outperform this easily.
PHP Html Parser
PHPHtmlParser is a simple, flexible, HTML parser which allows you to select tags using any CSS selector, like jQuery. The goal is to assist in the development of tools which require a quick, easy way to scrape HTML, whether it's valid or not! This project was original supported by sunra/php-simple-html-dom-parser but the support seems to have stopped so this project is my adaptation of his previous work.
Again, I would not recommend this parser. It is rather slow with high CPU usage. There is also no function to clear memory of created DOM objects. These problems scale particularly with nested loops. The documentation itself is inaccurate and misspelled, with no responses to fixes since 14 Apr 16.
Ganon
A universal tokenizer and HTML/XML/RSS DOM parser
Ability to manipulate elements and their attributes
Supports invalid HTML and UTF8
Can perform advanced CSS3-like queries on elements (like jQuery -- namespaces supported)
A HTML beautifier (like HTML Tidy)
Minify CSS and Javascript
Sort attributes, change character case, correct indentation, etc.
Extensible
Parsing documents using callbacks based on current character/token
Operations separated in smaller functions for easy overriding
Fast and easy
Never used it. Can't tell if it's any good.
HTML 5
You can use the above for parsing HTML5, but there can be quirks due to the markup HTML5 allows. So for HTML5 you want to consider using a dedicated parser, like:
html5lib
A Python and PHP implementations of a HTML parser based on the WHATWG HTML5 specification for maximum compatibility with major desktop web browsers.
We might see more dedicated parsers once HTML5 is finalized. There is also a blogpost by the W3's titled How-To for html 5 parsing that is worth checking out.
WebServices
If you don't feel like programming PHP, you can also use Web services. In general, I found very little utility for these, but that's just me and my use cases.
ScraperWiki
ScraperWiki's external interface allows you to extract data in the form you want for use on the web or in your own applications. You can also extract information about the state of any scraper.
Regular Expressions
Last and least recommended, you can extract data from HTML with regular expressions. In general using Regular Expressions on HTML is discouraged.
Most of the snippets you will find on the web to match markup are brittle. In most cases they are only working for a very particular piece of HTML. Tiny markup changes, like adding whitespace somewhere, or adding, or changing attributes in a tag, can make the RegEx fails when it's not properly written. You should know what you are doing before using RegEx on HTML.
HTML parsers already know the syntactical rules of HTML. Regular expressions have to be taught for each new RegEx you write. RegEx are fine in some cases, but it really depends on your use-case.
You can write more reliable parsers, but writing a complete and reliable custom parser with regular expressions is a waste of time when the aforementioned libraries already exist and do a much better job on this.
Also see Parsing Html The Cthulhu Way
Books
If you want to spend some money, have a look at
PHP Architect's Guide to Webscraping with PHP
I am not affiliated with PHP Architect or the authors.

How can I extract HTML content efficiently with Perl?

I am writing a crawler in Perl, which has to extract contents of web pages that reside on the same server. I am currently using the HTML::Extract module to do the job, but I found the module a bit slow, so I looked into its source code and found out it does not use any connection cache for LWP::UserAgent.
My last resort is to grab HTML::Extract's source code and modify it to use a cache, but I really want to avoid that if I can. Does anyone know any other module that can perform the same job better? I basically just need to grab all the text in the <body> element with the HTML tags removed.
I use pQuery for my web scraping. But I've also heard good things about Web::Scraper.
Both of these along with other modules have appeared in answers on SO for similar questions to yours:
how can i screen scrape with perl
how can i extract xml of a website and save in a file using perls lwp
how do i extract an html title with perl
can you provide an example of parsing html with your favorite parser
how do I extract content from html file using perl
HTML::Extract's features look very basic and uninteresting. If the modules that draegfun mentioned don't interest you, you could do everything that HTML::Extract does using LWP::UserAgent and HTML::TreeBuilder yourself, without requiring very much code at all, and then you would be free to work in caching on your own terms.
I've been using Web::Scraper for my scraping needs. It's very nice indeed for extracting data, and because you can call ->scrape($html, $originating_uri) then it's very easy to cache the result you need as well.
Do you need to do this in real-time? How does the inefficiency affect you? Are you doing the task serially so that you have to extract one page before you move onto the next one? Why do you want to avoid a cache?
Can your crawler download the pages and pass them off to something else? Perhaps your crawler can even run in parallel, or in some distributed manner.

Handling properties in Scala

I'd like to know what is the most efficient way of handling properties in Scala. I'm tired of having gazillion property files, xml files and other type of configuration files in Java and wonder if there's "best practice" to handle those someway more efficient in Scala?
Why would you have a gazillion property files?
I'm still using the Apache commons Digester, which works perfectly well in Scala. It's basically a very easy way of making a user-defined XML document map to method calls on a user-defined configurator class. I find it extremely useful when I want to parse some configuration data (as opposed to application properties).
For application properties, you might either use a dependency injection framework (like Spring) or just plain old property files. I'd also be interested to see if Scala offers anything on top of this, though.
EDIT: Typesafe config gives you a simple and powerful solution for configuration - https://github.com/typesafehub/config
ORIGINAL (possibly not very useful):
Quoting from "Programming in Scala":
"In Scala, you can configure via Scala code itself."
Scala's runtime linking allows for classes to be swapped at runtime and the general philosophy of these languages tends to favour convention over configuration. If you don't want to deal with gazillion property files, just don't have them.
Check out Configgy which looks like a neat little library. It includes nesting and change-notification. It also include a logging library.
Unfortunately, it didn't compile for me on the Mac instances I tried. Let us know if you have better luck and what you think...
Update: solved Mac compilation problems. See this post.