How can I extract HTML content efficiently with Perl? - html

I am writing a crawler in Perl, which has to extract contents of web pages that reside on the same server. I am currently using the HTML::Extract module to do the job, but I found the module a bit slow, so I looked into its source code and found out it does not use any connection cache for LWP::UserAgent.
My last resort is to grab HTML::Extract's source code and modify it to use a cache, but I really want to avoid that if I can. Does anyone know any other module that can perform the same job better? I basically just need to grab all the text in the <body> element with the HTML tags removed.

I use pQuery for my web scraping. But I've also heard good things about Web::Scraper.
Both of these along with other modules have appeared in answers on SO for similar questions to yours:
how can i screen scrape with perl
how can i extract xml of a website and save in a file using perls lwp
how do i extract an html title with perl
can you provide an example of parsing html with your favorite parser
how do I extract content from html file using perl

HTML::Extract's features look very basic and uninteresting. If the modules that draegfun mentioned don't interest you, you could do everything that HTML::Extract does using LWP::UserAgent and HTML::TreeBuilder yourself, without requiring very much code at all, and then you would be free to work in caching on your own terms.

I've been using Web::Scraper for my scraping needs. It's very nice indeed for extracting data, and because you can call ->scrape($html, $originating_uri) then it's very easy to cache the result you need as well.

Do you need to do this in real-time? How does the inefficiency affect you? Are you doing the task serially so that you have to extract one page before you move onto the next one? Why do you want to avoid a cache?
Can your crawler download the pages and pass them off to something else? Perhaps your crawler can even run in parallel, or in some distributed manner.

Related

Rails, HTML to JSON?

Given a static HTML page, is there an automated way to generate json?
For a large website that contains a lot of static HTML I am wanting to generate json for RSS feeds and search functionality and am looking for a way to convert HTML to json.
I could obviously write json templates for every page and every language but that would be a unmaintainable. That would double an 800page website to 1600 pages and that is not an option.
One approach I thought of could be to write a bot that would loop through the routes to index the pages and save data to a database which would give me all the choices I could wish for, for searching such as solr, elastic search, thinking sphinx etc...
I could use capybarra to aid me in this by visiting each path and extracting text to save to a database in a rake task as a background job but not sure how that would work in a production environment and it seems that such a common requirement might have already been achieved but for the life of me I can't find one.
I would be far happier (I think) if I could find a way to convert HTML text content to JSON
Any ideas? Has this already been done? are there any gems that might help? or is there built in functionality that I have not thought of, maybe a way to get html into a hash that could then be converted into json? whatever the approach it needs to be automated. I'm just stuck for the best approach.
Basically html looks a lot like xml, but with strong tag meanings, so you could use xml to json conversion, if it all ends up getting tree of html tags embedded in each other.
And so your question becomes this question Except you might get problems with single tags, without closing one. So you might get all of these and put a closing bracket after each one before trying to get it as hash from xml. Oh, early answer. Btw in general for parsing text data you should look at regular expressions.
I chose to go with a nokogiri solution in the end and wrote a parser to meet my needs

Activating HTML with Haskell

I have a large pile of lecture notes in raw HTML format. I would like to add interactive content to these notes, in particular incorporating online exercises. I have some experience implementing online exercises as cgi-bin executables compiled from Haskell code running on the server, interacting with a student record file and sending suitable HTML back to the browser, using Text.Xhtml to generate the content. Now I plan to integrate the notes and the exercises.
The trouble is that I don't want to spend ages manually transforming my raw HTML into Haskell code to generate exactly the raw HTML I started with. Instead, I'd like to put my Haskell code and my HTML in the same source file, with placeholders in the latter for content generated by the former. A suitable tool should then transform this file into Haskell source code for (e.g.) a cgi-bin executable which generates the corresponding page.
Before I go hacking up such a piece of kit, I thought I'd ask if there's better technology out there already. The fixed points are the large legacy lump of HTML, the need to implement the assessment of the exercises in Haskell, and the need to interact with student records on the server. The handicap is that I need to use the departmental web server and I can't reconfigure it (ok, maybe I could ask nicely): that's one of the reasons I currently use cgi-bin executables, which are just fine on our server already, but I'm open to other possibilities.
My current plan is to write a (I mean adapt an existing) preprocessor to support a special syntax for defining functions of type
Html -> ... -> Html -> Html
that looks a lot like raw HTML with splice points. Then what I do with my existing raw HTML is indent it a bit and mark the holes.
But would that be a waste of time? Please, please tell me that this question is a duplicate!
There are Haskell frameworks like Yesod and Happstack which use templating engines like you describe.
Have you looked at the haskell wiki at http://www.haskell.org/haskellwiki/HSP or
http://www.haskell.org/haskellwiki/Web/Libraries/Templating ?
They may do what you need.
You might find someting to do the job here: Templating packages for Haskell.
And you should probably look into Snap, Yesod or Happstack for serving the content.
I have a large pile of lecture notes in raw HTML format. I would like to add interactive content to these notes, in particular incorporating online exercises.
There is already a system (called "ActiveHs"), written in Haskell, that allows to put lecture notes and interactive exercises in one file.
See:
http://pnyf.inf.elte.hu/fp/UsersGuide_en.xml
http://pnyf.inf.elte.hu/fp/Constructive_en.xml
I can really say that it is very well written code and completely open source!

How can I create a well-formatted PDF?

I'm working on automating our company invoicing system. Currently all data is stored in our local MySQL database and someone manually updates an excel spreadsheet and then merges this data into a MS Word template. The goal is to automate this process so that the invoice can be generated from our intranet website as a PDF.
My original plan was to create a template in HTML/CSS and use wkhtmltopdf to generate the PDF but I ran into problems with getting a repeatable header and footer on each page. thead and tfoot aren't supported by Webkit and the fix suggested in this other question does not seem to work either.
So I then stumbled on using XML and XSL-FO, the latter I know nothing about. Is this the best path to take? Are there any libraries or utilities out there that will make converting my HTML+CSS into XML+XSL-FO easier? Are there any other alternatives I'm overlooking?
EDIT
Currently the server is CentOS Linux with a MySQL database. All other code is currently in PHP currently but that may change as the whole system is being revamped. Linux and MySQL will almost certainly remain, though.
For your requirement, XSL-FO might just do the trick. It is much cleaner to produce the pdf's directly from the data, then going the cumbersome html path, unless you need to display the html as well, then you might consider converting from html to pdf, but it will always be messy.
You can get xml results from mysql quite easily (mysql --xml) and then you write one (or several) xsl-fo stylesheet for the data. then, you cannot only produce pdfs, but also postscript files or rtf's with some processors.
XSL-FO has its limitations tho, but for your situation, it should suffice.
I admit, the learning curve can be steep, and maintaining xslt-stylesheets can get very tiring, but as you start knowing more about it, you end up writing less code.
another possibility is to do the whole thing in e.g. java or c# - send select statements and loop the results and iteratively build the pdf using a library like iText.
You could try JODReports or Docmosis as less-code intensive options. You supply Word or OpenOffice Writer documents to act as templates and use these engines to manipulate/populate the templates then spit out the documents in the format(s) you require. This may mean your existing Word-templates can be used directly which should save you some effort/time.
iText is another library that will let you build and pump out PDFs from code. It's pretty good.
If you cloud use ASP.NET for web you can use free ReportViewer library and designer for automated of publishing PDF-s.
Here is some references:
http://gotreportviewer.com
http://weblogs.asp.net/srkirkland/archive/2007/10/29/exporting-a-sql-server-reporting-services-2005-report-directly-to-pdf-or-excel.aspx
If you're OK using .NET and C#, you could use DotPdf from Atalasoft (obligatory disclaimer: I work for Atalasoft and wrote most of DotPdf). The Generating namespace is geared for exactly what you're trying to do: automate report generation. From the very basics, you could just create docs directly with the toolkit or you can create template documents that have unpopulated text fields that you can reload and fill later (see here and here for examples).

Simple HTML interface to XSD?

I'm writing an app that, at its heart, uses a hierarchical tree of nodes
in XML, it looks like this:
<node>
<name>Node1</name>
<Attribute1>Something</Attribute1>
<Attribute2>SomethingElse</Attribute2>
<child>Node2</child>
<child>Node4</child>
<child>Node7</child>
</node>
And so on (all child elements must refer to an existing node, though the node inquestion doesnt have to precede the first reference to it)
For a simple structure like this is there a simple tool to generate a html page that will allow a user to enter Nodes and dynamically update a server-side xml file?
Im basically writing a tool that will use such a file, but the people who's job it is to create the file arent especially techno-literate, so creating the XML by hand is a no-no.
I could hand-crank one fairly quickly, but if I can get a tool to do it, even better (especially as the format may change in future)....
Xopus is a browser based XML editor that you could use for this. It is designed for the non techno-literate people out there.
Disclaimer: I work at Xopus.
I am pretty sure there is nothing that will do that for you automagically and you'll need to write that bit yourself.
Your options are to create a web based interface to do it, using HTML POST and writing the output to a file or database (then reloading it on submission) or something more advance with Javascript (e.g. that could do it dynamically with AJAX).
You can't do it in HTML alone - either way you'd need something to handle outputting the existing data and accepting HTTP POST requests, but you don't mention what language or platform you are using to write this. Being clear on that will help people suggest appropriate solutions.
You might want to rethink the XML structure ... Elements called "attribute{anything}" are ill advised (as are elements named in the convention foo1, foo2, (etc)). The whole <child>Node2</child> thing doesn't seem like a good way to go either. I suggest posting an actual example of the XML in question.
From what you've said, it sounds like there is no specific need for it to be in XML at all. Not that XML is bad (it isn't) but if putting it in an SQL database is a valid option and you have one of those anyway (e.g. your using a LAMP stack) then that's something to consider.
Would an XML editor like http://www.oxygenxml.com/ suffice? I don't know of any html web ones unless you write one yourself and use AJAX to send the data. At least an XML editor can generate a form that you can use to create and edit XML documents. Microsoft do infopath as well - which is actually designed more for questionaires but might do what you need, if the non-tech people would prefer something more office like.

best library to do web-scraping [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 months ago.
Improve this question
I would like to get data from from different webpages such as addresses of restaurants or dates of different events for a given location and so on. What is the best library I can use for extracting this data from a given set of sites?
If using python, take a good look at Beautiful Soup (http://crummy.com/software/BeautifulSoup).
An extremely capable library, makes scraping a breeze.
The HTML Agility Pack For .net programers is awesome. It turns webpages in XML docs that can be queried with XPath.
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a#href")
{
HtmlAttribute att = link"href";
att.Value = FixLink(att);
}
doc.Save("file.htm");
You can find it here. http://www.codeplex.com/htmlagilitypack
I think the general answer here is to use any language + http library + html/xpath parser. I find that using ruby + hpricot gives a nice clean solution:
require 'rubygems'
require 'hpricot'
require 'open-uri'
sites = %w(http://www.google.com http://www.stackoverflow.com)
sites.each do |site|
doc = Hpricot(open(site))
# iterate over each div in the document (or use xpath to grab whatever you want)
(doc/"div").each do |div|
# do something with divs here
end
end
For more on Hpricot see http://code.whytheluckystiff.net/hpricot/
I personally like the WWW::Mechanize Perl module for these kinds of tasks. It gives you an object that is modeled after a typical web browser, (i.e. you can follow links, fill out forms, or use the "back button" by calling methods on it).
For the extraction of the actual content, you could then hook it up to HTML::TreeBuilder to transform the website you're currently visiting into a tree of HTML::Element objects, and extract the data you want (the look_down() method of HTML::Element is especially useful).
i think watir or selenium are the best choices. Most of the other mentioned libraries are actually HTML parsers, and that is not what you want... You are scraping, if the owner of the website wanted you to get to his data he'd put a dump of his database or site on a torrent and avoid all the http requests and expensive traffic.
basically, you need to parse HTML, but more importantly automate a browser. This to the point of being able to move the mouse and click, basically really mimicking a user. You need to use a screencapture program to get to the captchas and send them off to decaptcha.com (that solve them for a fraction of a cent) to circumvent that. forget about saving that captcha file by parsing the html without rendering it in a browser 'as it is supposed to be seen'. You are screenscraping, not httprequestscraping.
watir did the trick for me in combination with autoitx (for moving the mouse and entering keys in fields -> sometimes this is necessery to set of the right javascript events) and a simple screen capture utility for the captcha's. this way you will be most succesfull, it's quite useless writing a great html parser to find out that the owner of the site has turned some of the text into graphics. (Problematic? no, just get an OCR library and feed the jpeg, text will be returned). Besides i have rarely seen them go that far, although on chinese sites, there is a lot of text in graphics.
Xpath saved my day all the time, it's a great Domain Specific Language (IMHO, i could be wrong) and you can get to any tag in the page, although sometimes you need to tweak it.
What i did miss was 'reverse templates' (the robot framework of selenium has this). Perl had this in CPAN module Template::Extract, very handy.
The html parsing, or the creation of the DOM, i would leave to the browser, yes, it won't be as fast, but it'll work all the time.
Also libraries that pretend to be Useragents are useless, sites are protected against scraping nowadays, and the rendering of the site on a real screen is often necessery to get beyond the captcha's, but also javascript events that need to be triggered for information to appear etc.
Watir if you're into Ruby, Selenium for the rest i'd say. The 'Human Emulator' (or Web Emulator in russia) is really made for this kind of scraping, but then again it's a russian product from a company that makes no secret of its intentions.
i also think that one of these weeks Wiley has a new book out on scraping, that should be interesting. Good luck...
I personally find http://github.com/shuber/curl/tree/master and http://simplehtmldom.sourceforge.net/ awesome for use in my PHP spidering/scraping projects.
The Perl WWW::Mechanize library is excellent for doing the donkey work of interacting with a website to get to the actual page you need.
I would use LWP (Libwww for Perl). Here's a good little guide: http://www.perl.com/pub/a/2002/08/20/perlandlwp.html
WWW::Scraper has docs here: http://cpan.uwinnipeg.ca/htdocs/Scraper/WWW/Scraper.html
It can be useful as a base, you'd probably want to create your own module that fits your restaurant mining needs.
LWP would give you a basic crawler for you to build on.
There have been a number of answers recommending Perl Mechanize, but I think that Ruby Mechanize (very similar to Perl's version) is even better. It handles some things like forms in a much cleaner way syntactically. Also, there are a few frontends which run on top of Ruby Mechanize which make things even easier.
What language do you want to use?
curl with awk might be all you need.
You can use tidy to convert it to XHTML, and then use whatever XML processing facilities your language of choice has available.
I'd recommend BeautifulSoup. It isn't the fastest but performs really well in regards to the not-wellformedness of (X)HTML pages which most parsers choke on.
what someone said.
use ANY LANGUAGE.
as long as you have a good parser library and http library, you are set.
the tree stuff are slower, then just using a good parse library.