HTML downloading and text extraction - html

What would be a good tool, or set of tools, to download a list of URLs and extract only the text content?
Spidering is not required, but control over the download file names, and threading would be a bonus.
The platform is linux.

wget | html2ascii
Note: html2ascii can also be called html2a or html2text (and I wasn't able to find a proper man page on the net for it).
See also: lynx.

Python Beautiful Soup allows you to build a nice extractor.

I know that w3m can be used to render an html document and put the text content in a textfile
w3m www.google.com > file.txt for example.
For the remainder, I'm sure that wget can be used.

Look for the Simple HTML DOM parser for PHP on Sourceforge. Use it to parse HTML that you have downloaded with CURL. Each DOM element will have a "plaintext" attribute which should give you only the text. I was very successful in a lot of applications using this combination for quite some time.

PERL (Practical Extracting and Reporting Language) is a scripting language that is excellent for this type of work. http://search.cpan.org/ contains allot of modules that have the required functionality.

Use wget to download the required html and then run html2text on the output files.

Related

xgettext to generate po file from html files

This is something I am trying so hard to get. tried a bunch of options, including this one found here Extracting gettext strings from Javascript and HTML files (templates). No go.
this is the sample html
<h1 data-bind="text: _loc('translate this')"></h1>
the command I have tried (php, glade..)
xgettext -LPHP --force-po -o E:\Samples\poEdit\translated.po --from-code=utf-8 -k_loc E:\Samples\poEdit\html\samplePO.html
glade seems to look only inside tags and completely skips the keyword. Anyone solve this problem?
We eventually ended up writing a small .net application to parse the html and create a json representation and used language PYTHON with xgettext to create the po file from javascript.

How do I use the Perl Text-MediawikiFormat to convert mediawiki to xhtml?

On an Ubuntu platform, I installed the nice little perl script
libtext-mediawikiformat-perl - Convert Mediawiki markup into other text formats
which is available on cpan. I'm not familiar with perl and have no idea how to go about using this library to write a perl script that would convert a mediawiki file to an html file. e.g. I'd like to just have a script I can run such as
./my_convert_script input.wiki > output.html
(perhaps also specifying the base url, etc), but have no idea where to start. Any suggestions?
I believe #amon is correct that perl library I reference in the question is not the right tool for the task I proposed.
I ended up using the mediawiki API with the action="parse" to convert to HTML using the mediawiki engine, which turned out to be much more reliable than any of the alternative parsers I tried proposed on the list. (I then used pandoc to convert my html to markdown.) The mediawiki API handles extraction of categories and other metadata too, and I just had to append the base url to internal image and page links.
Given the page title and base url, I ended up writing this as an R function.
wiki_parse <- function(page, baseurl, format="json", ...){
require(httr)
action = "parse"
addr <- paste(baseurl, "/api.php?format=", format, "&action=", action, "&page=", page, sep="")
config <- c(add_headers("User-Agent" = "rwiki"), ...)
out <- GET(addr, config=config)
parsed_content(out)
}
The Perl library Text::MediawikiFormat isn't really intended for stand-alone use but rather as a formatting engine inside a larger application.
The documentation at CPAN does actually show a way how to use this library, and does note that other modules might provide better support for one-off conversions.
You could try this (untested) one-liner
perl -MText::MediawikiFormat -e'$/=undef; print Text::MediawikiFormat::format(<>)' input.wiki >output.html
although that defies the whole point (and customization abilities) of this module.
I am sure that someone has already come up with a better way to convert single MediaWiki files, so here is a list of alternative MediaWiki processors on the mediawiki site. This SO question coud also be of help.
Other markup languages, such as Markdown provide better support for single-file conversions. Markdown is especially well suited for technical documents and mirrors email conventions. (Also, it is used on this site.)
The libfoo-bar-perl packages in the Ubuntu repositories are precompiled Perl modules. Usually, these would be installed via cpan or cpanm. While some of these libraries do include scripts, most don't, and aren't meant as stand-alone applications.

download links from a web page with renaming

I'm trying to find a way to automatically download all links from a web page, but I also want to rename them. for example:
<a href = fileName.txt> Name I want to have </a>
I want to be able to get a file named 'Name I want to have' (I don't worry about the extension).
I am aware that I could get the page source, then parse all the links, and download them all manually, but I'm wondering if there are any built-in tools for that.
lynx --dump | grep http:// | cut -d ' ' -f 4
will print all the links that can be batch fetched with wget - but is there a way to rename the links on the fly?
I doubt anything does this out of the box. I suggest you write a script in Python or similar to download the page, and load the source (try the Beautiful Soup library for tolerant parsing). Then it's a simple matter of traversing the source to capture the links with their attributes and text, and download the files with the names you want. With the exception of Beautiful Soup (if you need to be able to parse sloppy HTML), all you need is built in with Python.
I solved the problem by converting the web page entirely to unicode on the first pass (using notepad++'s built-in conversion)
Then I wrote a small shell script that used cat, awk and wget to fetch all the data.
Unfortunately, I couldn't automate the process since I didn't find any tools for linux which would convert an entire page from KOI8-R to unicode.

Someway of removing internal links from Wikipedia XML files?

If I have downloaded Wikipedia XML dumps, is there any way of removing all of the internal links from within an XML file?
Thanks
One thing you could do, if you are importing them into a local wiki, is to import all the files you want, then use a robot (eg. pywikipediabot is easy to use) to get rid of all the internal links.
Wikipedia database dumps and information about using them are located here: Wikipedia:Database download. You should do this instead of writing a script to scrape Wikipedia.
I would try to use XSLT to transform the XML file into another XML file.
You could do a search and replace in your favorite text editor, replacing [[ and ]] with nothing.

Generate single-file HTML code documentation

How can I use Doxygen to create the HTML documentation as a single, very long file? I want something like the RTF output, but as HTML.
The reason: I need my API published as a single, printable, document. Something that can be loaded into Word, converted to PDF, etc.
I think you can use HTMLDOC to convert the generated html files to a single html file. (I did not try it myself)
The manual includes the following example to generate a html from two source html files:
htmldoc --book -f output.html file1.html file2.html
But there is also a gui.
I don't think there's an option that will produce the output as a single HTML file, but the RTF output may be suitable if you need an editable output format (I haven't tried this myself so I don't know how well this works).
If you want good quality printable output, then Doxygen can output LaTeX format (set GENERATE_LATEX to YES in your doxygen configuration file). This can then be converted to PDF, although you'll need to install a LaTeX distribution such as MiKTeX.