How do I use the Perl Text-MediawikiFormat to convert mediawiki to xhtml? - html

On an Ubuntu platform, I installed the nice little perl script
libtext-mediawikiformat-perl - Convert Mediawiki markup into other text formats
which is available on cpan. I'm not familiar with perl and have no idea how to go about using this library to write a perl script that would convert a mediawiki file to an html file. e.g. I'd like to just have a script I can run such as
./my_convert_script input.wiki > output.html
(perhaps also specifying the base url, etc), but have no idea where to start. Any suggestions?

I believe #amon is correct that perl library I reference in the question is not the right tool for the task I proposed.
I ended up using the mediawiki API with the action="parse" to convert to HTML using the mediawiki engine, which turned out to be much more reliable than any of the alternative parsers I tried proposed on the list. (I then used pandoc to convert my html to markdown.) The mediawiki API handles extraction of categories and other metadata too, and I just had to append the base url to internal image and page links.
Given the page title and base url, I ended up writing this as an R function.
wiki_parse <- function(page, baseurl, format="json", ...){
require(httr)
action = "parse"
addr <- paste(baseurl, "/api.php?format=", format, "&action=", action, "&page=", page, sep="")
config <- c(add_headers("User-Agent" = "rwiki"), ...)
out <- GET(addr, config=config)
parsed_content(out)
}

The Perl library Text::MediawikiFormat isn't really intended for stand-alone use but rather as a formatting engine inside a larger application.
The documentation at CPAN does actually show a way how to use this library, and does note that other modules might provide better support for one-off conversions.
You could try this (untested) one-liner
perl -MText::MediawikiFormat -e'$/=undef; print Text::MediawikiFormat::format(<>)' input.wiki >output.html
although that defies the whole point (and customization abilities) of this module.
I am sure that someone has already come up with a better way to convert single MediaWiki files, so here is a list of alternative MediaWiki processors on the mediawiki site. This SO question coud also be of help.
Other markup languages, such as Markdown provide better support for single-file conversions. Markdown is especially well suited for technical documents and mirrors email conventions. (Also, it is used on this site.)
The libfoo-bar-perl packages in the Ubuntu repositories are precompiled Perl modules. Usually, these would be installed via cpan or cpanm. While some of these libraries do include scripts, most don't, and aren't meant as stand-alone applications.

Related

Erlang: How to include libraries

I'm writing a simple Erlang program that requests an URL and parses the response as JSON.
To do that, I need to use a Library called Jiffy. I downloaded and compiled it, and now i have a .beam file along with a .app file. My question is: How do I use it? How do I include this library in my program?. I cannot understand why I can't find an answer on the web for something that must be very crucial.
Erlang has an include syntax, but receives a .hrl file.
Thanks!
You don't need to include the file in your project. In Erlang, it is at run time that the code will try to find any function. So the module you are using must be in the search path of the VM which run your code at the point you need it, that's all.
For this you can add files to your path when you start erlang: erl -pa your/path/to/beam (it exists also -pz see erlang doc)
Note that it is also possible to modify the path from the application itself using code:add_path(Dir).
You should have a look to the OTP way to build applications in erlang documentation or Learn You Some Erlang, and also look at Rebar a tool that helps you to manage erlang application (for example starting with rebar or rebar wiki)
To add to Pascal's answer, yes Erlang will search for your files at runtime and you can add extra paths as command line arguments.
However, when you build a project of a scale that you are including other libraries, you should be building an Erlang application. This normally entails using rebar.
When using rebar, your app should have a deps/ directory. To include jiffy in your project, it is easiest to simply clone the repo into deps/jiffy. That is all that needs to be done for you to do something like jiffy:decode(Data) in your project.
Additionally, you can specify additional include files in your rebar.config file by adding extra lines {erl_opts, [{i, "./Some/path/to/file"}]}.. rebar will then look for file.so using that path.

Different code style settings in PhpStorm for php files based on file extension?

I'm using PhpStorm (and love it!), but the coding standard for my current project uses 4 space indents for .php files and 2 space indents for template files (.phtml). The template files are traditional php and HTML. Our code implements a standard Zend Framwork MVC setup.
Is there a way to configure PhpStorm to use one set of code style settings for *.php files and a different set of code style settings for *.phtml files?
Setting::File Type didn't work
I've tried associating .phtml files with the HTML file type, but that causes me to lose ALL php language assistance (no PHP syntax highlighting, no code assist, etc.).
Settings::Template Data Languages didn't work
I also looked for a solution using the the Template Data Languages setting. I setup my .phtml files to the File Type HTML, but PHP isn't an available setting, so it appears there is no way to add php language support for HTML files.
AFAIK it is not possible.
If you want to have PHP support, file extension has to be associated with PHP file type. That's the only way to have PHP support as PHP is not injectable language in current PhpStorm version/implementation.
You may utilize TextMate bundles support plugin and install PHP supported highlighting there. This will allow to assign .phtml extension to another file type. The drawback is that you can only have one language highlighting .. so HTML will not be highlighted + no code completion for actual PHP (that's as far as my simple experiments went with other not-yet-supported languages).

How to use the dart:html library to write html files?

I want to make a program that prepares an HTML file. It would either be on the server side or just running in my local machine.
I think it would be nice to be able to use the dart:html library since it has a lot of methods for manipulating html (obviously). But it is thought to be used dynamically on the client side, and I want to use it like this: manipulate an html DOM tree with dart:html, and when its ready, write a static html file. For instance using query('body').innerHtml
The problem I'm running into is that I if start a project with the "console application" template, I am not able to make dart:html talk to an html file. And if I choose "web application", in which I am able to do this, I cannot load the dart:io library, maybe it has to do with it being tagged as [server] in the SDK?
Of course I could just do:
print(query('body').innerHtml);
and manually copying the output to a file, but I thought maybe there is a more elegant solution.
See html5lib.
html5lib in Pure Dart
This is a pure Dart html5 parser. It's a port of
html5lib from Python. Since it's 100% Dart you can use it safely from
a script or server side app.
Eventually the parse tree API will be compatible with dart:html, so
the same code will work on the client or the server.
It doesn't support much in the way of queries yet.

Parsing an HTML file through Emacs Lisp

I need to parse an HTML file in order to store all the fields of a table in a list through Emacs Lisp.
Although a function like libxml-parse-html-region could do that work for me, Emacs should have been compiled with libxml2 support in advance and as I do not have admin privileges on this machine I cannot use that function.
Therefore, can you share some other options with me in order to get the job accomplished taking into account the referred constraints ?
You can found various XML Parser on EmacsWiki
You could build your own Emacs just for your user?

HTML downloading and text extraction

What would be a good tool, or set of tools, to download a list of URLs and extract only the text content?
Spidering is not required, but control over the download file names, and threading would be a bonus.
The platform is linux.
wget | html2ascii
Note: html2ascii can also be called html2a or html2text (and I wasn't able to find a proper man page on the net for it).
See also: lynx.
Python Beautiful Soup allows you to build a nice extractor.
I know that w3m can be used to render an html document and put the text content in a textfile
w3m www.google.com > file.txt for example.
For the remainder, I'm sure that wget can be used.
Look for the Simple HTML DOM parser for PHP on Sourceforge. Use it to parse HTML that you have downloaded with CURL. Each DOM element will have a "plaintext" attribute which should give you only the text. I was very successful in a lot of applications using this combination for quite some time.
PERL (Practical Extracting and Reporting Language) is a scripting language that is excellent for this type of work. http://search.cpan.org/ contains allot of modules that have the required functionality.
Use wget to download the required html and then run html2text on the output files.