How to convert modern HTML to PDF - html

In 2016, the best way to convert HTML files to PDF from the command line was using wkhtmltopdf. Unfortunately, it seems that that is not really maintained anymore. It doesn't support a lot of things like flexbox.
One can use headless chrome/chromium to do it:
chrome --headless --print-to-pdf="path/to/pdf" https://your_url
but that has no options such as margins, paper type, control over header/footer, screen size, etc.
It appears that there is no plan to add those into headless chrome as command line options (one needs to use the dev tools interface):
https://bugs.chromium.org/p/chromium/issues/detail?id=603559#c89
How can one convert HTML files to PDF from the command line that gives control over how the document prints (margins, etc., from above), and supports modern html/css? Of course, once one can convert from the command line, you can also convert using a programming language of your choice.

Here is a command line tool that you can use to convert HTML pages to PDF just as they would be in chrome. Once you have it installed, you can use it with whatever programming language you want (Python, Java, PHP, etc.) to automatically generate PDFs from HTML web pages or documents. All of the dependencies should be well maintained well into the future, so it shouldn't have the same issues as things like wkhtmltopdf that were difficult to maintain.
URLs:
https://www.npmjs.com/package/chromehtml2pdf
https://github.com/dataverity/chromehtml2pdf
To install it, you'll need npm, and type:
npm install chromehtml2pdf
or to make it globally available to all users on the system
npm install -g chromehtml2pdf
Command line usage:
chromehtml2pdf --out=file.pdf --landscape=1 https://www.npmjs.com/package/chromehtml2pdf
For help (to view all possible options), type:
chromehtml2pdf --help
Feel free to make pull requests on github.

If all you wanted to do was get rid of Chrome's default header/footer, and you control the page in question, you can do that with CSS, without having to use something more complex than a simple command line call.
#media print {
#page { margin: 0; }
}
Of course, you probably do want margins on your pages, so you'll need to fake those. The complexity of doing so varies depending on how many pages you meant to emit to PDF. The recommended body margins in the linked Answer will work if you're emitting a 1-pager. If not, you'll need to use other methods. For example, for multiple pages, you can add body padding on the left and right, then wrap each page's content in a tag with margin for top and bottom.
https://stackoverflow.com/a/15150779/176877

Project Gotenberg does this and a bit more. Including margin manipulation as well as WebHooks, timeouts, merging, and other formats.
To try it
docker run --rm -p 3000:3000 thecodingmachine/gotenberg:6
Example
curl --request POST \
--url http://localhost:3000/convert/url \
--header 'Content-Type: multipart/form-data' \
--form remoteURL=https://brave.com \
--form marginTop=0 \
--form marginBottom=0 \
--form marginLeft=0 \
--form marginRight=0 \
-o result.pdf
- HTML and Markdown conversions using Google Chrome headless

Related

How to get absolute URLs in Bash

I want to get all URLs from specific page in Bash.
This problem is already solved here: Easiest way to extract the urls from an html page using sed or awk only
The trick, however, is to parse relative links into absolute ones. So if http://example.com/ contains links like:
About us
<script type="text/javascript" src="media/blah.js"></a>
I want the results to have following form:
http://example.com/about.html
http://example.com/media/blah.js
How can I do so with as little dependencies as possible?
Simply put, there is no simple solution. Having little dependencies leads to unsightly code, and vice versa: code robustness leads to higher dependency requirements.
Having this in mind, below I describe a few solutions and sum them up by providing pros and cons of each one.
Approach 1
You can use wget's -k option together with some regular expressions (read more about parsing HTML that way).
From Linux manual:
-k
--convert-links
After the download is complete, convert the links in the document to
make them suitable for local viewing.
(...)
The links to files that have not been downloaded by Wget will be
changed to include host name and absolute path of the location they
point to.
Example: if the downloaded file /foo/doc.html links to /bar/img.gif
(or to ../bar/img.gif), then the link in doc.html will be modified to
point to http://hostname/bar/img.gif.
An example script:
#wget needs a file in order for -k to work
tmpfil=$(mktemp);
#-k - convert links
#-q - suppress output
#-O - redirect output to given file
wget http://example.com -k -q -O "$tmpfil";
#-o - print only matching parts
#you could use any other popular regex here
grep -o "http://[^'\"<>]*" "$tmpfil"
#remove unnecessary file
rm "$tmpfil"
Pros:
Works out of the box on most systems, assuming you have wget installed.
In most cases, this will be sufficient solution.
Cons:
Features regular expressions, which are bound to break on some exotic pages due to HTML hierarchical model standing below regular expressions in Chomsky hierarchy.
You cannot pass a location in your local file system; you must pass working URL.
Approach 2
You can use Python together with BeautifulSoup. An example script:
#!/usr/bin/python
import sys
import urllib
import urlparse
import BeautifulSoup
if len(sys.argv) <= 1:
print >>sys.stderr, 'Missing URL argument'
sys.exit(1)
content = urllib.urlopen(sys.argv[1]).read()
soup = BeautifulSoup.BeautifulSoup(content)
for anchor in soup.findAll('a', href=True):
print urlparse.urljoin(sys.argv[1], anchor.get('href'))
And then:
dummy:~$ ./test.py http://example.com
Pros:
It's the correct way to handle HTML, since it's properly using fully-fledged parser.
Exotic output is very likely going to be handled well.
With small modifications, this approach works for files, not URLs only.
With small modifications, you might even be able to give your own base URL.
Cons:
It needs Python.
It needs Python with custom package.
You need to manually handle tags and attributes like <img src>, <link src>, <script src> etc (which isn't presented in the script above).
Approach 3
You can use some features of lynx. (This one was mentioned in the answer you provided in your question.) Example:
lynx http://example.com/ -dump -listonly -nonumbers
Pros:
Very concise usage.
Works well with all kind of HTML.
Cons:
You need Lynx.
Although you can extract links from files as well, you cannot control the base URL and you end up with file://localhost/ links. You can fix this using ugly hacks like manual inserting <base href=""> tag into HTML.
Another option is my Xidel (XQuery/Webscraper):
For all normal links:
xidel http://example.com/ -e '//a/resolve-uri(#href)'
For all links and srcs:
xidel http://example.com/ -e '(//#href, //#src)/resolve-uri(.)'
With rr-'s format:
Pros :
Very concise usage.
Works well with all kind of HTML.
It's the correct way to handle HTML, since it's properly using fully-fledged parser.
Works for files and urls
You can give your own base URL. (with resolve-uri(#href, "baseurl"))
No dependancies except Xidel (except openssl, if you also have https urls)
Cons:
You need Xidel, which is not contained in any standard repository
Why not simply this ?
re='(src|href)='
baseurl='example.com'
wget -O- "http://$baseurl" | awk -F'(src|href)=' -F\" "/$re/{print $baseurl\$2}"
you just need wget and awk.
Feel free to improve the snippet a bit if you have both relative & absolute urls at the same time.

running a perl cgi script from a webpage (html)

I have been searching for a tutorial on how to run a perl program from an html webpage. I cannot find a tutorial or even a good starting point that explains clearly how to do this...
What I'm trying to do is use WWW::mechanize in perl to fill in some information for me on the back end of a wordpress site. Before I can do that I'd like to just see the retrieved html displayed in the browser like the actual website would be displayed in the browser. Here is my perl:
print "Content-type: text/html\n\n";
use CGI;
use WWW::Mechanize;
my $m = WWW::Mechanize->new();
use WWW::Mechanize;
$url = 'http://www.storagecolumbusohio.com/wp-admin';
$m->post($url);
$m->form_id('loginform');
$m->set_fields('log' => 'username', 'pwd' => 'password');
$page = $m->submit();
$m->add_handler("request_send", sub { shift->dump; return });
$m->add_handler("response_done", sub { shift->dump; return });
print $page->decoded_content;
This code works from the command prompt (actually I'm on mac, so terminal). However I'd like it to work from a website when the user clicks on a link.
I have learned a few things but it's confusing for me since I'm a perl noob. It seems that there are two ways to go about doing this (and I could be wrong but this is what I've gathered from what iv'e read). One way people keep talking about is using some kind of "template method" such as embperl or modperl. The other is to run the perl program as a cgi script. From what I've read on various sites, it seems like cgi is the simplest and most common solution? In order to do that I'm told I need to change a few lines in the httpd.conf file. Where can I find that file to alter it? I know I'm on an apache server, but my site is hosted by dreamhost. Can I still access this file and if so how?
Any help would be greatly appreciated as you can probably tell I don't have a clue and am very confused.
To use a cgi script on dreamhost, it is sufficient to
give the script a .cgi extension
put the script somewhere visible to the webserver
give the script the right permissions (at least 0755)
You may want to see if you can get a toy script, say,
#!/usr/bin/perl
print "Content-type: text/plain\n\nHello world\n";
working before you tackle debugging your larger script.
That said, something I don't see in your script is the header. I think you'll want to say something like
print "Content-type: text/html\n\n";
before your other print call.
I would suggest that you test your code first on your local server.
I assume you are using windows or something similar with your questions, so use xamp http://www.apachefriends.org/en/xampp.html or wamp http://www.wampserver.com/en/ or get a real OS like http://www.debian.org (you can run it in a vm as well).
You should not print the content type like that, but use "print header", see this page:
http://perldoc.perl.org/CGI.html#CREATING-A-STANDARD-HTTP-HEADER%3a
Make sure you have your apache server configured properly for perl, see also these commons problems:
http://oreilly.com/openbook/cgi/ch12_01.html
Also see How can I send POST and GET data to a Perl CGI script via the command line? for testing on the command line.

Convert MediaWiki wikitext format to HTML using command line

I tend to write a good amount of documentation so the MediaWiki format to me is easy for me to understand plus it saves me a lot of time than having to write traditional HTML. I, however, also write a blog and find that switching from keyboard to mouse all the time to input the correct tags for HTML adds a lot of time. I'd like to be able to write my articles in Mediawiki syntax and then convert it to HTML for use on my blog.
I've tried Google-ing but must need better nomenclature as surprisingly I haven't been able to find anything.
I use Linux and would prefer to do this from the command line.
Any one have any thoughts or ideas?
The best would be to use MediaWiki parser. The good news is that MediaWiki 1.19 will provide a command line tool just for that!
Disclaimer: I wrote that tool.
The script is maintenance/parse.php some usage examples straight from the source code:
Entering text yourself, ending it with Control + D:
$ php maintenance/parse.php --title foo
''[[foo]]''^D
<p><i><strong class="selflink">foo</strong></i>
</p>
$
The usual file input method:
$ echo "'''bold'''" > /tmp/foo.txt
$ php maintenance/parse.php /tmp/foo.txt
<p><b>bold</b>
</p>$
And of course piping to stdin:
$ cat /tmp/foo | php maintenance/parse.php
<p><b>bold</b>
</p>$
as of today you can get the script from http://svn.wikimedia.org/svnroot/mediawiki/trunk/phase3/maintenance/parse.php and place it in your maintenance directory. It should work with MediaWiki 1.18
The script will be made available with MediaWiki 1.19.0.
Looked into this a bit and think that a good route to take here would be to learn to a general markup language like restucturedtext or markdown and then be able to convert from there. Discovered a program called pandoc that can convert either of these to HTML and Mediawiki. Appreciate the help.
Example:
pandoc -f mediawiki -s myfile.mediawiki -o myfile.html -s
This page lists tons of MediaWiki parsers that you could try.

How to convert phpBB board to static archive page?

I used to run a phpBB forum for our class in school but we have now graduated and the forum isn't used anymore. I want to remove the phpBB installation but there is a lot written in the forum that is fun to read now and then.
I wonder if there is an easy way to convert the phpBB forum to some kind of static archive page that anyone can browse and read, instead of having the full phpBB installation.
I guess I could create some kind of converter myself using the database tables but I wonder if there already is something like that.
I just used wget to archive a PhpBB2 forum completely. Things might be a bit different for PhpBB3 or newer version, but the basic approach is probably useful.
I first populated a file with session cookies (to
prevent phpbb from putting sid= in links), then did the actual mirror. This used
wget 1.20, since 1.18 messed up the --adjust-extension for non-html files (e.g.
gifs).
wget https://example.com/forum/ --save-cookies cookies \
--keep-session-cookies
wget https://example.com/forum/ --load-cookies cookies \
--page-requisites --convert-links --mirror --no-parent --reject-regex \
'([&?]highlight=|[&?]order=|posting.php[?]|privmsg.php[?]|search.php[?]|[&?]mark=|[&?]view=|viewtopic.php[?]p=)' \
--rejected-log=rejected.log -o wget.log --server-response \
--adjust-extension --restrict-file-names=windows
This tells wget to recursively mirror the entire site, including requisites (CSS and images). It rejects (skips) certain urls, mostly because they are no longer useful in a static site (e.g. search) or are just slightly different or even identical views on the same content (e.g. viewtopic.php?p=... just returns the topic containing the given post, so no need to mirror that topic for each individual post. The --adjust-extension option makes wget add .html to dynamically generated HTML pages, and --restrict-file-names=windows makes it replace (among other things) the ? with a #, so you can actually put the result on a webserver without that webserver chopping the urls at the ? (which normally starts the query parameters).
You could write a quick php script, to query the database and generate a flat HTML file.
...
<body>
<table>
<tr>
<th>Topic</th>
<th>Author</th>
<th>Content</th>
</tr>
// Query php Database Table
foreach (Row in tblComment) {
echo "
<tr>
<th>$topic</th>
<th>$author</th>
<th>$content</th>
</tr>
"
}
</table>
</body>
...
Or you could get a little fancier and generate a HTML file for each subject, and build a index.html page that has links to all the HTML pages created, but I don't think you'll find anything that does what you need.
Another option would be to use a website copier such as http://www.httrack.com/ to generate and save all generated HTML files that can later be served from the server.

What to use to check html links in large project, on Linux?

I have directory with > 1000 .html files, and would like to check all of them for bad links - preferably using console. Any tool you can recommend for such task?
you can use wget, eg
wget -r --spider -o output.log http://somedomain.com
at the bottom of the output.log file, it will indicate whether wget has found broken links. you can parse that using awk/grep
I'd use checklink (a W3C project)
You can extract links from html files using Lynx text browser. Bash scripting around this should not be difficult.
Try the webgrep command line tools or, if you're comfortable with Perl, the HTML::TagReader module by the same author.