Is there any wkhtmltopdf option to convert html text rather than file? - html

I recently stumbled on wkhtmltopdf and have found it to be an excellent tool for on-the-fly conversion from html to pdf in the browser.
A typical usage (in Windows) would go:
wkhtmltopdf.exe --some-option "<div>Some html <b>formatted</b> text</div>" www.host.com/page_to_print.html file.pdf
My question is: Is there an option to use <html><head></head><body><h1>This is a header</h1></body></html> in place of www.host.com/page_to_print.html?
Thanks for any help.

You can pipe content into wkhtmltopdf using the command line. For Windows, try this:
echo "<h3>blep</h3>" | wkhtmltopdf.exe - test.pdf
This reads like "echo <h3>blep</h3>, output it's stdout (standard out stream) to wkhtmltopdf stdin (standard in stream)".
The dash - in the wkhtmltopdf command means that it takes it's input from stdin and not a file.
You could also echo HTML into a file, feed that file to wkhtmltopdf and delete that file inside a script.

Just a correction to the answer provided by Nenotlep. As Jigar noted (in a comment to Nenotlep's answer), Nenotlep's command results in quotation marks preceding and following the actual text. On my system (Windows 10) this command is the correct solution:
echo ^<h3^>magical ponies^</h3^> | "C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe" - test.pdf
The echo command needs no quotation marks - but, if you do not put the text between quotation marks, the < and > characters need to be escaped (by ^).
Another way to try out is writing the text into a temporary file, which - on Windows - might even be faster as some sources state:
echo ^<h3^>magical ponies^</h3^> > temp.txt
"C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe" - test.pdf < temp.txt
(This can also be written in one line: just put an & between the two commands.)

In addition to the answer provided by pp. If you prefer not to escape the < > characters, you can also do the following:
echo | set /p="<h3>Magical ponies</h3>" | wkhtmltopdf - test.pdf

Using PowerShell, you can do it like this:
$html = "<h1>Magical Ponies</h1><p>Once upon a time in Horseland, there was a band of miniat
ure creatures..."
$html | .\wkhtmltopdf.exe - C:\temp\test.pdf
Just make sure you're running the code from within the \bin\ directory of wkhtmltopdf, otherwise, you'd have to provide a full path to the executable.

I couldn't get wkhtmltopdf to work on my end converting raw html to PDF, but I did find a simple microservice that was able to do it with a couple minutes work on bantam.io.
Here's how it works:
bantam
.run('#images/html', {
html: `
<h1 style='width: 400px; text-align: center'>TEST</h1>
<br/>
<img src='https://images.unsplash.com/photo-1507146426996-ef05306b995a?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1950&q=80' />
`,
imageType: 'pdf',
})
.then(pdfUrl => {
// pdf is temporarily hosted on AWS S3 via a secure link at `pdfUrl`
});
They also have options for taking in a URL rather than raw HTML and you can produce images as well as PDFs. Here are the docs: https://bantam.io/functions/#images/html?link=docs&subLink=0

Related

Using Regex to pull value between html tags

So I know there is easier ways to do this, however I was given the code and asked to attempt to make it work. Rather than rewrite the entire thing I'd simply like to get this working.
So what it does is download the source code for the web page that displays when a person searches the app store. Once that is done I am attempting to pull the version of the app which comes across as the first line below
Once I get the code from the downloaded file I'd like it be placed in another file to be called for later use, however if this is an unnecessary step I am willing to remove it
I have a feeling I am missing something simple.
<span class="htlgb">4.72</span>
# connects to iTunes website with Casino/Manufacturers id
curl https://play.google.com/store/apps/details?id=${address[$a]} > json
# puts just the version from the json file into version file
grep -Po '(?<=<span class="htlgb"> ).*?(?=</span>)' json > version
# cuts out some data so we have just a version number
current_Version=`cat version | tr -d '"' | tr -d ',' | tr -d 'version:'`
Please don't use regex to parse HTML! Use a true HTML parser like Xidel instead:
echo '<span class="htlgb">4.72</span>' | xidel -s - -e '//span[#class="htlgb"]'
4.72
I wouldn't use this path expression for the playstore website however, because there are a lot of these. I've used the Spotify page as an example.
xidel -s https://play.google.com/store/apps/details?id=com.spotify.music -e '//div[#class="hAyfc"][div
="Current Version"]/outer-html()'
<div class="hAyfc"><div class="BgcNfc">Current Version</div><span class="htlgb"><div class="IQ1z0d"><s
pan class="htlgb">8.5.40.195</span></div></span></div>
The version string 8.5.40.195 can be found within a div (with an attribute class="hAyfc") that has a child div with the text Current Version.
Then it's as simple as selecting the span (text)node:
xidel -s https://play.google.com/store/apps/details?id=com.spotify.music -e '//div[#class="hAyfc"][div="Current Version"]/span'
8.5.40.195
# or with your variable:
xidel -s https://play.google.com/store/apps/details?id=${address[$a]} -e '//div[#class="hAyfc"][div="Current Version"]/span'
I am not a bash pro, but this matches 3 groups to your desired html tag. All you need to add now, it so select the value from group 2.
(<span class=\"htlgb\">)(.*?)(</span>)
You can test it here: https://regex101.com/r/9RPycf/1

How to only copy lines of code with specific word from perl file into new file

I have a perl file with html code inside of a subroutine. I want to copy the html code into a new file but ONLY the html code, not the rest of the perl syntax. The HTML code is all inside of one subroutine and all the HTML code starts with the 'push':
sub getTable {
push #htmlBase, qq(<html>\n);
push #htmlBase, qq(\n);
push #htmlBase, qq(<head>\n);
push #htmlBase, qq(<meta http-equiv="Content-Language" content="en-us">\n);
In essence, how do I ONLY copy lines that start with 'push' into a new file from my current perl file? Thanks in advance.
If you're using a unix-like OS, try using grep. Something like:
$ grep 'push' myfile.pl | grep -Po '(?<=qq\().*(?=\);)' >Newfile.html
The first grep just grabs lines with 'push' on them. The second grep turns on Perl RE mode (the -P) and only returns matching results. The query has two parts: (?<=qq\() matches "qq(" right before the text (but doesn't include it in the result) and (?=);) looks for the last ");" on the line.
This won't match multi-line quotes and the output will also include escapes, like the \n for newlines.
Using perl to grep the file:
perl -lne'm/push.+qq\((.+)?(\\n)\);/ && print $1' source.pl > target.html
For the output you've shown, this one-liner will work.
If your source script is more complex, e.g. multi-line statements and embedded variables, then you will need to write some temporary code to call getTable, print the contents of #htmlBase, then save that output to the new file.

batch base64 image decode

I've got a large (117MB!) html file that has thousands of images encoded as base64, I'd like to decode them to JPG's but my bash-fu isn't enough to do this and I haven't been able to find an answer online
In general, HTML can't be parsed properly with regular expressions, but if you have a specific limited format then it could work.
Given a simple format like
<body>
<img src="">
<img src=""><img src="">
<div><img src=""></div>
</body>
the following can pull out the data
i=0; awk 'BEGIN{RS="<"} /="data:image\/jpeg;base64,[^\"]*"/ { match($0, /="data:image\/jpeg;base64,([^\"]*)"/, data); print data[1]; }' test.html | while read d; do echo $d | base64 -d > $i.jpg; i=$(($i+1)); done
To break that down:
i=0 Keep a counter so we can output different filenames for each image.
awk 'BEGIN{RS="<"} Run awk with the Record Separator changed from the default newline to <, so we always treat each HTML element as a separate record.
/="data:image\/jpeg;base64,[^\"]*"/ Only run the following commands on records that have embedded base64 jpeg data.
{ match($0, /="data:image\/jpeg;base64,([^\"]*)"/, data); print data[1]; }' Pull out the data itself, the part matched with parentheses between the comma and the trailing quotation mark, then print it.
test.html Just the input filename.
| while read d; do Pipe the output base64 data to a loop. read will put each line into d until there's no more input.
echo $d | base64 -d > img$i.jpg; Pass the current image through the base64 decoder and store the output to a file.
i=$(($i+1)); Increment to change the next filename.
done Done.
There are a few things that could probably be done better here:
There should be a way to get the line-match regexp to capture the base64 data directly, instead of repeating the regexp in a call to the match() function, but I couldn't get it to work.
I don't like the technique of reading a pipe into the variable d, only to echo it back out to another pipe - it would be nicer to just pipe straight through - but base64 doesn't know to only use one line of the input.
For some reason I have not yet figured out, incrementing the counter directly where it's used (i.e. echo $d | base64 -d > img$((i++)).jpg) only wrote to the first file, even though echo $d > img$((i++)).b64 correctly wrote the encoded data to multiple files. Rather than waiting on working that out, I've just split the increment into its own command.
You can try scrapping the encoded strings of the images using Python.
Then check this out for converting the encoded strings to images.
Use regex to direct the base64 images to separate files
Write loop to iterate through your files.
Bash command to decode files will be along lines of:
cat base64_file1 |base64 -d > file1.jpg

How to add html attributes and values for all lines quickly with vim and plugins?

My os:debian8.
uname -a
Linux debian 3.16.0-4-amd64 #1 SMP Debian 3.16.39-1+deb8u2 (2017-03-07) x86_64 GNU/Linux
Here is my base file.
home
help
variables
compatibility
modelines
searching
selection
markers
indenting
reformatting
folding
tags
makefiles
mapping
registers
spelling
plugins
etc
I want to create a html file as bellow.
home
help
variables
compatibility
modelines
searching
selection
markers
indenting
reformatting
folding
tags
makefiles
mapping
registers
spelling
plugins
etc
Every line was added href and id attributes,whose values are line content pasted .html and line content itself correspondingly.
How to add html attributes and values for all lines quickly with vim and plugins?
sed,awk,sublime text 3 are all welcomed to solve the problem.
$ sed 's:.*:&:' file
home
help
variables
compatibility
modelines
searching
selection
markers
indenting
reformatting
folding
tags
makefiles
mapping
registers
spelling
plugins
etc
if you want to do this in vi itself, no plug-in neccessary
Open the file, type : and insert this line as the command
%s:.*:&
it will make all the substitutions in the file.
sed is the best solution (simple and pretty fast here) if your are sure of the content, if not it need a bit of complexity that is better treated by awk:
awk '
{
# change special char for HTML constraint
Org = URL = HTML = $0
# sample of modification
gsub( / /, "%20", URL)
gsub( /</, "%3C", HTML)
printf( "%s\n", URL, Org, HTML)
}
' YourFile
To complete this easily in Sublime Text, without any plugins added:
Open the base file in Sublime Text
Type Ctrl+Shift+P and in the fuzzy search input type syn html to set the file syntax to HTML.
In the View menu, make sure Word Wrap is toggled off.
Ctrl+A to select all.
Ctrl+Shift+L to break selection into multi-line edit.
Ctrl+C to copy selection into clipboard as multiple lines.
Alt+Shift+W to wrap each line with a tag-- then tap a to convert the default <p> tag into an <a> tag (hit esc to quit out of any context menus that might pop up)
Type a space then href=" -- you should see this being added to every line as they all have cursors. Also you should note that Sublime has automatically closed your quotes for you, so you have href="" with the cursor between the quotes.
ctrl+v -- this is where the magic happens-- your clipboard contains every lines worth of contents, so it will paste each appropriate value into the quotes where the cursor is lying. Then you simply type .html to add the extension.
Use the right arrow to move the cursors outside of the quotes for the href attribute and follow the two previous steps to similarly add an id attribute with the intended ids pasted in.
Voila! You're done.
Multi-line editing is very powerful as you learn how to combine it with other keyboard shortcuts. It has been a huge improvement in my workflow. If you have any questions please feel free to comment and I'll adjust as needed.
With bash one-liner:
while read v; do printf '%s\n' "$v" "$v" "$v"; done < file
(OR)
while read v; do echo "$v"; done < file
Try this -
awk '{print a$1b$1c$1d}' a='' d='' file
home
help
variables
compatibility
modelines
searching
selection
markers
indenting
reformatting
folding
tags
makefiles
mapping
registers
spelling
plugins
etc
Here I have created 4 variable a,b,c & d which you can edit as per your choice.
OR
while read -r i;do echo ""$i";done < f
home
help
variables
compatibility
To execute it directly in vim:
!sed 's:.*:&:' %
In awk, no regex, no nothing, just print strings around $1s, escaping "s:
$ awk '{print "" $1 ""}' file
home
help
If you happen to have empty lines in there just add /./ before the {:
/./{print ...
list=$(cat basefile.txt)
for val in $list
do
echo ""$val"" >> newfile.html
done
Using bash, you can always make a script or type this into the command line.
This vim replacement pattern handles your base file:
s#^\s*\(.\{-}\)\s*$#\1#
^\s* matches any leading spaces, then
.\{-} captures everything after that, non-greedily — allowing
\s$ to match any trailing spaces.
This avoids giving you stuff like home .
You can also process several base files with vim at once:
vim -c 'bufdo %s#^\s*\(.\{-}\)\s*$#\1# | saveas! %:p:r.html' some.txt more.txt`
bufdo %s#^\s*\(.\{-}\)\s*$#\1# runs the replacement on each buffer loaded into vim,
saveas! %:p:r.html saves each buffer with an html extension, overwriting if necessary,
vim will open and show you the saved more.html, which you can correct as needed, and
you can use :n and :prev to visit some.html.
Something like sed’s probably best for big jobs, but this lets you tweak the conversions in vim right after it’s made them, use :u to undo, etc. Enjoy!

Which Perl module do I use to convert Pod to HTML?

I need to convert Pod to HTML. There are number of Pod::HTML and Pod::Simple::* modules. Which one is the one I should use?
The short answer is Pod::Simple::XHTML. It produces useful yet concise HTML output. You can see an example of the output by viewing the html source at http://metacpan.org.
It also works with Pod::Simple::HTMLBatch which you should check out if you are converting more than one file. Note that the default for Pod::Simple::HTMLBatch is Pod::Simple::HTML. But the maintainer of Pod::Simple, David Wheeler, recommends using Pod::Simple::XHTML.
use Pod::Simple::HTMLBatch;
use Pod::Simple::XHTML;
mkdir './html' or die "could not create directory: $!";
my $convert = Pod::Simple::HTMLBatch->new;
$convert->html_render_class('Pod::Simple::XHTML');
$convert->add_css('http://www.perl.org/css/perl.css');
$convert->css_flurry(0);
$convert->javascript_flurry(0);
$convert->contents_file(0);
$convert->batch_convert('./pod', './html');