How to achieve Page numbering Chapter/Section wise in PDF creation using WKHTMLTOPDF PDF engine - html

I am trying to generate a PDF book using WKHTMLTOPDF for Linux and lunching through a Perl program. I am passing multiple html files with cover page and footer info through html file to WKHTMLTOPDF command line. While printing as PDF WKHTMLTOPDF treats each file as one section/chapter. I can achieve the page numbering on full PDF through variables using Javascript given in footer html. But I want to print the page number of the current chapter/section is in processing and also total no of pages in that chapter/section. Actually it should reset the page number counter after printing chapter/section. This info can be useful to show the total no of pages in each section and page numbering section wise.
Can anyone know how to achieve it? I am using WKHTMLTOPDF patch QT RC11
Thanks

There are at least two ways to solve your problem.
You can provide a --page-offset value with each page file object. For instance: wkhtmltopdf page --page-offset 1 section1.html page --page-offset 1 section2.html
You can calculate the number of pages in the current section by using the query string provided to the footer html script

Use [sitepage] and [sitepages], e.g.:
wkhtmltopdf.exe file:///C:/temp/html1.html --footer-right "Page [sitepage] of [sitepages]" file:///C:/temp/html2.html --footer-right "Page [sitepage] of [sitepages]" output.pdf

Related

Convert markdown table of contents to HTML

I am trying to convert a markdown document to HTML, using pandoc. I cannot get the HTML output to create the table of contents correctly.
Issue:
I have added a table of contents to the markdown doc, where clicking on each header takes the reader to the relevant section. I am using the format below, where clicking on 'Header Title' will send the reader to the section 'header' in the document:
[Header Title](#header)
I tried to convert this to HTML using the pandoc command
pandoc -i input.md -f markdown -t html -o input.html
This creates a valid HTML file I can open in Firefox, and the items in the table of contents show up as links - but when I click them, nothing happens (I am expecting it to jump to the relevant section)
This happens when I use either markdown or markdown_github as the input format (-i in pandoc)
Question:
How can I get the table of contents to show the expected behavior in HTML?
Or is the concept of 'table of contents' a wrong approach to HTML, and I should change my markdown code?
Apologies if I am going about this the wrong way, I have no experience with HTML / web documents.
I found a couple of similar questions but they seemed to be specific to other programming languages / tools, so any help how I can achieve this with markdown / pandoc is much appreciated.
I am using pandoc 1.19.2.4 on Ubuntu.
Example markdown:
- [Chapter 1](#chapter-1)
- [1. Reading a text file](#1-reading-a-text-file)
## Chapter 1
This post focuses on standard text processing tasks such as reading files and processing text.
### 1. Reading a text file
Reading a file.
Looking at your markdown file, you have used #1-reading-a-text-file as the id for the 1st subheading.
While converting it to HTML, the following line is generated for the subheading:
<h3 id="reading-a-text-file">1. Reading a text file</h3>
The problem is the mismatch of "#1" which is present in the table of contents, but not in the heading.
My guess is that pandoc does not allow HTML id to start with a number.
Changing the table of contents to the following should work:
- [Chapter 1](#chapter-1)
- [1. Reading a text file](#reading-a-text-file)

How does header and footer printing work in Puppeter's page.pdf API?

I've noticed a few inconsistencies when trying to use the headerTemplate and footerTemplate options with page.pdf:
The DPI for headers and footers seems to be lower (72 vs 96 for the main body, I think). So if I'm trying to match the margins, I have to scale by that.
Styles are not shared with the main body so I have to include them in the template.
If I try to use a locally stored font, it works on the main body but not in the header/footer even if I include the same CSS in the header/footer template.
I suspect that this happens because headers and footers are treated as separate documents and converted to image/pdf separately (https://cs.chromium.org/chromium/src/components/printing/resources/print_header_footer_template_page.html also implies something like that). Can someone familiar with the implementation explain how it actually works? Thanks!
Short Answer:
Puppeteer controls Chrome or Chromium over the DevTools Protocol.
Chromium uses Skia for PDF generation.
Skia handles the header, set of objects, and footer separately.
Detailed Answer:
From the Puppeteer Documentation:
page.pdf(options)
options <Object> Options object which might have the following properties:
headerTemplate <string> HTML template for the print header. Should be valid HTML markup with following classes used to inject printing values into them:
date formatted print date
title document title
url document location
pageNumber current page number
totalPages total pages in the document
footerTemplate <string> HTML template for the print footer. Should use the same format as the headerTemplate.
returns: <Promise<Buffer>> Promise which resolves with PDF buffer.
NOTE Generating a pdf is currently only supported in Chrome headless.
NOTE headerTemplate and footerTemplate markup have the following limitations:
Script tags inside templates are not evaluated.
Page styles are not visible inside templates.
We can learn from the the Puppeteer source code for page.pdf() that:
The Chrome DevTools Protocol method Page.printToPDF (along with the headerTemplate and footerTemplate parameters) are sent to to page._client.
page._client is an instance of page.target().createCDPSession() (a Chrome DevTools Protocol session).
From the Chrome DevTools Protocol Viewer, we can see that Page.printToPDF contains the parameters headerTemplate and footerTemplate:
Page.printToPDF
Print page as PDF.
PARAMETERS
headerTemplate string (optional)
HTML template for the print header. Should be valid HTML markup with following classes used to inject printing values into them:
date: formatted print date
title: document title
url: document location
pageNumber: current page number
totalPages: total pages in the document
For example, <span class=title></span> would generate span containing the title.
footerTemplate string (optional)
HTML template for the print footer. Should use the same format as the headerTemplate.
RETURN OBJECT
data string
Base64-encoded pdf data.
The Chromium source code for Page.printToPDF shows us that:
The Page.printToPDF parameters are passed to the sendDevToolsMessage function, which issues a DevTools protocol command and returns a promise for the results.
After further digging, we can see that Chromium has a concrete implementation of a class called SkDocument that creates PDF files.
SkDocument comes from the Skia Graphics Library, which Chromium uses for PDF generation.
The Skia PDF Theory of Operation, in the PDF Objects and Document Structure section, states that:
Background: The PDF file format has a header, a set of objects and then a footer that contains a table of contents for all of the objects in the document (the cross-reference table). The table of contents lists the specific byte position for each object. The objects may have references to other objects and the ASCII size of those references is dependent on the object number assigned to the referenced object; therefore we can’t calculate the table of contents until the size of objects is known, which requires assignment of object numbers. The document uses SkWStream::bytesWritten() to query the offsets of each object and build the cross-reference table.
The document explains further down:
The PDF backend requires all indirect objects used in a PDF to be added to the SkPDFObjNumMap of the SkPDFDocument. The catalog is responsible for assigning object numbers and generating the table of contents required at the end of PDF files. In some sense, generating a PDF is a three step process. In the first step all the objects and references among them are created (mostly done by SkPDFDevice). In the second step, SkPDFObjNumMap assigns and remembers object numbers. Finally, in the third step, the header is printed, each object is printed, and then the table of contents and trailer are printed. SkPDFDocument takes care of collecting all the objects from the various SkPDFDevice instances, adding them to an SkPDFObjNumMap, iterating through the objects once to set their file positions, and iterating again to generate the final PDF.
Thanks to the other answer (https://stackoverflow.com/a/51460641/364131) and codesearch, I think I found most of the answers I was looking for.
The printing implementation is in PrintPageInternal. It uses two separate WebFrames — one to render the content, and one to render the header and footer. The rendering for the header and footer is done by creating a special frame, writing the contents of print_header_and_footer_template_page.html to this frame, calling the setup function with the options provided and then printing to a shared canvas. After this, the rest of the contents of the page are printed on the same canvas within the bounds defined by the margins.
Headers and footers are scaled by a fudge_factor which isn't applied to the rest of the content. There might be something funny going on here with the DPIs (which might explain the fudge_factor of 1.33333333f which is equal to 96/72).
I'm guessing this special frame is what prevents the header and footer from sharing the same resources (styles, fonts etc.) as the contents of the page. It probably isn't setup to load (and wait for) any additional resources requested by the header and footer templates, which is why the requested fonts don't load.
I do a lot of research on this issue and finally, I implement a small library to handle this issue by a small hack:
I create two PDF files. The first one is the HTML content without header and footer. And the second one is the header and footer repeated based upon original content PDF pages' number, then merges them together.
You can find it here:
https://github.com/PejmanNik/puppeteer-report

Can it transform one map with number of .dita file into one html file using dita-ot 3.0?

Background: I have a ditamap file with 7 .dita files. Using the default dita-ot 3.0 html5 plugin, I get 7 html files and one index.html.
Question: Can it output the map file with 7 dita file into only one html file (structured in map or index.html)? How?
Thanks in advance.
Note: because of topic-based writing, information is divided into mini blocks, while for the output in this forms, the whole document seems to isolate. Users need to do many clicks to switch to another topics and hardly can get the big picture of the document. That is why i ask the question. Any other suggestion?
What you need is chunking. Simply set the chunk="to-content" attribute on the map root element.

HTML file to screenshot as .jpg or other image

Nothing to do with rendering an individual image on a webpage. Goal is to render the entire webpage save that as a screenshot. Want to show a thumbnail of an HTML file to the user. The HTML file I will be screenshotting will be an HTML part in a MIME email message - ideally I would like to snapshot the entire MIME file but if I can do this to an HTML file I'll be in good shape.
An API would be ideal but an executable is also good.
You need html2ps, and convert from the package ImageMagick:
html2ps index.html index.ps
convert index.ps index.png
The second program produces one png per page for a long html-page - the page layout was done by by html2ps.
I found a program evince-thumbnailer, which was reported as:
apropos postscript | grep -i png
evince-thumbnailer (1) - create png thumbnails from PostScript and PDF documents
but it didn't work on an simple, first test.
If you like to combine multiple pages to a larger image, convert will help you surely.
Now I see, that convert operates on html directly, so
convert index.html index.png
shall work too. I don't see a difference in the output, and the size of the images is nearly identical.
If you have a multipart mime-type email, you typically have a mail header, maybe some pre-html-text, the html and maybe attachments.
You can extract the html and format it seperately - but rendering it embedded might not be that easy.
Here is a file I tested, which was from Apr. 14, so I extract the one mail from the mailfolder:
sed -n "/From - Sat Apr 14/,/From -/p" /home/stefan/.mozilla-thunderbird/k2jbztqu.default/Mail/Local\ Folders-1/Archives.sbd/sample | \
sed -n '/<html>/,/<\/html>/p' | wkhtmltopdf - - > sample.pdf
then I extract just the html-part of that.
wkhtmltopdf needs - - for reading stdin/writing to stdout. The PDF is rendered, but I don't know how to integrate it into your workflow.
You can replace wkhtml ... with
convert - sample.jpg
I'm going with wkhtmltoimage. This worked once correctly set up xvfb. The postscript suggestion did not render correctly and we need img not pdf.

How to create a PDF in reStructuredText?

The documentation for the uncertainties Python package is written in reStructuredText, for the Sphinx documentation system. The HTML looks fine. I would like to create a PDF version. The goal is to have a "chapter" for each of the web page.
However, what happens is that the PDF generated by the ReST files transforms the (HTML) sections of index.html into individual chapters (which I don't want: the PDF should have them as sections too). Another problem is that all HTML pages after the main page appear in the PDF as subsections of the section where the toctree directive appears (i.e., in the Acknowledgment section of the main page).
So, how should the ReST file be structured so that (1) the web documents look the same as they are now, and (2) each web page corresponds to a PDF chapter. Any help would be much appreciated!
There is a solution. If I remember correctly, the key points were:
Use a special Table of Contents as the master document (I used index_TOC.rst instead of the default index.rst): in conf.py
master_doc = 'index_TOC'
latex_documents = [('index_TOC', 'uncertaintiesPythonpackage.tex',…]
The new Table of Contents file index_TOC.rst contains a ToC like
TOC
===
.. toctree::
:hidden:
:maxdepth: 1
index
user_guide
numpy_guide
tech_guide
Thus, the web version still opens onto the main index.rst text, and the PDF (LaTeX) version has each ReST file in a separate chapter.