I'm building a custom print template closely following the directions in Chuck Ainslie's articles. One thing I'd like to do is generate a table of contents on the fly with the actual page numbers.
Is there any way too find what part of the document a layoutrect instance contains? Basically, I want to scan the original document for specific tags (say <h1> tags), then figure out which layoutrect contains those tags. From there I can figure out which devicerect is the parent and that tells me the page number.
During the layout, when the onLayoutComplete handler is called, there doesn't seem to be any way to get the source of what was actually laid out.
I managed to generate a dynamic table of contents (dynamic in the sense that the page numbers are collected at print time and are not in the static html page). It isn't pretty though.
I couldn't find any way to determine what part of my document was being flowed into the specific device and layout rects. Instead, I break my document up into individual html files, one for each section. To print, I create an html file that looks roughly like this:
<html>
<body>
<h1>Table of Contents</h1>
<table>
<tr><th>Report</th><th>Page</th></tr>
<tr><td>Foo</td><td>0</td></tr>
<tr><td>Bar</td><td>0</td></tr>
<tr><td>Etc</td><td>0</td></tr>
...
</table>
</body>
</html>
Rather than send that document to the printer, my print template pulls out all the section URLs from the document object. It then sends each of these to the printer and for each, it tracks the page number the section is printed on. When printing is complete, it updates the original document and replaces the '0' placeholders with the actual page numbers. Then the table of contents is printed.
It's not very elegant and now I have to add the rest of the UI around my print template code.
Related
I am trying to automate a workflow for automatically creating HTML newsletters based on information stored in a spreadsheet.
Currently, I am using a newsletter drag and drop tool, in which several pre-programmed blocks are available (e.g. full column block, 2 column block etc). When creating a newsletter, I drag and drop a block and fill in my content (e.g. uploading an image, inserting a url). This is all well and good, however, since I have to create the same newsletter in 10 different languages, this process is quiet time consuming and prone to human error. While all newsletters are the same in terms of layout, the images and urls differ.
To solve this issue, I would like to get rid of the drag and drop process, and instead automate the workflow in some other way.
One idea that I have already tried, but that doesn't seem like the perfect option to me, is to dynamically create the needed HTMLs in Excel. Basically, the idea is to take the existing block template structure, and put it into Excel with some formulas.
I could then copy and paste the links to the images (in a simple format, such as EN1.jpg, ES1.jpg, etc.), as well as to urls (url.com, url.es).
This is some example block:
<img alt="" align="center" width="700" style="max-width:700px;" class="resetWidth" border="0" src="IMAGE" />
My final expected result is something like this:
I define the layout in a very quick manner (e.g. writing fullcolumn, half column, fullcolumn). The corresponding code is taken from the template. I then provide the attributes (image url, link url) in the form of a list or so. The end result should then be 10 html files that I simply have to upload to the newsletter software.
I would appreciate it very much if anyone had any ideas on this.
Another option for translating the page is to do something like this https://www.w3schools.com/howto/howto_google_translate.asp
it adds a selection for languages to translate into.
As for automating the images, you could set up folders for each langauge and reuse the name of images based on where you want them so they would be placed in the correct location.
All you'll have to do it replace the images with the same file names and swap the default language on the Google Translator.
So something like this that the html will stay the same with regards to the image names
For the link variables you may be able to write some JS or another language to take advantage of the
<html lang="">
and based on which lang is set, insert a set of links to the file.
I've noticed a few inconsistencies when trying to use the headerTemplate and footerTemplate options with page.pdf:
The DPI for headers and footers seems to be lower (72 vs 96 for the main body, I think). So if I'm trying to match the margins, I have to scale by that.
Styles are not shared with the main body so I have to include them in the template.
If I try to use a locally stored font, it works on the main body but not in the header/footer even if I include the same CSS in the header/footer template.
I suspect that this happens because headers and footers are treated as separate documents and converted to image/pdf separately (https://cs.chromium.org/chromium/src/components/printing/resources/print_header_footer_template_page.html also implies something like that). Can someone familiar with the implementation explain how it actually works? Thanks!
Short Answer:
Puppeteer controls Chrome or Chromium over the DevTools Protocol.
Chromium uses Skia for PDF generation.
Skia handles the header, set of objects, and footer separately.
Detailed Answer:
From the Puppeteer Documentation:
page.pdf(options)
options <Object> Options object which might have the following properties:
headerTemplate <string> HTML template for the print header. Should be valid HTML markup with following classes used to inject printing values into them:
date formatted print date
title document title
url document location
pageNumber current page number
totalPages total pages in the document
footerTemplate <string> HTML template for the print footer. Should use the same format as the headerTemplate.
returns: <Promise<Buffer>> Promise which resolves with PDF buffer.
NOTE Generating a pdf is currently only supported in Chrome headless.
NOTE headerTemplate and footerTemplate markup have the following limitations:
Script tags inside templates are not evaluated.
Page styles are not visible inside templates.
We can learn from the the Puppeteer source code for page.pdf() that:
The Chrome DevTools Protocol method Page.printToPDF (along with the headerTemplate and footerTemplate parameters) are sent to to page._client.
page._client is an instance of page.target().createCDPSession() (a Chrome DevTools Protocol session).
From the Chrome DevTools Protocol Viewer, we can see that Page.printToPDF contains the parameters headerTemplate and footerTemplate:
Page.printToPDF
Print page as PDF.
PARAMETERS
headerTemplate string (optional)
HTML template for the print header. Should be valid HTML markup with following classes used to inject printing values into them:
date: formatted print date
title: document title
url: document location
pageNumber: current page number
totalPages: total pages in the document
For example, <span class=title></span> would generate span containing the title.
footerTemplate string (optional)
HTML template for the print footer. Should use the same format as the headerTemplate.
RETURN OBJECT
data string
Base64-encoded pdf data.
The Chromium source code for Page.printToPDF shows us that:
The Page.printToPDF parameters are passed to the sendDevToolsMessage function, which issues a DevTools protocol command and returns a promise for the results.
After further digging, we can see that Chromium has a concrete implementation of a class called SkDocument that creates PDF files.
SkDocument comes from the Skia Graphics Library, which Chromium uses for PDF generation.
The Skia PDF Theory of Operation, in the PDF Objects and Document Structure section, states that:
Background: The PDF file format has a header, a set of objects and then a footer that contains a table of contents for all of the objects in the document (the cross-reference table). The table of contents lists the specific byte position for each object. The objects may have references to other objects and the ASCII size of those references is dependent on the object number assigned to the referenced object; therefore we can’t calculate the table of contents until the size of objects is known, which requires assignment of object numbers. The document uses SkWStream::bytesWritten() to query the offsets of each object and build the cross-reference table.
The document explains further down:
The PDF backend requires all indirect objects used in a PDF to be added to the SkPDFObjNumMap of the SkPDFDocument. The catalog is responsible for assigning object numbers and generating the table of contents required at the end of PDF files. In some sense, generating a PDF is a three step process. In the first step all the objects and references among them are created (mostly done by SkPDFDevice). In the second step, SkPDFObjNumMap assigns and remembers object numbers. Finally, in the third step, the header is printed, each object is printed, and then the table of contents and trailer are printed. SkPDFDocument takes care of collecting all the objects from the various SkPDFDevice instances, adding them to an SkPDFObjNumMap, iterating through the objects once to set their file positions, and iterating again to generate the final PDF.
Thanks to the other answer (https://stackoverflow.com/a/51460641/364131) and codesearch, I think I found most of the answers I was looking for.
The printing implementation is in PrintPageInternal. It uses two separate WebFrames — one to render the content, and one to render the header and footer. The rendering for the header and footer is done by creating a special frame, writing the contents of print_header_and_footer_template_page.html to this frame, calling the setup function with the options provided and then printing to a shared canvas. After this, the rest of the contents of the page are printed on the same canvas within the bounds defined by the margins.
Headers and footers are scaled by a fudge_factor which isn't applied to the rest of the content. There might be something funny going on here with the DPIs (which might explain the fudge_factor of 1.33333333f which is equal to 96/72).
I'm guessing this special frame is what prevents the header and footer from sharing the same resources (styles, fonts etc.) as the contents of the page. It probably isn't setup to load (and wait for) any additional resources requested by the header and footer templates, which is why the requested fonts don't load.
I do a lot of research on this issue and finally, I implement a small library to handle this issue by a small hack:
I create two PDF files. The first one is the HTML content without header and footer. And the second one is the header and footer repeated based upon original content PDF pages' number, then merges them together.
You can find it here:
https://github.com/PejmanNik/puppeteer-report
I'm currently working on a blog using Django and SQLite for the back end. In my setup, I stored my articles in the database in this sort of form:
<p> <strong>The Time/Money Tradeoff</strong> </p> <p> As we flesh out High Life, Low Price, you will notice that sometimes we will suggest deals and solutions that may cost slightly more than their alternatives. We won’t always suggest the cheapest laptop...
On the page itself, I have this code for where I use the session data:
<p>{{request.session.article.0.blog_article}}</p>
I had assumed that the web broswer would be able to read the HTML tags. However, it prints on the page in that form, with the visible <p> tags and the like. I think this is because it's stored as a Unicode string in the database and is put onto the page between two quotation marks. If I paste the HTML code onto the page, the format looks like I wanted it to look, but I want it to be an automated process (tell Django which article ID I want, it plugs the elements of the page into the template and everything looks great).
How can I get the stored article in a form where the page can see the HTML tags?
By default django would autoescape all strings in the template, so when you render html code in the template, they just show up as the literal html code. But you could use safe filter to turn this off:
<p>{{request.session.article.0.blog_article|safe}}</p>
I have an HTML page which needs to display some HTML generated by the user on the Administration area (like a blog, for instance). The problem is that the user sometimes needs to copy-paste tables and other "garbage" content from Word/Excel to the WYSIWYG editor (that has the proper "paste from Word" function). This causes the resulting HTML code to be very dirty.
It wouldn't be a problem unless some of these pages are shown totally wrong: other divs AFTER user's HTML code are not in their supposed position, floats are not respected... etc...
I tried putting a div:
<div style="clear: both;"></div>
without success. I even tried with iFrames, but iFrames accept only external webpages (if applicable...).
The question is: is there any tag or method to put a part of an HTML code inside a webpage discarding all formatting AFTER this code?
Thank you.
To my knowledge, you simply want to end all divs. HTML is a very simple code, with very simple options. Your example doesn't work because HTML isn't that advances. You can either start a function <...> or end a function .
Ideally what you want is a piece of code that puts their work in a separate frame entirely, so as soon as the page passes their code, it goes back to the correct formatting.
Or, you could be really sloppy and put one hundred 's in, just in case.
I have two .png files that I would like to display along with some text in an html page that is being dynamically generated by Perl. In a perfect world this would be in a 3 column layout with the text in the first column and the two .png files in the second and third columns. The problem I keep coming across is the need for one of these tags:
print "Content-type: image/png\n\n";
print "Content-type: text/html\n\n";
I can only include one in the html and it means I can only display the text or the images but not both.
Obviously there must be a way to do this but so far Googling has gotten me nowhere.
Any advice is appreciated.
Regards.
Here are the print statements that I am using to generaate the dynamic html:
print "<html>";
print "<head>";
print "Content-type: text/html\n\n";
print "</head>";
print "<body>";
print "<table>";
print "<tr><td>Number of Sentences = $numSentences</td><td rowspan=\"4\"><img src=\"/home/kfarmer/public_html/kevin/charFreq.png\" /></td><td rowspan=\"4\"><img src=\"/home/kfarmer/public_html/kevin/charFreq.png\" /></td></tr>";
print "<tr><td>Number of Words = $numWords</td></tr>";
print "<tr><td>Number of Unique Words = $numUniqueWords</td></tr>";
print "<tr><td>Number of English Characters = $numEnglishCharacters</td></tr>";
print "</table></body></html>";
However, now when I generate this page I get a popup asking me how I would like to open the file -- and the file is just what I've printed above.
One HTTP response sends one thing, and each thing gets its own content header. You don’t need to include everything that will be in the page in one response.
From Perl, just link to the image files using the HTML features for that. In the HTML, you’d include:
<img src="url to PNG"/>
If there’s a CGI script serving up a dynamic image, you’d just link to it in the HTML:
<img src="/cgi-bin/imager.png?key=value;..."/>
That CGI script has to return the proper content type for what it’s sending back, whether HTML or image data.
The browser will interpret the HTML, see the links for the images, and make additional requests to get the images.
Your problem is more likely, that your images are absolute paths on the drive, rather than relative paths to the images on your websever.
I noticed that your img src is '/home/kfarmer/public_html/kevin/charFreq.png'
So, if your webserver's root is '/home/kfarmer/public_html/kevin' and that is what will load when you go to someurl.com, then your img src should just be '/charFreq.png'
If the root of your server is '/home/kfarmer/public_html' then your img src should be 'kevin/charFreq.png'
This is more complicated than that, as urls are relative to where the page is loaded from.
So, if the page is www.someurl.com/kevin/anotherfolder/mypage.html and you want to load an image in 'kevin/someotherplace/charFreq.png' then you need your img src to be either '/kevin/someotherplace/charFreq.png' OR '../someotherplace/charFreq.png'
I am afraid you are mixing up some definitions/technologies.
Context-type is part of the HTTP header. When your browser fetches a web page it normally uses HTTP, which consists of multiple header fields and a data block (for example containing the data which got requested). If you use a (decent) web-server you do not need to worry about HTTP headers (normally).
In order to show a 3 column page, with two (or any other number of) images and some text. You need to generate a HTML page. On this page you tell the browser that it should download the images and present them alongside the text.
In order to do, so you have to generate the HTML page in a pre-defined format containing so called elements:
<html>
<head><title>mypage</title></head>
<body>
This is the visible part of the page
</body>
</html>
Now in the body element, you can generate your text and images. The images can be placed onto the page using the img element.
Check out w3schools for more extensive tutorials on these subjects. If you gained some more basic knowledge, you can try to generate a HTML page using perl (or any other (scripting) language).