Proper way to convert HTML to PDF

Proper way to convert HTML to PDF - html

I want to convert HTML page to PDF. There are several options, but they have some problems.
Print HTML page in IE through PDFCreator (too cumbersome)
Use wkhtmltopdf (low quality)
Use PhantomJS (low quality)
Maybe I can use a complex solution? To print with PhantomJS through PDFCreator, or improve quality of wkhtmltopdf, or maybe something else?

Maybe you can try with Amyuni WebkitPDF. It's not open source, but it's free for commercial use, and it can be used from C#.
Sample code for C# from the documentation:
static private void SaveToFile(string url, string file)
{
// Store the WebkitPDFContext returned value in an IntPtr
IntPtr context = IntPtr.Zero;
// Open the URL. The WebkitPDFContext returned value will be stored in
// the passed in IntPtr
int ret = WKPDFOpenURL(url, out context, 0, false);
if (ret == 0)
{
// if ret is 0, then we succeeded in opening the URL.
// Save the result as PDF to a file. Use the obtained context value
ret = WKPDFSaveToFile(file, context);
}
if (ret != 0)
Debug.WriteLine("Failed to run SaveToFile on '" + url + "' to generate file '" + file + "'");
// Make sure to close the WebkitPDFContext because otherwise the
// internal PDFCreator as well as other objects will not be released
WKPDFCloseContext(context);
}
Usual disclaimer applies.

You can properly convert HTML to PDF using GroupDocs.Conversion for .NET API.
Have a look at the code:
// Setup Conversion configuration and Initailize ConversionHandler
ConversionConfig config = new ConversionConfig();
config.StoragePath = "source file storage path";
// Initailize ConversionHandler
ConversionHandler conversionHandler = new ConversionHandler(config);
// Convert and save converted document
var convertedDocumentPath = conversionHandler.Convert("sample.html", new PdfSaveOptions{});
convertedDocumentPath.Save("result-" + Path.GetFileNameWithoutExtension("sample.html") + ".pdf");
Disclosure: I work as Developer Evangelist at GroupDocs.

patched wkhtmltopdf (a very good WebKit-based command line tool, fast) with --print-media-type --no-stop-slow-scripts keys
chromium --headless --no-zygote --single-process ... --print-to-pdf= ... (slower, Portrait orientation only)
chromium headless via DevTools Protocol (slow, only a few programming languages do have bindings to)
wrapper around Blink Engine (e.g., Qt5 https://code.qt.io/cgit/qt/qtwebengine.git/tree/examples/webenginewidgets/html2pdf?h=5.15)
If you believe in containers, - https://github.com/thecodingmachine/gotenberg (internally - chromium headless via DevTools Protocol)

google chrome Save as PDF
the output looks exactly the same (as rendered by chrome)
here, I use Puppeteer to automate the process: singlefile or in Folder
https://github.com/FuPeiJiang/puppeteer-pdf

Related

Selenium 4 + geckodriver: printing html5 webpage to PDF with Page.printToPDF

With Selenium 4 and chromedriver, I succeeded printing websites to PDF with custom page sizes (see Python code below). I would like to know the equivalent to do this with geckodriver/firefox.
def send_devtools(driver, cmd, params={}):
resource = "/session/%s/chromium/send_command_and_get_result" % driver.session_id
url = driver.command_executor._url + resource
body = json.dumps({'cmd': cmd, 'params': params})
response = driver.command_executor._request('POST', url, body)
if (response.get('value') is not None):
return response.get('value')
else:
return None
def save_as_pdf(driver, path, options={}):
result = send_devtools(driver, "Page.printToPDF", options)
if (result is not None):
with open(path, 'wb') as file:
file.write(base64.b64decode(result['data']))
return True
else:
return False
options = webdriver.ChromeOptions()
# headless setting is mandatory, otherwise saving tp pdf won't work
options.add_argument("--headless")
driver = webdriver.Chrome(executable_path='/usr/local/bin/chromedriver', options=options)
# chrome has to operate in headless mode to procuce PDF
driver.get(r'https://example.my')
send_devtools(driver, "Emulation.setEmulatedMedia", {'media': 'screen'})
pdf_options = { 'paperHeight': 92, 'paperWidth': 8, 'printBackground': True }
save_as_pdf(driver, 'myfilename.pdf', pdf_options)

Did you try wkhtmltopdf?
wkhtmltopdf and wkhtmltoimage are open source (LGPLv3) command line tools to render HTML into PDF and various image formats using the Qt WebKit rendering engine. These run entirely "headless" and do not require a display or display service.
Example usage:
wkhtmltopdf http://google.com google.pdf
If you want to do it with python, after installation you can invoke with:
import os
number = iter(range(100))
def html_to_pdf(link, name="test"):
if os.path.isfile(name): # same file name
name = name[:-1] + str(next(number))
os.system(f"wkhtmltopdf {link} {name}.pdf")
Additionally you can use subprocess.run if you want to use wkhtmltopdf with more parameters. Your html_to_pdf method will gain more effective with more parameters. You can checkout documentation with:
wkhtmltopdf -H

To print a page as PDF there is a specific WebDriver command that can be used for cross-browser automation. That means that there is no need to write custom code, which utilizes the Chrome DevTools protocol, as done above for Chrome.
For both Chrome and Firefox this command is already available in Selenium 3.141, and should also work without modifications for Selenium 4.
The command will return the base64 encoded PDF data in the response's payload, and would require you to save it to a file yourself.

Issue:
To proceed with the same task using Firefox or Geckodriver, it apparently has some issues with the mentioned code for writing to the file, resulting in not saving the target document.
Solution:
So I tweaked around the code, which now opens the website using Geckdriver on Firefox and takes a screenshot for the body elements using the function find_element_by_tag_name(), which is later on converted to RGB mode, with the dimensions of the screenshot and later saved as a PDF document using Pillow
Code:
from PIL import Image
from io import BytesIO
from selenium import webdriver
driverOptions = webdriver.FirefoxOptions()
# Uncomment the below line and change the path according to your configurations if you encounter an error like "Expected browser binary location ..."
# driverOptions.binary_location = '/Applications/Firefox.app/Contents/MacOS/firefox'
driverOptions.add_argument("--headless")
webDriver = webdriver.Firefox(executable_path = '/usr/local/bin/geckodriver', options = driverOptions)
webDriver.get(f'https://stackoverflow.com')
websiteScreenshot = Image.open(BytesIO(webDriver.find_element_by_tag_name('body').screenshot_as_png))
rgbImage = Image.new('RGB', websiteScreenshot.size, (255, 255, 255))
rgbImage.paste(websiteScreenshot, mask=websiteScreenshot.split()[3])
rgbImage.save('fileName.pdf', "PDF", resolution=100)
webDriver.quit()
References:
Browser Binary Location Issue
Converting a screenshot to PDF
Additional:
You can download the Geckodriver for Firefox based on your configurations from here, happy coding! 😊

How to programmatically read-write scripts for offline usage in chrome extension?

I need to have predefined scripts, accessible from chrome content_script, that could be updated automatically from given URL.
Exactly what i do:
I have content_script.js. Inside it, i`d like to create iframe for current page from predefined html+css+js.Sometimes html or css or js can be changed. I want to avoid updating extension, instead, each time user have internet, he could load fresh html+css+js for further offline usage.
So, how to read and write some internal files within extension from content script (or delegate this task to background script)?

You can use HTML5 Filesystem to have a read/write place for files, or just store it as strings in chrome.storage (with "unlimitedStorage" permission as needed) for later reuse.
This code can then be executed in a content script using executeScript, or, if you enable 'unsafe-eval' for the extension CSP, in the main script (which is dangerous, and should be avoided in most cases).
Note that this Filesystem API has a warning that's it's only supported in Chrome, but that shouldn't be a problem (Firefox / WebExtensions platform explicitly reject self-update mechanisms).

You can do read extension file contents, but you can't write to extension folder since it is sandboxed.
To read an extension file, you can just send Ajax call using chrome.runtime.getURL("filepath") as url
var xhr = new XMLHttpRequest();
xhr.open('GET', chrome.runtime.getURL('your file path'), true);
xhr.onreadystatechange = function() {
if (chr.readyState === XMLHttpRequest.DONE && xhr.status === 200) {
var text = xhr.responseText;
// Do what you want using text
}
};
xhr.send();

Posting Documents to OneNote via new REST API

For some reason, any document I upload to OneNote via the new REST API is corrupt when viewed from OneNote. Everything else is fine, but the file (for example a Word document) isn't clickable and if you try and open is shows as corrupt.
This is similar to what may happen when there is a problem with the byte array, or its in memory, but that doesn't seem to be the case. I use essentially the same process to upload the file bytes to SharePoint, OneDrive, etc. It's only to OneNote that the file seems to be corrupt.
Here is a simplified version of the C#
HttpRequestMessage createMessage = null;
HttpResponseMessage response = null;
using (var streamContent = new ByteArrayContent(fileBytes))
{
streamContent.Headers.ContentType = new MediaTypeHeaderValue("application/vnd.openxmlformats-officedocument.wordprocessingml.document");
streamContent.Headers.ContentDisposition = new ContentDispositionHeaderValue("form-data");
streamContent.Headers.ContentDisposition.Name = fileName;
createMessage = new HttpRequestMessage(HttpMethod.Post, authorizationUrl)
{
Content = new MultipartFormDataContent
{
{
new StringContent(simpleHtml,
System.Text.Encoding.UTF8, "text/html"), "Presentation"
},
{streamContent}
}
};
response = await client.SendAsync(createMessage);
var stream = await response.Content.ReadAsStreamAsync();
successful = response.IsSuccessStatusCode;
}
Does anyone have any thoughts or working code uploading an actual binary document via the OneNote API via a Windows Store app?

The WinStore code sample contains a working example (method: CreatePageWithAttachedFile) of how to upload an attachment.
The slight differences I can think of between the above code snippet and the code sample are that the code sample uploads a pdf file (instead of a document) and the sample uses StreamContent (while the above code snippet uses ByteArrayContent).
I downloaded the code sample and locally modified it to use a document file and ByteArrayContent. I was able to upload the attachment and view it successfully. Used the following to get a byte array from a given stream:
using (BinaryReader br = new BinaryReader(stream))
{
byte[] b = br.ReadBytes(Convert.ToInt32(s.Length));
}
The rest of the code looks pretty similar to the above snippet and overall worked successfully for me.
Here are a few more things to consider while troubleshooting the issue:
Verify the attachment file itself isn't corrupt in the first place. for e.g. can it be opened without the OneNote API being in the mix?
Verify the API returned a 201 Http Status code back and the resulting page contains the attachment icon and allows downloading/viewing the attached file.

So, the issue was (strangely) the addition of the meta Content Type in the tag sent over in the HTML content that's not shown. The documentation refers to adding a type=[mime type] in the object tag, and since the WinStore example didn't do this (it only adds the mime type to the MediaTypeHeaderValue I removed it and it worked perfectly.
Just changing it to this worked:
<object data-attachment=\"" + fileName + "\" data=\"name:" + attachmentPartName + "\" />
Thanks for pointing me in the right direction with the sample code!

Detecting folders/directories in javascript FileList objects

I have recently contributed some code to Moodle which uses some of the capabilities of HTML5 to allow files to be uploaded in forms via drag and drop from the desktop (the core part of the code is here: https://github.com/moodle/moodle/blob/master/lib/form/dndupload.js for reference).
This is working well, except for when a user drags a folder / directory instead of a real file. Garbage is then uploaded to the server, but with the filename matching the folder.
What I am looking for is an easy and reliable way to detect the presence of a folder in the FileList object, so I can skip it (and probably return a friendly error message as well).
I've looked through the documentation on MDN, as well as a more general web search, but not turned up anything. I've also looked through the data in the Chrome developer tools and it appears that the 'type' of the File object is consistently set to "" for folders. However, I'm not quite convinced this is the most reliable, cross-browser detection method.
Does anyone have any better suggestions?

You cannot rely on file.type. A file without an extension will have a type of "". Save a text file with a .jpg extension and load it into a file control, and its type will display as image/jpeg. And, a folder named "someFolder.jpg" will also have its type as image/jpeg.
Instead, try to read the first byte of the file. If you are able to read the first byte, you have a file. If an error is thrown, you probably have a directory:
try {
await file.slice(0, 1).arrayBuffer();
// it's a file!
}
catch (err) {
// it's a directory!
}
If you are in the unfortunate position of supporting IE11, The file will not have the arrayBuffer method. You have to resort to the FileReader object:
// use this code if you support IE11
var reader = new FileReader();
reader.onload = function (e) {
// it's a file!
};
reader.onerror = function (e) {
// it's a directory!
};
reader.readAsArrayBuffer(file.slice(0, 1));

I also ran into this problem and below is my solution. Basically, I took have a two pronged approach:
(1) check whether the File object's size is large, and consider it to be a genuine file if it is over 1MB (I'm assuming folders themselves are never that large).
(2) If the File object is smaller than 1MB, then I read it using FileReader's 'readAsArrayBuffer' method. Successful reads call 'onload' and I believe this indicates the file object is a genuine file. Failed reads call 'onerror' and I consider it a directory. Here is the code:
var isLikelyFile = null;
if (f.size > 1048576){ isLikelyFile = false; }
else{
var reader = new FileReader();
reader.onload = function (result) { isLikelyFile = true; };
reader.onerror = function(){ isLikelyFile = false; };
reader.readAsArrayBuffer(f);
}
//wait for reader to finish : should be quick as file size is < 1MB ;-)
var interval = setInterval(function() {
if (isLikelyFile != null){
clearInterval(interval);
console.log('finished checking File object. isLikelyFile = ' + isLikelyFile);
}
}, 100);
I tested this in FF 26, Chrome 31, and Safari 6 and three browsers call 'onerror' when attempting to read directories. Let me know if anyone can think of a use case where this fails.

I proposing calling FileReader.readAsBinaryString on the File object. In Firefox, this will raise an Exception when the File is a Directory. I only do this if the File meets the conditions proposed by gilly3.
Please see my blog post at http://hs2n.wordpress.com/2012/08/13/detecting-folders-in-html-drop-area/ for more details.
Also, version 21 of Google Chrome now supports dropping folders. You can easily check if the dropped items are folders, and also read their contents.
Unfortunately, I don´t have any (client-side) solution for older Chrome versions.

One other note is that type is "" for any file that has an unknown extension. Try uploading a file named test.blah and the type will be empty. AND... try dragging and dropping a folder named test.jpg - type will be set to "image/jpeg". To be 100% correct, you can't depend on type solely (or if at all, really).
In my testing, folders have always been of size 0 (on FF and Chrome on 64-bit Windows 7 and under Linux Mint (Ubuntu essentially). So, my folder check is just checking if size is 0 and it seems to work for me in our environment. We also don't want 0-byte files uploaded either so if it's 0 byte the message comes back as "Skipped - 0 bytes (or folder)"

FYI, this post will tell you how to use dataTransfer API in Chrome to detect file type: http://updates.html5rocks.com/2012/07/Drag-and-drop-a-folder-onto-Chrome-now-available

The best option is to use both the 'progress' and 'load' events on a FileReader instance.
var fr = new FileReader();
var type = '';
// Early terminate reading files.
fr.addEventListener('progress', function(e) {
console.log('progress - valid file');
fr.abort();
type = 'file';
});
// The whole file loads before a progress event happens.
fr.addEventListener('load', function(e) {
console.log('load - valid file');
type = 'file';
});
// Not a file. Possibly a directory.
fr.addEventListener('error', function(e) {
console.log('error - not a file or is not readable by the web browser');
});
fr.readAsArrayBuffer(thefile);
This fires the error handler when presented with a directory and most files will fire the progress handler after reading just a few KB. I've seen both events fire. Triggering abort() in the progress handler stops the FileReader from reading more data off disk into RAM. That allows for really large files to be dropped without reading all of the data of such files into RAM just to determine that they are files.
It may be tempting to say that if an error happens that the File is a directory. However, a number of scenarios exist where the File is unreadable by the web browser. It is safest to just report the error to the user and ignore the item.

An easy method is the following:
Check if the file's type is an empty string: type === ""
Check if the file's size is 0, 4096, or a multiple of it: size % 4096 === 0.
if (file.type === "" && file.size % 4096 === 0) {
// The file is a folder
} else {
// The file is not a folder
}
Note: Just by chance, there could be files without a file extension that have the size of some multiple of 4096. Even though this will not happen very often, be aware of it.
For reference, please see the great answer from user Marco Bonelli to a similar topic. This is just a short summary of it.

Is there any way to get command line parameters in Google Chrome extension?

I need to launch Chrome from command line with custom parameter, which
contains path to some js-file. Further this path will be used in
extension.
I browsed carefully all related documentation and clicked all nodes in
Chrome debugger, but found nothing which can resemble on command line
parameters. Is it possible anyway to get these parameters or it's need
to write more complex npapi-extension? (theoretically in such npapi-
extension we able to get self process through win-api, command line of
self process and so on).

Hack alert: this post suggests passing a fake URL to open that has all the command-line parameters as query string parameters, e.g.,
chrome.exe http://fakeurl.com/?param1=val1&param2=val2

Perhaps pass the path to your extension in a custom user agent string set via the command line. For example:
chrome.exe --user-agent='Chrome 43. My path is:/path/to/file'
Then, in your extension:
var path = navigator.userAgent.split(":");
console.log(path[1])

Basically I use the technique given in #dfrankow's answer, but I open 127.0.0.1:0 instead of a fake URL. This approach has two advantages:
The name resolution attempt is skipped. OK, if I've chosen the fake URL carefully to avoid opening an existing URL, the name resolution would fail for sure. But there is no need for it, so why not just skip this step?
No server listens on TCP port 0. Using simply 127.0.0.1 is not enough, since it is possible that a web server runs on the client machine, and I don't want the extension to connect to it accidentally. So I have to specify a port number, but which one? Port 0 is the perfect choice: according to RFC 1700, this port number is "reserved", that is, servers are not allowed to use it.
Example command line to pass arguments abc and xyz to your extension:
chrome "http://127.0.0.1:0/?abc=42&xyz=hello"
You can read these arguments in background.js this way:
chrome.windows.onCreated.addListener(function (window) {
chrome.tabs.query({}, function (tabs) {
var args = { abc: null, xyz: null }, argName, regExp, match;
for (argName in args) {
regExp = new RegExp(argName + "=([^\&]+)")
match = regExp.exec(tabs[0].url);
if (!match) return;
args[argName] = match[1];
}
console.log(JSON.stringify(args));
});
});
Console output (in the console of the background page of the extension):
{"abc":"42","xyz":"hello"}

You could try:
var versionPage = "chrome://version/strings.js";
$.post(versionPage, function(data){
data = data.replace("var templateData = ", "");
data = data.slice(0, -1);
var jsonOb = $.parseJSON(data);
alert(jsonOb.command_line);
});
This assumes you are using jQuery in your loading sequence, you could always substitute with any other AJAX method

Further to the answers above about using the URL to pass parameters in, note that only Extensions, not Apps, can do this. I've published a Chrome Extension that just intercepts the URL and makes it available to another App.
https://chrome.google.com/webstore/detail/gafgnggdglmjplpklcfhcgfaeehecepg/
The source code is available at:
https://github.com/appazur/kiosk-launcher
for Wi-Fi Public Display Screens

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Proper way to convert HTML to PDF - html

google chrome Save as PDF the output looks exactly the same (as rendered by chrome) here, I use Puppeteer to automate the process: singlefile or in Folder https://github.com/FuPeiJiang/puppeteer-pdf

Related

Selenium 4 + geckodriver: printing html5 webpage to PDF with Page.printToPDF

How to programmatically read-write scripts for offline usage in chrome extension?

Posting Documents to OneNote via new REST API

Detecting folders/directories in javascript FileList objects

Is there any way to get command line parameters in Google Chrome extension?

Categories

Resources