Modern HTML to PDF conversion
Many web apps require some sort of PDF functionality. And as a web developers, we already know one great way to lay out documents — HTML!
But there’s a confusing array of options available for converting HTML to PDF. In this article I’m going to consider the pros and cons of:
- Established tools like wkhtmltopdf that have been in use for years
- Google Chrome in “headless” mode
- Specialist software built specifically for converting HTML to PDF
I should state upfront that you’re reading this on the Paperplane blog. Paperplane is a cloud API for generating PDFs, but it’s just one of many options available. I’ve tried to present a fair and accurate comparison of the trade offs involved in each of the options presented in this article.
The tried and tested approaches
Things changed in April 2017 with the release of Google Chrome 59 which included a “headless” mode. In conjunction with Chrome’s “devtools” API, headless mode allows you to use Chrome in a server environment and script it to perform tasks — like creating PDFs!
You can get started by using the “print-to-pdf” command line option, but for more control over the PDF you’ll need to communicate with Chrome’s devtools API.
Controlling Chrome with Puppeteer
Since the release of headless mode, a strong ecosystem of tools has emerged for interacting with the devtools API. Foremost amongst these is “Puppeteer”, a Node.js library that’s been built by the Google Chrome team themselves. You could use Puppeteer to script the browser to perform countless tasks, but we’ll be focusing on how it can be used to create PDFs.
Using Puppeteer with Chrome on your own server
When you install Puppeteer on your server or in your development environment, it will automatically download it’s own copy of Chrome for you. Surprisingly simple, right? Bear in mind though that a Chrome install can be 250MB or more, and running Chrome has the potential to use a significant amount of your server’s resources.
Using Puppeteer via browserless.io
If you don’t want to install Chrome on your own infrastructure, but do like the idea of using Puppeteer then it’s worth checking out a service called browserless.io. It lets you connect to a browser running in the cloud, freeing you from the server administration involved in installing, running and updating Chrome yourself. You can also run up to 20 browser sessions in parallel.
Using Puppeteer via Google Cloud Functions
One interesting new option is the ability to run headless Chrome on Google Cloud’s “serverless” platform — Cloud Functions. This feature was added to Cloud Functions in August 2018 and should provide a low-cost and highly scalable way of generating PDFs. Google’s announcement post has a good walk-through that explains how to set it all up.
Creating your own PDF microservice
If you want to generate PDFs on your own servers, but keep all PDF-related concerns out of your main application, you should check out pdf-bot. It’s a Node.js microservice that can receive URLs via it’s API, add them to a queue, and then notify you via webhooks when the URL has been converted to PDF. It also supports storing your PDF files on Amazon S3.
Generating PDFs using a cloud API
If you don’t mind paying a small amount to outsource your PDF infrastructure and focus on more important features, consider using a cloud API.
This is probably the simplest way to get started, as you'll be interacting with a relatively simple REST API and you won't need to handle installing Chrome on servers or connecting directly to the Chrome devtools API.
Paperplane is one option here — you send URLs to it’s API and it takes care of converting them to PDF using Chrome. The finished PDFs are uploaded to your own Amazon S3 storage. You can set options like page size and include headers and footers just as you would if you were using Puppeteer. Finally, if you have lots of documents to create, Paperplane can generate up to 20 PDFs in parallel. For more information, you can check out the full list of features or the documentation.
Fine-grained control — advanced typesetting features
If Chrome isn’t able to produce the PDF output you need, then there are a few commercial software packages you might want to investigate. These all have great support for CSS paged media — a CSS module for controlling print or PDF output. CSS paged media support in Chrome is reasonable but currently incomplete in some areas.
PrinceXML is capable of creating extremely well-formatted output (check out the samples on their website) but at a steep price of $3800 for a 1-server licence. However, if you require some of the features that only it can offer, such as automatic hyphenation, footnotes or print crop marks, then the cost may be worthwhile.
Docraptor is a cloud API backed by PrinceXML that lets you get started with PrinceXML at a much lower price point.
PDFreactor is a competitor to PrinceXML with a similar price tag and a similar focus on producing print-quality PDF output.
Comparing the options
I hope this helps you choose the right option for your project!
If you’ve got comments or can suggest improvements to this post, let me know on Twitter 😀.