HTML to PDF conversion seems like the obvious choice until it spectacularly isn't. You've got HTML content, browsers can print to PDF, so why not automate it? The harsh reality is that HTML and CSS were designed for infinite-scroll web pages, not paginated documents—and this fundamental mismatch will eventually bite you.
The Three Paths (And Their Hidden Costs)
Every developer faces the same three options, each with expensive tradeoffs that aren't obvious upfront.
PDF Libraries like wkhtmltopdf and jsPDF process HTML directly in your backend. They're appealing because they're self-hosted and seem cost-effective. But they're built on outdated browser engines that choke on modern CSS. Try using CSS Grid or Flexbox layouts, and you'll get mangled output that looks nothing like your web page.
1# This works fine in development...
2import pdfkit
3
4options = {
5 'page-size': 'A4',
6 'margin-top': '0.75in'
7}
8
9pdfkit.from_string(html_content, 'output.pdf', options=options)
10
11# ...until you hit 8,100 records and get a 95-second timeoutHeadless Browsers (Chrome, Firefox) render pages exactly as users see them. Perfect CSS support, JavaScript execution, the works. The catch? They're resource monsters. Each conversion spins up a browser instance that can consume 100MB+ of memory and significant CPU cycles.
Cloud APIs eliminate infrastructure headaches but create new dependencies. You're now relying on external services for a core feature, potentially sending sensitive data to third parties, and dealing with per-conversion costs that can explode with usage.
<> The fundamental problem stems from browser engines neglecting CSS Paged Media specifications—standards that define how to handle different page styles, margins, and layouts./>
Why Conversion Fails: The Standards Gap
The real issue isn't technical complexity—it's that web browsers fundamentally don't understand document pagination. CSS Paged Media specifications exist to handle things like different headers on title pages versus content pages, but browsers largely ignore these standards.
This creates cascading problems:
- Page breaks happen randomly instead of at logical content boundaries
- Headers and footers either don't work or display inconsistently
- External assets (fonts, images, stylesheets) dramatically slow processing and can cause timeouts
- Vector graphics get converted to pixelated images, destroying scalability
One real-world example: a developer tried converting 8,100 database records to PDF and hit Azure's 90-second gateway timeout. What seemed like a simple bulk operation became an architecture problem requiring job queues and background processing.
1// This innocent-looking code becomes a bottleneck
2const generateReports = async (records: Record[]) => {
3 const promises = records.map(record =>
4 htmlToPdf(renderTemplate(record))
5 );
6
7 // 8,100 records = 95 seconds = timeout
8 return Promise.all(promises);
9};Security: Your PDF Endpoint is an Attack Vector
HTML to PDF libraries can expose Server-Side Request Forgery (SSRF) vulnerabilities that most developers miss. If users can inject URLs into your conversion process, attackers can probe internal networks or exfiltrate data.
1// Dangerous: user-controlled HTML can contain malicious URLs
2const userHtml = `
3 <img src="http://internal-server:8080/admin/users">
4 <link rel="stylesheet" href="file:///etc/passwd">
5`;
6
7// The converter will attempt to fetch these resources
8convertToPdf(userHtml); // SSRF vulnerabilityMitigation requires disabling JavaScript execution and local file access, but only some libraries support these controls. Testing shows wkhtmltopdf allows JavaScript to be disabled, while WeasyPrint and Flying Saucer don't execute JavaScript at all—a critical difference for security-conscious applications.
Choosing Your Poison: A Decision Framework
Use PDF libraries when:
- You control the HTML completely (no user input)
- Your layouts are simple and static
- You need self-hosted solutions for compliance
- Volume is moderate (hundreds, not thousands of conversions)
Choose headless browsers when:
- Visual fidelity is critical
- You use modern CSS features
- You can absorb the infrastructure costs
- You need JavaScript execution for dynamic content
Consider APIs when:
- You can tolerate external dependencies
- Data privacy isn't a primary concern
- You want to avoid infrastructure complexity
- Cost per conversion fits your model
Performance Optimization: Beyond the Obvious
The performance killers aren't always obvious. External asset loading is often the biggest bottleneck—not the PDF generation itself. Every external font, stylesheet, or image adds network round trips and processing time.
1<!-- This innocent HTML creates multiple network requests -->
2<html>
3<head>
4 <link href="https://fonts.googleapis.com/css2?family=Inter" rel="stylesheet">
5 <link href="https://cdn.jsdelivr.net/npm/tailwindcss@2/dist/tailwind.min.css">
6</head>
7<body>
8 <img src="https://api.example.com/user/123/avatar.png">
9</body>
10</html>Optimization strategies:
- Embed critical CSS instead of linking to external stylesheets
- Use base64-encoded images for small graphics
- Preprocess HTML to strip unnecessary markup
- Implement timeouts for external resource loading
Why This Matters
PDF generation seems like a solved problem until you hit scale, security requirements, or complex layouts. Your choice of approach locks you into specific capabilities and constraints that become expensive to change later.
The key insight: match your tool to your actual requirements, not your initial assumptions. That simple invoice generator might work fine with a basic library, but if you're building a document-heavy application, invest in understanding the tradeoffs upfront.
Test with realistic data volumes early. Implement security controls immediately. And remember that the "simple" HTML-to-PDF approach is often anything but simple once real-world requirements emerge.

