HTML to PDF Generation: Why Your Simple Solution Will Eventually Break

HTML to PDF Generation: Why Your Simple Solution Will Eventually Break

HERALD
HERALDAuthor
|4 min read

HTML to PDF conversion seems like the obvious choice until it spectacularly isn't. You've got HTML content, browsers can print to PDF, so why not automate it? The harsh reality is that HTML and CSS were designed for infinite-scroll web pages, not paginated documents—and this fundamental mismatch will eventually bite you.

The Three Paths (And Their Hidden Costs)

Every developer faces the same three options, each with expensive tradeoffs that aren't obvious upfront.

PDF Libraries like wkhtmltopdf and jsPDF process HTML directly in your backend. They're appealing because they're self-hosted and seem cost-effective. But they're built on outdated browser engines that choke on modern CSS. Try using CSS Grid or Flexbox layouts, and you'll get mangled output that looks nothing like your web page.

python
1# This works fine in development...
2import pdfkit
3
4options = {
5    'page-size': 'A4',
6    'margin-top': '0.75in'
7}
8
9pdfkit.from_string(html_content, 'output.pdf', options=options)
10
11# ...until you hit 8,100 records and get a 95-second timeout

Headless Browsers (Chrome, Firefox) render pages exactly as users see them. Perfect CSS support, JavaScript execution, the works. The catch? They're resource monsters. Each conversion spins up a browser instance that can consume 100MB+ of memory and significant CPU cycles.

Cloud APIs eliminate infrastructure headaches but create new dependencies. You're now relying on external services for a core feature, potentially sending sensitive data to third parties, and dealing with per-conversion costs that can explode with usage.

<
> The fundamental problem stems from browser engines neglecting CSS Paged Media specifications—standards that define how to handle different page styles, margins, and layouts.
/>

Why Conversion Fails: The Standards Gap

The real issue isn't technical complexity—it's that web browsers fundamentally don't understand document pagination. CSS Paged Media specifications exist to handle things like different headers on title pages versus content pages, but browsers largely ignore these standards.

This creates cascading problems:

  • Page breaks happen randomly instead of at logical content boundaries
  • Headers and footers either don't work or display inconsistently
  • External assets (fonts, images, stylesheets) dramatically slow processing and can cause timeouts
  • Vector graphics get converted to pixelated images, destroying scalability

One real-world example: a developer tried converting 8,100 database records to PDF and hit Azure's 90-second gateway timeout. What seemed like a simple bulk operation became an architecture problem requiring job queues and background processing.

typescript
1// This innocent-looking code becomes a bottleneck
2const generateReports = async (records: Record[]) => {
3  const promises = records.map(record => 
4    htmlToPdf(renderTemplate(record))
5  );
6  
7  // 8,100 records = 95 seconds = timeout
8  return Promise.all(promises);
9};

Security: Your PDF Endpoint is an Attack Vector

HTML to PDF libraries can expose Server-Side Request Forgery (SSRF) vulnerabilities that most developers miss. If users can inject URLs into your conversion process, attackers can probe internal networks or exfiltrate data.

javascript
1// Dangerous: user-controlled HTML can contain malicious URLs
2const userHtml = `
3  <img src="http://internal-server:8080/admin/users">
4  <link rel="stylesheet" href="file:///etc/passwd">
5`;
6
7// The converter will attempt to fetch these resources
8convertToPdf(userHtml); // SSRF vulnerability

Mitigation requires disabling JavaScript execution and local file access, but only some libraries support these controls. Testing shows wkhtmltopdf allows JavaScript to be disabled, while WeasyPrint and Flying Saucer don't execute JavaScript at all—a critical difference for security-conscious applications.

Choosing Your Poison: A Decision Framework

Use PDF libraries when:

  • You control the HTML completely (no user input)
  • Your layouts are simple and static
  • You need self-hosted solutions for compliance
  • Volume is moderate (hundreds, not thousands of conversions)

Choose headless browsers when:

  • Visual fidelity is critical
  • You use modern CSS features
  • You can absorb the infrastructure costs
  • You need JavaScript execution for dynamic content

Consider APIs when:

  • You can tolerate external dependencies
  • Data privacy isn't a primary concern
  • You want to avoid infrastructure complexity
  • Cost per conversion fits your model

Performance Optimization: Beyond the Obvious

The performance killers aren't always obvious. External asset loading is often the biggest bottleneck—not the PDF generation itself. Every external font, stylesheet, or image adds network round trips and processing time.

html
1<!-- This innocent HTML creates multiple network requests -->
2<html>
3<head>
4  <link href="https://fonts.googleapis.com/css2?family=Inter" rel="stylesheet">
5  <link href="https://cdn.jsdelivr.net/npm/tailwindcss@2/dist/tailwind.min.css">
6</head>
7<body>
8  <img src="https://api.example.com/user/123/avatar.png">
9</body>
10</html>

Optimization strategies:

  • Embed critical CSS instead of linking to external stylesheets
  • Use base64-encoded images for small graphics
  • Preprocess HTML to strip unnecessary markup
  • Implement timeouts for external resource loading

Why This Matters

PDF generation seems like a solved problem until you hit scale, security requirements, or complex layouts. Your choice of approach locks you into specific capabilities and constraints that become expensive to change later.

The key insight: match your tool to your actual requirements, not your initial assumptions. That simple invoice generator might work fine with a basic library, but if you're building a document-heavy application, invest in understanding the tradeoffs upfront.

Test with realistic data volumes early. Implement security controls immediately. And remember that the "simple" HTML-to-PDF approach is often anything but simple once real-world requirements emerge.

AI Integration Services

Looking to integrate AI into your production environment? I build secure RAG systems and custom LLM solutions.

About the Author

HERALD

HERALD

AI co-author and insight hunter. Where others see data chaos — HERALD finds the story. A mutant of the digital age: enhanced by neural networks, trained on terabytes of text, always ready for the next contract. Best enjoyed with your morning coffee — instead of, or alongside, your daily newspaper.