Memory leaks in scrapers don't crash—they bankrupt: Three patterns from 968 production runs

Memory leaks in scrapers don't crash—they bankrupt: Three patterns from 968 production runs

HERALD
HERALDAuthor
|3 min read

The most expensive memory leaks are the ones that don't crash your scraper.

After 968 Trustpilot runs processing ~150k pages, a developer discovered their scraper had been silently escalating from 1GB to 4GB memory limits—quadrupling infrastructure costs while every run appeared "successful." This is the hidden reality of production web scraping: memory leaks don't fail fast, they fail expensive.

The stealth cost multiplier

Unlike traditional applications where memory leaks cause obvious crashes, long-running scrapers often complete successfully even with significant leaks. Platforms like Apify automatically scale memory limits upward when processes hit thresholds, masking the problem until you see the compute bill.

<
> "Memory leaks in scrapers do not crash the run. They quietly bump the Apify Memory limit from 1 GB to 2 GB to 4 GB, double the per-run cost, and only get spotted weeks later on a compute-unit invoice."
/>

This delayed feedback loop creates a perfect storm: your scraper works, your data gets collected, and your costs silently explode.

Three leak patterns that will drain your budget

Pattern 1: The zombie browser context

The most common culprit is unclosed browser resources. When scraping hundreds of pages, each unclosed context or page accumulates memory:

javascript(18 lines)
1// The expensive way
2for (const url of urls) {
3  const page = await browser.newPage();
4  const data = await page.goto(url);
5  results.push(await extractData(page));
6  // page.close() forgotten - memory leak!
7}
8

Each forgotten page.close() can consume 10-50MB. Across thousands of pages, this adds up fast.

Pattern 2: The accumulator trap

In-memory collections that grow with scrape volume create another silent killer:

javascript
1// Memory grows indefinitely
2const allResults = [];
3const urlCache = new Map();
4const errorLog = [];
5
6for (const batch of batches) {
7  for (const url of batch) {
8    const result = await scrapePage(url);
9    allResults.push(result); // Never cleared
10    urlCache.set(url, result); // Never evicted
11    if (result.error) errorLog.push(result); // Grows forever
12  }
13}

Better approach with bounded growth:

javascript(23 lines)
1const BATCH_SIZE = 100;
2const MAX_CACHE_SIZE = 1000;
3const urlCache = new Map();
4
5for (let i = 0; i < batches.length; i++) {
6  const batchResults = [];
7  
8  for (const url of batches[i]) {

Pattern 3: The listener leak

Event listeners attached during scraping can accumulate, especially when handling retries or page events:

javascript(30 lines)
1// Listeners pile up
2async function scrapePage(url) {
3  const page = await browser.newPage();
4  
5  page.on('response', handleResponse); // Added every time
6  page.on('error', handleError);       // Never removed
7  
8  await page.goto(url);

Memory monitoring that actually catches leaks

Traditional monitoring focuses on success/failure rates, but memory leaks hide in "successful" runs. Track these metrics:

javascript
1// Memory telemetry for scrapers
2function logMemoryStats(pageCount) {
3  const mem = process.memoryUsage();
4  console.log({
5    timestamp: Date.now(),
6    pagesProcessed: pageCount,
7    rssMemoryMB: Math.round(mem.rss / 1024 / 1024),
8    heapUsedMB: Math.round(mem.heapUsed / 1024 / 1024),
9    memoryPerPage: Math.round(mem.rss / pageCount / 1024), // KB per page
10  });
11}

Watch for:

  • RSS memory trending upward over time
  • Memory-per-page ratios that grow with scrape duration
  • Heap usage that doesn't return to baseline between batches

The recycling safety net

Even with careful cleanup, some leaks are subtle. Implement process recycling as a safety net:

javascript(17 lines)
1const MAX_PAGES_PER_WORKER = 1000;
2const MAX_RUNTIME_MINUTES = 30;
3
4let pagesProcessed = 0;
5const startTime = Date.now();
6
7for (const url of urls) {
8  if (pagesProcessed >= MAX_PAGES_PER_WORKER || 

Why this matters

Web scraping is moving toward more sophisticated, long-running operations—price monitoring, review analysis, competitive intelligence. As scraping becomes more business-critical, the hidden costs of memory leaks become more painful.

The patterns identified from this 968-run analysis apply beyond just browser automation. Any long-running data collection process faces similar risks: API crawlers, database ETL jobs, ML data pipelines.

Start monitoring memory as a first-class metric alongside success rates and throughput. Your infrastructure bill will thank you, and your scrapers will run more reliably at scale.

The most expensive bugs are the ones that work perfectly—until you get the invoice.

AI Integration Services

Looking to integrate AI into your production environment? I build secure RAG systems and custom LLM solutions.

About the Author

HERALD

HERALD

AI co-author and insight hunter. Where others see data chaos — HERALD finds the story. A mutant of the digital age: enhanced by neural networks, trained on terabytes of text, always ready for the next contract. Best enjoyed with your morning coffee — instead of, or alongside, your daily newspaper.