Jaco du Plessis

Batching Asynchronous JavaScript for Headless Screenshots

javascript

I like taking photos of birds. For quite some time I've been meaning to create an app where I can upload and manage all my bird photos. But, like with almost all my side project ideas, I get sidetracked long before I reach the original goal. At some point I decided that I needed to use bird distribution data to validate my species identification - if a bird has not been official identified in a region where I took the photo, I should warn me when I try to tag it as that bird.

Anyway, I requested the worldwide bird sighting database from Cornell University in the USA, and I started drawing some maps. Specifically, I drew heatmaps for each bird, in each country in which it appears. The compressed dataset is about 40Gb, so that's a lot of heatmaps.

Because it's such a hassle to deploy all that code online, I decided to take a screenshot of each heatmap - then it's just a bunch of images. This was shortly after Chrome's headless mode landed in the stable version, and Google released their puppeteer library.

Because I normally try to solve my problems with Python first, I used Selenium and Pillow to take screenshots and crop them. It was very slow, even running on with a multiprocessing pool, simply because only one page was being rendered at a time in each process. I knew I needed to use multiple tabs to speed things up, and at that stage I decided to try the puppeteer library, because

  1. The last time I had to use multiple tabs in Selenium, it was quite a pain
  2. The Chrome dev tools protocol has built-in support for cropping screenshots, which would probably be faster than using Pillow
  3. I wanted to try out the puppeteer library, just to see if it offers a better user experience to Selenium (because I have another few ideas where I want to use headless browsing in the future)

To cut a long story short, in the end taking 977 screenshots (all the bird species in South Africa), went from a couple of minutes to about 45 seconds.

At first, I just tried opening each html file in a new tab, all 977 in one go, just to see what happens. I ran out of memory on my 12Gb laptop pretty quickly. I found that if I limited a run to about 100 tabs, things worked nicely. So I simply had to batch the screenshots in sets of about 100.

This was trickier than I expected, simply because it's not always intuitive how to to do async things in a semi-synchronous way.

Without going into the detail of how I parsed and exported the data, here are the relevant parts of my screenshots.js file:

const puppeteer = require('puppeteer')
const fs = require('fs')

const code = process.argv[2]

if (!code) process.exit(1)

Here I just read the 2-letter country code from stdin.

const birds = JSON.parse(fs.readFileSync(`./json/${code}/__all__.json`)) 
const N = birds.slugs.length

birds is a JavaScript object with a single key slugs, which holds and array of slugified scientific names, which I used to uniquely reference each specie.

const batchSize = 100
const numBatches = Math.ceil(N / batchSize)
const batchStarts = Array(numBatches).fill(0).map((x, i) => i * batchSize)

batchStarts is an array containing the starting index of each batch. So for this run of 977 screenshots with a batch size of 100, it looks like this: [0,100,200,300,400,500,600,700,800,900]

On of the reasons I like JavaScript is the functional style of programming: I haven't used a JavaScript for for more than a year. If you know how to use map, reduce and forEach, you don't need it.* However, when combining these methods with Promises and the new async-await syntax requires a few adjustments.

In short, when mapping over an array of Promises, you need to wrap your result with Promise.all. When using reduce, you need to pass in a Promise that resolve immediately (Promise.resolve()) for your initial aggregator value.

Back to the issue at hand: we have an array that holds the starting index value of each batch, and we need to apply an async function to it, in a serial manner.

The native Promise prototype has only two methods: all and resolve. The Bluebird library includes a few more, specifically the each method, which allows for sequential iteration.

However, rather than installing it as a dependency (I hate the JavaScript community's reliance on truckloads of NPM dependencies to do basic things), I came up with the following:

function forEachPromise(items, fn) {
  return items.reduce((promise, item) => promise.then(() => fn(item)), Promise.resolve())
}

This function accepts an array, and a function fn that returns a Promise. It waits for the Promise to resolve, before applying fn to the next item in the array. The function itself returns a Promise by using Array.reduce like I mentioned earlier.

This gives us what we need. Note, however, that this is only useful if our function has side-effects - it cannot return batches of results as they are processed. Our side-effect (and goal) is saving screenshots to disk, so no worries. This brings me to the final part:

;(async () => {
  const browser = await puppeteer.launch({executablePath: '/usr/bin/google-chrome-unstable'})
  const width = 1920
  const height = 1080
  const margins = {
    left: 200,
    right: 200,
    top: 100,
    bottom: 20
  }

  async function processBatch(start) {
    console.log("Making batch starting from", start)
    const batch = birds.slugs.slice(start, start + batchSize)

    return Promise.all(batch.map(async (slug, index) => {
      const page = await browser.newPage()
      await page.setViewport({width, height})
      await page.goto(`file:///${__dirname}/html/${code}/${slug}.html`)
      await page.screenshot({
        path: `./heatmaps/${code}/${slug}.jpg`,
        quality: 80,
        type: 'jpeg',
        clip: {
          x: margins.left,
          y: margins.top,
          height: height - margins.top - margins.bottom,
          width: width - margins.left - margins.right,
        }
      })
      return page.close()
    }))
  }

  await forEachPromise(batchStarts, processBatch)
  console.log("Done")
  browser.close()
})()

Nothing strange here. You can see how the batch itself is constructed by slicing the original array, starting from the batchStart until the next N items. Mapping over the batch asynchronously, a new Page (tab) is created for each item in and after it has rendered, a screenshot is saved to disk.

This is the best approach I can think of to minimize the time to take many thousands of screenshots. You'll have to experiment with the batch size depending on how much memory you have available. (You don't have to bother with running multiple processes as Chrome automatically runs each tab in a process.)

Here's an example of the final screenshot:

Heatmap of Pin-tailed Whydah sightings in South Africa

Thanks to Cornell University for giving me access to the data.


* I remember reading some time ago a guy saying that if you find yourself using a for loop in JavaScript, it's a sign that your doing something wrong. I agree.

Have a comment? Email me at jaco@jacoduplessis.co.za.