Jaco du Plessis

Bypassing Paywalls With One Line of Code

paywalls google bash curl pup lynx

Let's not talk about whether paywalls are a good or bad idea. But if you've ever spent some time looking into how most paywalls are implemented, you'll know that they are often very easy to bypass.

Publishers are slaves to Google - if they cut off their content from the big G, their traffic will drop severely. So most add a backdoor to their paywall to allow Google to still index their content. This is done by checking the user agent header of every incoming request.

Therefore you can often bypass a paywall simply by simulating the GoogleBot user agent.

curl -s -H "User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" $URL

Making a request like the one above will print the entire HTML response to your terminal.

As most sites have ridiculously bloated markup, you'll want to strip out most of it. Using the pup tool, this is very easy - simply identify the relevant CSS selectors.

As an example, let's take this page from the Financial Times. Note you can't read the article when visiting that page in your browser.

By looking at the markup (use your browser's dev tool or look at the raw HTML), we see the relevant selector is .article__content and pass this into pup:

... | pup ".article__content"

(If you have more than one selector, pass it as a comma-separated list.)

Now we have the content, but it's in HTML and not easy to read.

The final part of the solution is the Lynx browser. We can pipe the HTML into lynx, and use it's ability to transform HTML into readable text:

... | lynx -stdin -dump

All together, it looks like this:

curl -s \
    -H "User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" \
    "https://www.ft.com/content/23bdde0a-5ca4-11e8-9334-2218e7146b04" \
    | pup ".article__content" \
    | lynx -stdin -dump

Because each site requires different content selectors, I make a bash function in my .bashrc that looks like the following:

function read_example {
    SELECTORS=".article-meta, .content"
    curl -s -H "User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "$1" | pup "$SELECTORS" | lynx -stdin -dump
}

Then you just use read_example $URL to read. This can be easily copied and edited to read different sites.

So for the Financial Times, it would look like this:

function read_ft {
    SELECTORS=".article__content"
    curl -s -H "User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "$1" | pup "$SELECTORS" | lynx -stdin -dump
}

Lastly, I've made a general purpose program to read content while simulating the Google user agent. It's called goggles and can be found here.

Happy reading.

Have a comment? Email me at jaco@jacoduplessis.co.za.