Html Proofer

There is a great gem called html-proofer by Garen Torikian that will check html files for a bunch of things like: invalid markup, broken images, dead hyperlinks, bad favicons, etc.

When html-proofer checks for valid external links it actually makes external HTTP calls using Typhoeus. It determines whether a link is external or not by looking for http in the url. For internal links it simply checks whether the file exists or not.

This means, throughout your site, you should use relative links for internal links so that when html-proofer runs it will not make external network calls and slow down the test build. This is also a good practice anyway and because it keep your html pages smaller.

The issue is that there is a place where the use of absolute urls is appropriate. Google recommends that you use absolute urls to help them identify the canonical url for a piece of content.

Avoid errors: use absolute paths rather than relative paths with the rel="canonical" link element.

Use this structure: https://www.example.com/dresses/green/greendresss.html

Not this structure: /dresses/green/greendress.html).

Google recommends placing them in the head link tag like this.

<link rel="canonical" href="https://blog.example.com/dresses/green-dresses-are-awesome" />

Doing what google recommends, html-proofer finds an error for any posts that have not yet been published on the live site. Here is an example:

- ./_site/2015/08/31/getting-html-proofer-to-work-with-canonical-url-for-google-seo/index.html
  *  External link http://tongueroo.com/2015/08/31/getting-html-proofer-to-work-with-canonical-url-for-google-seo/ failed: 404 No error

Workaround Fix

There is a closed GitHub issue covering this canonical url issue. Understandably, support will likely not be added for this edge case. Fortunately a workaround for this issue is to use the html_swap option.

HTML::Proofer.new(
  "./_site",
  # hacks to get html proof to pass links we wanted ignored
  href_swap: {
    'http://tongueroo.com' => '', # canonical link in head
  }
).run

Using the html_swap option to swap out the domain of the site, in my case tongueroo.com, and replacing it with a blank string will make html-proofer think these are relative links and hence only check for the existence of the file versus making the actual HTTP request.