Testing the health of a website

Even if my notes are mainly static, it happened more than once that the resulting document was malformed without me noticing it.

There are online a lot of tools for validating websites. Tools like the Markup Validation Service or the Unicorn - W3C’s Unified Validator of W3C are the first to come to mind, but searching online there are lots of other websites to offer similar services.

The main validations I’m interested in are

  • if the HTML documents are well-formed

  • if the linked pages are reachable

The first test is important as it ensures compatibility between conforming programs. The second is to avoid errors like mislinking sources (or typing internal links wrong), and notice if some external resource is not available anymore (AFAIK there is no technique for verifying if an external resource changed), to try to find an alternate link as early as possible.

Online testing is a problematic approach

However, there are multiple disadvantages to relying on external websites/services for verifying our work/daily tasks.

The service might disappear

The company hosting the website could decide at any moment to shut down the service (especially if free).

This makes it extremely problematic if we are not aware of other services with similar features, and there is nothing we can do except search for something else.

The validation service might change behavior

This is problematic if we like to have reproducible results, and have built a system where ignoring errors is hard. If from one day to another the pages we are testing change their status from invalid to valid (or vice-versa) we might have an issue

The validation service might be temporarily down

It happens, for example for maintenance.

Limit the number of tested pages

Especially in the case of free services. If someone points to a gigantic website, like for example Wikipedia, it will cost some resources to validate all pages and links for free…​

The validation service needs to access the pages

Obviously.

This means that we have to publish the content somewhere, while I would normally like to test the changes for validity before publishing them. Some services provide the possibility to copy-paste the content or upload files, but it’s hard to automate such a task. Of course, validating a page should be automated as much as possible.

self-hosting a validation service

The W3C Markup Validator is open-source software, and even already packaged for systems like GNU/Debian.

I’ve tried configuring it but realized that it is a piece of software too complex for what I wanted to do.

Even when I got the service running:

  • I needed to open the webpage and point it to an address

  • I would not validate pages on localhost

  • It had the same restrictions as the public validator, so I could not validate recursively a complete domain

Tools from the command line

I’m not the first one that wants to validate HTML documents and if URLs are correct, so I’ve searched what packages are available in the Debian repository.

Generally, command-line tools are easier to

  • automate

  • integrate with other tools

  • adapt

linkchecker

As the name implies, LinkChecker is a program for checking HTML documents and websites for broken links.

For example

linkchecker --no-robots 'https://example.com'

Would parse recursively the website https://example.com and search for links.

Links that do not return a 2xx code are reported to the user, like 404 or 418

URL        `/non-existent/'
Name       `here'
Parent URL https://example.com/non-existent/
Real URL   https://example.com/non-existent/
Check time 3.996 seconds
Size       787B
Result     Error: 404 Not Found


URL        `https://fekir.info/teapot/'
Parent URL https://fekir.info/sitemap.xml
Real URL   https://fekir.info/teapot/
Check time 2.973 seconds
Size       776B
Result     Error: 418 I'm A Teapot

It also works on localhost and can accept files and directories as parameters, which makes it perfect for testing unpublished pages

linkchecker 'http://localhost:1313'

linkchecker index.html

But read the manual, as it also has a lot of useful features (that I admit I’m not using), like skipping pages based on patterns, support for passwords, encodings, cookies, and user-agent, …​

In particular, use --check-extern if you also want to test external links, and pay attention not too cause too many requests to external domains.

tidy

tidy can validate HTML documents (it can also pretty-print and correct them, but it’s out of scope for my current use case).

Contrary to linkchecker, it does not accept an URL or file as a parameter and verifies recursively all documents, it verifies only those specified as a parameter.

Fortunately

  • if developing, I normally have the HTML files I want to push on the server

  • even if I do not have those files, it is possible to download them all with wget

Thus it is possible to use tidy both for testing locally and already public pages.

# download all pages
wget --recursive --no-verbose --level inf -- 'https://example.com';

# execute tidy for every HTML file and save the output to a corresponding report file
find . -name '*.html' -type f -exec tidy -errors -quiet -file {}.report.txt {} \;

# Remove empty reports
find . -name '*.html.report.txt' -type f -empty -delete;

# open all remaining reports
find . -name '*.html.report.txt' -type f -exec vim -p {} +;

A more interactive approach would be downloading a single site and validating it immediately

curl 'https://example.com' | tidy -errors -quiet

As a webpage does not only consist of HTML documents but also CSS, images, XML, audio and video files, and many other resources, it is possible, once the content is available offline, to use other programs, like csstidy, jpeginfo, mp3val, ffmpeg and many others, to validate the resources:

# validate mp3 files
find . -name '*.mp3' -type f -exec mp3val -f {} +;

# validate jpg files
find . -name '*.jpg' -type f -exec jpeginfo -c {} +;


# validate video files, like mp4
find . -name '*.mp4' -type f -exec sh -c "ffmpeg -v error -i {} -f null - 2> {}.stderr" \;
Note 📝
You might want to parallelize those operations for bigger websites.

Just as with tidy, those tools can be used for verifying the content while building the website, as for existing webpages.

Obviously, during development and on a finished page the test possibilities are different.

For example, if the page is generated dynamically server-side with something like PHP, with wget we can download one dynamically generated (and hopefully significative) page and validate it. With the PHP code, while developing, it would be generally not possible to validate all possible generated pages, but there are other tools (linters, tests…​) for reducing the risks of producing something invalid.

If the content is generated dynamically client-side with, for example, javascript, then my experience is more limited, as generally, one needs to execute the javascript code as it can manipulate the HTML structure, and validate at least after the execution.


Do you want to share your opinion? Or is there an error, some parts that are not clear enough?

You can contact me anytime.