Testing health of a website


4 - 6 minutes read, 1103 words
Categories: web
Keywords: css html mp3val php web

Even if my notes are mainly static, it happened more than once that the resulting document was malformed without me noticing it.

There are online a lot of tools for validating websites. Tools like the Markup Validation Service or the Unicorn - W3C’s Unified Validator of W3C are the first to come to mind, but searching online there are lots of other websites to offer similar services.

The main validations I’m interested in are

  • if the HTML documents are well-formed

  • if the linked pages are reachable

The first test is important as it ensures compatibility between conforming programs. The second is to avoid errors like mislinking sources (or typing internal links wrong), and notice if some external resource is not available anymore (AFAIK there is no technique for verifying if an external resource changed), to try to find an alternate link as early as possible.

Online testing is a problematic approach

But there are multiple disadvantages to relying on external websites/services for verifying our work/daily tasks.

The service might disappear

The company hosting the website could decide at any moment to shut down the service (especially if free).

This makes it extremely problematic if we are not aware of other services with similar features, and there is nothing we can do except searching for something else.

The validation service might change behavior

This is problematic if we like to have reproducible results, and have built a system where ignoring errors is hard. If from one day to another the pages we are testing change their status from invalid to valid (or vice-versa) we might have an issue

The validation service might be temporarily down

It happens, for example for maintenance.

Limit on the number of tested pages

Especially in case of free services. Because if someone would point to a gigantic website, like for example Wikipedia, it will cost some resources validating all pages and links for free…​

The validation service needs to access the pages

Obviously.

This means that we have to publish the content somewhere, while I would normally like to test the changes for validity before publishing them. Some services provide the possibility to copy-paste the content or to upload files, but it’s hard to automate such a task. Or course validating a page should be automated as much as possible.

self-hosting a validation service

The W3C Markup Validator is open source software, and even already packaged for systems like GNU/Debian.

I’ve tried configuring it but realized that it is a piece of software too complex for what I wanted to do.

Even when I got the service running:

  • I needed to open the webpage and point it to an address

  • I would not validate pages on localhost

  • It had the same restrictions of the public validator, so I could not validate recursively a complete domain

Tools from the command line

Obviously, I’m not the first one that wants to validate HTML documents and if URLs are correct, so I’ve searched what packages are available in the Debian repository.

Generally, command-line tools are

  • easier to automate

  • easier to integrate with other tools

  • easier to adapt

linkchecker

As the name implies, LinkChecker is a program for checking HTML documents and websites for broken links.

For example

linkchecker --no-robots 'https://example.com'

Would parse recursively the website https://example.com and search for links.

Links that do not return a 2xx code are reported to the user, like 404 or 418

URL        `/non-existent/'
Name       `here'
Parent URL https://example.com/non-existent/
Real URL   https://example.com/non-existent/
Check time 3.996 seconds
Size       787B
Result     Error: 404 Not Found


URL        `https://fekir.info/teapot/'
Parent URL https://fekir.info/sitemap.xml
Real URL   https://fekir.info/teapot/
Check time 2.973 seconds
Size       776B
Result     Error: 418 I'm A Teapot

It also works on localhost and can accept files and directories as parameters, which makes it perfect for testing unpublished pages

linkchecker 'http://localhost:1313'

linkchecker index.html

But read the manual, as it also has a lot of useful features (that I admit I’m not using), like skipping pages based on patterns, support for passwords, encodings, cookies, user-agent, …​

tidy

tidy can validate HTML documents (it can also pretty-print and correct them, but it’s out of scope for my current use-case).

Contrary to linkchecker, it does not accept an URL or file as a parameter and verify recursively all documents, it verifies only those specified as a parameter.

Fortunately

  • if developing, I normally have the HTML files I want to push on the server

  • even if I do not have those files, it is possible to download them all with wget

Thus it is possible to use tidy both for testing locally and already public pages.

# download all pages
wget --recursive --no-verbose --level inf -- "https://example.com";

# execute tidy for every html file and save output to a correspondig report file
find . -name '*.html' -type f -exec tidy -errors -quiet -file {}.report.txt {} \;

# remove empty reports
find . -name '*.html.report.txt' -type f -empty -delete;

# open all remaining reports
find . -name '*.html.report.txt' -type f -exec nvim -p {} +;

As a webpage does not only consists of HTML document but also CSS, images, XML, audio and video files and many other resources, it is possible, once with wget the website is available offline, to use similarly other programs, like csstidy, jpeginfo, mp3val, ffmpeg and many others

# validate mp3 files
find . -name '*.mp3' -type f -exec mp3val -f {} +;

# validate jpg files
find . -name '*.jpg' -type f -exec jpeginfo -c {} +;


# validate video files, like mp4
find . -name '*.mp4' -type f -exec sh -c "ffmpeg -v error -i {} -f null - 2> {}.stderr" \;
Note You might want to parallelize those operations for bigger websites.

Just as with tidy, those tools can be used for verifying the content while building the website, as for existing webpages.

Obviously, during development and on a finished page the test possibilities are different.

For example, if the page is generated dynamically server-side with something like PHP, with wget we can download one dynamically generated (and hopefully significative) page and validate it. With the PHP code, while developing, it would be generally not possible to validate a priori all possible generated pages, but there are other tools (linters, tests…​) for reducing the risks to produce something invalid.

If the content is generated dynamically client-side with, for example, javascript, then my experience is more limited, as generally, one needs to execute the javascript code as it can manipulate the HTML structure, and do the validation at least after the execution.


Do you want to share your opinion? Or is there an error, same parts that are not clear enough?

You can contact me here.