Testing the health of a website
Even if my notes are mainly static, it happened more than once that the resulting document was malformed without me noticing it.
There are online a lot of tools for validating websites. Tools like the Markup Validation Service or the Unicorn - W3C’s Unified Validator of W3C are the first to come to mind, but searching online there are lots of other websites to offer similar services.
The main validations I’m interested in are
-
if the HTML documents are well-formed
-
if the linked pages are reachable
The first test is important as it ensures compatibility between conforming programs. The second is to avoid errors like mislinking sources (or typing internal links wrong), and notice if some external resource is not available anymore (AFAIK there is no technique for verifying if an external resource changed), to try to find an alternate link as early as possible.
Online testing is a problematic approach
However, there are multiple disadvantages to relying on external websites/services for verifying our work/daily tasks.
The service might disappear
The company hosting the website could decide at any moment to shut down the service (especially if free).
This makes it extremely problematic if we are not aware of other services with similar features, and there is nothing we can do except search for something else.
The validation service might change behavior
This is problematic if we like to have reproducible results, and have built a system where ignoring errors is hard. If from one day to another the pages we are testing change their status from invalid to valid (or vice-versa) we might have an issue
Limit the number of tested pages
Especially in the case of free services. If someone points to a gigantic website, like for example Wikipedia, it will cost some resources to validate all pages and links for free…
The validation service needs to access the pages
Obviously.
This means that we have to publish the content somewhere, while I would normally like to test the changes for validity before publishing them. Some services provide the possibility to copy-paste the content or upload files, but it’s hard to automate such a task. Of course, validating a page should be automated as much as possible.
self-hosting a validation service
The W3C Markup Validator is open-source software, and even already packaged for systems like GNU/Debian.
I’ve tried configuring it but realized that it is a piece of software too complex for what I wanted to do.
Even when I got the service running:
-
I needed to open the webpage and point it to an address
-
I would not validate pages on
localhost
-
It had the same restrictions as the public validator, so I could not validate recursively a complete domain
Tools from the command line
I’m not the first one that wants to validate HTML documents and if URLs are correct, so I’ve searched what packages are available in the Debian repository.
Generally, command-line tools are easier to
-
automate
-
integrate with other tools
-
adapt
linkchecker
As the name implies, LinkChecker is a program for checking HTML documents and websites for broken links.
For example
linkchecker --no-robots 'https://example.com'
Would parse recursively the website https://example.com and search for links.
Links that do not return a 2xx code are reported to the user, like 404 or 418
URL `/non-existent/'
Name `here'
Parent URL https://example.com/non-existent/
Real URL https://example.com/non-existent/
Check time 3.996 seconds
Size 787B
Result Error: 404 Not Found
URL `https://fekir.info/teapot/'
Parent URL https://fekir.info/sitemap.xml
Real URL https://fekir.info/teapot/
Check time 2.973 seconds
Size 776B
Result Error: 418 I'm A Teapot
It also works on localhost and can accept files and directories as parameters, which makes it perfect for testing unpublished pages
linkchecker 'http://localhost:1313'
linkchecker index.html
But read the manual, as it also has a lot of useful features (that I admit I’m not using), like skipping pages based on patterns, support for passwords, encodings, cookies, and user-agent, …
In particular, use --check-extern
if you also want to test external links, and pay attention not too cause too many requests to external domains.
tidy
tidy can validate HTML documents (it can also pretty-print and correct them, but it’s out of scope for my current use case).
Contrary to linkchecker
, it does not accept an URL or file as a parameter and verifies recursively all documents, it verifies only those specified as a parameter.
Fortunately
-
if developing, I normally have the HTML files I want to push on the server
-
even if I do not have those files, it is possible to download them all with
wget
Thus it is possible to use tidy
both for testing locally and already public pages.
# download all pages
wget --recursive --no-verbose --level inf -- 'https://example.com';
# execute tidy for every HTML file and save the output to a corresponding report file
find . -name '*.html' -type f -exec tidy -errors -quiet -file {}.report.txt {} \;
# Remove empty reports
find . -name '*.html.report.txt' -type f -empty -delete;
# open all remaining reports
find . -name '*.html.report.txt' -type f -exec vim -p {} +;
A more interactive approach would be downloading a single site and validating it immediately
curl 'https://example.com' | tidy -errors -quiet
As a webpage does not only consist of HTML documents but also CSS, images, XML, audio and video files, and many other resources, it is possible, once the content is available offline, to use other programs, like csstidy
, jpeginfo
, mp3val
, ffmpeg
and many others, to validate the resources:
# validate mp3 files
find . -name '*.mp3' -type f -exec mp3val -f {} +;
# validate jpg files
find . -name '*.jpg' -type f -exec jpeginfo -c {} +;
# validate video files, like mp4
find . -name '*.mp4' -type f -exec sh -c "ffmpeg -v error -i {} -f null - 2> {}.stderr" \;
Note 📝 | You might want to parallelize those operations for bigger websites. |
Just as with tidy
, those tools can be used for verifying the content while building the website, as for existing webpages.
Obviously, during development and on a finished page the test possibilities are different.
For example, if the page is generated dynamically server-side with something like PHP
, with wget
we can download one dynamically generated (and hopefully significative) page and validate it. With the PHP
code, while developing, it would be generally not possible to validate all possible generated pages, but there are other tools (linters, tests…) for reducing the risks of producing something invalid.
If the content is generated dynamically client-side with, for example, javascript, then my experience is more limited, as generally, one needs to execute the javascript code as it can manipulate the HTML structure, and validate at least after the execution.
Do you want to share your opinion? Or is there an error, some parts that are not clear enough?
You can contact me anytime.