Verify the size of a webpage

Notes published the
Notes updated the
2 minutes to read, 469 words

I wanted to verify how big most of the web pages I am hosting are.

Not the single HTML file, not the single style sheet, but the whole page. This requires taking into account images and eventually other resources loaded by the browser.

With wget and linkchecker (which is very practical for testing all internal and external links), it is trivial to verify every single page.

First, thanks to linkchecker it is possible to collect all links:

linkchecker 'https://example.com' --output=csv --verbose --no-warnings --no-status > links.txt

After that, use grep for filtering only the pages we want to test, and eventually sed for normalizing them, for example, to remove the trailing /, or replace the / with another character

cat links.txt | awk "-F;" '{print $8 }' | grep post | sed -e 's|/$||g' > posts.txt # filter post and remove trailing /

Then, create a folder for every page, and download all data necessary for viewing the page with wget and the parameter --page-requisites

while read -r link; do :;
  newdir=${link##*/};
  mkdir "$newdir";
  (cd $newdir; wget -e robots=off --page-requisites "$link";);
done < posts.txt

After that, with du or ncdu, it is trivial to find what are the biggest pages and files.

Note that this method completely ignores the fact that the server can send compressed data. This is a feature because the browser needs to support the same compression methods.

It wouldn’t be the first time that a webpage loads fast on some versions of some browsers because they’ve implemented a new compression method and the others have not.

Note 📝
This method does not work if your website uses JavaScript for loading content on the fly (which should be discouraged, especially on flaky/unstable/capped connections). The developer tools of the browser will give you the right answer, but measuring all pages one by one might not be a good plan.

Get the compressed size of pages with wget

In case one wants to verify how big a compressed webpage is, it is sufficient to add the required header, for example --header='Accept-Encoding: gzip', to the wget command

while read -r link; do :;
  newdir=${link##*/};
  mkdir "$newdir";
  (cd $newdir; wget -e robots=off --header='Accept-Encoding: gzip' --page-requisites "$link";);
done < posts.txt

The downloaded files will be the compressed resource, wget will not decompress them.

This means, for example, that index.html won’t be an HTML file, but some gzip compressed data.

The easiest way to decompress it is to rename the file and use gzip:

mv index.html index.html.gz
gzip -d index.html.gz
# index.html is not the decompressed file

As wget will not decompress the pages, it will not be able to download all resources necessary for displaying the page correctly.


Do you want to share your opinion? Or is there an error, some parts that are not clear enough?

You can contact me anytime.