Verify the size of a webpage
I wanted to verify how big most of the web pages I am hosting are.
Not the single HTML file, not the single style sheet, but the whole page. This requires taking into account images and eventually other resources loaded by the browser.
With wget
and linkchecker
(which is very practical for testing all internal and external links), it is trivial to verify every single page.
First, thanks to linkchecker
it is possible to collect all links:
linkchecker 'https://example.com' --output=csv --verbose --no-warnings --no-status > links.txt
After that, use grep
for filtering only the pages we want to test, and eventually sed
for normalizing them, for example, to remove the trailing /
, or replace the /
with another character
cat links.txt | awk "-F;" '{print $8 }' | grep post | sed -e 's|/$||g' > posts.txt # filter post and remove trailing /
Then, create a folder for every page, and download all data necessary for viewing the page with wget
and the parameter --page-requisites
while read -r link; do :;
newdir=${link##*/};
mkdir "$newdir";
(cd $newdir; wget -e robots=off --page-requisites "$link";);
done < posts.txt
After that, with du
or ncdu
, it is trivial to find what are the biggest pages and files.
Note that this method completely ignores the fact that the server can send compressed data. This is a feature because the browser needs to support the same compression methods.
It wouldn’t be the first time that a webpage loads fast on some versions of some browsers because they’ve implemented a new compression method and the others have not.
Note 📝 | This method does not work if your website uses JavaScript for loading content on the fly (which should be discouraged, especially on flaky/unstable/capped connections). The developer tools of the browser will give you the right answer, but measuring all pages one by one might not be a good plan. |
Get the compressed size of pages with wget
In case one wants to verify how big a compressed webpage is, it is sufficient to add the required header, for example --header='Accept-Encoding: gzip'
, to the wget
command
while read -r link; do :;
newdir=${link##*/};
mkdir "$newdir";
(cd $newdir; wget -e robots=off --header='Accept-Encoding: gzip' --page-requisites "$link";);
done < posts.txt
The downloaded files will be the compressed resource, wget
will not decompress them.
This means, for example, that index.html
won’t be an HTML file, but some gzip compressed data.
The easiest way to decompress it is to rename the file and use gzip
:
mv index.html index.html.gz
gzip -d index.html.gz
# index.html is not the decompressed file
As wget
will not decompress the pages, it will not be able to download all resources necessary for displaying the page correctly.
Do you want to share your opinion? Or is there an error, some parts that are not clear enough?
You can contact me anytime.