Backup webpage
It happened again. One of the many web pages saved in my bookmarks "died".
The page still exists but is empty, there is not even a 404 error.
I knew the content that was there, so I searched for it and found it again on the same domain, but at a different location. Why did they not make a redirect from the old page to the new one?
The page was not only moved, but it was also slightly modified; it had at least a different styling. All internal links were broken and the images are missing too.
Fortunately, on the internet archive the old page was saved multiple times. Now I’ve updated my bookmark to the old page on the internet archive; the new page, without images and with broken links, is nearly useless.
But what should I do?
I prefer my bookmarks to point to the original resource, it has some added value.
On the same website, there might be other content worth exploring.
During a discussion, when I’m trying to make a point, I can at least show that my idea is backed up by some data (with the assumption that the source is trustworthy) not created by me. Thus if you do not trust me, just click on the link and read the reference I gave you (and eventually search for other sources, you should not trust unknown people on the internet…).
This also holds for me. Sometimes I get some facts wrong, but being able to double and triple-check the original resource makes me aware of my errors.
Thus pointing to the original resource should be the way to go, but it is error-prone and resource-intensive.
It is error-prone as
-
some URLs contain tracking elements or other metainformation, thus I have to clean them up
-
In case of redirects, I need to choose between different URLs pointing to the same resource. Do I save the most readable, or the one that is not a redirect? Which one lasts longer?
-
if the content appears on a much bigger page, verify if there is an internal anchor to point to the relevant piece directly
and so on. What to do if the content appears on a page with infinite scroll, or a page that is dynamic as in it changes content periodically?
It is resource intensive, as one needs to check not only that the page is still available, but that the content is still there too!
Creating a screenshot, or saving the webpage locally would be so much easier…
But simply replacing a link to external resources with an image or a copy has some issues too.
First, can I host the copied content? Even assuming that the content can be copied and hosted, how can a third party be sure that I did not change the content to my liking?
Granted, as long as the original content is there, it can be compared. Depending on how the copy was saved and if there were other changes, is not necessarily a trivial task.
Using a third-party provider (like the Internet Archive) should at least ensure that a webpage has been saved as-is.
But there is also the risk that such a third-party provider disappears, or is unavailable 🗄️.
Considering how volatile most websites are in the meantime, I’ll save more often webpages offline, at least for personal use.
How to do an offline copy
There are multiple ways and multiple formats for storing an offline copy.
Download the resources
The most simple one is to visit a webpage and all resources and download them locally.
This can be accomplished with wget
and other programs
wget --recursive -np -k -p https://example.com/page.html
The parameters, all described in man wget
, are the following:
-
--recursive
to recurse into links, retrieving those pages too (this has a default max depth of 5, it can be changed with-l
) -
-np
to never enter a parent directory (for example, don’t follow a "home" link and thus mirror the whole site) -
-k
to convert links relative to the local copy -
-p
to get page requisites; for example stylesheets
All resources will be saved unmodified as separate files on the hard drive.
mhtml
, warc
, …
While wget
works well enough for many use cases, having a bunch of files instead of a single document has its disadvantages, especially if you want to move it around.
MHTML, MAFF and webarchive are multiple file formats for storing a website offline.
warc is probably the format to be preferred when archiving a webpage.
Since newer versions[1] of wget
support warc
out-of-the-box, the entry bar for creating such archives should be pretty low.
# save a single page in warc
wget "https://example.com/page.html" --warc-file="example-page"
# save multiple pages in warc
wget "https://example.com/" --mirror --warc-file="example"
The main issue with all these formats is that none of them is supported by all major browsers (Firefox, browsers based on Chrome, and Safari), thus a third-party program is generally necessary.
Single HTML file
SingleFile proposes an alternate approach. Download a website and concatenate everything in a single html file.
The main advantage to the other archive formats is that browsers support HTML files… thus it is compatible with all browsers out of the box.
Compared to downloading all files with wget
the main advantage is that as long as one wants to save a single page, it creates only one file.
The extension also does some additional cleanups. For example, it removes the dynamic parts. If not, a webpage could still depend on resources loaded in a second moment. Without Javascript, we know that the offline copy is truly offline.
wget
on the other hand would download the javascript code that the browser would execute every time, and might thus still depend on external resources.
Thus SingleFile
creates, by default, a static page, even for dynamic ones.
This has multiple advantages, but it is not ideal for some pages with dynamic content that can be still saved offline.
Custom solutions for custom websites
For some websites, there are other alternatives for managing offline copies.
Wikipedia, for example, offers standalone copies to download.
The Kiwix Reader has a built-in download manager that not only supports Wikipedia, but other wikis too.
Zeal is more focused on documentation. It supports "out of the box" pages like cppreference, the Python documentation, the PHP Manual, Java Oracle Docs and many others.
Conclusion
SingleFile is probably the easiest way to go for most bookmarks.
Having an offline copy means I can look at the resource whenever I want.
For most use cases, I am not even interested in preserving/archiving the page as-is. If there are some differences, fine for me, as long as the content is there, and it more or less looks like the original page (otherwise I would just copy the textual content).
Do you want to share your opinion? Or is there an error, some parts that are not clear enough?
You can contact me anytime.