The sad state of the web

Notes published the
Notes updated the
19 - 24 minutes to read, 4729 words

Like many people, I tend to visit a lot of websites.

Sometimes I do not have enough time for reading the whole article. Sometimes I simply want to read it at another moment or when I need the described information.

Most of the time I save the page as a bookmark of my browser. Other times I send the link via mail for sharing it with other people, or I add it to my notes, like on this website.

This makes it easier to access the information again, especially if it was not easy to find it.

Sometimes after years, other times the day after, the resource I am or was interested in is not available anymore!

And this is not even the main issue I have with the web as of today!

Everything is lazy, nothing is validated

Invalid HTML

Most webpages are not valid HTML. Many use browser-specific extensions, and in most cases the HTML is malformed.

Wikipedia, is, for example, one of the most visited websites worldwide; this is the result of the Nu Html Checker of the main page: 11 Errors and 12 Warnings.

This is not bad, consider the Netflix page, which does not have as much content, but has more than 100 errors. The same holds for the Microsoft landing page.

For most people, even developers, as long as the content is shown as they would expect it to be rendered, this is not an issue.

This poses a major burden to everything that wants to interact with HTML documents. Not only browsers, which God knows how many quirks they have for ensuring compatibility with the majority of websites. There are other tools, like parsers that consume HTML and produce some output based on it.

Also generating HTML is not always as easy. Better said, it is not easy to ensure that the generated HTML is correct. Most programs in the toolchain are not as strict as they should be, and when composing HTML snippets together, most tools simply paste strings together.

As the generated HTML "works" as intended, there is little incentive to improve the tooling.

This leads to a lot of web pages that do not work correctly on all browsers. Most of the time, there are only minor quirks, but sometimes there are bigger issues.

Some websites will even ask to use another browser by querying the user agent or showing alternate content instead of improving the website.

Note 📝
One might argue that some browsers do not comply with the HTML standard and thus it is necessary to create invalid HTML pages. This is not true anymore, as of today most browsers are based either on Firefox, Chromium, or Safari and all of those can parse HTML correctly. There are some differences in how the browsers render content based on CSS rules, especially if one takes into account the Internet Explorer browsers, but most validation errors are not there because they are required for supporting a particular browser.

External resources

Many websites depend on external content. Prominent examples are fonts, CSS files, advertisement banners, and JavaScript frameworks.

Because validating the content of a website is nearly useless, and testing websites interactively is not as easy, they all tend to break in minor ways.

If you open the developer tools on your browser (F12 on Firefox and Chrome-based browsers), you might notice some requests return 404 even if the webpage seems to look fine.

Another unwanted behavior is that external resources might download more slowly.

For example, it happened shortly that loading a webpage was really slow. After opening the developer Tools, I noticed that the browser already completed all requests except for https://pro.fontawesome.com/releases/v5.15.2/css/all.css. After some time I decided to block the URL with the developer tools and refresh the page. The page loaded instantly, no functionality was broken, and I could not detect any element out of place. This is not me blaming the Font Awesome project, I am also sure that normally the content of https://pro.fontawesome.com/releases/ is downloaded fast enough. I’m more irritated by the fact that this and similar issues can be prevented if the author of the websites would simply copy the CSS file on his domain once.

But most importantly, external resources can change behavior. What can be done if something like that happens? Is there any chance to get the old behavior back?

For example, people noticed that Twitter scripts changed their behavior, and now are hiding tweets that have been deleted on Twitter.

In this case, restoring the expected behavior is trivial, as the tweeted content is copied on the website. Just removing the snippet of JavaScript will show the quoted content. But normally it is not as easy, because one does not have a copy of the linked resource.

It’s important to understand why so many websites owner chooses a fragile method for quoting tweets in the first place.

Was it not clear they were loading and executing external code, code that they do not control?

Is it unexpected that external resources can change? Are most people not experiencing (or noticing) bitrot when visiting websites?

Or were those issues known, but the authors decided that the risk was so low that it should not bother them?

Cool URIs do not change

One of the main motivations I am writing is because last week I noticed that I was not able to reach http://gcc.1065356.n8.nabble.com/Global-variable-in-static-library-double-free-or-corruption-error-td657603.html anymore. Fortunately, I was able to find the same information on another website (https://gcc.gnu.org/legacy-ml/gcc-help/2010-10/msg00255.html), even if the one provided in the first link had a better representation of the data.

In this particular case, the domain http://gcc.1065356.n8.nabble.com is not reachable anymore if you try to run wget http://gcc.1065356.n8.nabble.com, I am currently getting wget: unable to resolve host address 'gcc.1065356.n8.nabble.com' as output.

In all other cases, when I notice that a link is not valid anymore, it’s because the owner of the website decided to completely remove the content or to move it somewhere else. W3C, wrote in 1998, that Cool URIs do not change.

The article shows some practices that can be used to minimize the need to change URI, and also why one should avoid changing them.

The TLDR would be that if you need to change a URI, you should at least provide a redirect to the new page. Letting the old URL respond with an HTTP 404 is problematic because if I, as an end-user, have a link in my bookmarks (or in some email or text document) to that page, the first time I visit the page I will have no idea where the content has been moved.

The worst-case scenario is that instead of returning a redirect or page not found, I am still getting a valid page (thus an HTTP 202 code), but the content is not the previous one (as I landed on a completely different page).

In practice, most website authors do not have cool URIs, and it happens far too often that I do not get a 404 response for something that does not exist anymore.

If a website returns a 404 (or another error, for what it’s worth), I have at least the possibility to check programmatically if a URL is still valid, or recognize that the content is missing, even if I do not know at what content I should be looking at.

I use linkchecker periodically to check if the links on my websites and bookmarks are still valid. But if the page returns with a 202 I won’t notice that the resource I linked to does not exist anymore.

The biggest offender is probably the Microsoft website.

For example, in 2009 Microsoft released EMET, a mitigation toolkit for Windows (as far as I remember all the things EMET provided have been integrated and enhanced in Windows 10), and this http://www.microsoft.com/downloads/en/details.aspx?FamilyID=c6f0a6ee-05ac-4eb6-acd0-362559fd2f04 was the webpage where to download it.

I find those URLs especially problematic.

First of all, they are not user-friendly. It is not possible to recognize where it points, if I did not write down where it pointed to, I would have no clue. I just see that it is from Microsoft and that it points to a download page.

If you follow that link, you’ll get an HTTP 301 (moved permanently) as a response, and be redirected to /en-us/download/details.aspx. This site responds with HTTP 302 (found) and https://www.microsoft.com/en-us/download/ as the new location.

Thus if you saved that link in your bookmarks without an appropriate description (like I do too many times for things to look later), you’ll wonder why you saved it, because the browser will load a generic download site. At least if it returned a 404 you would know that the content is not there anymore.

The site currently has "Top download categories" as the title, which includes Windows, Office, Browsers, Developer Tools, and so on. There is nothing about "EMET".

Even the link saved on Wikipedia as the official website; http://microsoft.com/emet, does simply a redirect to https://www.bing.com/?linkid=509835.

I am currently not able to find any official site where to download EMET, only the documentation: https://www.microsoft.com/en-us/download/details.aspx?id=48240.

Again, this is such a user-unfriendly link that I expect it to redirect to some generic download page in the future.

One might argue that since EMET is a discontinued product (since 2018), Microsoft has no reason to "spend resources" to keep those links alive. I would, in fact, not expect to find active links for Windows 2000. But if I had some old link, I would expect it to return 404 instead of redirecting to some generic or unrelated page.

Also, the Win32 API documentation has the same issue, and it is still currently supported, maintained, and updated. This is why I generally stopped saving links to any Microsoft website, and it’s a shame.

The first reason is that it is impossible to know what a link is supposed to point to. The URL is not descriptive, just some inscrutable number and the content is going to be moved somewhere else, while the current link will point to some generic page.

Maintaining old URIs, especially if those are generated dynamically, as seems to be the case with Microsoft, might not be easy. It is just curious that Windows is the operating system that gives a lot of guarantees about backward compatibility, and the Microsoft website follows a completely different policy.

I have to admit that I do not pay anything for the resource provided on the Microsoft website (or most other websites for what it is worth), the content is provided as-is, and expecting an "old" URI to work is probably too much.

Some websites have a different policy, for example, WordPress.com, a blogging platform, keeps the content alive even if the author does not maintain the blog anymore. If the content is deleted, it prohibits reusing the subdomain that used to be owned by someone else.

But even if it would be possible to convince all website owners to care about their links, there are situations where it is not realistic to expect a website to keep backward compatibility.

For example; when the owner of a domain changes.

This is also the case for bigger companies when one is acquired by another, or simply when someone does not pay for the domain name anymore.

There is also a lot of relevant information on independent websites, websites of smaller companies, and content written by independent people on bigger websites, like blogging platforms and social media. If a domain is bought by someone else, even if the new owner could host the old content as-is (because, for example, it is a simple static page), why should he?

In this case, whose fault is it?

Generally of no one. Again, I do not pay anything for most websites I visit, the content is provided as-is, and if the owner of a website stops maintaining it (either because he lost interest, he cannot afford the domain anymore, or something else), why should someone else, that does not find the content interesting and wants to reuse those URLs, maintain it?

Another case is if the webpage hosts content generated by its users.

Consider a provider like GitHub, GitLab, Gitea, and so on. What should the provider do if the user decides to delete the repository or change completely its content? Should such an operation be prohibited?

Generally, no, some users might have submitted sensitive data, or malware (like in the case of windowtoolbox), also there are rules like GPDR and so on.

It is also hard to detect if one "changes completely" the content, or is doing only minor corrections. One might argue, for example, that whitespace-only changes can be ignored. Something like a formatting operation on a project level would not count as "changing completely the content" if only, for example, the indentation is changed. But if you are writing in whitespace, whitespaces are significant.

And what if one changes the encoding? Changing from UTF-16 to UTF-8 means changing every single symbol, even if the content does not.

Detecting and validating such changes automatically is hard.

Another example, less common, is when a top-level domain expires, or when the requirements for owning a domain changes. If you have, for example, a .cs, .an or .yu link in your bookmarks, you can bet they are not valid anymore. .su domains might suffer similar a fate, and generally every time a country changes name or ceases to exist.

seven million .ga domains will also change ownership during 2023.

Content quality

This might be only my impression, it seems that is getting every day harder to find qualitative content.

I remember reading some good articles that were easy to find (and I did not save a link to those), and today I am unable to find them again.

Of course, in the last years, it got easier and easier to put something online. Platforms like forums, social media, website aggregators, sharing buttons, …​ all those technologies give everyone the possibility to upload content from different devices with ease and for free.

Thus there is much more information online, but also a lot of duplicates, and I find the quality often disappointing.

Then there are websites whose sole role seems to clone other websites, thus leading to a lot of low-quality content.

SEO

SEO stands for "Search Engine Optimization", and many times it seems to be more important than the posted content, because if no one can find your page, then the content is worthless.

Most SEO guides explain how it is important to measure X and Y with frameworks of Google, Bing, or other providers.

How much keywords and descriptions should be there on every page, and even if HTML documents already embed metadata, you should also add metadata for the the Open Graph protocol and JSON-LD, which duplicate most of the already provided metadata.

Speed is of extreme importance, as it can affect user experience (but most importantly ranking on the website), thus everything that is slow should be loaded after the page is rendered.

I’m sure out of touch with reality when thinking about maintaining a website, but…​ "Content is King", or at least it should be.

Search engines do not scrape only for metadata, and if they do, they use crappy bots and should know better.

Bots can access the whole content of a webpage, and search engines can (and should) use the words of the article to promote the results of a webpage.

Note 📝
Some bots, rightfully, might not fully parse a webpage but only a subset, like the first megabyte, or only what it can parse in a given timeframe. This might be the reason why speed is important for ranking on some search engines, but the correct solution is to make slimmer websites and avoid loading unnecessary content, not to load some content asynchronously.

In some cases, the content could be unavailable to Bots, for example, because it is loaded dynamically with JavaScript.

An example would be an online office suite, where documents are loaded in an editor with JavaScript. A possible solution would be to create automatically a static and read-only page with the content in plain HTML and provide a link that opens the editor.

But in most cases, this is not necessary. Excluding web apps, most web pages show static content for reading.

In the case of something that is not text, like Video and Audio, adding relevant metadata is of course more important.

In case the content is not available to bots, for example, because the content is supposed to be readable to users only after logging in or paying a subscription, then of course keywords and description play a different role.

If the reason is that the content is not available because it is loaded unnecessarily dynamically, then the first thing to fix would be to make the content more accessible.

For all other cases, I do not expect them to make a big difference. Actually, one could (and probably should) programmatically create the elements <meta name="keywords", <meta property="og:article:tag" and <script type="application/ld+json"> by parsing the content of the site and add them to <head>, or at least generate them from a common source.

As already written, most of the metadata should not be needed, but some platforms/companies (Google 🗄️, Facebook 🗄️, …​) promote those, so it is a sensible choice to add it to a website, but it should not be the focus when writing a new page.

Website Obesity

Because of "reasons", websites are getting bigger and bigger, and probably contain less and less content.

Other people have already written about it, and no one is happy about that, but it is (apparently) not sufficient to change this trend.

Offenders are unnecessary Banners/Hero images, custom fonts, and all other things that have nothing to do with the content of a webpage.

If possible the situation got even worse. More than once I landed on a page with some content I was interested in, just to find out that the article had been artificially split over multiple pages.

This practice forces end-users to click on "Next Page" or other links and load a lot of unnecessary data (mainly ads).

Because of the GDPR, many websites also load an additional JavaScript banner for requesting permissions to use tracking technologies, sometimes without the option to simply opt-out.

Note 📝
I believe the GDPR is a step in the right direction, I just wish that it could have been implemented with DNT, instead of having every website implement a dialog (which is not necessary, but whatever). A browser plugin is far from perfect, as it will not work on every website, even if it gets the job done most of the time.

Websites are like executables

For simplicity, suppose you are on a Windows system, you know one or two things about security, and someone sends you an image, only that the file extension is not .jpg but .exe.

You should at least get suspicious because .exe files are executables, not images.

Of course, an executable could show an image and thus could be completely genuine. In practice, if you would receive a .jpg it would be much better because you know that the chances that such a file affects your security are much lower (not zero because of possible bugs in the image viewer).

This should be something that everyone who uses a computer should know; what surprises me every time I think about it, is that when navigating the web, we are constantly downloading and executing unknown code.

Most webpages depend on JavaScript, and generally (as JavaScript is Turing complete) it is not possible to know what the page looks like without executing it. We could read the code, to be sure that it is not malicious, but it would not work for most pages. First, the amount of JavaScript one needs to read is simply too much. Second, as there is so much JavaScript, some websites minify it to reduce download time, making it harder to understand.

Most of the time it is also not trivial to see if a given feature can be misused or not. For example, browsers made multiple changes to make the :visited attribute more privacy friendly, but it is still possible to misuse it.

Reproducibility

Even if the content does not change, external resources, CDN, and computations are done dynamically (and eventually cached for some time).

This means that closing and reopening the browser, restarting the operating system, or pressing F5 because some elements did not load correctly might make it impossible to read again the content of a website.

Missing timestamps

When searching for information, most of the time I want to know what’s the status quo.

Technology changes over time and many words and names are reused in different contexts, so it is not always easy to recognize if an article is up-to-date.

Sometimes I also want to search for outdated information, for example, to understand better how something worked, how certain things changed, and so on.

For this reason, I find it very valuable if an article reports when it has been published.

Unfortunately, most do not, and some websites replace timestamps with something like "published 1 year ago", which is less precise compared to a day (was it published more like 12 or 20 months ago?), and it is also not clear if the information is up-to-date. Even an imprecise "published in 2002" is much better.

Ads

Many websites are riddled with targeted ads, even if they might not generate enough revenue to maintain the website.

True, if you have a lot of visitors you could, like Vizio make a considerable amount of money, but static and non-tracking ads could be good enough.

Unless you are typing URLs or links by hand, dictating them, or writing in an SMS message, link shorteners generally do not provide any advantage.

In fact, third-party link shorteners have so many disadvantages that they should be prohibited on most occasions, but some platforms (probably mainly Twitter because of the 140/280 character limitation 🗄️) made them widespread.

The first issue is that the URL is obfuscated. Shortened URLs are generally a random sequence of letters and characters, thus it is generally not possible to know where it points to.

Another issue is that many URL shorteners are malicious.

Even if the URL shortener is not malicious, it might have different privacy policies, considering regulations like GDPR, it is a gamble to rely on an external service for providing alternate links with no advantages…​

Too short URLS can also be a security issue, as described in this paper.

Last but not least, a URL shortener is another layer of indirection. Even if today it is working correctly, tomorrow it might not, even if the original resource is still there, and because URLs are pretty much obfuscated, it is much harder to find the original resource again.

TLDR: do not use a third-party link shortener, in the case of Twitter, it is not even necessary.

Note 📝
Microsoft Safelink 🗄️ (an Outlook feature) shares many issues with a URL shortener, even if it does make URLs shorter. Worst of all, it makes the links harder to read and cannot be disabled.

Ever-changing content by design

Some sites do always change by design.

This is per se not an issue, for example for a webpage that shows the current weather forecast it would be expected that the information is updated periodically.

But for many other websites, that do not show contents that need to be updated frequently, this is unnecessary.

A technique I particularly dislike is the infinite scroll.

For example, it makes it unnecessarily hard, sometimes impossible, to bookmark content.

Also, the navigation is sometimes broken, and the forward and back keys might not work, just like the /Home, PgUp🠕, or /End keys.

Another overused technique is the carousel.

Scope creep

It seems that browsers have evolved into "small" virtual machines for running applications. This is at the expense of making all other tasks incredibly complex, slow, and complicated.

I would prefer to download/use a "real" virtual machine (like a VirtualBox Image), a portable program, or have a separate "app browser" than making the browser as complicated as they are for all use-cases, instead of being a much simpler platform for documents.

It might seem extreme proposing to downloading small virtual machines or executables instead of adding "just one more little feature" to an already complex standard, but it is very hard to write a half-complete browser "from scratch" compatible with all web specifications.

The number of W3C specifications grows at an average rate of 200 new specs per year, or about 4 million words, or about one POSIX every 4 to 6 months.

Most websites do not need many features, but (major) browsers cannot simply ignore them if they want to stay competitive.

Other formats, like PDF, might have similar issues (at a smaller scale). PDF supports images, audio, video 🗄️, JavaScript 🗄️, portfolios 🗄️, attachments, multiple types of metadata, encryption, signatures 🗄️, timestamps, DRM, encoding, fonts, forms and surely many other things I am not even aware of.

As many of those features are not used by most documents, it is still possible to create independent PDF readers (integrated into web browsers too), which are sufficiently complete for most use cases, even if they do not support all (optional) features.

Alternate technologies

Some "alternate" technologies emerged or are periodically proposed to improve the status quo.

Google AMP and Instant Articles

Google AMP and Instant Articles where to proposals to make websites somehow faster.

As far as I see, the generated pages still use JavaScript, HTML, and CSS, thus both for the browser and end-user, they are like normal web pages, with all advantages and disadvantages.

How can I be sure that the executed JavaScript will not harm my PC?

What would have been interesting, but this is not what happened, is if they created a different format, and eventually a plugin for the browser to render such new file types.

This would have made it possible to avoid some issues (like embedding a Turing complete language by providing an API for common operations), but adoption would have been much slower (who wants to install a plugin or alternate browser for visiting some websites)?

Note 📝
Something similar happened with MathML, except that it is already standardized. Chrome/Chromium users can download a plugin for rendering MathML. Obviously, the browser could implement the functionality like other browsers do or embed the JavaScript library for rendering it.

Unfortunately, there is much more interest in adding new capabilities to websites, like accessing USB devices 😕, even if it surely has interesting use-cases 🗄️.

Gopher and Gemini

Those are truly alternate technologies, but not alternate formats.

Gopher it’s an alternative to HTTP(S) with constraints over the served content. Instead of defining a "better" format, users are limited to basic ASCII text.

For most websites, lacking the possibility to embed images, math formulas, and tables is way too limiting.

Also requiring a new protocol is not necessary, it is already possible to serve plain text, markdown, AsciiDoc, and simple HTML files over HTTP(S).

Gemini is the successor of Gopher but still has too many limitations compared to plain HTML.

Gemtext files, like Gopher, do not support tables, math, and images. The protocol does not support compression, which, especially for textual content, can easily reduce by 50% the size of a page.

Unless there are some particular requirements, it is probably easier to serve Gemtext/text files over HTTP(S) instead of using another protocol.


Do you want to share your opinion? Or is there an error, some parts that are not clear enough?

You can contact me anytime.