You've probably been frustrated by the short lifespan of most web pages, but you may not know about the team trying to do something about it. jIll Lepore, Harvard professor, award-winning author and staff writer for The New Yorker has just written a wonderful and insightful article about the people behind the Internet Archive. It's titled The Cobweb: Can the Internet be Archived? and it's in the Jan. 26, 2015 issue of The New Yorker.
As a librarian and a computer scientist we feel this topic is very important to us as entrepreneurs in the personal history space and to society as a whole.
Everyday we're reminded how fragile the Internet is. We regularly look for things that turn out to no longer be there. Even though we know better, when we see the message “Page Not Found”, we're still a bit surprised.
As a librarian, it is very disturbing to know that sources can’t be traced or verified in today’s world of link rot. It’s nearly impossible to know the validity of a story when you realize that it can be changed, edited or deleted within seconds. This is really creepy and one reason why the Internet Archive (https://archive.org) and it’s Wayback Machine exist.
As just one example, one of our previous companies Glassbook was sold to Adobe in 2000. If you enter glassbook.com in a browser, you get Adobe’s home page where there is no mention of Glassbook. You can then do a search on Adobe.com where you'll get a list of press releases. Further sleuthing on Google for Glassbook will result in random occurrences of the word glassbook and some links related to our company. If you want more, the Wayback Machine is the place to look.
According to the Wayback Machine, the Glassbook URL http://glassbook.com was "Saved 131 times between January 25, 1999 and December 17, 2014." You can then see on a calendar the dates those saves took place.
We love the Wayback Machine and are grateful to Brewster Kahle and his team for taking on the challenge of archiving more than 452 billion Web pages over time. With pages being added daily, that number has surely been surpassed.
Recapping a few details from the New Yorker article:
- "The average life of a Web page is about a hundred days."
- "Social media, public records, junk: in the end, everything goes."
- "The Web dwells in a never-ending present. It is—elementally—ethereal, ephemeral, unstable, and unreliable."
- Link rot (Page Not Found), an updated Web page (original has been overwritten), content drift (a page has been moved and something else is where it used to be), reference rot (combination of link rot and content drift) all interfere with knowing what is or was even remotely true on the Internet.
- "According to a 2014 study conducted at Harvard Law School, “more than 70% of the URLs within the Harvard Law Review and other journals, and 50% of the URLs within United States Supreme Court opinions, do not link to the originally cited information.”
- “A team of digital library researchers based at Los Alamos National Laboratory reported the results of an exacting study of three and a half million scholarly articles published in science, technology, and medical journals between 1997 and 2012: one in five links provided in the notes suffers from reference rot."
The motto of the Internet Archive is “Universal Access to All Knowledge.” A great goal and much needed service to society.
One of our biggest goals for Timebox has been to provide a way for people to archive their own valuable digital data, which for many of us is tends to be our digital photos and the stories that go with them. Like the Internet Archive, Timebox is still a work-in-progress, but we are all working to try to ensure that our digital lives are have a long existence.