I am beginning the slow, painful process of recovering the website from web crawler caches.

There are a few automated tools for recovering a website from internet web spider (Yahoo, Bing, Google, etc.) caches, like Warrick, but I had some bad results using this:

  • My IP address was quickly banned from Google for using it
  • I get lots of 500 and 503 errors and “waiting 5 minutes…”
  • Ultimately, I can recover the text content faster by hand

I’ve had much better luck by using a list of all blog posts, clicking through to the Google cache and saving each individual file as HTML. While there are a lot of blog posts, there aren’t that many, and I figure I deserve some self-flagellation for not having a better backup strategy. Anyway, the important thing is that I’ve had good luck getting the blog post text this way, and I am definitely able to get the text of the web pages out of the Internet caches. Based on what I’ve done so far, I am confident I can recover all the lost blog post text and comments.

However, the images that go with each blog post are proving…more difficult.

Any general tips for recovering website pages from Internet caches, and in particular, places to recover archived images from website pages?

(And, again, please, no backup lectures. You’re totally, completely, utterly right! But being right isn’t solving my immediate problem… Unless you have a time machine…)

Here’s my wild stab in the dark: configure your web server to return 304 for every image request, then crowd-source the recovery by posting a list of URLs somewhere and asking on the podcast for all your readers to load each URL and harvest any images that load from their local caches. (This can only work after you restore the HTML pages themselves, complete with the <img …> tags, which your question seems to imply that you will be able to do.)

This is basically a fancy way of saying, “get it from your readers’ web browser caches.” You have many readers and podcast listeners, so you can effectively mobilize a large number of people who are likely to have viewed your web site recently. But manually finding and extracting images from various web browsers’ caches is difficult, and the entire approach works best if it’s easy enough that many people will try it and be successful. Thus the 304 approach. All it requires of readers is that they click on a series of links and drag off any images that do load in their web browser (or right-click and save-as, etc.) and then email them to you or upload them to a central location you set up, or whatever. The main drawback of this approach is that web browser caches don’t go back that far in time. But it only takes one reader who happened to load a post from 2006 in the past few days to rescue even a very old image. With a big enough audience, anything is possible.