I have a hoarding problem. Not, I hasten to add, in the physical world. There, with the exception of kitchen equipment, I tend to the minimal. My hoarding problem is in the virtual world of information. I hate the idea of losing content. It’s so easy to copy around – it happens for every visit on every page for every image – yet we do such a terrible job of preserving old digital sources.
When tumblr announced it was banning all adult content I was therefore particularly perturbed, and set about locally archiving what I could. The initial idea was to just save sites I liked to use as image sources for this blog. I figured I could sort and catalog it all later. That seemed like a fun way to spend the holiday period. However, once I had the pipeline setup and ticking along, things got maybe a little out of hand.
When I finally pulled it all together and de-duplicated, I discovered I’d got 5.1 million images from about 400 sites occupying over 3TB of space. Oops. The cataloging process might therefore be a touch more time consuming than I first thought. Let’s say I want to run through all images just once and spend just 4 seconds per image. With a lot of animated gifs involved, that seems like a pretty fast average pace. If I devote 2 hours every day, 7 days a week, 52 weeks a year, I should be done in a little over 7 years and 9 months. Alternatively, I could quit my job entirely, put in a solid 8 hours a day, 5 days a week, and be done in a bit over 2 years and 8 months. Piece of cake.
I guess the good news is that there’s no danger of me running out of images to post. The bad news is that I’m not sure how I’ll be able to sort through them all to find the good stuff.
Talking of archiving content, here’s an old image from the Leda / NuWest company. I’d guess its from the mid to late 80’s. I found it via the now unavailable ‘x ray blue eyes’ tumblr. That was the single largest tumblr site I archived, with 355,247 images. That’s a lot of femdom.
Hmmm.
I have been sorting out my Tumblr copies too, but a tiny collection compared to yours.
Yes, I remember xray blue eyes …
The funny thing is, you can still copy entire sites back even now, complete with all the blocked content. You just have to know the URL. I’ve had a couple of situations recently where I had an old link but could no longer see the original post. So I just ripped the entire site back! Crazy.
-paltego
A public service! Would you consider posting the raw collected data publicly for others to trawl and post on whichever site emerges as the new tumblr?
I’ve thought about it, but there are a few challenges.
Firstly, it’s enormous. Like I said, it’s over 3TB of data. There’s no easy way to share that. Even moving it all around on my fast local network takes forever.
Secondly, I’m sure there’s a large amount of copyright material. Tumblr had lawyers and a mechanism for yanking posts if they got complaints. I can’t handle that kind of thing.
It’s possible I could wrap some of the images from certain sites up into a torrent and share that. Even then the size might be a challenge. e.g. Alternative Femdom, which was a site I always liked to browse, is about 18GB of data. That’s a pretty big torrent.
I’ll have a think about possible solutions, see if there are any good options.
-paltego
A bit late, Titia and I wisch you and your readers the very very best for 2019.
Question: Could you “just” make available a list of those 400 odd URLs ? Than it’s – after copy/paste/safe on our hard drives – for us to find out if they still “work”.
Regards from the Netherlands
Marga & Titia – Happy New Year! Hope you both have a good 2019!
I don’t have the full 400 tumblr list. It’s a bit of a mixed grab bag of stuff and I’ve already started cleaning up and amalgamating some sites. However, I can email you a copy of my old image page. That has a lot of the tumblrs on it that I grabbed, and a good number of them are still ‘available’ (provided you use something like TumblThree rather than a web browser).
-paltego
You are not alone! Read the article in the link below.
https://gizmodo.com/delete-never-the-digital-hoarders-who-collect-tumblrs-1832900423
—
As for weeding duplicates, try visipics. The look and feel of it make you think you are in some kind of time machine but it does the job rather well:
http://www.visipics.info
Interesting link. Obviously I’m not alone in this. Although the people building their own home servers and measuring storage in petabytes rather than terrabytes are a lot more serious about it than I’m prepared to be.
I tried various image de-dupers including (I think) that one. The problem they’re relatively slow and just can’t handle huge batches of images. Or at least not before I got bored and killed them. I ended up writing my own super simple de-dupe code that simply checked file size and then, if that matched, compared the byte for byte contents. Doesn’t work if an image has been cropped/modified/shrunk/etc. but it removed all the re-posts of the same thing and some of the standard tumblr images. That was enough to remove 10-15% of the original images.
-paltego