I was handed 5,300 PDFs of medical manuals and now I'm going to put them into the Archive: A thread.
First, I'm keeping the original 29gb .tar.bzip2 they gave it to me in, because I know there are folks for whom they want just one big pile, and don't want my clever little uploads. Keep the originals around, if you can - let someone else have the chance you did to start.
Next, the metadata is partially in the directory tree. I am writing a custom script to take the directory structure to add keywords.
I can make collections because I'm an admin. The collection name will be "manuals_medicaldevices" and be in the "manuals" collection.
I'm now rewriting my longtime uploading script to do a little extra work since the metadata is there in the file directory. I'll then test with a single item.

The collection is now waiting for me: https://archive.org/details/manuals_medicaldevices
These things almost never go right the first time, so I'm running just one iteration of my script, on a single item: a Welch Allyn LCI 100 & 200 Imaging System Service Manual. I see it got uploaded, possibly with useful metadata.
Now we're going to run into an interesting situation - the archive has a massive queue system running, with hundreds of thousands of "jobs" a day. My manual upload will fall into place, over the course of a few minutes, and then generate a readable version. It's not instant.
Note that how it looks depends on when you see it. If you're following this thread this exact moment, then it's going to be very incomplete. But then, over time, it will pull in a thumbnail, generate an online readable version, and it'll add OCR to the search function.
Looking over this item, I already now see there's an interesting situation: I happened to choose an item where the creators of this collection would put two perfectly the same copies into the directories!
Devices.
You can follow @textfiles.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: