Thread by @textfiles, I was handed 5,300 PDFs of medical manuals and now I'm going [...]

Jason Scott

textfiles

I was handed 5,300 PDFs of medical manuals and now I& #39;m going to put them into the Archive: A thread.

First, I& #39;m keeping the original 29gb .tar.bzip2 they gave it to me in, because I know there are folks for whom they want just one big pile, and don& #39;t want my clever little uploads. Keep the originals around, if you can - let someone else have the chance you did to start.

Next, the metadata is partially in the directory tree. I am writing a custom script to take the directory structure to add keywords.

I can make collections because I& #39;m an admin. The collection name will be "manuals_medicaldevices" and be in the "manuals" collection.

I& #39;m now rewriting my longtime uploading script to do a little extra work since the metadata is there in the file directory. I& #39;ll then test with a single item.

The collection is now waiting for me: https://archive.org/details/manuals_medicaldevices">https://archive.org/details/m...

These things almost never go right the first time, so I& #39;m running just one iteration of my script, on a single item: a Welch Allyn LCI 100 & 200 Imaging System Service Manual. I see it got uploaded, possibly with useful metadata.

Now we& #39;re going to run into an interesting situation - the archive has a massive queue system running, with hundreds of thousands of "jobs" a day. My manual upload will fall into place, over the course of a few minutes, and then generate a readable version. It& #39;s not instant.

You can now see the item here: https://archive.org/details/manual_Welch_Allyn_LCI_100_200_Imaging_System_Service_Manual">https://archive.org/details/m...

Note that how it looks depends on when you see it. If you& #39;re following this thread this exact moment, then it& #39;s going to be very incomplete. But then, over time, it will pull in a thumbnail, generate an online readable version, and it& #39;ll add OCR to the search function.

Looking over this item, I already now see there& #39;s an interesting situation: I happened to choose an item where the creators of this collection would put two perfectly the same copies into the directories!

Devices.

You can follow @textfiles.

Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: