[kwlug-disc] digitizing books
Insurance Squared Inc.
gcooke at insurancesquared.com
Thu Oct 28 16:58:05 EDT 2010
As some of you know, I've got 10's of thousands of pages digitized, and
most of it already online. It's quite frankly, an absolute horrible
business to do. (for one of my dead, non-commercial projects, see
fishing buddies dot ca, I've got one book online there).
Scanning is a pita. The problem with a regular flatbed is that you have
to bend the book, the pages don't scan well. I use an opticbookpro book
scanner, but it's manual and one page at a time, which is extremely time
consuming and beats up the books as well. USB scanning technology means
9seconds per page.
Because I post my books online all my stuff has to be old enough to be
out of copyright so my books are delicate. Handling gets to be a
consideration both from time and condition of the books.
Commercial quality book scanners cost $10K and up. I actually found a
used one once, it was like Christmas that day. Drove to Barrie to buy
it. Turned out to be a paperweight. I guess I got coal in my stocking
that year.
The other site similiar to the one richard's pointed out is
diybookscanners.com. They have some home brews there along with
directions. The basic trick is to have a V shape so you can image two
pages at once without bending the book, and then using cameras instead
of scanner technology. Cameras allow you to take an image in a second
or less, instead of the 9-10 seconds you'd see on a scanner.
So now you've got your 500 page books scanned. The next issue for some
of us is OCRing it. That's a whole other ballgame that's again very
manual intensive. There's pretty much nothing on Linux that does a good
job of it. As soon as you run into nonstandard page layout, or
pictures, or tables, things go bad, fast with OCR software.
Spend the time OCRing it and the next problem is indexing and
organizing. The hardware book's already organized into chapters, with a
table of contents and index. After you image and OCR you've lost that
meta-data. I'll leave it to your imagination to figure out how to
automate the replacement of a table of contents :) (hint, it's not
really something you can automate).
If you want to go to the next step and put the book online, now you've
got to find a way to import all these pages of poorly OCR'ed text :)
into a web page and again, keep the meta data link indexes and table of
contents. I've had to have custom software written to handle that task.
In the end, if you're looking to just image books and have a file of
images you can flip through on your computer, that machine will work
reasonably well. If you want to take it to the next step things get
very manual very fast.
Nevertheless, projects like the one Richard pointed out are moving us
forward fast. I'm excited, and hope one day to be able to digitize
hundreds more books that are sitting in boxes in my house.
On 28/10/10 04:10 PM, Richard Weait wrote:
> I know that this topic and hardware will find a small and enthusiastic
> crowd here as well.
>
> If you had to scan a book, what would you use?
>
> http://bookliberator.com/doku.php
>
> _______________________________________________
> kwlug-disc_kwlug.org mailing list
> kwlug-disc_kwlug.org at kwlug.org
> http://astoria.ccjclearline.com/mailman/listinfo/kwlug-disc_kwlug.org
>
More information about the kwlug-disc
mailing list