| Home | Computers |
There are three key steps to putting a sizable document on the Web:
For the documents I worked on, the court decisions were no problem, as these are all public record. In the case of the Q&A book, the author is an officer in the organization sponsoring the Web site, and he was happy to let us use his book without compensation.
Note that copyrights do expire -- fifty years after the death of the author -- so if you want to post Shakespeare or the Babylonian Talmud or something that should not be a problem. But even in books by long dead authors, watch out for any notes or commentary added by a modern editor! They will still be copyrighted. Also, as a reader has courteously pointed out, a modern translation of an old text may be copyrighted. That is why, for example, it is perfectly legally to distribute copies of the King James Version of the Bible, but not the New International Version. While both are translations of a book whose copyright expired long before current copyright laws were written, the Jimmy was translated long ago and so it's copyright has already expired, while the NIV is quite recent.
Also note that illustrations may be copyrighted separately from text.
And let me hastily add that I am not a lawyer, so please do not construe the above as legal advice.
The next step is to get the text of the document on the computer. There are three ways to do this. (That I can think of, anyway. Somebody else may have other creative ideas.)
Method One: If you are lucky, the text may already be on a computer. In the case of Q&A, the publisher sent us a copy of the book on diskette. (I think they sent us the file they used to run some computerized printing machine when they printed the book in the first place, but I didn't ask for details.)
Method Two: You can scan in the document and OCR it. For the court cases, to the best of my knowledge these are not available on computer files anywhere. (Supreme Court decisions since late 1989 are available on computer through the Supreme Court's Hermes project. A couple of state courts have similar projects. For older decisions, other states, and lower courts, though, I think you're out of luck. If somebody knows something to the contrary, please let me know.) So I got printed copies and scanned them in page by page with an OCR program. I'm sure that the capabilities of OCR programs vary, and as this was our first attempt at OCR and we are a poor non-profit organization, we bought a low-cost package. Well, I found that the documents we had that were original printed copies on clean white paper scanned very well. But the documents I had xeroxed at the law library did not scan well at all. I spent a great deal of time cleaning up mistakes the OCR program made. The next time I do one like this, I think I'll find a clean printed copy. I didn't do this because it seemed like unnecessary delay and expense to order them. Ha ha ha.
By the way, on some smaller documents I've scanned, text on colored paper also proves to be a problem. I've found that telling the scanning software to lighten the image slightly helps a lot, but does not solve the problem. I tried using color dropout on the scanner and it didn't seem to help. Maybe it would with suitable tinkering.
Update: Since I wrote this, I've done some scanning using much more expensive scanners and OCR programs. Interestingly enough, I've found that these give only minor improvements in the accuracy of the scan. Not that they didn't have other nice features, but in my humble opinion, the most cost-effective thing to buy if you have some extra money in your scanning budget is a sheet feeder: then at least you can go away and do something else while the scanner and OCR software grind through the process.
If you haven't yet done any scanning, let me warn you: It is not a quick and easy task. Besides doing Right-to-Life work I actually work for a living to pay the bills, and in the course of that employment I happened to work with a government organization that was involved in a very large scanning project. They found that they had gotten in way over their heads. In one meeting they commented that they had thought that scanning documents was essentially a "lights out operation". That is, they thought somebody could just back a truck load of books up to the building with the computer in it, shovel them into the scanner, and come back the next day to pick up the disks with the computer versions of the documents. Well, as they quickly found out, in real life even the most successful scanning efforts require a great deal of manual clean-up after the computer has done its best. If you're planning to scan in an entire book, expect to spend a few days at it.
Method Three: You could type in the text manually. I've done this with some small documents. It does have the advantage that you can type in the HTML tags as you go, do any reformatting you think necessary and useful, etc.
I type things in manually if they are very short. It takes some time to fire up the scanner and load the OCR software, so for a very short document I can type it in faster than I could scan it. I also sometimes type it in when I know I'm going to be doing a lot of reformatting, like converting complex tables.
But for anything longer than a couple of pages, you'd better be VERY patient if you're thinking of doing this. If you're paying somebody to type in documents, some cheap OCR software would pay for itself very quickly.
If the document came to you (or you typed it in) in MS Word format or one of a few other popular word processors, there are programs available to translate automatically to HTML. They'll translate Word's code for italics to HTML's, etc.
Even if no programs are available to convert the format you have to HTML, if the format is reasonably uncomplicated you may be able to write a simple program. If the formatting codes actually show up as text in the file -- like files created with Unix's troff or some such -- you may be able to do all your HTMLizing with a word processor. Just do a search-and-replace on the incoming code for italics and change it to the HTML code for italics, etc.
I had no such luck. Scanned files come out plain ascii -- no formatting information aside from spacing. For the court decisions, which I did first, I just went through with a word processor and added HTML tags by hand. Court decisions don't have a lot of fancy formatting to them, so for the most part this just meant putting in the paragraph breaks. This was simple, though tedious in a long document. I just marked the paragraph break sign and pasted it over and over at the end of each paragraph.
When I got to the Q&A book, though, there was a lot of complex formatting, none of which was reflected in the file I was given. As the title implies, the book consists of a series of questions followed by answers. The questions should be set off from the answers by some typographical distinction. There are numerous charts and tables, block quotes, italics, emphasized words, etc. All this makes for a much more visually interesting and readable book, but it took me hours to get through just the first chapter adding all the tags. After going through it all all over again with the second chapter, I concluded there had to be a better way.
The obvious thing to do was write a program to insert the tags automagically. There is one gigantic catch to writing such a program: The reason I wanted a program was to add formatting information to the text. But by definition the formatting information wasn't there when I started -- if it was I wouldn't need the program. So how is the program supposed to create this formatting information out of thin air?
My conclusion was: it has to guess.
Some things were easy. For example, if you see a blank line, you can figure that this probably divides two paragraphs, and insert an end-paragraph tag. The file I had used asterisks to mark items in a bullet list, so if you see a line that begins with an asterisk, turn it into an entry in a bullet list.
And that was about the end of the obvious and easy ones. Even those rules were not valid 100% of the time. But beyond that the program just had to make plausible guesses. If it sees a line with blank lines before and after, it figures that this is a section heading and puts heading tags around it. Of course sometimes it's just a short paragraph, but usually this makes a good guess. The first non-blank line in a file is assumed to be the title. A line with long strings of blanks between the text is assumed to be part of a table. And so on.
For a few chapters this program successfully converted the text 100%. (Those were fun!) For most it got about 80% to 90% of the tags right. Not perfect, but at least I could then just fix up the remaining problems. Some chapters it guessed wrong too much and I had to do a lot of clean-up. Maybe a smarter program could have done better, though at some point you hit diminishing returns on the effort required to write the program versus the effort saved on translations.
Well it was at it the program could easily throw some boiler-plate around the text. I wanted a header at the top of each file with the title of the book and a couple of links -- table of contents, home page, etc -- and a footer at the bottom with those same links, the date the file was created, copyright notice, a mailto: link, etc. It was easy to throw this stuff into a template and have the program wrap it around the body of text as it created each file.
Anyway, I'm putting the program on this Web site, so if you want to download it and try it out, go ahead. See the download and installation instructions.
I found this program philosophically interesting in one way: Normally when I write a program, I intend it to produce complete and correct results. But in this case I was writing a program which I freely admitted up front would only produce a "first draft", which a human being would then have to fix up by hand. I felt a little like I was designing a car that I knew would throw a rod before you got to your destination, but at least it's better to only have to walk PART of the distance rather than the WHOLE distance ...
Anyone out there who has experience in "importing" documents for the Web, I'd be happy to hear what you've learned. Maybe we can help each other.
| Home | Computers |