(Via International Herald Tribune) Google's experimental service Google Scholar scours the Web for academic papers. But while the site, at Google Scholar, may prove an asset for students and those who work in areas like science, neither it nor any other search engine is very useful for historians whose research involves plowing through documents from the time when handwriting ruled. Even after handwritten documents have been scanned and made available through the Web, it is not possible for Google or any other widely used searching technology to read them. The only method in use for searching such documents involves first having them typed into standard computer text, a costly and time-consuming process.
"There is an enormous amount of handwritten stuff locked away in many archives, libraries and museums," said R. Manmatha, an assistant professor with the Center for Intelligent Information Retrieval at the University of Massachusetts at Amherst. "Most of the time when people do research they just ignore this stuff because it's not accessible. "Handwritten manuscripts are just images," he said. Some sophisticated handwriting recognition systems are in use. The U.S. Postal Service and postal agencies around the world use them to read addresses at sorting stations.But Manmatha said the experience developed from those systems was not particularly useful when he and two graduate students, Toni Rath and Victor Lavrenko, began work on their manuscript-searching project. The current systems have to cope with only a limited range of material - for example, names and addresses - written in a consistent format.
To develop their system, Manmatha and his students obtained about 1,000 pages of George Washington's correspondence that had been scanned from microfilm by the Library of Congress. They began by working on a variation of an approach used to search digital photographs, trying to match specific typewritten letters with digital images of their handwritten counterparts. To develop software that would take a holistic approach, Manmatha turned to an idea developed to let search engine users enter queries in their own language to find Web pages written in another language. Rather than mapping words between the two languages one for one, he said, those systems rely on software that is trained to spot common ground between them. Manmatha had a portion of Washington's papers converted into computer text. His software is given some help as it makes its learning comparisons. For example, the system is designed to eliminate the slanting in the handwriting and to rescale words to make them a consistent size. Even after training, the resulting software lacks the accuracy of programs used to read and digitize printed books, he said."That's the difference between having to recognize a thing and having to search it," he said. "You don't have to get every word right."
Right now, Manmatha said that he believed that the system was about 65 percent accurate. Though no library or archive has yet approached Manmatha about the system, he will brief Google on it early in 2005. With sufficient funds for software development and document scanning, he said, it might be possible within a decade for people to search historical manuscripts as easily as they now locate anything else on the Web."A tool like this will help people access such material and make possible new discoveries," he said. Handwritten archive is definitely a very voluminous lot and definitely, it is a great challenge creating a search engine that can crawl handwritten repositories all over.