Newspapers are full of information but finding this information can be a challenge. Newspaper volumes are often large and heavy, and microfilm, should copies exist, is often an irrelevant and unfamiliar medium for many. From the outset it was therefore decided that we would attempt to significantly add value to the newspaper collection by undertaking both digital imaging (essentially taking a digital ‘photograph’ of each page) and Optical Character Recognition(OCR; making each word machine-readable and therefore searchable to the user).
Newspaper OCR is a complex process and we have recently completed the lengthy procurement of an external contractor, France-based Jouve, to undertake this task on our behalf. In addition to defining an efficient workflow and clarifying logistical arrangements for the exchange of large sets of data over the next year or so, we are also working closely with Jouve to define detailed page analysis and OCR capture rules. How should article boundaries be identified, bearing in mind the inconsistent typographical layout and complex arrangement of newspaper content? What differentiates an ‘article’ from a ‘section’? Are particular fonts indicative of a particular category of information or type of newspaper content? Is there any significance to capitalised words? Can an article carry more than one subtitle and how can these be identified by a machine?
I for one have learned to recognise and appreciate the richness and structural complexity of the newspaper page and often find myself scruitinising the alignment of words and the readability of elaborate fonts for objective clues to help me decipher the authorship or likely topic of an article. Attempting to impose consistency on the highly inconsistent is challenging.
Alan Vaughan Hughes