Blog

Newspapers and OCR

Digitisation - Posted 09-05-2011

Newspapers are full of information but finding this information can be a challenge. Newspaper volumes are often large and heavy, and microfilm, should copies exist, is often an irrelevant and unfamiliar medium for many. From the outset it was therefore decided that we would attempt to significantly add value to the newspaper collection by undertaking both digital imaging (essentially taking a digital ‘photograph’ of each page) and Optical Character Recognition(OCR; making each word machine-readable and therefore searchable to the user).

Newspaper OCR is a complex process and we have recently completed the lengthy procurement of an external contractor, France-based Jouve, to undertake this task on our behalf. In addition to defining an efficient workflow and clarifying logistical arrangements for the exchange of large sets of data over the next year or so, we are also working closely with Jouve to define detailed page analysis and OCR capture rules. How should article boundaries be identified, bearing in mind the inconsistent typographical layout and complex arrangement of newspaper content? What differentiates an ‘article’ from a ‘section’? Are particular fonts indicative of a particular category of information or type of newspaper content? Is there any significance to capitalised words? Can an article carry more than one subtitle and how can these be identified by a machine?

I for one have learned to recognise and appreciate the richness and structural complexity of the newspaper page and often find myself scruitinising the alignment of words and the readability of elaborate fonts for objective clues to help me decipher the authorship or likely topic of an article. Attempting to impose consistency on the highly inconsistent is challenging.

Alan Vaughan Hughes

Comments are closed.

Categories

Search

Archives

About this blog

A blog about the work and collections of the National Library of Wales.

Due to the more personal nature of blogs it is the Library's policy to publish postings in the original language only. An equal number of blog posts are published in both Welsh and English, but they are not the same postings. For a translation of the blog readers may wish to try facilities such as Google Translate.

About the blog