The text we read when we view a web page, a blog or a journal article is full of rich and valuable information. Our brains are very good at processing and making sense of words in the context in which they are presented. We can tell when a word is a placename because we understand the sentence around it, and are expecting to see a place name. Also, we often already know the name of the place and could describe it in further detail from memory.
If computers could understand text as we do then they could be super useful in helping us find and understand information better. Technology such as Named Entity Recognition (NER), where machines are trained to recognise things like people, places and organizations by analyzing a whole text, is increasingly being used to turn plain text into a structured network of ‘things’, and this means machines can make a more complex analysis of text, much as we do.
As part of our ongoing Welsh Place Names project, which is funded by the Welsh Government, we were keen to explore how these new technologies and methodologies might be applied to Welsh language texts and to our own collections. With millions of pages of journals, newspapers and books already digitised, how might this technology help us improve our services for better research, discovery and interpretation?
Named Entity Recognition
The Dictionary of Welsh Biography was chosen for this experiment, as a (fairly) manageable corpus of about 5000 articles, packed with information about people and places. Most placenames have actually already been tagged as such in the mark-up for each page, which gives us a good benchmark for NER models to aim for, and a big corpus of place names for further analysis.
Identifying which words are placenames is the first step in this process. Those names then need to be reconciled against a database of names, which can give us access to a deeper, multilingual understanding of the place.
English language NER tools struggle to identify places in Welsh text for a number of reasons. Firstly they are not trained to understand grammatical mutations present in the Welsh language. For example, ‘Tregaron’ is the name of a town, in English and Welsh, however, if the text reads ‘yn Nhregaron’ it will not recognise the name due to the mutation (treiglo) of the first letter. Secondly, many placenames are different in Welsh (e.g. Cardiff is Caerdydd) and so models trained on English text simply won’t have the word in their vocabulary. Several English models were tested and many either didn’t recognise names, or assumed they were names of people.
Extracting named entities from digital text using ‘Cymrie’
This was able to extract a number of Welsh placenames, including many with mutations. The text of 5 articles was analyzed in detail. On average the tool was able to extract approximately 67% of placenames. Of those place names identified, only 2% were not in fact places.
Some of the placenames it was unable to recognise were tagged as people or organizations, though this was at a lower rate than the English language model.
Reconciling the Data
Knowing what words are names of people or places is useful only to a point, because we still know nothing more than ‘it’s a place’. For the data to be really useful we need access to more information about each place, such as its name in other languages, its location on a map and the county, country or continent it is part of. We can then apply a unique identifier to each place and they become unique data entities.
To do this we need to take our long list of place names and attempt to reconcile them against a database which holds more information about them. In our case we are using Wikidata, which is home to one of the largest corpus of Welsh place names available. Wikidata is free for anyone to reuse and is structured as linked data.
The Dictionary of Welsh Biography contains around 80,000 instances of place names. Due to the practicalities of working with such a large dataset, I opted to work with the first 46,000 tagged places.
The tags in the Welsh Biography code often contained more than just the placename. They commonly included a Grid reference, the type of place (city, village etc) and the relation to that place being discussed in the article.
Obviously having all this information to hand makes the reconciliation process far more likely to succeed. As NER technology improves, it should be able to imply much of this information, by understanding the wider context in which the place name appears, but for now, we must accept that without this additional information, this process would have a far lower success rate.
Using Open Refine’s reconciliation tool we were able to compare our list of placenames to Wikidata. The software’s algorithm looks for similarities in spelling but also considers the likelihood of a match based on the popularity of its content. By transforming the grid references from our data into coordinates we were also able to instruct Open Refine to score matches based on their proximity. Places with matching names and proximity of less than a kilometre were mostly matched automatically. Our data on the type of place was also used to help the software make a judgement.
In order to give the reconciliation process the best chance of success some initial cleaning was done to remove mutations from the text. Much of this could be done using a series of transformations such as;
Nghaer – Caer
Nhre – Tre
Others require knowledge of the language and human input in order to avoid the corruption of other names. For example ‘Lan’ cannot be automatically changed to ‘Llan’ without corrupting other names such as ‘Lanishan’.
Other issues included the use of English language names in the Welsh text;
New England (Lloegr Newydd)
Saint Brides (Sant y Brid)
There were also a number of placenames which had suggested matches, but had a high chance of also being the name of a property. For example;
Trawscoed (house, estate and community)
Cilgwyn (village in Powys, Gwynedd, Carmarthenshire AND a gentry house)
Ty-coch (area near Swansea and common house name)
short of reading each article in order to make a decision, there is currently no way to match such places with any certainty. However, such a manual process could be easily gamified as a crowd-sourcing task. Undertaking such tasks would also create training data for improving NER in the future.
Reconciling the data to Wikidata using OpenRefine
The result was an initial match of 25,000 names, to which a further 2000 were quickly added following a human review of high-scoring match suggestions. These matches include 2208 unique place names. Beyond this, an increasing amount of time would be required to match entries manually.
Matching placenames to unique identifiers allows us to examine the frequency of specific places in the text with greater accuracy
Utilizing the enriched data
Now that we have aligned our placenames to Wikidata entries for those places, we have access to a wealth of additional information. This extra information can be summarized in several categories;
Persistent ID – Being able to assign a unique Qid to each placename means we can treat each one as a unique entity, even if there are examples of multiple places with the same name.
External ID’s – Wikidata collects persistent Id’s from other institutions which hold information about the subject. This helps align and enrich data across multiple datasets.
Contextual information – This includes links to Wikipedia articles, openly licenced images and references to other authoritative works.
Structured Data – Wikidata contains a linked, structured ontology about its items, So places are linked to their administrative hierarchy and every other item in the dataset with a statement about that place.
This allows us to better understand the connections between people and place. In the example below a computer is able to understand that two people are connected to several common places through reference to these places in their Welsh Biography articles. The colour and thickness of the connecting strands also indicate the frequency of these references within each article.
When this approach is scaled up to the whole corpus we can see a hugely complex web of interconnections between people and places.
And since we now have access to coordinates for all our places, we can visualize these connections on a map. Below we see visualisations for an individual and for the whole collection using people’s birthplace as a starting point, connected to all other places mentioned in their articles.
Using the contextual information in place name tags we can make more granular queries, such as links between the place of birth and places of education mentioned in their articles. This highlights clear correlations to major centres of learning and further demonstrates the research potential of the data.
In conclusion, existing technology can accurately identify around 60-70% of Welsh place names in digital text. Training more advanced A.I. algorithms using larger place name vocabularies and a bigger corpus of training data may help to increase this percentage even further. Undertaking this process at scale would allow for further research and reconciliation work to take place and would also help to improve search and discovery functionality, but it does not identify unique places, only the instance of a place name.
In order to create notable benefits, the data must be reconciled against a database with data about specific places. With many duplications in place names in Wales and around the world this step is vital in creating connections to the correct places. It would seem that we don’t yet have the technology to automate this, in any language, with a high level of certainty. Several examples of pipelines being developed in order to identify entities in text and reconcile directly against Wikidata or other large datasets do exist, including a project by a colleague here at the National Library (link). However, they have faced the same kind of challenges.
Where additional supporting data already exists, like our Dictionary of Welsh Biography example it is possible to automate this to some degree but there is still a significant margin for error without human input.
Whilst accurate and complete identification of entities from a text is not yet possible, these processes offer value, as a stand alone activity or as part of a multidisciplinary approach, as a way of improving understanding of a text and improving search and discovery services for users.
Importantly, the ability to undertake this work on Welsh language texts is only possible with the continued development, adaptation and improvement of new technologies, and the availability of Open Access data sources such as Wikidata and Open Street Map as well as large corpora of Welsh language text for training machine learning algorithms.
After 64 long years, the Welsh football team finally managed to qualify for their second World Cup tournament, this time held in Qatar. Now that the tournament has ended, I thought that I’d look back at their exploits via the Library’s updated Newsbank subscription, which now includes full image versions for certain titles. To access Newsbank, it is necessary to be an online member in Wales of the Library. See here for more information and here to register. Online members can access Newsbank and the other external resources through the Library’s A-Z of external resource page. They can do so by either being in the Library building or by logging in with their reader’s ticket.
Excitement and expectations were understandably high after such a long absence from the biggest competition in football. Having beaten Ukraine in the play-off finals, Welsh fans could finally look forward to seeing their team perform at the highest stage. In the lead up to the tournament, Dafydd Iwan’s iconic song “Yma o Hyd” was adopted as Wales’ World Cup anthem, and The Guardian interviewed him and other fans to discuss how everyone felt before the tournament.
Here it was, our first World Cup game since 1958! Thousands of Welsh fans had made the trip to be part of the Red Wall, and they and the fans here in Wales were raring for the game to start. However, it looked like the occasion got to the team, and the Americans took a deserved lead midway through the first half. A change was clearly needed in the second half, and the introduction of Kiefer Moore helped get Wales back into the game. With 10 minutes to go, Wales won a penalty after Gareth Bale was clumsily fouled. Bale calmly converted, and Welsh fans went wild. The game ended in a draw, and we had our first point!
After Iran conceded 6 goals in their opening game, Wales fans were quietly confident that they could get a result in this game. With excitement levels growing, the game was shown in schools and workplaces across Wales, due to the 10am kick off. Unfortunately, Iran had other ideas. They were clearly the better side, and they were only denied a goal by a combination of the woodwork and VAR. The situation got worse for Wales after Wayne Hennessey was sent off for clattering into Taremi, suffering the indignity of being the first player of the tournament to receive a red card. It was now a matter of damage limitation, and hanging on for a draw. Wales almost succeeded, but Iran scored 2 quickfire goals at the death to break Welsh hearts.
Having progressed from the group stages in the last 2 European Championships, the chances of doing so in Qatar were hanging by a thread. Any hopes of progressing to the knockout stages were dashed by their English neighbours, and just like that, it was over. Although things didn’t go to plan, this group of players will always be remembered as the team that finally got us back to where all Welsh football fans wanted to be. Diolch bois.
Legal Deposit, Electronic and Acquisitions Librarian
Aberystwyth University, in partnership with the National Library, is launching a new research centre on Friday, 11 November, the Literature and History of Medicine Research Centre. The centre will make use of the research sources in the Library’s medicine collections as a foundation for new academic research in the field. A one-day conference has been arranged for the launch on 11 November. It’s free and you can book a ticket to the event here. The conference will be held in person and online.
The Library’s medicine-related collection is extensive, and includes print material, archival material, manuscript material, architectural material, drawings and photographs. As a result of the Library’s Medicine and Health in Wales before the NHS project, the medicine-related material that is part of the Welsh and Celtic Print Collection is now available on the online catalogue in its entirety, with the items that are out of copyright also digitized and available remotely. The print collection includes a number of important research sources, including the reports of the Medical Officer of Health for the rural and urban district councils across Wales, hospital reports and psychiatric hospital reports.
The psychiatric hospital reports offer a good example of the type of information and data that is included in these print sources. If we look at the example of the annual reports of psychiatric hospitals, in this case the reports of the Joint Counties Asylum at Carmarthen (see above for the embedded digital version or click here to see it on the Library’s digital viewer), we can see the feast of core data that the reports offer to researchers. The reports contain data on a large number of aspects of the life of the hospital and its patients including statistics regarding where patients came from, their work, the nature of their illnesses, mortality rates, the patients’ diet, the patients’ ages, readmission levels, the patients’ relationship status, and the institution’s financial statistics.
Such data is fundamental to research in this field, and it is hoped that establishing the Centre in partnership with Aberystwyth University will be a means of strengthening the relationship between the Library, our collections and the research community. If you want to learn more about the partnership, or if you’re interested in the latest research in the field of literature and the history of medicine, book a ticket to the conference!
Our digitisation work has continued behind the scenes and a number of new items and collections are now available to view from home on the Library’s website and/or the catalogue. Find out what’s new in our blog.
The work on digitising a series of meteorological registers of thermometer, barometer and rain gauge readings in ‘The Chain’ has been completed. They will be available on ‘Torf’ in due course: C 2/6: Meteorological register. Including enclosures C 2/6/1-40, 1901, Jan. 1-1906, July 7 C 2/7: Meteorological register. Including enclosures C 2/7/1-73, 1906, July 1-1911, July 1 C 2/10: Meteorological register. Including enclosures C 2/10/1-9, 1918, Dec. 29-1923, Feb. 3 C 2/11: Meteorological register. Including enclosures C 2/11/1-6, 1923, Feb. 4-1927, Feb. 12 C 2/12: Meteorological register. Including enclosures C 2/12/1-13, 1927, Feb. 13-1931, Feb. 21 C 2/13: Meteorological register. Including enclosures C 2/13/1-69, 1931, Feb. 22-1935, March 2 C 2/14: Meteorological register. Including enclosures C 2/14/1-32, 1935, March 3-1939, March 11 C 2/15: Meteorological register. Including enclosures C 2/15/1-26, 1939, March 12-1943, March 20 C 2/16: Meteorological register. Including enclosures C 2/16/1-78. The meteorological readings continue to 29 Dec. 1945 only, 1943, March 21-1947, Feb. 8
The Digital Preservation Awards are presented by the Digital Preservation Coalition every two years to celebrate the most significant achievements by individuals and organisations in ensuring the sustainability of digital content. Following a rigorous assessment process, the winners were announced at a glittering presentation ceremony in Glasgow, attended by a organisations and practitioners of digital preservation from around the world. The Library was delighted to win the Dutch Digital Heritage Network Award for Teaching and Communications for its project: Learning through doing: building digital preservation skills in Wales, https://www.dpconline.org/news/dpa2022-winners.
Learning through doing was a programme of interactive training delivered by Library staff on the Teams platform to extend digital preservation skills and increase capacity for staff working in organisations across Wales. Resources to support the training are available on the Archives Wales website at https://archives.wales/staff-toolkit/saving-the-bits-programme/.
The Library also contributed to winning another prestigious award. The Archives and Records Assocation’s award for the New Professional of the Year was won by Gemma Evans. Gemma was employed by the Library to lead the Records at Risk project for the Archives and Records Council Wales. The project was funded by The National Archives Covid-19 Archives Fund, which was established to support archives to secure records which were in danger of being lost as a result of the economic impact of the pandemic which threatened the continuing operation of businesses, charities and organisations, acrossWales. Gemma developed a Records at Risk Toolkit to enable the identification and preservation of at risk records, which is available for download on the Archives Wales website at https://archives.wales/records-at-risk/.
Another new year is on the horizon! Let us reflect on the Library’s collection of almanacs and how they were used in the past. These almanacs included dates of fairs and agricultural shows which would be of interest to country folk when planning their year.
Thomas Jones (1648?-1713) was one of the most prominent figures responsible for publishing and writing almanacs. He was born in Merionethshire, the son of a tailor. After moving to London as a young man to start his training there, he changed his career and became a printer and publisher. By 1693, he had moved to Shrewsbury and had established the first Welsh printing press. The main work of the press was to publish books, but it became famous throughout Wales for publishing almanacs. Thomas Jones won a royal patent for the press in 1679 to publish yearly Welsh almanacs, and he did so from 1680 to the year of his death in 1713. The almanacs were very popular in much the same way as we use calendars and year planners today.
In the example shown of Thomas Jones’s almanac, as well as a calendar, we have a short description of typical weather on each day of every month. Thomas Jones, it appears, wanted to warn, and entertain his readers at the same time. Some of the days in January are described as windy, others as frosty, others as rainy. Obviously, these are fruits of the imagination rather than a scientific analysis of the climate! But Thomas Jones also included cloudy prophecies in the almanacs with references to complex conditions he himself suffered (he was said to be a hypochondriac!).
His readers were delighted to read the almanacs for practical purposes, but the contents also proved to be a welcome escape from the harsh reality of their lives.
This year marks the centenary of the publication by J. Gwenogvryn Evans of his monochrome facsimile of the contents of the Black Book of Chirk (notwithstanding the 1909 imprinted on the title-page!). Through the generosity of a patron, and to mark the occasion, the National Library has published new digital images of the manuscript on our website.
This manuscript – Peniarth 29 – was once believed to be the earliest written in Welsh. Today, it is regarded as among the earliest, sharing a birthdate, as it were, with another Black Book, the rather more famous one from Carmarthen. Both were produced in the mid-thirteenth century, one in the South, and the other in North Wales.
The Chirk manuscript was written in Welsh, on parchment, by six scribes, in regular and professional style, although their familiarity with written Welsh may not have been fluent.
The volume contains legal texts relating to the king and his court, according to the ‘Venedotian’ or ‘Iorwerth’ code, associated with Gwynedd. The ‘king’ is a native ruler, one such as the young Llywelyn ap Gruffudd, known as ‘the last native Prince of Wales’, whose influence was becoming apparent at the time when the manuscript was written. Following the Law of the Court (reminiscent of those fine images in Peniarth 28, a contemporary Latin law manuscript), the scribes record laws that were relevant to ordinary inhabitants, including elements such as the values of wild and tame animals. A summary, text and translation is available on the Cyfraith Hywel website.
The manuscript also contains non-legal additions, such as proverbs, and Dafydd Benfras’s elegy on the death of Llywelyn ab Iorwerth (Llywelyn the Great) in 1240, harking back perhaps to the ‘golden age’ of native law in the Gwynedd tradition.
But why is the volume associated with Chirk, in Denbighshire? The contents suggest affiliation with medieval North Wales, and by 1615, it was owned by John Edwards of Plas Newydd, Chirk, a scholar and recusant who lost many belongings by sequestration before his death in 1625. Llanstephan MS 68 is a copy of the manuscript, made by Francis Tate whilst the Black Book was owned by Edwards. Subsequently, probably via John Jones of Gellilyfdy, it became part of Robert Vaughan’s library at Hengwrt, and on the upper part of page 114 is part of his ornate inscription identifying the work as ‘Y llyfr du or Waun’ (the Black Book of Chirk).
The original black covers are long gone, but the remains of the binding leaves survive at the end of the manuscript.
Since the beginning of the year work has continued on digitising our collections and the following items and collections are now available to view from home on the Library’s website and/or the catalogue:
33 Ystrad Marchell charters have also been made available and can be accessed via the catalogue.
A selection of volumes relating to King Arthur were selected for digitization in 2019. The following 13 volumes are already available and the work of digitizing the remaining items will continue over the coming months:
As much of medieval life was centered around religious belief, the daily services of the church (Matins, Lauds, Prime, Terce, Sext, None, Vespers, and Compline) helped to mark the passing of time, particularly for those in holy orders. Consequently, one of the most common types of manuscript to be found in medieval homes were those that allowed the laity to observe these services – known as the ‘books of hours’.
For those who could afford them, books of hours were often richly illustrated, and could serve just as much of a decorative purpose as a religious one. But for the average lay person, life was more concerned with the farming year and the passing of the seasons. Many books of hours included illustrations of agricultural tasks which were carried out at various times of the year, such as sowing crops, harvest time, or tree felling, often associated with the various feast days across the year.
/* wordpress fix */
The De Grey Hours: [mid. 15th cent.]. A task for midsummer – an illustration of scything in June, with the symbol of the zodiac denoting Cancer, the crab (f. 6r)
In a legal sense, these holy and saints’ days were also commonly used in medieval charters to record the date. Hundreds of examples of this practice can be seen in the collection of the charters of Margam Abbey, Glamorgan, part of the Penrice and Margam Estate Records at NLW.
Margam Abbey was founded in 1147 as a daughter-house of the Cistercian order at Clairvaux and was endowed with a large amount of land by Robert, earl of Gloucester (charter 1). By the late 13th century, Margam was Wales’ richest monastery, owning land and granges in both Wales and England, and Gerald of Wales wrote of Margam in his Itinerarium Cambriae (c.1191) that it was ‘by far the most renowned for alms and charity’. As a result, the Margam Abbey charters, including those of the Penrice and Mansel families, comprise one of the largest and most complete monastic collections in Britain. The majority of its records consist of sealed land grants to and from many of the ruling families of Glamorgan, ranging from the 12th to the 16th centuries. As well as being a source of local history for Glamorgan, Margam’s charters also help to place it in a wider European context – not only containing royal charters and letters patent, but also a number of 13th-century papal bulls (charters 82-84, 141, 171, 173-4, 185, 245) confirming the importance of Margam to the Cistercian order.
Typically, each charter records the day upon which it was signed or sealed, usually given as a feast day or saints’ day, and the year of the reigning monarch. Midsummer Day or Canol Haf – usually celebrated on 21st June but also known as Gŵyl Ifan due to the feast day of St John the Baptist falling on the 24th June – was a significant date in the farming year as it marked the longest day and the turning of seasons as the days shortened and harvest time was nearing. In Margam’s charters, Midsummer is used as a dating clause in several instances. A quit-claim by a William de Marle to Margam Abbey (charter 227, 1354) is dated Midsummer Day, while charters 193 (1312) and 228 (1357), also quit-claims to the Abbey, are dated at Margam ‘the Sunday after Midsummer’ and ‘the Saturday after Midsummer’ respectively. It is not only within land grants that this dating occurs. Charter 233 (1366), which detailed assizes recovering the Abbot of Margam’s salmon fishery from one Res [Rhys] and one Howel, stated that for their piscine thievery each were fined threepence in damages on ‘the Monday before Midsummer Day’.
This theme of agriculture is abundant when looking at the rent requirements in some of Margam’s charters, which stipulate what is given in exchange for each piece of land. Rents could include livestock, crops, or spices, as well as money, and could stipulate a nominal amount in order to make a legal exchange. Charter 302 (1315) asks for just ‘a rose at Midsummer’ in exchange for the rent of half an acre of land; a rose is also given in charter 329 (1383) for a burgage. Charter 306 (1315) more generously specifies a garland of roses to be given annually at Midsummer in exchange for six and three-quarter acres. Symbolically, the only time roses are stipulated to be given is at Midsummer, and they do not appear as an exchange at any other date in Margam’s charters.
Of course, these dates were not always reliable. Margam may have been the wealthiest Abbey in Wales but news in the medieval period travelled more slowly than today and could be hampered by events of the time. Charter 336, for example, issued during the Wars of the Roses, was dated at Oxwich, Gower, on 4th April, yet supplies the year (1461) as the reign of Henry VI, rather than that of Edward IV whose accession had been on the 4th of March previously. Evidently the announcement of Edward’s accession had not yet reached Gower at the time.
Margam Abbey was a prominent landmark in south Wales for nearly four centuries, but it did not survive Henry VIII’s dissolution. In 1540 the Abbey and its lands, including its church, bell-tower, fisheries, cemetery, water-mill, and a large number of its granges were sold to the Mansel family for £938, six shillings and eightpence (charter 359). Incidentally, the charter granting Margam’s dissolution was dated at Westminster on 22nd June. It appears that the Abbey saw its final day at Midsummer.
Although our building is closed at the moment a great deal of work has continued behind the scenes and since June the following items and collections have been made available to view from home on the Library’s website and/or the catalogue:
Almost 10,000 images of personal papers and papers relating to the public offices of members of the Wynn family of Gwydir, Caernarfonshire have been made available. 2,786 items from the Sir John Williams Group, 1519-1683 (NLW MSS 463-470) and the Panton Group, 1515- [c. 1699] (NLW MSS 9051-9069) can be found in the catalogue.
Sir John Herbert Lewis Papers
8 diaries in the Sir John Herbert Lewis Papers from the period 1925-1933 are now available:
A blog about the work and collections of the National Library of Wales.
Due to the more personal nature of blogs it is the Library's policy to publish postings in the original language only. An equal number of blog posts are published in both Welsh and English, but they are not the same postings. For a translation of the blog readers may wish to try facilities such as Google Translate.