Elisabeth Heigl


Posted by Elisabeth Heigl on

Ground Truth is the Alpha and Omega

Release 1.7.1

Ground Truth (GT) is the basis for the creation of HTR models. It is simply a typewritten copy of the historical manuscript, a classic literal or diplomatic transcription that is 100% correct – plainly “Groundt Truth”.

Any mistake in this training material will cause “the machine” to learn – among many correct things – something wrong. That’s why quality management is so important when creating GT. But don’t panic, not every mistake in the GT has devastating consequences. It simply must not be repeated too often; otherwise it becomes “chronic” for the model.

In order to ensure the quality of the GT within our project, we have set up a few fixed transcription guidelines, as you know them from edition projects. It is worthwhile to strive for a literal, character-accurate transcription. Regulations of any kind must be avoided; e.g. normalizations, such as the vocal or consonant usage of “u” and “v” or the encoding of complex abbreviations.

If the material contains only one or two different handwritings, about 100 pages of transcribed text are sufficient for a first training session. This creates a basic model that can be used for further work. In our experience, the number of languages used in the text is irrelevant, since the HTR models usually work without dictionaries.

In addition to conventional transcription, Ground Truth can also be created semi-automatically. Transkribus offers a special tool – Text2Image – which is presented in another post.

Posted by Elisabeth Heigl on

How to transcribe – Basic Decisions

In Transkribus, we create transcripts primarily to produce training material for our HTR-models; the so called „Ground Truth“. There are already a number of recommendations in the How-to’s for simple and advanced requirements.

We do not intend to create a critical edition. Nonetheless, we need some sort of guidelines, especially if we want to be successful in a team where several transcribers work on the same texts. Unlike classical edition guidelines, ours are not based on the needs of the scholarly reader. Instead, we focus on the needs of the ‘machine’ and the usability of the HTR result for a future full text search. We are well aware that this can only lead to a compromise in the result.

The training material should help the machine to recognize what we see as well. So it has to be accurate and not falsified by interpretation. This is the only way the machine can learn to read in the ‘right’ way. This principle is the priority and a kind of guideline for all our further decisions regarding transcriptions.

Many questions that are familiar to us from edition projects must also be decided here. In our project we generally use the literal or diplomatic transcription, meaning that we transcribe the characters exactly as we see them. This applies to the entire range of letters and punctuation marks. To give just an example: we don´t regulate the consonantal and vocal usage of the letters “v” and “u”. If the writer meant “und” (and) but wrote “vnndt”, we take it as literal and transcribe as the latter.

The perfection of the training data has a high priority for us. But there are also some other considerations influencing the creation of the GT. We would like to make the HTR results accessible via a full-text search. This means that a user must first phrase a word he is searching before receiving an answer. Since certain characters of the Kurrents font, such as the long „ſ“ (s) will hardly be part of a search term, we do regulate the transcription in such and similar cases.

In the case of sentence characters – using a certain amount of leeway here – we only regulate some. We regulate the bracket character for example, which is represented quite differently in the manuscripts. The same applies to word separators at the end of a line.

The usual “[…]” is never used for illegible passages. Instead this text area is tagged as “unclear”.

Posted by Elisabeth Heigl on

WebUI & Expert Client

As we said before, this blog is almost exclusively about the Expert Client of Transkribus. It offers a variety of possibilities. To handle them it requires a certain level of knowledge.

The tools of the WebUI are much more limited, but also easier to work with. In the WebUI it is not possible to perform an automatic layout analysis or to start an HTR, let alone to train a model or to interfere in the user management. But that’s not what it’s meant for.

The WebUI is the ideal interface for crowd projects with a lot of volunteers who mainly transcribe or comment and tag content. And this is exactly what it is used for most of the time. The coordination of such a crowd project is done via the Expert Client.

The WebUI’s advantages are that it can be used without any requirements. It is a web application called from the browser; no installation, no updates, etc. Moreover, it is almost intuitive and can be used by anyone without any previous knowledge.

 

Tips & Tools
The WebUI has also a version management – somewhat adapted for crowd projects. When a transcriber is done with the page to be edited, he sets the edit status to “ready for review”, so that his supervisor knows that now it’s his turn.

 

Posted by Elisabeth Heigl on

Workflow and Information System

The journey from a file in the archive to its digital and HTR-based presentation on an online platform passes through several stations. These steps make up the overall project workflow. They are based on a broad technical infrastructure. The workflow of our project, which is geographically spread over three locations, consists roughly of six main stations:

  1. Preaparation of the files and the process (restorative, archival, digital)
  2. Scanning
  3. Enrichment with structural and metadata
  4. Providing the files for Transkribus
  5. Automatic Handwritten Textrecognition (HTR) with Transkribus
  6. Online presentation in the Digital Library Mecklenburg-Vorpommern

It proved to be helpful that we not only had strictly defined the individual steps in advance but also determined the persons responsible from the beginning, i.a. experts for the individual tasks as well as coordinators for the steps across stations and locations. This ensures that all parties involved always know the respective contact person. Open questions can thus be answered more easily and any problems that may arise can be solved more efficiently.

Especially with the scanning of the Spruchakten, we have not proceeded chronologically from the start. We did not scan the inventory ‘from top to bottom’. Instead, we first selected and edited individual representative volumes between 1580 and 1675. We wanted to create mighty HTR models first. Only then did we ‘fill up the gaps’. This procedure showed us, how crucial it is to continuously document the progress of the project with all its indivdual areas and stages so that it may not become confusing. There are many ways to do this.

We keep – meanwhile very colourful – spreadsheets on the progress of the various collections and files. However, they only depict partial processes and are only accessible to the coordinators. But these spreadsheets have to be maintained and for this purpose the progress in the various areas have to be closely monitored.

Another possiblity is the Goobi workflow. Some tasks take place on the Goobi server anyway. In addition to those we can freely add tasks to the Goobi-workflow which do not have to be related to Goobi itself. All of them can be ‘accepted’ and ‘completed’ on that platform to reflect the progress of the project. However, the condition here is that all project contributors must be familiar with this workflow system. Where this is not the case, an „external“ information system must be selected that everyone can access and handle.

The different partners of our project therefor jointly keep a wiki (e-collaboration).

Posted by Elisabeth Heigl on

Scanning and Structural Data

The Spruchakten of the Greifswald Law Faculty are scanned on Bookeye4 book scanners (Image Access) in combination with the scanning software UCC (Universal Capturing Client) by Intranda. UCC not only allows the capturing of structural data while scanning, but is also directly connected to the Goobi server (also by Intranda), where the digital processes of our project (except for handwritten text recognition) are controlled. Tasks already created in Goobi (Goobi-Vorgang) can thus be accessed in UCC, ‘filled up’ with image-files and associated structural data and then be exported to the Goobi server.

We scan consistently at 400 dpi and 24-bit color depth. The original files created are saved as uncompressed TIF files. For further processing and presentation at the Digital Library Mecklenburg-Vorpommern however they are copied as compressed JPG files.

UCC enables you to capture structural data during the scanning process. This means that the scan operator can already set a structural element for related pages of the file while scanning. Every single “Responsum” (meaning each legal case in the file) receives the structural element “Vorgang”. In the later editing of the metadata, we only have to add a descriptive main title.

Posted by Elisabeth Heigl on

Knowing what you want

A digitization project with Handwritten Text Recognition can have very different goals. They can range from the critical digital edition to the provision of manuscripts as full texts to the indexing of large text corpora via Key Word Spotting. All three objectives allow different approaches, which have a great influence on the technical and personnel efforts.

In this project, only the last two target definitions are of interest.  A critical edition is not intended, even if the full texts generated in this project could serve as the basis of such.

We aim at a complete indexing of the manuscripts by automatic text recognition. The results will then be made public online in the Digital Library Mecklenburg-Vorpommern. A search is available there, which shows the hits in the image itself. The user, who has sufficient palaeographic knowledge, can explore the context of the hit in the image himself or switch to a modern full text view, or even only use the latter.

Posted by Elisabeth Heigl on

Why HTR will change it all

For some years now, archives and libraries have been dedicating more and more of their time to the digitisation of historical manuscripts. The strategies are quite different. Some would like to present their “treasures” in a contemporary manner, others would like to make more extensive collections available for use in an appropriate digital form. The advantages of digitisation are obvious. The original sources are preserved and the interested researchers and non-experts can access the material independently of place and time without having to spend days or weeks in reading rooms. Considering the practice of the 20th century, this is an enormous step forward.

Initially, such digital services provide no more than a digital image of the original historical source. They are developed and maintained at gerat expense, both financially and in terms of staff. If you look at the target groups of these services, you can see that they are mainly aimed at the very same people who also visit archives and libraries. However, the addressees usually have the ability to decipher such historical manuscripts. Optimistically speaking, we are talking about one or two percent of the population. For everyone else, these digital copies are just beautiful to look at.

Keep this picture in mind if you want to understand why the Handwritten Text Recognition (HTR) is opening a whole new chapter in the history of digital indexing and use of historical documents. In a nutshell: HTR allows us to move from simple digitalization to the digital transformation of historical sources. Thanks to the HTR, not only the digital image of a manuscript but also its content is made available in a form that can be read by everyone and searched by machines – over hundreds of thousands of pages.

Thus the contents of historical handwritings can be opened up to a public to whom it has so far remained closed or at least not easily accessible. This does not only adress the non-professional researchers. Access to the contents of the sources will also be much easier for academic experts from disciplines that do not have historical auxiliary sciences like palaeography as part of their classical educational canon. This makes new constellations of inderdisciplinary research possible. Ultimately, since the contents of the manucsripts can now be evaluated by machine, questions and methods of the Digital Humanities can be more easily applied to the material than before.

Tips & Tools
Recommendation for further reading: Mühlberger, Archiv 4.0 oder warum die automatisierte Texterkennung alles verändern wird Tagungsband Archivtag Wolfsburg, in: Massenakten – Massendaten. Rationalisierung und Automatisierung im Archiv (Tagungsdokumentationen zum Deutschen Archivtag, Band 22), hg. v. VdA, Fulda 2018, S. 145-156.