Posted by Dirk Alvermann on

Ground Truth is the Alpha and Omega

Ground Truth (GT) is the basis for the creation of HTR models. It is simply a typewritten copy of the historical manuscript, a classic literal or diplomatic transcription that is 100% correct – plainly “Groundt Truth”.

Any mistake in this training material will cause “the machine” to learn – among many correct things – something wrong. That’s why quality management is so important when creating GT. But don’t panic, not every mistake in the GT has devastating consequences. It simply must not be repeated too often; otherwise it becomes “chronic” for the model.

In order to ensure the quality of the GT within our project, we have set up a few fixed transcription guidelines, as you know them from edition projects. It is worthwhile to strive for a literal, character-accurate transcription. Regulations of any kind must be avoided; e.g. normalizations, such as the vocal or consonant usage of “u” and “v” or the encoding of complex abbreviations.

If the material contains only one or two different handwritings, about 100 pages of transcribed text are sufficient for a first training session. This creates a basic model that can be used for further work. In our experience, the number of languages used in the text is irrelevant, since the HTR models usually work without dictionaries.

In addition to conventional transcription, Ground Truth can also be created semi-automatically. Transkribus offers a special tool – Text2Image – which is presented in another post.

Posted by Elisabeth Heigl on

How to transcribe – Basic Decisions

In Transkribus, we create transcripts primarily to produce training material for our HTR-models; the so called „Ground Truth“. There are already a number of recommendations in the How-to’s for simple and advanced requirements.

We do not intend to create a critical edition. Nonetheless, we need some sort of guidelines, especially if we want to be successful in a team where several transcribers work on the same texts. Unlike classical edition guidelines, ours are not based on the needs of the scholarly reader. Instead, we focus on the needs of the ‘machine’ and the usability of the HTR result for a future full text search. We are well aware that this can only lead to a compromise in the result.

The training material should help the machine to recognize what we see as well. So it has to be accurate and not falsified by interpretation. This is the only way the machine can learn to read in the ‘right’ way. This principle is the priority and a kind of guideline for all our further decisions regarding transcriptions.

Many questions that are familiar to us from edition projects must also be decided here. In our project we generally use the literal or diplomatic transcription, meaning that we transcribe the characters exactly as we see them. This applies to the entire range of letters and punctuation marks. To give just an example: we don´t regulate the consonantal and vocal usage of the letters “v” and “u”. If the writer meant “und” (and) but wrote “vnndt”, we take it as literal and transcribe as the latter.

The perfection of the training data has a high priority for us. But there are also some other considerations influencing the creation of the GT. We would like to make the HTR results accessible via a full-text search. This means that a user must first phrase a word he is searching before receiving an answer. Since certain characters of the Kurrents font, such as the long „ſ“ (s) will hardly be part of a search term, we do regulate the transcription in such and similar cases.

In the case of sentence characters – using a certain amount of leeway here – we only regulate some. We regulate the bracket character for example, which is represented quite differently in the manuscripts. The same applies to word separators at the end of a line.

The usual “[…]” is never used for illegible passages. Instead this text area is tagged as “unclear”.