How to transcribe – Basic Decisions
In Transkribus, we create transcripts primarily to produce training material for our HTR-models; the so called „Ground Truth“. There are already a number of recommendations in the How-to’s for simple and advanced requirements.
We do not intend to create a critical edition. Nonetheless, we need some sort of guidelines, especially if we want to be successful in a team where several transcribers work on the same texts. Unlike classical edition guidelines, ours are not based on the needs of the scholarly reader. Instead, we focus on the needs of the ‘machine’ and the usability of the HTR result for a future full text search. We are well aware that this can only lead to a compromise in the result.
The training material should help the machine to recognize what we see as well. So it has to be accurate and not falsified by interpretation. This is the only way the machine can learn to read in the ‘right’ way. This principle is the priority and a kind of guideline for all our further decisions regarding transcriptions.
Many questions that are familiar to us from edition projects must also be decided here. In our project we generally use the literal or diplomatic transcription, meaning that we transcribe the characters exactly as we see them. This applies to the entire range of letters and punctuation marks. To give just an example: we don´t regulate the consonantal and vocal usage of the letters “v” and “u”. If the writer meant “und” (and) but wrote “vnndt”, we take it as literal and transcribe as the latter.
The perfection of the training data has a high priority for us. But there are also some other considerations influencing the creation of the GT. We would like to make the HTR results accessible via a full-text search. This means that a user must first phrase a word he is searching before receiving an answer. Since certain characters of the Kurrents font, such as the long „ſ“ (s) will hardly be part of a search term, we do regulate the transcription in such and similar cases.
In the case of sentence characters – using a certain amount of leeway here – we only regulate some. We regulate the bracket character for example, which is represented quite differently in the manuscripts. The same applies to word separators at the end of a line.
The usual “[…]” is never used for illegible passages. Instead this text area is tagged as “unclear”.