Monthly Archives

2 Articles

Posted by Dirk Alvermann on

Breaking the rules – the problem with concept writings

Release 1.10.1

Concept scripts are were used when a scribe quickly creates created a draft that is later on “written in the clean”. In the case of the Spruchakten, these are mainly the drafts to the judgments that were sent away later. The concept scripts were usually written very quickly and “sloppy”. Often letters are omitted or word endings “swallowed”. Even for humans conceptual writings are not easy to decipher  – for the machine they are a particular challenge.

To train an HTR model for reading concept scripts, you proceed in a similar way to training a model that is to interpret abbreviations. In both cases, the HTR model must be enabled to read something that is not really there – namely missing letters and syllables. To achieve this we must break our first rule: “We transcribe as ground truth only what is really written on paper”. Instead, we have to include all skipped letters and missing word endings etc. in our transcription. Otherwise we will not get a sensible and searchable HTR result in the end.

In our experiments with concept writings we tried at first to train special HTR models for concept scripts. The success was rather small. Finally, we decided to train concept scripts – similar to abbreviations – directly within our generic model. In doing so, we checked again and again whether the “wrong ground truth” that we produce in the process worsened the overall result of our HTR model. Surprisingly, the breaking of the transcription rule had no negative effect on the quality of the model. But this could also happen due to the sheer amount of ground truth used in our case (about 400,000 words).

HTR models are therefore able to distinguish concept writings from fair copies and interpret them accordingly – within certain limits. Below you can see a comparison of the HTR result with the GT for a typical concept script from our material.

Posted by Elisabeth Heigl on

Language Models

Release 1.10.1

We talked about the use of dictionaries in an earlier post and mentioned that the better an HTR model is (CER below 7%), the less useful a dictionary is for the HTR result.
This is different when using Language Models, which are available in Transkribus since December 2019. Like dictionaries, Language Models are generated from the ground truth used in each HTR training. Unlike dictionaries, Language Models do not aim at identifying individual words. Instead, they determine the probability of a word sequence or the frequent combination of words and expressions in a particular context.
Unlike dictionaries, the use of Language Models always leads to much better HTR results. In our tests, the average CER improved by as much as 1% compared to the HTR result without the Language Model – consistently, on all test sets.

Tips & Tools: The Language Model can be selected when configuring the HTR. Unlike dictionaries, Language Models and HTR model cannot be freely combined. Each HTR model has its uniquely genereated Language Model and only this one can be used.