Elisabeth Heigl


Posted by Elisabeth Heigl on

Use Case: Extend and improve existing HTR-Models

Release 1.10.1

In the last post we described that a base model can pass on everything it has “learned” to the new HTR model. With additional ground truth, the new model can then extend and improve its capabilities.

Here is a typical use case: In our subproject on the Assessor Relations of the Wismar Tribunal we train a model with eight different writers. The train set contains 150,000 words, the CER was 4.09% in the last training. However, the average CER for some writers was much higher than for others.

So we decided to do an experiment. We added 10,000 words of new GT for two of the obvious writers (Balthasar and Engelbrecht)and used the Base Model as well as its Training and Validation Set for the new training.

As a result, the new model had an average CER of 3.82% – it had improved. But what is remarkable is that not only the CER of the two writers for which we had added new GT was improved – in both cases up to 1%. Also the reliability of the model applied to the other writers did not suffer, but was reduced as well.

Posted by Elisabeth Heigl on

On the Shoulders of Giants: Training with Base Models

Release 1.10.1

If you want to develop generic HTR models, there is no way around working with base models. When training with base models, each training session for a model is based on an existing model, i.e. a base model. This is usually the last HTR model that was trained in the corresponding project.

Base models “remember” what they have already “learned”. Therefore each new training session improves the quality of the model (theoretically). The new model learns from its predecessor and thus becomes better and better. Therefore, training with Base Models is also particularly suitable for large generic models that are continuously developed over a long period of time.

To carry out training with Base Model, you simply select a specific Base Model in the training tool – in addition to the usual settings. Then, from the HTR Model Data tab, insert the Train Set and the Validation Set (called Test Set in earlier Trankribus versions) of the base model, as well as the new Training and Validation Set. Additionally you can add more new Ground Truth and then start the training.

Posted by Elisabeth Heigl on

Validation possibilities

Release 1.10.1

There are different ways to measure the accuracy of our HTR-models in Transkribus. Three Compare tools calculate the results and present them in different ways. In all three cases the hypothesis (HTR version) of a text is compared with a corresponding reference (correct version, i.e. GT) of the same text.

The first tool which shows the most immediate result is “Compare Text Versions“. It visualizes the validation for the currently opened page in the text itself. Here we can see exactly which mistakes the HTR has made at which points.

The standard “Compare” gives us these same validation results as numerical values. Among other things, it calculates the average word error rate (WER), the character error rate (CER) and the respective accuracy rates. (If someone knows what the bag tokens are about, he/she is welcome to write us a comment). In the “Compare” we also have the possibility to run the “Advanced Compare“, which allows us to perform the corresponding calculations for the whole document or only for certain pages.

We already have presented the validation tool “Compare Sample” briefly in another post to show how to create Test Samples. The actual Sample Compare then predicts how a model will potentially read on a Test Sample that has been created for this purpose.

Posted by Elisabeth Heigl on

Generic Models and what they do

Release 1.10.1

In a previous post we talked about the differences between special models and generic models. Special models should always be the first choice if your material includes a limited number of writers. If your material is very diverse – for example, if the writer changes frequently in a bundle of handwritings – it makes sense to train a generic model.

The following articles are based on our experiences with the training of a generic model for the Responsa of the Greifswald Law Faculty, in which about 1000 different writer’s hands were trained.

But first: What should a generic HTR model be able to do? The most important point has already been said: It should be able to handle a variety of different writer’s hands. But it should also be able to “read” different fonts (alphabets) and languages and be able to interpret abbreviations. Below are a few typical examples of such challenges from our collection.

Different writer’s hands in one script:

Abbreviations:

Different languages in one script:

Posted by Elisabeth Heigl on

Breaking the rules – the problem with concept writings

Release 1.10.1

Concept scripts are were used when a scribe quickly creates created a draft that is later on “written in the clean”. In the case of the Spruchakten, these are mainly the drafts to the judgments that were sent away later. The concept scripts were usually written very quickly and “sloppy”. Often letters are omitted or word endings “swallowed”. Even for humans conceptual writings are not easy to decipher  – for the machine they are a particular challenge.

To train an HTR model for reading concept scripts, you proceed in a similar way to training a model that is to interpret abbreviations. In both cases, the HTR model must be enabled to read something that is not really there – namely missing letters and syllables. To achieve this we must break our first rule: “We transcribe as ground truth only what is really written on paper”. Instead, we have to include all skipped letters and missing word endings etc. in our transcription. Otherwise we will not get a sensible and searchable HTR result in the end.

In our experiments with concept writings we tried at first to train special HTR models for concept scripts. The success was rather small. Finally, we decided to train concept scripts – similar to abbreviations – directly within our generic model. In doing so, we checked again and again whether the “wrong ground truth” that we produce in the process worsened the overall result of our HTR model. Surprisingly, the breaking of the transcription rule had no negative effect on the quality of the model. But this could also happen due to the sheer amount of ground truth used in our case (about 400,000 words).

HTR models are therefore able to distinguish concept writings from fair copies and interpret them accordingly – within certain limits. Below you can see a comparison of the HTR result with the GT for a typical concept script from our material.

Posted by Elisabeth Heigl on

Language Models

Release 1.10.1

We talked about the use of dictionaries in an earlier post and mentioned that the better an HTR model is (CER below 7%), the less useful a dictionary is for the HTR result.
This is different when using Language Models, which are available in Transkribus since December 2019. Like dictionaries, Language Models are generated from the ground truth used in each HTR training. Unlike dictionaries, Language Models do not aim at identifying individual words. Instead, they determine the probability of a word sequence or the frequent combination of words and expressions in a particular context.
Unlike dictionaries, the use of Language Models always leads to much better HTR results. In our tests, the average CER improved by as much as 1% compared to the HTR result without the Language Model – consistently, on all test sets.

Tips & Tools: The Language Model can be selected when configuring the HTR. Unlike dictionaries, Language Models and HTR model cannot be freely combined. Each HTR model has its uniquely genereated Language Model and only this one can be used.

Posted by Elisabeth Heigl on

“between the lines” – Handling of inserts

At least as often as deletions or blackenings, there are overwritings or inserted text passages written between the lines. It is useful in two aspects to clarify at the beginning of a project how these cases should be handled.

In this simple example you can see how we handle such cases.

Since we include erasures and blackening in both layout and text, it is a logical step to treat overwrites and insertions in the same way. Usually, such passages are already provided with separate baselines during the automatic layout analysis. Every now and then you have to make corrections. In any case, each insertion is treated by us as a separate line and is also taken into account accordingly in the reading order.

Under no circumstances should you transcribe overwrites or insertions above the line instead of deletions. This would falsify the training material, even though the presentation of the text would of course be more pleasing to the human eye.

Posted by Elisabeth Heigl on

Treatment of erasures and blackenings

HTR models treat erased text the same as any other. They know no difference and therefore always provide a reading result. In fact, they are amazingly good at it, and read useful content even where a transcriber would have given up long ago.

The simple example shows an erased text that is still legible to the human eye and which the HTR model has read almost without error.

Because we have already seen this ability of the HTR models in other projects, we decided from the beginning to transcribe as much of the erased and blacked out text as possible in order to use the potential of the HTR. The corresponding passages are simply tagged as text-style “strike through” in the text and thus remain recognizable for possible internal search operations.

If you don’t want to make this effort, you have the possibility to intervene in the layout at the corresponding text passages that contain erasures or blackenings and, for example, delete the baselines or shorten them accordingly. Then the HTR does not “see” such passages and cannot recognize them. But this is not less complex than the way we have chosen.

Under no circumstances should you work on erasures with the “popular” expression “[…]”. The transcribers know what the deletion sign means, but the HTR learns something completely wrong here, when many such passages come into training.

Posted by Elisabeth Heigl on

Test samples – the impartial alternative

If a project concept does not allow the strategic planning and organization of test sets, there is a simple alternative: automatically generated samples. Samples are also a valuable addition to the manually created test sets, because here it is the machine that decides which material is included in the test and which is not. Transkribus can select individual lines from a set of pages that you have previously provided. So it’s a more or less random selection. This is worthwhile for projects that have a lot of material at their disposal – including ours.

We use samples as a cross-check to our manually created test sets and because samples are comparable to the statistical methods of random sampling, which the DFG also recommends in its rules of practice for testing the quality of OCR.

Because HTR – unlike OCR – is line-based, we modify the DFG’s recommendation somewhat. I will explain this with an example: For our model Spruchakten_M-3K, a check should be carried out on the reading accuracy achieved. For this purpose we created a sample, exclusively from untrained material of the whole period for which the model works (1583-1627). To do this, we selected every 20th page from the entire data set, obtainimg a subset of 600 pages. It represents approximately 3.7% of the total 16,500 pages of material available for this period.  All this is done on your own network drive. After uploading this subset to Transkribus and processing it with the CITlab Advanced LA (16.747 lines were detected), we let Transkribus create a sample from it. It contains 900 randomly selected lines. This is about 5% of the subset. This sample was now provided with GT and used as a test set to check the model.

And this is how it works in practice: The “Sample Compare” function is called up in the “Tools” menu. Select your subset in the collection and add it to the sample set pressing the “add to sample” button. Then specify the number of lines Transkribus should select from the subset. Here you should select at least as many lines as there are pages in the subset, so that each page has one test line.

In our case, we decided to use a factor of 1.5 just to be sure. The program now selects the lines independently and compiles a sample from them, which is saved as a new document. This document does not contain any pages, but only lines. These must now be transcribed as usual to create GT. Afterwards any model can be tested on this test set using the Compare function.

Posted by Elisabeth Heigl on

Image resolution

A technical parameter that needs to be uniformly determined at the beginning of the scanning process is the resolution of the digital images, that means how many pixels/dots per inch (dpi) the scanned image should have.

The DFG Code of Practice for Digitisation generally recommends 300 dpi (p. 15). For “text documents with the smallest significant character” of up to 1.5 mm, however, a resolution of 400 dpi can be selected (p. 22). Indeed, in the case of manuscripts – especially concept writings – of the early modern period, the smallest character components can result in different readings and should therefore be as clearly identifiable as possible. We have therefore chosen 400 dpi.

In addition to the advantages for deciphering the scripts, however, the significantly larger storage format of the 400 (around 40,000 KB/img.) compared to the 300 (around 30,000 KB/img.) files must be taken into account and planned for!

The selected dpi number also has an impact on the process of automatic handwritten text recognition. Different resolutions lead to different results of the layout analysis and the HTR. To verify this thesis, we have selected three pages from a speech file of 1618, scanned them in 150, 200, 300 and 400 dpi each, processed all pages in transcript and determined the following CERs:

Seite/dpi 150 200 300 400
2 3,99 3,5 3,63 3,14
5 2,1 2,37 2,45 2,11
9 6,73 6,81 6,52 6,37

Generally speaking, a lower resolution therefore means a decrease in CERs – though within the range of less than one percent.To be honest, such comparisons of HTR results are somewhat misleading. The very basis of the HTR – the layout analysis – leads to latently different results at different resolutions, which in turn influences the HTR results (apparently cruder analyses achieve worse HTR results) but also the GT production itself (e.g. with cut-off words).

In our example you see the same image in different resolutions. The result of the CITlab Advanced LA changes as the resolution increases. The initial “V” of the first line is no longer recognized at higher resolution, while line 10 is increasingly “torn apart” at higher resolution. With increasing resolution, the LA becomes more sensitive – this can have advantages and disadvantages at the same time.

It is therefore important to have a uniform dpi number for the entire project so that the average CER values and the entire statistical material can be reliably evaluated.