Dirk Alvermann


Posted by Dirk Alvermann on

Our first public model for german current (17th century)

Today we proudly present our HTR-model “Acta 17” as a public model.

The model was trained on the base of more than 500,000 words from about a 1000 different writers during the period of 1580-1705. It can handle the languages German, Lower German and Latin and is able to decipher simple german and latin abbreviations. Besides the usual chancellery lettering, the training material also contained a selection of concept writings and printed material of the period.

The entire training material is based on legal texts or court writings from the Responsa of the Greifswald Law Faculty. Validation sets are based on a chronological selection of the years: 1580 – 1705. GT & validation set was produced by Dirk Alvermann, Elisabeth Heigl, Anna Brandt.

Due to some problems with creating a large series of base-model-trainings for HTR+ Models in the last couple of weeks, we decided to launch an HTR+ Model trained from the scratch.

It is accompanied by a PyLaia model, which is based on the same training and validation sets and was also trained without using a base model.

For the validation set we choose pages representing single years of the total set of documents. All together they represent 48 selected years, five pages of each year.

How the models do perform in the several time periods of the validation set, you can check in the comparison below. Both models did run without language model.

Posted by Dirk Alvermann on

textual tagging

Like everything else, tagging can be integrated into your work on very different levels and with different requirements. In Transkribus, a large number of tags are available for a wide range of applications, some of which are described here.

We decided to try it with only two tags, namely “person” and “place”. These tags will later allow systematic access to the corresponding text passages.

When tagging, Transkribus automatically adopts the term under the cursor as “value” or “label” for the specific case. So if I mark “Wolgast” as in the example below and tag it as “place”, then two important pieces of information are already recorded. The same is true for the name of the person further down.

Transkribus offers the possibility to assign properties to each tagged element, e.g. to display the historical place name in modern spelling or to assign a gnd number to the person’s name. You can also create additional properties, geodata for places etc.

Given the amount of text we process, we have decided not to assign properties to our tags. Only the place names are identified as best as possible. The aim is to be able to display the tag-values separately for people and places next to the respective document when presenting them in the viewer of the Digital Library M-V, thus enabling the user to navigate systematically through the document.

Posted by Dirk Alvermann on

Tagging: what for? – when and why tagging makes sense

Tagging allows – in addition to content indexing by HTR – systematic indexing of the text by the later user. In contrast to an HTR model that does its work independently, tagging has to be done mostly by hand, which means that it requires a lot of effort. Therefore, a realistic effort analysis should be carried out before developing far-reaching plans regarding tagging.

Due to the amount of material processed in our project, we primarily use tagging where it helps us in the practical work on the text. This is the case with structure tagging, where the layout analysis is improved with the help of the tagging and the P2PaLA developed from it, and then of course also with the tagging of textstyles in case of deletions and blackening. This is where tagging is basically used “area-wide” by us. A fixed component of our transcription rules is also the use of the “unclear” tag for passages that cannot be read correctly by the transcriber. In this case, the tag is used more for internal team communication.

For the systematic preparation of texts for which an HTR has already been performed, we are experimenting with the “person” and “place” tags in order to offer systematic indexing, at least in this limited form.

Posted by Dirk Alvermann on

Use Case: “Model Booster”

Release 1.10.1

In our example we want to improve our HTR model for the Responsa. This is an HTR model that can read 17th century Kurrent documents. In the search for a possible base model, you can find two candidates in the “public models” of Transkribus: “German Kurrent M1+” from the Transkribus Team and “German_Kurrent_XVI-XVIII_M1” from Tobias Hodel. Both could fit. But the test on the Sample Compare shows that “German_Kurrent_XVI-XVIII_M1” performed better with a predicted average CER of 9.3% on our sample set.

Therefore “German_Kurrent_XVI-XVIII_M1” was chosen as the base model for the training. Afterwards the Ground Truth of the Responsa (108.000 words) and also the Validation Set of our old model was added. The average CER of our HTR model has improved considerably after the Base Model Training, from 7.3% to 6.6%.As you can see in the graph, the base model on the test set reads much worse than the original model, but the hybrid of the two is better than either one. The improvement of the model can be seen in each of the years tested and is up to 1%.

Posted by Dirk Alvermann on

Combining Models

Release 1.10.1

The longer you train HTR models yourself, the more you will be interested in the possibility of combining models. For example, you may want to combine several special models for individual writers or models that are specialized in particular fonts or languages.

To achieve a combination of models there are different possibilities. Here I would like to introduce a technique that works in my experience especially well for very large generic models – the “Model Booster“.

You start a base model training and use a powerful, foreign HTR model as base model and your own ground truth as train set. But before you start, two recommendations:

a) take a close look at the characteristics of the base model you are using (for how long is it trained, for which font style and which language?) – they have to match those of your own material as much as possible.

b) if possible try to predict the performance of the base model on your own material and then choose the base model with the best performance. Such a prediction can be made quite easily using the Sample Compare function. Another possibility is to test the basemodel with the Andvanced Compare on your own test set.

Posted by Dirk Alvermann on

Use Case: Extend and improve existing HTR-Models

Release 1.10.1

In the last post we described that a base model can pass on everything it has “learned” to the new HTR model. With additional ground truth, the new model can then extend and improve its capabilities.

Here is a typical use case: In our subproject on the Assessor Relations of the Wismar Tribunal we train a model with eight different writers. The train set contains 150,000 words, the CER was 4.09% in the last training. However, the average CER for some writers was much higher than for others.

So we decided to do an experiment. We added 10,000 words of new GT for two of the obvious writers (Balthasar and Engelbrecht)and used the Base Model as well as its Training and Validation Set for the new training.

As a result, the new model had an average CER of 3.82% – it had improved. But what is remarkable is that not only the CER of the two writers for which we had added new GT was improved – in both cases up to 1%. Also the reliability of the model applied to the other writers did not suffer, but was reduced as well.

Posted by Dirk Alvermann on

On the Shoulders of Giants: Training with Base Models

Release 1.10.1

If you want to develop generic HTR models, there is no way around working with base models. When training with base models, each training session for a model is based on an existing model, i.e. a base model. This is usually the last HTR model that was trained in the corresponding project.

Base models “remember” what they have already “learned”. Therefore each new training session improves the quality of the model (theoretically). The new model learns from its predecessor and thus becomes better and better. Therefore, training with Base Models is also particularly suitable for large generic models that are continuously developed over a long period of time.

To carry out training with Base Model, you simply select a specific Base Model in the training tool – in addition to the usual settings. Then, from the HTR Model Data tab, insert the Train Set and the Validation Set (called Test Set in earlier Trankribus versions) of the base model, as well as the new Training and Validation Set. Additionally you can add more new Ground Truth and then start the training.

Posted by Dirk Alvermann on

Generic Models and what they do

Release 1.10.1

In a previous post we talked about the differences between special models and generic models. Special models should always be the first choice if your material includes a limited number of writers. If your material is very diverse – for example, if the writer changes frequently in a bundle of handwritings – it makes sense to train a generic model.

The following articles are based on our experiences with the training of a generic model for the Responsa of the Greifswald Law Faculty, in which about 1000 different writer’s hands were trained.

But first: What should a generic HTR model be able to do? The most important point has already been said: It should be able to handle a variety of different writer’s hands. But it should also be able to “read” different fonts (alphabets) and languages and be able to interpret abbreviations. Below are a few typical examples of such challenges from our collection.

Different writer’s hands in one script:

Abbreviations:

Different languages in one script:

Posted by Dirk Alvermann on

Breaking the rules – the problem with concept writings

Release 1.10.1

Concept scripts are were used when a scribe quickly creates created a draft that is later on “written in the clean”. In the case of the Spruchakten, these are mainly the drafts to the judgments that were sent away later. The concept scripts were usually written very quickly and “sloppy”. Often letters are omitted or word endings “swallowed”. Even for humans conceptual writings are not easy to decipher  – for the machine they are a particular challenge.

To train an HTR model for reading concept scripts, you proceed in a similar way to training a model that is to interpret abbreviations. In both cases, the HTR model must be enabled to read something that is not really there – namely missing letters and syllables. To achieve this we must break our first rule: “We transcribe as ground truth only what is really written on paper”. Instead, we have to include all skipped letters and missing word endings etc. in our transcription. Otherwise we will not get a sensible and searchable HTR result in the end.

In our experiments with concept writings we tried at first to train special HTR models for concept scripts. The success was rather small. Finally, we decided to train concept scripts – similar to abbreviations – directly within our generic model. In doing so, we checked again and again whether the “wrong ground truth” that we produce in the process worsened the overall result of our HTR model. Surprisingly, the breaking of the transcription rule had no negative effect on the quality of the model. But this could also happen due to the sheer amount of ground truth used in our case (about 400,000 words).

HTR models are therefore able to distinguish concept writings from fair copies and interpret them accordingly – within certain limits. Below you can see a comparison of the HTR result with the GT for a typical concept script from our material.

Posted by Dirk Alvermann on

“between the lines” – Handling of inserts

At least as often as deletions or blackenings, there are overwritings or inserted text passages written between the lines. It is useful in two aspects to clarify at the beginning of a project how these cases should be handled.

In this simple example you can see how we handle such cases.

Since we include erasures and blackening in both layout and text, it is a logical step to treat overwrites and insertions in the same way. Usually, such passages are already provided with separate baselines during the automatic layout analysis. Every now and then you have to make corrections. In any case, each insertion is treated by us as a separate line and is also taken into account accordingly in the reading order.

Under no circumstances should you transcribe overwrites or insertions above the line instead of deletions. This would falsify the training material, even though the presentation of the text would of course be more pleasing to the human eye.