Category Archives

67 Articles

Posted by Dirk Alvermann on

How to train PyLaia models

Release 1.12.0

Since version 1.12.0 it is possible to train PyLaia models in Transkribus besides the proven HTR+ models. We have gained some experience with this in the last months and are quite impressed by the performance of the models.

PyLaia models can be trained like HTR or HTR+ models using the usual training tool. But there are some differences.

Like a normal HTR+ model you have to enter the name of the model, a description and the languages the model can be used for. Unlike the training of HTR+ models, the number of iterations (epochs) is limited to 250. This is sufficient in our experience. You can also train PyLaia models with base models, i.e. you can design longer training series that are based on each other. In contrast to the usual training, PyLaia has an “Early Stopping” setting. It determines when the training can be stopped, if a good result is achieved. At the beginning of your training attempts you should always set this value to the same number of iterations as you have chosen for the whole training. So if you train with 250 epochs, the setting for “Early Stopping” should be the same. Otherwise you risk that the training will be stopped too early.

The most important difference is that in PyLaia Training you can choose whether you want to train with the original images or with compressed images. Our recommendation is: train with compressed images. The PyLaia training with original images can take weeks in the worst case (with a correspondingly large amount of GT). With compressed images a PyLaia training is finished within hours or days (if you train with about 500.000 words).

Tips & Tools
For more detailed information, especially for setting specific training parameters, we recommend the tutorial by Annemieke Romein and the READ Coop guidelines.

Posted by Anna Brandt on

Searching and editing tags

Release 1.11.0

If you tag large amounts of historical text, as we have tried to do with place and person names, you will sooner or later have a problem: the spelling varies a lot – or in other words, the tag values are not identical.
Let’s take the places and a simple example. As “Rosdogk”, “Rosstok” or “Rosdock” the same place is always referred to – the City of Rostock. To make this recognizable, you use the properties. But if you do this over more than ten thousand pages with hundreds or thousands of places (we set about 15,000 tags for places in our attempt), you easily lose the overview. And besides, tagging takes much longer if you also assign properties.

Fortunately, there is an alternative. You can search in the tags, not only in the document you are working on, but in the whole collection. To do this, you just have to select the “binoculars” in the menu, similar to starting a full text search or KWS, only that you now select the submenu “Tags”.

Here you can select the search area (Collection, Document, Page) and also on which level you want to search (Line or Word).Then you have to select the corresponding tag and if you want to limit the search, you have to enter the tagged word. The search results can also be sorted. This way we can quickly find all “Rostocks” in our collection and can enter the desired additional information in the properties, such as the current name, geodata and similar. These “properties” can then be assigned to all selected tagged words. In this way, tagging and enrichment of data can be separated from each other and carried out efficiently.

The same is possible with tags like “Person” or “Abbrev” (where you would put the resolution/expansion in the properties).

Posted by Dirk Alvermann on

textual tagging

Like everything else, tagging can be integrated into your work on very different levels and with different requirements. In Transkribus, a large number of tags are available for a wide range of applications, some of which are described here.

We decided to try it with only two tags, namely “person” and “place”. These tags will later allow systematic access to the corresponding text passages.

When tagging, Transkribus automatically adopts the term under the cursor as “value” or “label” for the specific case. So if I mark “Wolgast” as in the example below and tag it as “place”, then two important pieces of information are already recorded. The same is true for the name of the person further down.

Transkribus offers the possibility to assign properties to each tagged element, e.g. to display the historical place name in modern spelling or to assign a gnd number to the person’s name. You can also create additional properties, geodata for places etc.

Given the amount of text we process, we have decided not to assign properties to our tags. Only the place names are identified as best as possible. The aim is to be able to display the tag-values separately for people and places next to the respective document when presenting them in the viewer of the Digital Library M-V, thus enabling the user to navigate systematically through the document.

Posted by Anna Brandt on

Tagging in WebUI

For tasks like tagging already transcribed documents, the WebUI, which is especially designed for crowd sourcing projects, is very well suited.

Tagging in the WebUI works slightly different than in the Expert Client. There are different tools and settings.

If you have selected your collection and the document in the WebUI and want to tag something, you have to select “Annotation” and not “plain text” for the page you want to edit.  Both modes are similar, except that in Annotation you can additionally tag. To do this, you need to select the words and right-click on them to pick the appropriate tag. Always save when you leave the page, even if you switch to layout mode. The program doesn’t ask you to save the tag as it does in the Expert Client and without saving your tags will be lost.

All tags appear to the left of the text field when you click on the word. The tags set in the Expert Client are also displayed there. The whole annotation mode is still in a beta version at the moment.

Posted by Anna Brandt on

Tagging Tools

Release 1.11.0

In a previous post we already wrote about our experiences with structure tagging and described the tools that go with it. But for most users (e.g. in edition projects) enriching texts with additional content information is even more important. To add tags to a transcription you can use the tagging tools in the tab “Metadata”/”Textual” in Transkribus.

Here you can see the available tags as well as those that have already been applied to the text of the page. With the Customize button you can create your own tags or add shortcuts to existing tags, just like with structure tagging. The shortcuts allow for easier and faster tagging in the transcript. If you want to do without shortcuts, you have to mark the respective words in the text (not in the image) and select the desired tag with a right click. Of course a word can be tagged several times.

These tags should not be confused with the so-called TextStyles (for example, crossed out or superscript words). They are not accessible below the tags but via the toolbar at the bottom of the text window.

Posted by Dirk Alvermann on

Tagging: what for? – when and why tagging makes sense

Tagging allows – in addition to content indexing by HTR – systematic indexing of the text by the later user. In contrast to an HTR model that does its work independently, tagging has to be done mostly by hand, which means that it requires a lot of effort. Therefore, a realistic effort analysis should be carried out before developing far-reaching plans regarding tagging.

Due to the amount of material processed in our project, we primarily use tagging where it helps us in the practical work on the text. This is the case with structure tagging, where the layout analysis is improved with the help of the tagging and the P2PaLA developed from it, and then of course also with the tagging of textstyles in case of deletions and blackening. This is where tagging is basically used “area-wide” by us. A fixed component of our transcription rules is also the use of the “unclear” tag for passages that cannot be read correctly by the transcriber. In this case, the tag is used more for internal team communication.

For the systematic preparation of texts for which an HTR has already been performed, we are experimenting with the “person” and “place” tags in order to offer systematic indexing, at least in this limited form.

Posted by Elisabeth Heigl on

Compare Samples

Release 1.10.1

As the name suggests, the Compare Samples tool tests the capabilities of an HTR model based on a sample rather than a manually selected test set. We have explained in an earlier post how to create such samples, that they represent an objective alternative to conventional test sets and why they can be created with much less effort.

“Compare Samples” may look like a validation tool, but is actually not one of them. You can use it to validate an HTR model, but Advanced Compare is better suited for this.  The real function of “Sample Compare” is to make predictions about the success of an HTR model on a given material.

You may remember the Model Booster. There you need a suitable HTR model that can serve as a base model for a planned HTR training. With the numerous Public Models available, it is a good idea to first check with “Compare Samples” which model fits to your project.

To create such a prediction for a sample, you first have to run the selected HTR models over the entire sample (before that, of course, you have already created the GT for the sample). Then open the Samples tab of the “Compare Samples” tool. This tab lists all samples of your active collection. You select the sample that will be used as the basis for the prediction. Now you can select the model in the middle, whose text version should serve as a reference for the GT. Start “Compute” and you’re done.

The tool now calculates average values for all lines of the sample with an upper bound, a lower bound and an average value. In the range between upper bound and lower bound you should find the Character Error Rate for 95% of your material at which the selected HTR model is expected to work.  In our example below, between 4,7 and 2,9 %.

This way you can compare as many models for your material as you like. But the tool also allows a few other things. For example, you can easily check how an HTR model with or without language model or dictionary works on your material and if it is worth using one or the other. Of course this is especially useful to check your own models.

 

Tips & Tools
Create several smaller samples rather than one giant sample for all your material. You can separate them chronologically or by writer’s hands, for example. This will allow you to make a differentiated prediction for the use of HTR models on all your material or parts of it.

Posted by Elisabeth Heigl on

CER? Don´t Worry!

Release 1.10.1

The Character Error Rate (CER) compares, for a given page, the total number of characters (n), including spaces, to the minimum number of insertions (i), substitutions (s) and deletions (d) of characters that are required to obtain the GT result. If that was not mathematical enough for you:

CER = [ (i + s + d) / n ]*100

This means that even all the little mistakes are statistically full-fledged errors. Every missing comma, a “u” instead of a “v”, an additional space or even an uppercase letter instead of a lowercase letter are included in the CER as “whole errors”. The small details neither disturb the reading and understanding of the text nor do they prevent the search engine from finding a term.
So don’t only look at the numbers but also at the text comparison.Your model is usually better than the CER (and especially the WER) suggest.
To illustrate this, we have calculated this exemplary:

Posted by Dirk Alvermann on

Use Case: “Model Booster”

Release 1.10.1

In our example we want to improve our HTR model for the Responsa. This is an HTR model that can read 17th century Kurrent documents. In the search for a possible base model, you can find two candidates in the “public models” of Transkribus: “German Kurrent M1+” from the Transkribus Team and “German_Kurrent_XVI-XVIII_M1” from Tobias Hodel. Both could fit. But the test on the Sample Compare shows that “German_Kurrent_XVI-XVIII_M1” performed better with a predicted average CER of 9.3% on our sample set.

Therefore “German_Kurrent_XVI-XVIII_M1” was chosen as the base model for the training. Afterwards the Ground Truth of the Responsa (108.000 words) and also the Validation Set of our old model was added. The average CER of our HTR model has improved considerably after the Base Model Training, from 7.3% to 6.6%.As you can see in the graph, the base model on the test set reads much worse than the original model, but the hybrid of the two is better than either one. The improvement of the model can be seen in each of the years tested and is up to 1%.

Posted by Dirk Alvermann on

Combining Models

Release 1.10.1

The longer you train HTR models yourself, the more you will be interested in the possibility of combining models. For example, you may want to combine several special models for individual writers or models that are specialized in particular fonts or languages.

To achieve a combination of models there are different possibilities. Here I would like to introduce a technique that works in my experience especially well for very large generic models – the “Model Booster“.

You start a base model training and use a powerful, foreign HTR model as base model and your own ground truth as train set. But before you start, two recommendations:

a) take a close look at the characteristics of the base model you are using (for how long is it trained, for which font style and which language?) – they have to match those of your own material as much as possible.

b) if possible try to predict the performance of the base model on your own material and then choose the base model with the best performance. Such a prediction can be made quite easily using the Sample Compare function. Another possibility is to test the basemodel with the Andvanced Compare on your own test set.