Category Archives

67 Articles

Posted by Dirk Alvermann on

“between the lines” – Handling of inserts

At least as often as deletions or blackenings, there are overwritings or inserted text passages written between the lines. It is useful in two aspects to clarify at the beginning of a project how these cases should be handled.

In this simple example you can see how we handle such cases.

Since we include erasures and blackening in both layout and text, it is a logical step to treat overwrites and insertions in the same way. Usually, such passages are already provided with separate baselines during the automatic layout analysis. Every now and then you have to make corrections. In any case, each insertion is treated by us as a separate line and is also taken into account accordingly in the reading order.

Under no circumstances should you transcribe overwrites or insertions above the line instead of deletions. This would falsify the training material, even though the presentation of the text would of course be more pleasing to the human eye.

Posted by Dirk Alvermann on

Treatment of erasures and blackenings

HTR models treat erased text the same as any other. They know no difference and therefore always provide a reading result. In fact, they are amazingly good at it, and read useful content even where a transcriber would have given up long ago.

The simple example shows an erased text that is still legible to the human eye and which the HTR model has read almost without error.

Because we have already seen this ability of the HTR models in other projects, we decided from the beginning to transcribe as much of the erased and blacked out text as possible in order to use the potential of the HTR. The corresponding passages are simply tagged as text-style “strike through” in the text and thus remain recognizable for possible internal search operations.

If you don’t want to make this effort, you have the possibility to intervene in the layout at the corresponding text passages that contain erasures or blackenings and, for example, delete the baselines or shorten them accordingly. Then the HTR does not “see” such passages and cannot recognize them. But this is not less complex than the way we have chosen.

Under no circumstances should you work on erasures with the “popular” expression “[…]”. The transcribers know what the deletion sign means, but the HTR learns something completely wrong here, when many such passages come into training.

Posted by Elisabeth Heigl on

Test samples – the impartial alternative

If a project concept does not allow the strategic planning and organization of test sets, there is a simple alternative: automatically generated samples. Samples are also a valuable addition to the manually created test sets, because here it is the machine that decides which material is included in the test and which is not. Transkribus can select individual lines from a set of pages that you have previously provided. So it’s a more or less random selection. This is worthwhile for projects that have a lot of material at their disposal – including ours.

We use samples as a cross-check to our manually created test sets and because samples are comparable to the statistical methods of random sampling, which the DFG also recommends in its rules of practice for testing the quality of OCR.

Because HTR – unlike OCR – is line-based, we modify the DFG’s recommendation somewhat. I will explain this with an example: For our model Spruchakten_M-3K, a check should be carried out on the reading accuracy achieved. For this purpose we created a sample, exclusively from untrained material of the whole period for which the model works (1583-1627). To do this, we selected every 20th page from the entire data set, obtainimg a subset of 600 pages. It represents approximately 3.7% of the total 16,500 pages of material available for this period.  All this is done on your own network drive. After uploading this subset to Transkribus and processing it with the CITlab Advanced LA (16.747 lines were detected), we let Transkribus create a sample from it. It contains 900 randomly selected lines. This is about 5% of the subset. This sample was now provided with GT and used as a test set to check the model.

And this is how it works in practice: The “Sample Compare” function is called up in the “Tools” menu. Select your subset in the collection and add it to the sample set pressing the “add to sample” button. Then specify the number of lines Transkribus should select from the subset. Here you should select at least as many lines as there are pages in the subset, so that each page has one test line.

In our case, we decided to use a factor of 1.5 just to be sure. The program now selects the lines independently and compiles a sample from them, which is saved as a new document. This document does not contain any pages, but only lines. These must now be transcribed as usual to create GT. Afterwards any model can be tested on this test set using the Compare function.

Posted by Dirk Alvermann on

Structural tagging – what else you might do with it (Layout and beyond)

In one of the last posts you read how we use structural tagging. Here you can find how the whole toolbox of structural tagging works in general. In our project it was mainly used to create an adapted LA model for the mixed layouts. But there is even more potential in it.
Who doesn’t know the problem?

There are several, very different handwritings on one page and it becomes difficult to get consistently good HTR results. This happens most often when a ‘clean’ handwriting has been commented in concept handwriting by another writer. Here is an example:

The real reason for the problem is that HTR has only been executed at the page level so far. This means that you can have one page or several pages read either with one or the other HTR model. But it is not possible to read with two different models, which are adapted to the respective handwritings.

Since version 1.10. it is possible to apply HTR models on the level of text regions instead of just assigning them to pages. This allows the contents of individual specific text regions on a page to be read using different HTR models. Structure tagging plays an important role here, for example, in the case of text regions with script styles that differ from the main text. These are tagged with a specific structure tag, to which a special HTR model is then assigned. Reason enough, therefore, to take a closer look at structure tagging.

Posted by Anna Brandt on

P2PaLA – line detection and HTR

Release 1.9.1

As already mentioned in a previous post, we noticed in the course of our project that the CITLabAdvanced-LA does not optimally identify the layout in our material. This happens not only on the ‘bad’ pages with mixed layouts, but also on simple layouts, i.e. on pages without any marginalias at the edge, great deletions in the text or similar. Here the automatic LA recognizes the TR correctly, but the baselines are often faulty.

This is not only confusing when the full text is displayed later; an insufficient LA also influences the result of the HTR. No matter how good your HTR model is: if the LA does not offer adequate quality, it is a problem.

The HTR does not read the single characters, but works line based and should recognize patterns. But if the line detection did not identify the lines correctly (in case letters or words were not recognized by the LA) this often produces wrong HTR results. This can have dramatic effects on the accuracy rate of a page or an entire document, as our example shows.


1587, page 41

For this reason we have trained a P2PaLA model which also detects BLs. That was very helpful. It is not possible to calculate statistics like CERs for these layout models, but from the visual point it seems to work almost error-free on ‘simple’ pages. In addition, a postprocessing is no longer necessary in many cases.

The training material for such a model is created in a similar way to models that should recognize TRs only. The individual baselines do not have to be tagged manually for the structural analysis, even if the model does so later in order to assign them to the tagged TR. With the support of the Transkribus team and a training material of 2500 pages, we were able to train the structural model that we use today instead of the standard LA.

Posted by Anna Brandt on

P2PaLA – Postprocessing

Release 1.9.1

Especially at the beginning of the development of a structure model, it occurred to us that the model recognized every irregularity in the layout as a TR. This leads to excessive – and unnecessary – many text regions. Many of these TRs were also extremely small.

The more training material you invest, the smaller the problem. In our case these mini TRs disappeared, after we had trained our model with about 1000 pages. Until then, they are annoying because removing them all by hand is tedious.

To reduce this labour you have two options. Firstly, starting the P2PaLA you can determine how large the smallest TR is allowed to be. For this you have to select the corresponding value in the “P2PaLA structure analysis tool” before starting the job (“Min area”).

If this option does not bring the expected success, there is the option “remove small textregions”. You will find this on the left toolbar, under the item “other segmentation tools”. In the menu you can set the pages on which the filter should run as well as the size of the TR to be removed.  The size is calculated in “Threshold percentage of image size”. Here the value can be calibrated finer than with the above mentioned option. If the images, as with our material, often have small notes, for example the marginalias where there is only a single word in a TR, then the smallest or second smallest value possible should be chosen. We usually use the “Threshold percentage” of 0.005.

Even with a good structural model, it may still be possible that individual TRs have to be manually merged, split or removed, but to a much lesser extent than the standard LA would require.

Tips & Tools
Important: If you want to be sure that you don’t remove too many TRs, you can start with a “dry run”. Then the number of potentially removable TRs will be listed. As soon as you uncheck the box, the affected TRs will be deleted immediately.

Posted by Anna Brandt on

P2PaLA – Training for Textregions

Release 1.9.1

At another place of these blog you can find information and tips for structure tagging. This kind of tagging can be good for a lot of things – the following is about its use for an improved layout analysis. Because structure tagging is an important part of training P2PaLA models.

With our mixed layouts the standard LA simply had to fail. The material was too extensive for a manual creation of the layout. So we decided to try P2PaLA. For this we created training material for which we selected particularly “difficult” but at the same time “typical” pages. These were pages that contained, in addition to the actual main text, comments and additions and the like.


coll: UAG Strukturtagging, doc. UAG 1618-1, image 12

For the training material only the correctly drawn and tagged text regions are important. No additional line detection or HTR is required. However, it doesn’t bother either, so you can include pages that have already been completely edited in the training. However, if you take new pages on which only the TR has to be drawn and tagged, you’ll be faster. Then you can prepare eighty to one hundred pages for training in one hour.

While we had tagged seven different structure types with our first model, we later reduced the number to five. In our experience, a too strong differentiation of the structure types has a rather negative effect on the training.

Of course, the success of the training also depends on the amount of training material you invest. According to our experience (and based on our material) you can make a good start with 200 pages, with 600 pages you get a model you can already work with; from 2000 pages on it is very reliable.

Tips & Tools
When you create the material for structure training, it is initially difficult to realize that this is not about content. That means no matter what the content is, the TR in the middle is always the paragraph. Even if there is only one note in the middle and the concept is much longer and more important. This is the only way to really recognize the necessary patterns during training.

Posted by Dirk Alvermann on

P2PaLA vs. standard LA

Release 1.9.1

In the previous post we described that if the document layouts are very complicated, the standard LA in Transkribus does not always provide good results. But for a perfect HTR result you need a perfect LA.

Especially in the documents of the 16th and early 17th century the CITlab Advanced LA could not convince us. It was clear to us from the beginning that the standard LA wouldn’t identify the more complex layouts (text regions) in a differentiated way. However, it was the line detection that ultimately failed to meet our demands in these documents.

An example of how (in the worst case) the line detection of the standard LA worked on our material can be seen here:


1587, page 41

This may be an isolated case. However, if you process large quantities of documents in Transkribus, such cases may occur more frequently. In order to be able to evaluate the problem correctly, we have therefore recorded representative error statistics on two bundles of our material. It has been found that the standard LA here worked with an average of 12 errors in the line detection per page (see graph below, 1598). This of course has undesirable effects on the HTR result, which we will describe in more detail in the next post.

Posted by Dirk Alvermann on

P2PaLA or structure training

Release 1.9.1

Page-to-page layout analysis (P2PaLA) is a form of layout analysis that can be trained for indivudual models- similar to the HTR. You can train structure models either to recognize textregions only or textregions with baselines. They therefore fulfill the same functions as the standard layout analysis (CITlabAdvanced). The P2PaLA is particularly suitable if a document has many pages with mixed layouts. The standard layout analysis usually recognizes just one TR – and this can lead to problems with the reading order of the text.

With the help of a structure training, the layout analysis can learn where and how many TRs it should recognize.

The CITlab Advanced LA often had problems to identify the text regions in a differentiated way on our material. That’s why we experimented with P2PaLA early on in our project. First, we tried out structural models that exclusively set text regions (Main text, marginal notes, footnotes etc.). Within the TRs thus generated, we worked with the usual line detection. However, the results were not always satisfactory.

The BLs were often too short (at the beginning or the end of the line) or were torn many times – even on pages with simple layouts. Therefore we trained another one with an included recognition of the BLs, based on our already working P2PaLA model. Our newest model recognizes all ‘simple’ pages almost without any errors. For pages with very complex layouts, the results still have to be corrected, but with much less effort than before.

Posted by Anna Brandt on

Structural Tagging

How structural tagging is done exactly is explained in this Wiki. In contrast to “textual” tagging you can tag all structures, for example text regions, baselines or tables. In our case, only the text regions are tagged, because we use structure tagging to train a P2PaLA model.

When you create your training material and decide where to position the specific structural elements, you should stick to your choices. For example: for us a “paragraph” is always the TR at the top in the middle, the core so to speak; “marginalia” are all the notes on the left side of the image, separated from the “paragraph”.  With this you can divide the images into ‘types’, i.e. groups of images in which all TRs with the same tags are always in a certain coordinate area of the page.

Tips & Tools
There are three ways to set the corresponding tag. First by right-clicking on the marked area and then assigning a tag via “assign structure type”. Or you can choose the area “Structural” in the tab “Metadata”, where the existing structure types are displayed. There you can also define shortcuts for tags that you are using a lot: click on the button “Customize” and enter a number from one to nine in the column “Shortcut”. Then the shortcut is displayed in the tab, it is always Ctrl+Alt+Number.