Dirk Alvermann

Posted by Dirk Alvermann on 17. March 2020

Treatment of erasures and blackenings

HTR models treat erased text the same as any other. They know no difference and therefore always provide a reading result. In fact, they are amazingly good at it, and read useful content even where a transcriber would have given up long ago.

The simple example shows an erased text that is still legible to the human eye and which the HTR model has read almost without error.

Because we have already seen this ability of the HTR models in other projects, we decided from the beginning to transcribe as much of the erased and blacked out text as possible in order to use the potential of the HTR. The corresponding passages are simply tagged as text-style “strike through” in the text and thus remain recognizable for possible internal search operations.

If you don’t want to make this effort, you have the possibility to intervene in the layout at the corresponding text passages that contain erasures or blackenings and, for example, delete the baselines or shorten them accordingly. Then the HTR does not “see” such passages and cannot recognize them. But this is not less complex than the way we have chosen.

Under no circumstances should you work on erasures with the “popular” expression “[…]”. The transcribers know what the deletion sign means, but the HTR learns something completely wrong here, when many such passages come into training.

Posted by Dirk Alvermann on 25. February 2020

Structural tagging – what else you might do with it (Layout and beyond)

In one of the last posts you read how we use structural tagging. Here you can find how the whole toolbox of structural tagging works in general. In our project it was mainly used to create an adapted LA model for the mixed layouts. But there is even more potential in it.
Who doesn’t know the problem?

There are several, very different handwritings on one page and it becomes difficult to get consistently good HTR results. This happens most often when a ‘clean’ handwriting has been commented in concept handwriting by another writer. Here is an example:

The real reason for the problem is that HTR has only been executed at the page level so far. This means that you can have one page or several pages read either with one or the other HTR model. But it is not possible to read with two different models, which are adapted to the respective handwritings.

Since version 1.10. it is possible to apply HTR models on the level of text regions instead of just assigning them to pages. This allows the contents of individual specific text regions on a page to be read using different HTR models. Structure tagging plays an important role here, for example, in the case of text regions with script styles that differ from the main text. These are tagged with a specific structure tag, to which a special HTR model is then assigned. Reason enough, therefore, to take a closer look at structure tagging.

Posted by Dirk Alvermann on 28. January 2020

P2PaLA vs. standard LA

Release 1.9.1

In the previous post we described that if the document layouts are very complicated, the standard LA in Transkribus does not always provide good results. But for a perfect HTR result you need a perfect LA.

Especially in the documents of the 16th and early 17th century the CITlab Advanced LA could not convince us. It was clear to us from the beginning that the standard LA wouldn’t identify the more complex layouts (text regions) in a differentiated way. However, it was the line detection that ultimately failed to meet our demands in these documents.

An example of how (in the worst case) the line detection of the standard LA worked on our material can be seen here:

1587, page 41

This may be an isolated case. However, if you process large quantities of documents in Transkribus, such cases may occur more frequently. In order to be able to evaluate the problem correctly, we have therefore recorded representative error statistics on two bundles of our material. It has been found that the standard LA here worked with an average of 12 errors in the line detection per page (see graph below, 1598). This of course has undesirable effects on the HTR result, which we will describe in more detail in the next post.

Posted by Dirk Alvermann on 21. January 2020

P2PaLA or structure training

Release 1.9.1

Page-to-page layout analysis (P2PaLA) is a form of layout analysis that can be trained for indivudual models- similar to the HTR. You can train structure models either to recognize textregions only or textregions with baselines. They therefore fulfill the same functions as the standard layout analysis (CITlabAdvanced). The P2PaLA is particularly suitable if a document has many pages with mixed layouts. The standard layout analysis usually recognizes just one TR – and this can lead to problems with the reading order of the text.

With the help of a structure training, the layout analysis can learn where and how many TRs it should recognize.

The CITlab Advanced LA often had problems to identify the text regions in a differentiated way on our material. That’s why we experimented with P2PaLA early on in our project. First, we tried out structural models that exclusively set text regions (Main text, marginal notes, footnotes etc.). Within the TRs thus generated, we worked with the usual line detection. However, the results were not always satisfactory.

The BLs were often too short (at the beginning or the end of the line) or were torn many times – even on pages with simple layouts. Therefore we trained another one with an included recognition of the BLs, based on our already working P2PaLA model. Our newest model recognizes all ‘simple’ pages almost without any errors. For pages with very complex layouts, the results still have to be corrected, but with much less effort than before.

Posted by Dirk Alvermann on 21. December 2019

dictionaries

Release 1.7.1

HTR does not require dictionaries. However, they are also available and can be selected if you perform full-text recognition.

With each HTR training, a dictionary can be generated out of the GT in the training set. It is therefore possible to create a suitable dictionary for each model or for the type of text you are working with.

However, dictionaries are rarely used in Transkribus. In our project they are sometimes used at the beginning of the work on new models. As long as the model to be improved still has a CER of more than 8%, correcting the texts recognized by the HTR is very time-consuming. If a dictionary is used at this point, the CER can often be reduced to 5%. If the model already has a CER below 8%, the use of dictionaries is counterproductive because the reading result often becomes worse again. In such cases, the HTR “contrary to better knowledge” replaces its own reading result with a recommendation from the dictionary.

We use dictionaries just to support very weak models. And we do this rather to help the transcriber with particularly difficult writings. So we used a dictionary to create the GT for the really hard to read concept writings. Of course, the results had to be corrected in every case. But the “reading recommendations”, which were based on the HTR with dictionary, were a good help. As soon as our model was able to recognize concept writings with less than 8% CER, we decided not to use the dictionary any longer.

Posted by Dirk Alvermann on 12. December 2019

Languages

HTR does not require dictionaries and works regardless of the language in which a text is written – as long as it uses the character system the model is trained for.

For the training strategy in our project, this means that we do not distinguish between Latin and German texts or Low German and High German texts when selecting the training material. So far, we have not found any serious differences in the quality of the HTR results between texts in both languages.

This observation is important for historical manuscripts from the German-speaking countries. Usually, the language used within a document also affects the script. Most writers of the 16th to 18th centuries, when they switch from German to Latin, change in the middle of the text from Kurrent to Antiqua. In contrast to OCR, where the mixed use of gothic and antiqua typefaces in modern printing is very difficult, HTR – if it is trained for it – has no problem with this change.

A very typical case in our material, here with a comparison of the HTR result and GT, can illustrate the problem. The error rate in the linguistically different text sections of the page is quite comparable. The models used were the Spruchakten_M 2-8 and M 3-1. The first is an generic model, the second is specialized for writings from 1583 to 1627.

Posted by Dirk Alvermann on 27. November 2019

mixed layouts

Release 1.7.1

The CITlab Advanced Layout Analysis handles most “ordinary” layouts well – in 90% of the cases. Let’s talk about the remaining 10%.

We already discussed how to proceed in order to avoid trouble with the Reading Order. But what happens if we have to deal with really mixed – crazy – layouts, e.g. concept writings?

With complicated layouts, you’ll quickly notice that the manually drawn TRs overlap. That’s not good – because in such overlapping text regions the automatic line detection doesn’t work reliably. This problem is easily solved because TRs can have shapes other than square. They can be drawn as polygons and are therefore easily separated from each other.

It makes sense to add structural tags if there are many text regions in order to be able to distinguish them better. You can also assign them to certain processing routines during later processing. This is a small effort with great benefits, because the structural tagging is not more complex than the tagging in context.

Tips & Tools
Automatic line detection can be a real challenge here. Sections where you can already predict (with a little experience) that this won’t happen are best handled manually. For automatic line detection, CITlab Advanced should be configured so that the default setting is replaced by “heterogeneous”. The LA will now take both horizontal and vertical or skewed and oblique lines into account. This will take a little longer, but the result will be better.

If such complicated layouts are a continuous characteristic of your material, then it is worth designing a P2PaLA training course. This will create your own layout analysis model that is tailored to the specific challenges of your material. By the way, structure tagging is the basic requirement for such training.

Posted by Dirk Alvermann on 18. November 2019

First volumes with decisions of the Wismar High Court online

Last week we were able to provide the first volumes with the opinions of the assessors of the High Royal Tribunal to Wismar – the final Court of appeal in the German territories of the Swedish Crown. Assessors is what the judges at the tribunal are called. Since the Great Nordic War there was only a panel of four judges instead of eight. The Deputy President assigned them the cases in which they should form a legal opinion. As at the Reichskammergericht at Wetzlar, speakers and co-referees were appointed for each case, who formulated their opinions in writing and discussed them with their colleagues. If the votes of the two judges were in agreement, the consensus of the remaining colleagues was only formally requested in the court session. In addition, all relations had to be checked and confirmed by the Deputy President. If the case was more complicated, all assessors expressed their opinion on the verdict. These reasons for the verdict are recorded in the collection of so-called “Relationes”.

These relations are a first-class source for the history of law, since they refer first to the course of the conflict in a narrative and then propose a judgment. Here we can understand both the legal bases in the justifications and the everyday life of the people in the narratives.The text recognition was realized with an HTR-model that was trained on the manuscripts of 9 different judges of the royal tribunal. The training set consisted of 600,000 words. Accordingly, the accuracy rate of handwritten text recognition is good, which in this case is about 99%.

The results can be seen here. How to navigate in our documents and how the full text search works is explained here.

Who were the judges?

In the second half of the 18th century there was a new generation of judges. At the end of the 1750s / at the beginning of the 1760s, justice at the tribunal was administered by: Hermann Heinrich von Engelbrecht (1709-1760), since 1745 as Assessor, since 1750 as Deputy President, Bogislaw Friedrich Liebeherr (1695-1761), since 1736 as Assessor, Anton Christoph Gröning (1695-1773), since 1749 as Assessor, Christoph Erhard von Corswanten (about 1708-1777), since 1751 Assessor, since 1761 Deputy President, Carl Hinrich Möller (1709-1759), since 1751 as Assessor, Joachim Friedrich Stemwede (about 1720-1787), since 1760 as Assessor, Johann Franz von Boltenstern (1700-1763), since 1762 as Assessor, Johann Gustrav Friedrich von Engelbrechten (1733-1806), between 1762 and 1775 as Assessor and Augustin von Balthasar (1701-1786), since 1763 as Assessor, since 1778 as Deputy President.

Posted by Dirk Alvermann on 12. November 2019

Transcribus in Chicago

Transkribus will be presented at this year’s meeting of the Social Sciences History Association (SSHA) in Chicago. Günter Mühlberger will not only present the potential of Transkribus, but also first results and experiences. These results come from the processing of the cadastral protocols of the Tiroler Landesarchiv and our digitization project. He will pay special attention to the training of HTR models and the chances of keyword spotting. The lecture will take place on 21.11. at 11:00 am under the title: ‘Handwritten Text Recognition and Keyword Spotting as Research Tools for Social Science and History’ in Session 31 (Emerging Methods: Computation/Spatial Econometrics).

Posted by Dirk Alvermann on 7. November 2019

How to create test sets and why they are important, #2

Release 1.7.1

What is the best procedure for creating test sets?
In the end, everyone can find their own way. In our project, the pages for the test sets are already selected during the creation of the GT. They receive a special edit status (Final) and are later collected in separate documents. This ensures that they will not accidentally be put into training. Whenever new GT is created for future training, the material for the test set is also extended at the same time. So both sets grow in proportion.

For the systematic training we create several Documents, which we call “test sets” and which are each related to a single Spruchakte (one year). For example, we create a “test set 1594” for the Document of the Spruchakte 1594. Here, we place representatively selected images, which should reflect the variety of writers as exactly as possible. In the “mother document” we mark the pages selected for the test set as “Final” to make sure that they will not be edited there in the future. We have not created a separate test set for each singel record or year, but have proceeded in five-year steps.

Since a model is often trained over many rounds, this procedure also has the advantage that the test set always remains representative. The CERs of the different versions of a model can therefore always be compared and observed during development, because the test is always executed on the same (or extended) set. This makes it easier to evaluate the progress of a model and to adjust the training strategy accordingly.

Transkribus also stores the test set used for each training session in the affected collection independently. So you can always fall back on it.
It is also possible to select a test set just before the training and simply assign individual pages of the documents from the training material to the test set. This may be a quick and pragmatic solution for the individual case, but it is not suitable for the planned development of powerful models.

Rechtsprechung im Ostseeraum

Dirk Alvermann