Elisabeth Heigl

Posted by Elisabeth Heigl on 25. February 2020

Structural tagging – what else you might do with it (Layout and beyond)

In one of the last posts you read how we use structural tagging. Here you can find how the whole toolbox of structural tagging works in general. In our project it was mainly used to create an adapted LA model for the mixed layouts. But there is even more potential in it.
Who doesn’t know the problem?

There are several, very different handwritings on one page and it becomes difficult to get consistently good HTR results. This happens most often when a ‘clean’ handwriting has been commented in concept handwriting by another writer. Here is an example:

The real reason for the problem is that HTR has only been executed at the page level so far. This means that you can have one page or several pages read either with one or the other HTR model. But it is not possible to read with two different models, which are adapted to the respective handwritings.

Since version 1.10. it is possible to apply HTR models on the level of text regions instead of just assigning them to pages. This allows the contents of individual specific text regions on a page to be read using different HTR models. Structure tagging plays an important role here, for example, in the case of text regions with script styles that differ from the main text. These are tagged with a specific structure tag, to which a special HTR model is then assigned. Reason enough, therefore, to take a closer look at structure tagging.

Posted by Elisabeth Heigl on 28. January 2020

P2PaLA vs. standard LA

Release 1.9.1

In the previous post we described that if the document layouts are very complicated, the standard LA in Transkribus does not always provide good results. But for a perfect HTR result you need a perfect LA.

Especially in the documents of the 16th and early 17th century the CITlab Advanced LA could not convince us. It was clear to us from the beginning that the standard LA wouldn’t identify the more complex layouts (text regions) in a differentiated way. However, it was the line detection that ultimately failed to meet our demands in these documents.

An example of how (in the worst case) the line detection of the standard LA worked on our material can be seen here:

1587, page 41

This may be an isolated case. However, if you process large quantities of documents in Transkribus, such cases may occur more frequently. In order to be able to evaluate the problem correctly, we have therefore recorded representative error statistics on two bundles of our material. It has been found that the standard LA here worked with an average of 12 errors in the line detection per page (see graph below, 1598). This of course has undesirable effects on the HTR result, which we will describe in more detail in the next post.

Posted by Elisabeth Heigl on 21. January 2020

P2PaLA or structure training

Release 1.9.1

Page-to-page layout analysis (P2PaLA) is a form of layout analysis that can be trained for indivudual models- similar to the HTR. You can train structure models either to recognize textregions only or textregions with baselines. They therefore fulfill the same functions as the standard layout analysis (CITlabAdvanced). The P2PaLA is particularly suitable if a document has many pages with mixed layouts. The standard layout analysis usually recognizes just one TR – and this can lead to problems with the reading order of the text.

With the help of a structure training, the layout analysis can learn where and how many TRs it should recognize.

The CITlab Advanced LA often had problems to identify the text regions in a differentiated way on our material. That’s why we experimented with P2PaLA early on in our project. First, we tried out structural models that exclusively set text regions (Main text, marginal notes, footnotes etc.). Within the TRs thus generated, we worked with the usual line detection. However, the results were not always satisfactory.

The BLs were often too short (at the beginning or the end of the line) or were torn many times – even on pages with simple layouts. Therefore we trained another one with an included recognition of the BLs, based on our already working P2PaLA model. Our newest model recognizes all ‘simple’ pages almost without any errors. For pages with very complex layouts, the results still have to be corrected, but with much less effort than before.

Posted by Elisabeth Heigl on 7. January 2020

Abbreviations

Release 1.9.1

Medieval and early modern manuscripts are usually full of abbreviations in all possible variations. These can be contractions (omission in the word) and suspensions (omission at the end of the word) as well as a wide variety of special characters. So if we want to transcribe old manuscripts, we must first consider how we want to reproduce the abbreviations: Do we reproduce everything as it appears in the text, or do we resolve everything – or do we adapt to the capacities of the HTR?

Basically there are three different ways to deal with abbreviations in Transkribus:

– You can try to reproduce abbreviation characters as Unicode characters. Many of the abbreviation characters used in 15th and 16th century Latin and German manuscripts can be found in the Unicode block “Latin Extended-D”. For special characters written in medieval latin texts, check the Medieval Unicolde Font Initiative. It depends entirely on the goals of your own project whether and when this path makes sense – it is quite complex anyhow.

– If you don’t want to work with Unicode characters, you could also use the “basic letter” of the abbreviation from the regular alphabet – like a literal transcription. Such a “placeholder” can then be provided with a textual tag that marks the word as an abbreviation (“abbrev”). How the tagged abbreviation is to be resolved can then be entered for each tag as “expansion”.

Thus the resolution of the abbreviation becomes part of the metadata. This approach offers the most possibilities for further use of the material. But it is also very laborious, because each and every abbreviation has to be tagged.

– Or you just dissolve the abbreviations. If you want to make large quantities of full text searchable, as we do, it makes sense to resolve the abbreviations consistently because it makes the search easier: Who is looking for “pfessores” instead of “professores”? We have made the experience that the HTR can handle abbreviations quite well; both the classic Latin and German abbreviations, as well as currency symbols or other special characters. This is why we resolve most abbreviations during transcription and use them as part of Ground Truth in HTR training.

The models we train have learned some abbreviations very well. The abbreviations frequently used in the manuscripts, such as the suffix “-en”, can be resolved by an HTR model – if it has been taught consistently.

But more complex abbreviations, especially the contractions, do cause difficulties for the HTR. In our project we have therefore decided to reproduce such abbreviations only in literal form.

In our Collection of Abbreviations we present the many different abbreviations that we find in our material from the 16th to 18th century. We also show how we (and later the HTR models) resolve them. This Collection will be updated by us from time to time – work in progress!

Posted by Elisabeth Heigl on 21. December 2019

dictionaries

Release 1.7.1

HTR does not require dictionaries. However, they are also available and can be selected if you perform full-text recognition.

With each HTR training, a dictionary can be generated out of the GT in the training set. It is therefore possible to create a suitable dictionary for each model or for the type of text you are working with.

However, dictionaries are rarely used in Transkribus. In our project they are sometimes used at the beginning of the work on new models. As long as the model to be improved still has a CER of more than 8%, correcting the texts recognized by the HTR is very time-consuming. If a dictionary is used at this point, the CER can often be reduced to 5%. If the model already has a CER below 8%, the use of dictionaries is counterproductive because the reading result often becomes worse again. In such cases, the HTR “contrary to better knowledge” replaces its own reading result with a recommendation from the dictionary.

We use dictionaries just to support very weak models. And we do this rather to help the transcriber with particularly difficult writings. So we used a dictionary to create the GT for the really hard to read concept writings. Of course, the results had to be corrected in every case. But the “reading recommendations”, which were based on the HTR with dictionary, were a good help. As soon as our model was able to recognize concept writings with less than 8% CER, we decided not to use the dictionary any longer.

Posted by Elisabeth Heigl on 12. December 2019

Languages

HTR does not require dictionaries and works regardless of the language in which a text is written – as long as it uses the character system the model is trained for.

For the training strategy in our project, this means that we do not distinguish between Latin and German texts or Low German and High German texts when selecting the training material. So far, we have not found any serious differences in the quality of the HTR results between texts in both languages.

This observation is important for historical manuscripts from the German-speaking countries. Usually, the language used within a document also affects the script. Most writers of the 16th to 18th centuries, when they switch from German to Latin, change in the middle of the text from Kurrent to Antiqua. In contrast to OCR, where the mixed use of gothic and antiqua typefaces in modern printing is very difficult, HTR – if it is trained for it – has no problem with this change.

A very typical case in our material, here with a comparison of the HTR result and GT, can illustrate the problem. The error rate in the linguistically different text sections of the page is quite comparable. The models used were the Spruchakten_M 2-8 and M 3-1. The first is an generic model, the second is specialized for writings from 1583 to 1627.

Posted by Elisabeth Heigl on 27. November 2019

mixed layouts

Release 1.7.1

The CITlab Advanced Layout Analysis handles most “ordinary” layouts well – in 90% of the cases. Let’s talk about the remaining 10%.

We already discussed how to proceed in order to avoid trouble with the Reading Order. But what happens if we have to deal with really mixed – crazy – layouts, e.g. concept writings?

With complicated layouts, you’ll quickly notice that the manually drawn TRs overlap. That’s not good – because in such overlapping text regions the automatic line detection doesn’t work reliably. This problem is easily solved because TRs can have shapes other than square. They can be drawn as polygons and are therefore easily separated from each other.

It makes sense to add structural tags if there are many text regions in order to be able to distinguish them better. You can also assign them to certain processing routines during later processing. This is a small effort with great benefits, because the structural tagging is not more complex than the tagging in context.

Tips & Tools
Automatic line detection can be a real challenge here. Sections where you can already predict (with a little experience) that this won’t happen are best handled manually. For automatic line detection, CITlab Advanced should be configured so that the default setting is replaced by “heterogeneous”. The LA will now take both horizontal and vertical or skewed and oblique lines into account. This will take a little longer, but the result will be better.

If such complicated layouts are a continuous characteristic of your material, then it is worth designing a P2PaLA training course. This will create your own layout analysis model that is tailored to the specific challenges of your material. By the way, structure tagging is the basic requirement for such training.

Posted by Elisabeth Heigl on 18. November 2019

First volumes with decisions of the Wismar High Court online

Last week we were able to provide the first volumes with the opinions of the assessors of the High Royal Tribunal to Wismar – the final Court of appeal in the German territories of the Swedish Crown. Assessors is what the judges at the tribunal are called. Since the Great Nordic War there was only a panel of four judges instead of eight. The Deputy President assigned them the cases in which they should form a legal opinion. As at the Reichskammergericht at Wetzlar, speakers and co-referees were appointed for each case, who formulated their opinions in writing and discussed them with their colleagues. If the votes of the two judges were in agreement, the consensus of the remaining colleagues was only formally requested in the court session. In addition, all relations had to be checked and confirmed by the Deputy President. If the case was more complicated, all assessors expressed their opinion on the verdict. These reasons for the verdict are recorded in the collection of so-called “Relationes”.

These relations are a first-class source for the history of law, since they refer first to the course of the conflict in a narrative and then propose a judgment. Here we can understand both the legal bases in the justifications and the everyday life of the people in the narratives.The text recognition was realized with an HTR-model that was trained on the manuscripts of 9 different judges of the royal tribunal. The training set consisted of 600,000 words. Accordingly, the accuracy rate of handwritten text recognition is good, which in this case is about 99%.

The results can be seen here. How to navigate in our documents and how the full text search works is explained here.

Who were the judges?

In the second half of the 18th century there was a new generation of judges. At the end of the 1750s / at the beginning of the 1760s, justice at the tribunal was administered by: Hermann Heinrich von Engelbrecht (1709-1760), since 1745 as Assessor, since 1750 as Deputy President, Bogislaw Friedrich Liebeherr (1695-1761), since 1736 as Assessor, Anton Christoph Gröning (1695-1773), since 1749 as Assessor, Christoph Erhard von Corswanten (about 1708-1777), since 1751 Assessor, since 1761 Deputy President, Carl Hinrich Möller (1709-1759), since 1751 as Assessor, Joachim Friedrich Stemwede (about 1720-1787), since 1760 as Assessor, Johann Franz von Boltenstern (1700-1763), since 1762 as Assessor, Johann Gustrav Friedrich von Engelbrechten (1733-1806), between 1762 and 1775 as Assessor and Augustin von Balthasar (1701-1786), since 1763 as Assessor, since 1778 as Deputy President.

Posted by Elisabeth Heigl on 17. November 2019

Generic vs. specialized model

Release 1.7.1

Did you notice in the graph for the model development that the character error rate (CER) of the last model got slightly worse again? Despite the fact that we had significantly increased the GT input? We had about 43,000 more words in training but the average CER deteriorated from 2.79% to 3.43%. We couldn’t really explain that.

At this point we couldn’t get any further with more and more GT. So we had to change our training strategy. So far we had trained large models, with writings from a total period of 70 years and more than 500 writers.

Our first suspicion fell on the concept writings, of which we already knew that the machine (LA and HTR) – just like ourselves – had its problems with it. During the next training we excluded these concept writings and trained exclusively with “clean” office writings. But that didn’t lead to a noticeable improvement: the Test Set-CER dropped from 3.43% to just 3.31%.

In the following trainings, we additionally focused on a chronological seuqencing of the models. We split our material and created two different models: Spruchakten_M_3-1 (Spruchakten 1583-1627) and Spruchakten_M_4-1 (Spruchakten 1627-1653).

With these new specialized models we actually achieved an improvement of the HTR – where the generic model was no longer sufficient. In the test sets several pages showed an error rate of less than 2 %. In the case of the model M_4-1, many CERs of single pages remained below 1 % and two pages were with 0 % even free of any errors.

Whether an generic or specialized model will help and produce better results depends a lot on the size and composition of the material. In the beginning, when you are keen to progress as quickly as possible (the more, the better), an generic model is useful. However, if that reaches its limits, you shouldn’t “overburden” the HTR any further, but instead specialize your models.

Posted by Elisabeth Heigl on 13. November 2019

Transkribus as an instrument for students and professors

In this year’s 24-hour lecture of the University of Greifswald Transkribus and our digitization project will be presented. Elisabeth Heigl, who is involved in the project as a academic assistant, will present some of the exciting cases from the rulings of the law faculty of Greifswald. If you are interested in the history of law, join the lecture at lecture hall 2, Audimax (Rubenowstraße 1) on 16.11.2019 at 12:00.
Read the whole program of the 24-hour lecture here.

Rechtsprechung im Ostseeraum

Elisabeth Heigl