Yearly Archives

36 Articles

Posted by Dirk Alvermann on

dictionaries

Release 1.7.1

HTR does not require dictionaries. However, they are also available and can be selected if you perform full-text recognition.

With each HTR training, a dictionary can be generated out of the GT in the training set. It is therefore possible to create a suitable dictionary for each model or for the type of text you are working with.

However, dictionaries are rarely used in Transkribus. In our project they are sometimes used at the beginning of the work on new models. As long as the model to be improved still has a CER of more than 8%, correcting the texts recognized by the HTR is very time-consuming. If a dictionary is used at this point, the CER can often be reduced to 5%. If the model already has a CER below 8%, the use of dictionaries is counterproductive because the reading result often becomes worse again. In such cases, the HTR “contrary to better knowledge” replaces its own reading result with a recommendation from the dictionary.

We use dictionaries just to support very weak models. And we do this rather to help the transcriber with particularly difficult writings. So we used a dictionary to create the GT for the really hard to read concept writings. Of course, the results had to be corrected in every case. But the “reading recommendations”, which were based on the HTR with dictionary, were a good help. As soon as our model was able to recognize concept writings with less than 8% CER, we decided not to use the dictionary any longer.

Posted by Dirk Alvermann on

Languages

HTR does not require dictionaries and works regardless of the language in which a text is written – as long as it uses the character system the model is trained for.

For the training strategy in our project, this means that we do not distinguish between Latin and German texts or Low German and High German texts when selecting the training material. So far, we have not found any serious differences in the quality of the HTR results between texts in both languages.

This observation is important for historical manuscripts from the German-speaking countries. Usually, the language used within a document also affects the script. Most writers of the 16th to 18th centuries, when they switch from German to Latin, change in the middle of the text from Kurrent to Antiqua. In contrast to OCR, where the mixed use of gothic and antiqua typefaces in modern printing is very difficult, HTR – if it is trained for it – has no problem with this change.

A very typical case in our material, here with a comparison of the  HTR result and GT, can illustrate the problem. The error rate in the linguistically different text sections of the page is quite comparable. The models used were the Spruchakten_M 2-8 and M 3-1. The first is an generic model, the second is specialized for writings from 1583 to 1627.

Posted by Anna Brandt on

layout tab

Release 1.7.1

If you correct the automatic layout analysis, you can do this directly in the image or navigate via the layout tab on the left side. There are all shapes, like the textregions and the baselines, displayed with their position in the image and their structural tags. It is possible to delete or move shapes. In the image you can always see where you are at the moment, which element is currently marked – thus what you can change.

If you want to merge two baselines, just mark them in the layout tab instead of trying to hit the thin line in the image.

The navigation in the tab is especially useful if you want to see the complete image in the right window. This way you keep a better overview, because everything in the image and in the tab will be changed at the same time.

Tips & Tools
You can change the reading order of the baselines in the layout tab either by moving the lines or by clicking and changing the number in the column “Reading Order”.

Posted by Dirk Alvermann on

mixed layouts

Release 1.7.1

The CITlab Advanced Layout Analysis handles most “ordinary” layouts well – in 90% of the cases. Let’s talk about the remaining 10%.

We already discussed how to proceed in order to avoid trouble with the Reading Order. But what happens if we have to deal with really mixed – crazy – layouts, e.g. concept writings?

With complicated layouts, you’ll quickly notice that the manually drawn TRs overlap. That’s not good – because in such overlapping text regions the automatic line detection doesn’t work reliably. This problem is easily solved because TRs can have shapes other than square. They can be drawn as polygons and are therefore easily separated from each other.

It makes sense to add structural tags if there are many text regions in order to be able to distinguish them better. You can also assign them to certain processing routines during later processing. This is a small effort with great benefits, because the structural tagging is not more complex than the tagging in context.

Tips & Tools
Automatic line detection can be a real challenge here. Sections where you can already predict (with a little experience) that this won’t happen are best handled manually. For automatic line detection, CITlab Advanced should be configured so that the default setting is replaced by “heterogeneous”. The LA will now take both horizontal and vertical or skewed and oblique lines into account. This will take a little longer, but the result will be better.

If such complicated layouts are a continuous characteristic of your material, then it is worth designing a P2PaLA training course. This will create your own layout analysis model that is tailored to the specific challenges of your material. By the way, structure tagging is the basic requirement for such training.

Posted by Dirk Alvermann on

First volumes with decisions of the Wismar High Court online

Last week we were able to provide the first volumes with the opinions of the assessors of the High Royal Tribunal to Wismar – the final Court of appeal in the German territories of the Swedish Crown. Assessors  is what the judges at the tribunal are called. Since the Great Nordic War there was only a panel of four judges instead of eight. The Deputy President assigned them the cases in which they should form a legal opinion. As at the Reichskammergericht at Wetzlar, speakers and co-referees were appointed for each case, who formulated their opinions in writing and discussed them with their colleagues.  If the votes of the two judges were in agreement, the consensus of the remaining colleagues was only formally requested in the court session. In addition, all relations had to be checked and confirmed by the Deputy President. If the case was more complicated, all assessors expressed their opinion on the verdict. These reasons for the verdict are recorded in the collection of so-called “Relationes”.

These relations are a first-class source for the history of law, since they refer first to the course of the conflict in a narrative and then propose a judgment.  Here we can understand both the legal bases in the justifications and the everyday life of the people in the narratives.The text recognition was realized with an HTR-model that was trained on the manuscripts of 9 different judges of the royal tribunal. The training set consisted of 600,000 words. Accordingly, the accuracy rate of handwritten text recognition is good, which in this case is about 99%.

The results can be seen here. How to navigate in our documents and how the full text search works is explained here.

Who were the judges?

In the second half of the 18th century there was a new generation of judges. At the end of the 1750s / at the beginning of the 1760s, justice  at the tribunal was administered by: Hermann Heinrich von Engelbrecht (1709-1760), since 1745 as Assessor, since 1750 as Deputy President, Bogislaw Friedrich Liebeherr (1695-1761), since 1736 as Assessor, Anton Christoph Gröning (1695-1773), since 1749 as Assessor, Christoph Erhard von Corswanten (about 1708-1777), since 1751 Assessor, since 1761 Deputy President, Carl Hinrich Möller (1709-1759), since 1751 as Assessor, Joachim Friedrich Stemwede (about 1720-1787), since 1760 as Assessor, Johann Franz von Boltenstern (1700-1763), since 1762 as Assessor, Johann Gustrav Friedrich von Engelbrechten (1733-1806), between 1762 and 1775 as Assessor and Augustin von Balthasar (1701-1786), since 1763 as Assessor, since 1778 as Deputy President.

Posted by Elisabeth Heigl on

Generic vs. specialized model

Release 1.7.1

Did you notice in the graph for the model development that the character error rate (CER) of the last model got slightly worse again? Despite the fact that we had significantly increased the GT input? We had about 43,000 more words in training but the average CER deteriorated from 2.79% to 3.43%. We couldn’t really explain that.

At this point we couldn’t get any further with more and more GT. So we had to change our training strategy. So far we had trained large models, with writings from a total period of 70 years and more than 500 writers.

Our first suspicion fell on the concept writings, of which we already knew that the machine (LA and HTR) – just like ourselves – had its problems with it. During the next training we excluded these concept writings and trained exclusively with “clean” office writings. But that didn’t lead to a noticeable improvement: the Test Set-CER dropped from 3.43% to just 3.31%.

In the following trainings, we additionally focused on a chronological seuqencing of the models. We split our material and created two different models: Spruchakten_M_3-1 (Spruchakten 1583-1627) and Spruchakten_M_4-1 (Spruchakten 1627-1653).

With these new specialized models we actually achieved an improvement of the HTR – where the generic model was no longer sufficient. In the test sets several pages showed an error rate of less than 2 %. In the case of the model M_4-1, many CERs of single pages remained below 1 % and two pages were with 0 % even free of any errors.

Whether an generic or specialized model will help and produce better results depends a lot on the size and composition of the material. In the beginning, when you are keen to progress as quickly as possible (the more, the better), an generic model is useful. However, if that reaches its limits, you shouldn’t “overburden” the HTR any further, but instead specialize your models.

Posted by Elisabeth Heigl on

Transkribus as an instrument for students and professors

In this year’s 24-hour lecture of the University of Greifswald Transkribus and our digitization project will be presented. Elisabeth Heigl, who is involved in the project as a academic assistant, will present some of the exciting cases from the rulings of the law faculty of Greifswald. If you are interested in the history of law, join  the lecture at lecture hall 2, Audimax (Rubenowstraße 1) on 16.11.2019 at 12:00.
Read the whole program of the 24-hour lecture here.

Posted by Dirk Alvermann on

Transcribus in Chicago

Transkribus will be presented at this year’s meeting of the Social Sciences History Association (SSHA) in Chicago. Günter Mühlberger will not only present the potential of Transkribus, but also first results and experiences. These results come from the processing of the cadastral protocols of the Tiroler Landesarchiv and our digitization project. He will pay special attention to the training of HTR models and the chances of keyword spotting.  The lecture will take place on 21.11. at 11:00 am under the title: ‘Handwritten Text Recognition and Keyword Spotting as Research Tools for Social Science and History’ in Session 31 (Emerging Methods: Computation/Spatial Econometrics).

Posted by Anna Brandt on

Feedback

The blog “Rechtsgeschiedenis” (Otto Vervaart/Utrecht) has given a detailed discussion about the project ‘Rechtssprechung im Ostseeraum’ and our blog. It describes our work with Transkribus, the project itself, as well as the page where we present the results and the blog – a good overview from a user’s perspective.

Posted by Dirk Alvermann on

How to create test sets and why they are important, #2

Release 1.7.1

What is the best procedure for creating test sets?
In the end, everyone can find their own way. In our project, the pages for the test sets are already selected during the creation of the GT. They receive a special edit status (Final) and are later collected in separate documents. This ensures that they will not accidentally be put into  training. Whenever new GT is created for future training, the material for the test set is also extended at the same time. So both sets grow in proportion.

For the systematic training we create several Documents, which we call “test sets” and which are each related to a single Spruchakte (one year). For example, we create a “test set 1594” for the Document of the Spruchakte 1594. Here, we place representatively selected images, which should reflect the variety of writers as exactly as possible. In the “mother document” we mark the pages selected for the test set as “Final” to make sure that they will not be edited there in the future. We have not created a separate test set for each singel record or year, but have proceeded in five-year steps.

Since a model is often trained over many rounds, this procedure also has the advantage that the test set always remains representative. The CERs of the different versions of a model can therefore always be compared and observed during development, because the test is always executed on the same (or extended) set. This makes it easier to evaluate the progress of a model and to adjust the training strategy accordingly.

Transkribus also stores the test set used for each training session in the affected collection independently. So you can always fall back on it.
It is also possible to select a test set just before the training and simply assign individual pages of the documents from the training material to the test set. This may be a quick and pragmatic solution for the individual case, but it is not suitable for the planned development of powerful models.