Dirk Alvermann


Posted by Dirk Alvermann on

Exclude single lines from training

Some of you will know this from practical experience: you are transcribing a particularly difficult page in Transkribus and you cannot decipher everything with the best will in the world. What do you do? If the page is set to the edit status “ground truth”, the transcription goes into training with the obvious errors (or what you could not read). This is not what is intended. But you don’t want to simply “throw away” the page either.

We have already mentioned the use of tags in another post. We used the “unclear” tag in the project from the beginning. Others also like to use the “gap” tag for such reading problems.

This is now proving to be a great advantage. For some months now, the Transkribus training tool has offered the function “omit lines at tag”

The tool ensures that on all pages that are taken into the training or validation set, the lines that have a tag “unclear” or “gap” are automatically excluded from the training. This means that pages that are not perfectly transcribed, but where the parts that could not be deciphered are marked by tags, can be trained without hesitation.

Posted by Dirk Alvermann on

Merge small Base Lines

This tool is – like “Remove small text lines” – distributed with version 1.12.0 of Transkribus. The idea behind it is very interesting.

Maybe you have had problems with “torn” lines in the automatic line detection (Citlab Advanced Layout Analysis). We have mentioned in an earlier post how annoying this problem can be.

So the expectations for such a great thing were of course high. But after a short time we realized that its use needs some practice and that it cannot be used everywhere without problems.

Here we show a simple example:

The Citlab Advanced Layout Analysis detected five “superfluous” text regions on the page and just as many “torn” base lines. In such a case you should first remove the redundant text regions with “remove small text regions” and then start the automatic merge tool.

Tips & Tools
Be careful with complicated layouts. You must always check the result of “merge small text lines”, because often base lines are merged that do not belong together (from lines with different reading order).

Posted by Dirk Alvermann on

Remove small text lines

Release 1.12.0

Many of you probably know the tool “Remove small text regions“, which has been available at Transkribus for the last year. Now his little brother “Remove small text lines” is coming. Finally – a tool that many users have been hoping for for a long time.

With the Citlab Advanced Layout Analysis (even on quite “normal” pages) it happens again and again that textregions or baselines are recognized where we don’t need or want them.

Often “mini-baselines” are recognized in decorated initials or between the individual lines. The HTR model of course can’t do anything with these during text recognition and the transcript will contain “empty” lines. With this tool you can easily and automatically delete these baselines

Try it yourself. We have had the best experience with this if we set the threshold to 0.05.

Posted by Dirk Alvermann on

Automatic Selection of Validation Set

About validation and the different ways to create a Validation Set you can already find some posts in this blog.

Since the last version of Transkribus (1.12.0) there is a new way to create Validation Sets. Transkribus takes a certain amount (2%, 5% or 10%) of the Ground Truth from the Train Set during the compilation of the training data and creates a Validation Set automatically. This set consists of randomly selected pages.

These Validation Sets are created in the Transkribus training tool. You start as usual by entering the training parameters of the model. But before you add the Ground Truth to the Train Set, you select the desired percentage for the Validation Set. This order is important. Every time you add a new document to the Train Set, Transkribus will extract the corresponding pages for the Validation Set.

The new tool is very well suited for large models with a lot of ground truth, especially if you don’t care about setting up special Validation Sets or if you find it difficult for representative models

Posted by Dirk Alvermann on

HTR+ versus Pylaia part 2

Release 1.12.0

Some weeks ago we reported about our first experiences with PyLaia while training a generic model (600.000 words GT).

Today we want to make another attempt to compare PyLaia and HTR+. This time we have a larger model (German_Kurrent_17th-18th; 1,8 million words GT) available. The model was trained as both PyLaia and HTR+ model, with identical ground truth and the same conditions (from the scratch).

Our hypothesis that PyLaia can show its advantages over HTR+ in larger generic models has been fully confirmed. In the case shown PyLaia is superior to HTR+ in all aspects. Both with and without the Language Model, the PyLaia model scored about one percentage point (in the CER) better than HTR+ on all our test sets.

By the way, in the last weeks the performance of PyLaia for “curved” textlines has also improved significantly.

Posted by Dirk Alvermann on

Tag Export II

In the last post we presented a benefit of tags. As an example we showed the visualization of results by displaying place tags on a map. But there are other possibilities.

Tags can not only be exported separately, as in the form of an Excel table. Some tags (place or person) are also output in the ALTO files. These files are among other things responsible for the fact that we can display the hits of the full text search in our viewer/presenter. To do this, simply select “Export ALTO (Split lines into words)” when exporting the METS files.

In our presenter in the Digital Library Mecklenburg-Vorpommern the tags are then displayed as “named entities” separated by places and persons for the respective document. The whole thing is still in the experimental phase and will be further developed in the near future so that you can jump directly to the corresponding places in the document via an actual tag cloud.

Posted by Dirk Alvermann on

New public model – German Kurrent 17th-18th

Today we are proud to present our second publicly available HTR model.

“German_Kurrent_17th-18th” is an HTR model for current scripts of the 17th and 18th century. For this model we used ground truth from our various larger and smaller projects of the last four years.

It is a generic model that includes material from about 2000 individual writer’s hands. About 35% of the Ground Truth comes from 17th century manuscripts and 50% from 18th century manuscripts. The remaining 15% are spread over the decades before and after. The documents selected for the training consist mainly of official minutes and legal documents, but also private records and letters of the time. In addition, a few contemporary printed materials (Fraktur), which appear in the records from time to time, were also used for the training. The language of the texts used is predominantly German and Latin. In addition, some Low German and French texts were also used.

Have fun trying it out. Please use the comments to let us know how the model works for you.

Posted by Dirk Alvermann on

Tag Exports I

Once you have taken the trouble to tag one or more documents, there are several ways to use this “added value” outside of Transkribus. The tags can be easily exported as an Excel spreadsheet using Transkribus’ export tool.

From there you have many options. We had conducted our “tagging experiment” to see if this would be a good way to visualize the geographical distribution of our documents. At the same time, this map should allow access to the digitized documents in the presenter.

All in all we are satisfied with the result of the experiment. You can select specific years or periods of time, search for locations and use the points on the map to access the documents.

In the end, however, the effort for this kind of tagging has proven to be so high that we cannot afford it within the scope of this project. But there are other ways to use tags in export, which we will write about in the next post.

Posted by Dirk Alvermann on

HTR+ versus Pylaia

Version 1.12.0

As you have probably noticed, since last summer there is a second recognition engine available in Transkribus besides HTR+ – PyLaia. (link)

We have been experimenting with PyLaia models in the last weeks and would like to document our first impressions and experiences about the different aspects of HTR+ and PyLaia. First of all: considering the economic point of view, PyLaia is definitely a little bit cheaper than HTR+, as you can see on the price list of the Read Coop. Does cheaper also mean worse? – Definitely not! In terms of accuracy rate PyLaia can easily compete with HTR+. It is often slightly better. The following graphic compares an HTR+ and a PyLaia model trained with identical ground truth (approx. 600,000 words) under the same conditions (from the scratch). The performance with and without the Language Model is compared.

Perhaps the most notable difference is that the results of PyLaia models cannot be improved as much by using a Language Model as is the case with HTR+. This is not necessarily a disadvantage, but rather indicates a high basic reliability of these models. In other words: PyLaia does not necessarily need a Language Model to achieve very good results.

There is also an area where PyLaia performs worse than HTR+. PyLaia has more difficulties to read “curved” lines correctly. For vertical text lines the result is even worse.

In training PyLaia is a bit slower than HTR+, which means that the training takes longer. On the other hand, PyLaia is much faster in “acceleration”. It needs relatively few training epochs or iterations to achieve good results. You can compare this quite well with the two learning curves.

Our observations are of course not exhaustive. So far they only refer to generic models that have been trained with a high level of ground truth. Overall we have the impression that PyLaia can fully exploit its advantages with such large generic models.

Posted by Dirk Alvermann on

How to train PyLaia models

Release 1.12.0

Since version 1.12.0 it is possible to train PyLaia models in Transkribus besides the proven HTR+ models. We have gained some experience with this in the last months and are quite impressed by the performance of the models.

PyLaia models can be trained like HTR or HTR+ models using the usual training tool. But there are some differences.

Like a normal HTR+ model you have to enter the name of the model, a description and the languages the model can be used for. Unlike the training of HTR+ models, the number of iterations (epochs) is limited to 250. This is sufficient in our experience. You can also train PyLaia models with base models, i.e. you can design longer training series that are based on each other. In contrast to the usual training, PyLaia has an “Early Stopping” setting. It determines when the training can be stopped, if a good result is achieved. At the beginning of your training attempts you should always set this value to the same number of iterations as you have chosen for the whole training. So if you train with 250 epochs, the setting for “Early Stopping” should be the same. Otherwise you risk that the training will be stopped too early.

The most important difference is that in PyLaia Training you can choose whether you want to train with the original images or with compressed images. Our recommendation is: train with compressed images. The PyLaia training with original images can take weeks in the worst case (with a correspondingly large amount of GT). With compressed images a PyLaia training is finished within hours or days (if you train with about 500.000 words).

Tips & Tools
For more detailed information, especially for setting specific training parameters, we recommend the tutorial by Annemieke Romein and the READ Coop guidelines.