Monthly Archives

2 Articles

Posted by Dirk Alvermann on

Remove small text lines

Release 1.12.0

Many of you probably know the tool “Remove small text regions“, which has been available at Transkribus for the last year. Now his little brother “Remove small text lines” is coming. Finally – a tool that many users have been hoping for for a long time.

With the Citlab Advanced Layout Analysis (even on quite “normal” pages) it happens again and again that textregions or baselines are recognized where we don’t need or want them.

Often “mini-baselines” are recognized in decorated initials or between the individual lines. The HTR model of course can’t do anything with these during text recognition and the transcript will contain “empty” lines. With this tool you can easily and automatically delete these baselines

Try it yourself. We have had the best experience with this if we set the threshold to 0.05.

Posted by Dirk Alvermann on

Automatic Selection of Validation Set

About validation and the different ways to create a Validation Set you can already find some posts in this blog.

Since the last version of Transkribus (1.12.0) there is a new way to create Validation Sets. Transkribus takes a certain amount (2%, 5% or 10%) of the Ground Truth from the Train Set during the compilation of the training data and creates a Validation Set automatically. This set consists of randomly selected pages.

These Validation Sets are created in the Transkribus training tool. You start as usual by entering the training parameters of the model. But before you add the Ground Truth to the Train Set, you select the desired percentage for the Validation Set. This order is important. Every time you add a new document to the Train Set, Transkribus will extract the corresponding pages for the Validation Set.

The new tool is very well suited for large models with a lot of ground truth, especially if you don’t care about setting up special Validation Sets or if you find it difficult for representative models