Yearly Archives

11 Articles

Posted by Anna Brandt on

Region Grouping

Since the version update 1.14.0 there is a new function to configure the layout analysis. It is about the arrangement of the text regions, called ‘Region grouping’. Now you can configure if they should be grouped around baselines or if all lines should be in one TR.

With the first mentioned setting it can happen that many small TRs appear at the edge of the image or in the middle of it, even if there is actually only one text block. This problem can be solved in a further step with the Remove small Textregions.

On the other hand, if only one text region is set, really all baselines are in this text region, even those that are otherwise marginal and even vertical BL. As long as the setting ‘Heterogeneous’ is selected for ‘Text orientation’, the layout analysis also recognizes the vertical lines in the same TR with the horizontal ones. It can be seen that the LA would normally recognize multiple TR. In fact, the reading order for the lines is still divided as if they were in their own text regions. The main paragraph is usually TR 1, so the RO starts there. The other baselines are placed at the back, even if they are at the side of the main text and could therefore be placed between them.

To decide which setting is better for you, you have to try it out. For pages that have only one text block, the second setting is of course advantageous, because all the small TR do not appear. It could also be that you have to choose different settings within one document.

Posted by Elisabeth Heigl on

Edit multiple documents simultaneously

Version 1.15.1

Until now, we were used to initiating the layout analysis and the HTR for the document we were currently in. Now, however, it is possible to activate both steps for all documents of the entire collection in which we are currently located. We will describe how this works in a moment – but first of all why we are very happy about this:

In order to check the results of our freshly trained models, we have created a separate collection of Spruchakten-testsets. How exactly and why, you can read elsewhere. This means that for each document from which we have used GT in training, there is a separate test set document in the test set collection.

When a new HTR model has finished training and we are curious to see how it will compare to the previous models, we run it through each of the test sets and then calculate the CER. After more than two years of training, our test set collection is now quite full; with almost 70 test sets already in it.

Imagine how time-consuming it used to be to open each test set individually to activate the new HTR. Even with only 40 test sets, you had to be very curious. And now imagine how much easier it is that we can trigger the HTR (and also the LA) for all documents at the same time with one click. This should please all those who process a lot of small documents, such as index cards, in one collection.

And how does that work now? You can see it immediately under the layout analysis tools: In red, under “Document Selection”, there is the new option “Current collection” to be selected for the following step.

However, it is not enough to simply select “Current Selection” and then start the LA; irst you must always enter the selection via “Choose docs…”. Either you simply confirm the preselection (all docs in the collection) or you select individual docs.

For the HTR, the same option will only appear in the selection window for “Text Recognition”. Here, too, you can select the “Current collection” for the following step. And here, too, you must confirm the selection again via “choose docs….”.

Posted by Dirk Alvermann on

Exclude single lines from training

Some of you will know this from practical experience: you are transcribing a particularly difficult page in Transkribus and you cannot decipher everything with the best will in the world. What do you do? If the page is set to the edit status “ground truth”, the transcription goes into training with the obvious errors (or what you could not read). This is not what is intended. But you don’t want to simply “throw away” the page either.

We have already mentioned the use of tags in another post. We used the “unclear” tag in the project from the beginning. Others also like to use the “gap” tag for such reading problems.

This is now proving to be a great advantage. For some months now, the Transkribus training tool has offered the function “omit lines at tag”

The tool ensures that on all pages that are taken into the training or validation set, the lines that have a tag “unclear” or “gap” are automatically excluded from the training. This means that pages that are not perfectly transcribed, but where the parts that could not be deciphered are marked by tags, can be trained without hesitation.

Posted by Anna Brandt on

Undo Job

Version 1.14.0

Since version 1.12. there is a practical tool to correct careless mistakes. It has certainly happened to some of you that you have started a large job in a document and then realize that the parameters were set incorrectly or that you did not want to run this job at all. This could be a layout analysis or an HTR with the wrong model. To fix such errors quickly and easily, especially if they affect several pages, the function ‘Undo Job’ was added to the job list window. With this you can delete a whole job that has gone wrong.

If, for example, a layout analysis has run on pages that were already finished because you forgot to set the checkbox to ‘Current Page’ (a mistake that happens often). Then you don’t have to select each page individually and delete the wrong version, but you can simply undo the whole job with this function.

This only works if the job is the last version you created on the pages. If another version is the last one, then Transkribus will show that and the job will not be deleted on that page. On the pages where the job is the last version it will be deleted.This means that you can continue working first and then just delete the version created by the wrong job on the pages where it should not run (e.g. GT), while it remains on the pages you have continued working on.


Tips & Tools
1) Even if the job is deleted on all pages, it does not disappear from the list of executed jobs. So you should always check one/two pages again to be sure.
2) It works only if you are in the document where the job was executed.

Posted by Dirk Alvermann on

Merge small Base Lines

This tool is – like “Remove small text lines” – distributed with version 1.12.0 of Transkribus. The idea behind it is very interesting.

Maybe you have had problems with “torn” lines in the automatic line detection (Citlab Advanced Layout Analysis). We have mentioned in an earlier post how annoying this problem can be.

So the expectations for such a great thing were of course high. But after a short time we realized that its use needs some practice and that it cannot be used everywhere without problems.

Here we show a simple example:

The Citlab Advanced Layout Analysis detected five “superfluous” text regions on the page and just as many “torn” base lines. In such a case you should first remove the redundant text regions with “remove small text regions” and then start the automatic merge tool.

Tips & Tools
Be careful with complicated layouts. You must always check the result of “merge small text lines”, because often base lines are merged that do not belong together (from lines with different reading order).

Posted by Dirk Alvermann on

Remove small text lines

Release 1.12.0

Many of you probably know the tool “Remove small text regions“, which has been available at Transkribus for the last year. Now his little brother “Remove small text lines” is coming. Finally – a tool that many users have been hoping for for a long time.

With the Citlab Advanced Layout Analysis (even on quite “normal” pages) it happens again and again that textregions or baselines are recognized where we don’t need or want them.

Often “mini-baselines” are recognized in decorated initials or between the individual lines. The HTR model of course can’t do anything with these during text recognition and the transcript will contain “empty” lines. With this tool you can easily and automatically delete these baselines

Try it yourself. We have had the best experience with this if we set the threshold to 0.05.

Posted by Dirk Alvermann on

Automatic Selection of Validation Set

About validation and the different ways to create a Validation Set you can already find some posts in this blog.

Since the last version of Transkribus (1.12.0) there is a new way to create Validation Sets. Transkribus takes a certain amount (2%, 5% or 10%) of the Ground Truth from the Train Set during the compilation of the training data and creates a Validation Set automatically. This set consists of randomly selected pages.

These Validation Sets are created in the Transkribus training tool. You start as usual by entering the training parameters of the model. But before you add the Ground Truth to the Train Set, you select the desired percentage for the Validation Set. This order is important. Every time you add a new document to the Train Set, Transkribus will extract the corresponding pages for the Validation Set.

The new tool is very well suited for large models with a lot of ground truth, especially if you don’t care about setting up special Validation Sets or if you find it difficult for representative models

Posted by Dirk Alvermann on

HTR+ versus Pylaia part 2

Release 1.12.0

Some weeks ago we reported about our first experiences with PyLaia while training a generic model (600.000 words GT).

Today we want to make another attempt to compare PyLaia and HTR+. This time we have a larger model (German_Kurrent_17th-18th; 1,8 million words GT) available. The model was trained as both PyLaia and HTR+ model, with identical ground truth and the same conditions (from the scratch).

Our hypothesis that PyLaia can show its advantages over HTR+ in larger generic models has been fully confirmed. In the case shown PyLaia is superior to HTR+ in all aspects. Both with and without the Language Model, the PyLaia model scored about one percentage point (in the CER) better than HTR+ on all our test sets.

By the way, in the last weeks the performance of PyLaia for “curved” textlines has also improved significantly.

Posted by Dirk Alvermann on

Tag Export II

In the last post we presented a benefit of tags. As an example we showed the visualization of results by displaying place tags on a map. But there are other possibilities.

Tags can not only be exported separately, as in the form of an Excel table. Some tags (place or person) are also output in the ALTO files. These files are among other things responsible for the fact that we can display the hits of the full text search in our viewer/presenter. To do this, simply select “Export ALTO (Split lines into words)” when exporting the METS files.

In our presenter in the Digital Library Mecklenburg-Vorpommern the tags are then displayed as “named entities” separated by places and persons for the respective document. The whole thing is still in the experimental phase and will be further developed in the near future so that you can jump directly to the corresponding places in the document via an actual tag cloud.

Posted by Dirk Alvermann on

New public model – German Kurrent 17th-18th

Today we are proud to present our second publicly available HTR model.

“German_Kurrent_17th-18th” is an HTR model for current scripts of the 17th and 18th century. For this model we used ground truth from our various larger and smaller projects of the last four years.

It is a generic model that includes material from about 2000 individual writer’s hands. About 35% of the Ground Truth comes from 17th century manuscripts and 50% from 18th century manuscripts. The remaining 15% are spread over the decades before and after. The documents selected for the training consist mainly of official minutes and legal documents, but also private records and letters of the time. In addition, a few contemporary printed materials (Fraktur), which appear in the records from time to time, were also used for the training. The language of the texts used is predominantly German and Latin. In addition, some Low German and French texts were also used.

Have fun trying it out. Please use the comments to let us know how the model works for you.