Category Archives

67 Articles

Posted by Anna Brandt on

Region Grouping

Since the version update 1.14.0 there is a new function to configure the layout analysis. It is about the arrangement of the text regions, called ‘Region grouping’. Now you can configure if they should be grouped around baselines or if all lines should be in one TR.

With the first mentioned setting it can happen that many small TRs appear at the edge of the image or in the middle of it, even if there is actually only one text block. This problem can be solved in a further step with the Remove small Textregions.

On the other hand, if only one text region is set, really all baselines are in this text region, even those that are otherwise marginal and even vertical BL. As long as the setting ‘Heterogeneous’ is selected for ‘Text orientation’, the layout analysis also recognizes the vertical lines in the same TR with the horizontal ones. It can be seen that the LA would normally recognize multiple TR. In fact, the reading order for the lines is still divided as if they were in their own text regions. The main paragraph is usually TR 1, so the RO starts there. The other baselines are placed at the back, even if they are at the side of the main text and could therefore be placed between them.

To decide which setting is better for you, you have to try it out. For pages that have only one text block, the second setting is of course advantageous, because all the small TR do not appear. It could also be that you have to choose different settings within one document.

Posted by Dirk Alvermann on

Exclude single lines from training

Some of you will know this from practical experience: you are transcribing a particularly difficult page in Transkribus and you cannot decipher everything with the best will in the world. What do you do? If the page is set to the edit status “ground truth”, the transcription goes into training with the obvious errors (or what you could not read). This is not what is intended. But you don’t want to simply “throw away” the page either.

We have already mentioned the use of tags in another post. We used the “unclear” tag in the project from the beginning. Others also like to use the “gap” tag for such reading problems.

This is now proving to be a great advantage. For some months now, the Transkribus training tool has offered the function “omit lines at tag”

The tool ensures that on all pages that are taken into the training or validation set, the lines that have a tag “unclear” or “gap” are automatically excluded from the training. This means that pages that are not perfectly transcribed, but where the parts that could not be deciphered are marked by tags, can be trained without hesitation.

Posted by Anna Brandt on

Undo Job

Version 1.14.0

Since version 1.12. there is a practical tool to correct careless mistakes. It has certainly happened to some of you that you have started a large job in a document and then realize that the parameters were set incorrectly or that you did not want to run this job at all. This could be a layout analysis or an HTR with the wrong model. To fix such errors quickly and easily, especially if they affect several pages, the function ‘Undo Job’ was added to the job list window. With this you can delete a whole job that has gone wrong.

If, for example, a layout analysis has run on pages that were already finished because you forgot to set the checkbox to ‘Current Page’ (a mistake that happens often). Then you don’t have to select each page individually and delete the wrong version, but you can simply undo the whole job with this function.

This only works if the job is the last version you created on the pages. If another version is the last one, then Transkribus will show that and the job will not be deleted on that page. On the pages where the job is the last version it will be deleted.This means that you can continue working first and then just delete the version created by the wrong job on the pages where it should not run (e.g. GT), while it remains on the pages you have continued working on.

Tips & Tools
1) Even if the job is deleted on all pages, it does not disappear from the list of executed jobs. So you should always check one/two pages again to be sure.
2) It works only if you are in the document where the job was executed.

Posted by Dirk Alvermann on

Merge small Base Lines

This tool is – like “Remove small text lines” – distributed with version 1.12.0 of Transkribus. The idea behind it is very interesting.

Maybe you have had problems with “torn” lines in the automatic line detection (Citlab Advanced Layout Analysis). We have mentioned in an earlier post how annoying this problem can be.

So the expectations for such a great thing were of course high. But after a short time we realized that its use needs some practice and that it cannot be used everywhere without problems.

Here we show a simple example:

The Citlab Advanced Layout Analysis detected five “superfluous” text regions on the page and just as many “torn” base lines. In such a case you should first remove the redundant text regions with “remove small text regions” and then start the automatic merge tool.

Tips & Tools
Be careful with complicated layouts. You must always check the result of “merge small text lines”, because often base lines are merged that do not belong together (from lines with different reading order).

Posted by Dirk Alvermann on

Remove small text lines

Release 1.12.0

Many of you probably know the tool “Remove small text regions“, which has been available at Transkribus for the last year. Now his little brother “Remove small text lines” is coming. Finally – a tool that many users have been hoping for for a long time.

With the Citlab Advanced Layout Analysis (even on quite “normal” pages) it happens again and again that textregions or baselines are recognized where we don’t need or want them.

Often “mini-baselines” are recognized in decorated initials or between the individual lines. The HTR model of course can’t do anything with these during text recognition and the transcript will contain “empty” lines. With this tool you can easily and automatically delete these baselines

Try it yourself. We have had the best experience with this if we set the threshold to 0.05.

Posted by Dirk Alvermann on

Automatic Selection of Validation Set

About validation and the different ways to create a Validation Set you can already find some posts in this blog.

Since the last version of Transkribus (1.12.0) there is a new way to create Validation Sets. Transkribus takes a certain amount (2%, 5% or 10%) of the Ground Truth from the Train Set during the compilation of the training data and creates a Validation Set automatically. This set consists of randomly selected pages.

These Validation Sets are created in the Transkribus training tool. You start as usual by entering the training parameters of the model. But before you add the Ground Truth to the Train Set, you select the desired percentage for the Validation Set. This order is important. Every time you add a new document to the Train Set, Transkribus will extract the corresponding pages for the Validation Set.

The new tool is very well suited for large models with a lot of ground truth, especially if you don’t care about setting up special Validation Sets or if you find it difficult for representative models

Posted by Dirk Alvermann on

HTR+ versus Pylaia part 2

Release 1.12.0

Some weeks ago we reported about our first experiences with PyLaia while training a generic model (600.000 words GT).

Today we want to make another attempt to compare PyLaia and HTR+. This time we have a larger model (German_Kurrent_17th-18th; 1,8 million words GT) available. The model was trained as both PyLaia and HTR+ model, with identical ground truth and the same conditions (from the scratch).

Our hypothesis that PyLaia can show its advantages over HTR+ in larger generic models has been fully confirmed. In the case shown PyLaia is superior to HTR+ in all aspects. Both with and without the Language Model, the PyLaia model scored about one percentage point (in the CER) better than HTR+ on all our test sets.

By the way, in the last weeks the performance of PyLaia for “curved” textlines has also improved significantly.

Posted by Dirk Alvermann on

Tag Export II

In the last post we presented a benefit of tags. As an example we showed the visualization of results by displaying place tags on a map. But there are other possibilities.

Tags can not only be exported separately, as in the form of an Excel table. Some tags (place or person) are also output in the ALTO files. These files are among other things responsible for the fact that we can display the hits of the full text search in our viewer/presenter. To do this, simply select “Export ALTO (Split lines into words)” when exporting the METS files.

In our presenter in the Digital Library Mecklenburg-Vorpommern the tags are then displayed as “named entities” separated by places and persons for the respective document. The whole thing is still in the experimental phase and will be further developed in the near future so that you can jump directly to the corresponding places in the document via an actual tag cloud.

Posted by Dirk Alvermann on

Tag Exports I

Once you have taken the trouble to tag one or more documents, there are several ways to use this “added value” outside of Transkribus. The tags can be easily exported as an Excel spreadsheet using Transkribus’ export tool.

From there you have many options. We had conducted our “tagging experiment” to see if this would be a good way to visualize the geographical distribution of our documents. At the same time, this map should allow access to the digitized documents in the presenter.

All in all we are satisfied with the result of the experiment. You can select specific years or periods of time, search for locations and use the points on the map to access the documents.

In the end, however, the effort for this kind of tagging has proven to be so high that we cannot afford it within the scope of this project. But there are other ways to use tags in export, which we will write about in the next post.

Posted by Dirk Alvermann on

HTR+ versus Pylaia

Version 1.12.0

As you have probably noticed, since last summer there is a second recognition engine available in Transkribus besides HTR+ – PyLaia. (link)

We have been experimenting with PyLaia models in the last weeks and would like to document our first impressions and experiences about the different aspects of HTR+ and PyLaia. First of all: considering the economic point of view, PyLaia is definitely a little bit cheaper than HTR+, as you can see on the price list of the Read Coop. Does cheaper also mean worse? – Definitely not! In terms of accuracy rate PyLaia can easily compete with HTR+. It is often slightly better. The following graphic compares an HTR+ and a PyLaia model trained with identical ground truth (approx. 600,000 words) under the same conditions (from the scratch). The performance with and without the Language Model is compared.

Perhaps the most notable difference is that the results of PyLaia models cannot be improved as much by using a Language Model as is the case with HTR+. This is not necessarily a disadvantage, but rather indicates a high basic reliability of these models. In other words: PyLaia does not necessarily need a Language Model to achieve very good results.

There is also an area where PyLaia performs worse than HTR+. PyLaia has more difficulties to read “curved” lines correctly. For vertical text lines the result is even worse.

In training PyLaia is a bit slower than HTR+, which means that the training takes longer. On the other hand, PyLaia is much faster in “acceleration”. It needs relatively few training epochs or iterations to achieve good results. You can compare this quite well with the two learning curves.

Our observations are of course not exhaustive. So far they only refer to generic models that have been trained with a high level of ground truth. Overall we have the impression that PyLaia can fully exploit its advantages with such large generic models.