Monthly Archives

3 Articles

Posted by Dirk Alvermann on

dictionaries

Release 1.7.1

HTR does not require dictionaries. However, they are also available and can be selected if you perform full-text recognition.

With each HTR training, a dictionary can be generated out of the GT in the training set. It is therefore possible to create a suitable dictionary for each model or for the type of text you are working with.

However, dictionaries are rarely used in Transkribus. In our project they are sometimes used at the beginning of the work on new models. As long as the model to be improved still has a CER of more than 8%, correcting the texts recognized by the HTR is very time-consuming. If a dictionary is used at this point, the CER can often be reduced to 5%. If the model already has a CER below 8%, the use of dictionaries is counterproductive because the reading result often becomes worse again. In such cases, the HTR “contrary to better knowledge” replaces its own reading result with a recommendation from the dictionary.

We use dictionaries just to support very weak models. And we do this rather to help the transcriber with particularly difficult writings. So we used a dictionary to create the GT for the really hard to read concept writings. Of course, the results had to be corrected in every case. But the “reading recommendations”, which were based on the HTR with dictionary, were a good help. As soon as our model was able to recognize concept writings with less than 8% CER, we decided not to use the dictionary any longer.

Posted by Dirk Alvermann on

Languages

HTR does not require dictionaries and works regardless of the language in which a text is written – as long as it uses the character system the model is trained for.

For the training strategy in our project, this means that we do not distinguish between Latin and German texts or Low German and High German texts when selecting the training material. So far, we have not found any serious differences in the quality of the HTR results between texts in both languages.

This observation is important for historical manuscripts from the German-speaking countries. Usually, the language used within a document also affects the script. Most writers of the 16th to 18th centuries, when they switch from German to Latin, change in the middle of the text from Kurrent to Antiqua. In contrast to OCR, where the mixed use of gothic and antiqua typefaces in modern printing is very difficult, HTR – if it is trained for it – has no problem with this change.

A very typical case in our material, here with a comparison of the  HTR result and GT, can illustrate the problem. The error rate in the linguistically different text sections of the page is quite comparable. The models used were the Spruchakten_M 2-8 and M 3-1. The first is an generic model, the second is specialized for writings from 1583 to 1627.

Posted by Anna Brandt on

layout tab

Release 1.7.1

If you correct the automatic layout analysis, you can do this directly in the image or navigate via the layout tab on the left side. There are all shapes, like the textregions and the baselines, displayed with their position in the image and their structural tags. It is possible to delete or move shapes. In the image you can always see where you are at the moment, which element is currently marked – thus what you can change.

If you want to merge two baselines, just mark them in the layout tab instead of trying to hit the thin line in the image.

The navigation in the tab is especially useful if you want to see the complete image in the right window. This way you keep a better overview, because everything in the image and in the tab will be changed at the same time.

Tips & Tools
You can change the reading order of the baselines in the layout tab either by moving the lines or by clicking and changing the number in the column “Reading Order”.