Transkribus in practice

Posted by Elisabeth Heigl on 7. January 2020

Abbreviations

Release 1.9.1

Medieval and early modern manuscripts are usually full of abbreviations in all possible variations. These can be contractions (omission in the word) and suspensions (omission at the end of the word) as well as a wide variety of special characters. So if we want to transcribe old manuscripts, we must first consider how we want to reproduce the abbreviations: Do we reproduce everything as it appears in the text, or do we resolve everything – or do we adapt to the capacities of the HTR?

Basically there are three different ways to deal with abbreviations in Transkribus:

– You can try to reproduce abbreviation characters as Unicode characters. Many of the abbreviation characters used in 15th and 16th century Latin and German manuscripts can be found in the Unicode block “Latin Extended-D”. For special characters written in medieval latin texts, check the Medieval Unicolde Font Initiative. It depends entirely on the goals of your own project whether and when this path makes sense – it is quite complex anyhow.

– If you don’t want to work with Unicode characters, you could also use the “basic letter” of the abbreviation from the regular alphabet – like a literal transcription. Such a “placeholder” can then be provided with a textual tag that marks the word as an abbreviation (“abbrev”). How the tagged abbreviation is to be resolved can then be entered for each tag as “expansion”.

Thus the resolution of the abbreviation becomes part of the metadata. This approach offers the most possibilities for further use of the material. But it is also very laborious, because each and every abbreviation has to be tagged.

– Or you just dissolve the abbreviations. If you want to make large quantities of full text searchable, as we do, it makes sense to resolve the abbreviations consistently because it makes the search easier: Who is looking for “pfessores” instead of “professores”? We have made the experience that the HTR can handle abbreviations quite well; both the classic Latin and German abbreviations, as well as currency symbols or other special characters. This is why we resolve most abbreviations during transcription and use them as part of Ground Truth in HTR training.

The models we train have learned some abbreviations very well. The abbreviations frequently used in the manuscripts, such as the suffix “-en”, can be resolved by an HTR model – if it has been taught consistently.

But more complex abbreviations, especially the contractions, do cause difficulties for the HTR. In our project we have therefore decided to reproduce such abbreviations only in literal form.

In our Collection of Abbreviations we present the many different abbreviations that we find in our material from the 16th to 18th century. We also show how we (and later the HTR models) resolve them. This Collection will be updated by us from time to time – work in progress!

Posted by Dirk Alvermann on 21. December 2019

dictionaries

Release 1.7.1

HTR does not require dictionaries. However, they are also available and can be selected if you perform full-text recognition.

With each HTR training, a dictionary can be generated out of the GT in the training set. It is therefore possible to create a suitable dictionary for each model or for the type of text you are working with.

However, dictionaries are rarely used in Transkribus. In our project they are sometimes used at the beginning of the work on new models. As long as the model to be improved still has a CER of more than 8%, correcting the texts recognized by the HTR is very time-consuming. If a dictionary is used at this point, the CER can often be reduced to 5%. If the model already has a CER below 8%, the use of dictionaries is counterproductive because the reading result often becomes worse again. In such cases, the HTR “contrary to better knowledge” replaces its own reading result with a recommendation from the dictionary.

We use dictionaries just to support very weak models. And we do this rather to help the transcriber with particularly difficult writings. So we used a dictionary to create the GT for the really hard to read concept writings. Of course, the results had to be corrected in every case. But the “reading recommendations”, which were based on the HTR with dictionary, were a good help. As soon as our model was able to recognize concept writings with less than 8% CER, we decided not to use the dictionary any longer.

Posted by Dirk Alvermann on 12. December 2019

Languages

HTR does not require dictionaries and works regardless of the language in which a text is written – as long as it uses the character system the model is trained for.

For the training strategy in our project, this means that we do not distinguish between Latin and German texts or Low German and High German texts when selecting the training material. So far, we have not found any serious differences in the quality of the HTR results between texts in both languages.

This observation is important for historical manuscripts from the German-speaking countries. Usually, the language used within a document also affects the script. Most writers of the 16th to 18th centuries, when they switch from German to Latin, change in the middle of the text from Kurrent to Antiqua. In contrast to OCR, where the mixed use of gothic and antiqua typefaces in modern printing is very difficult, HTR – if it is trained for it – has no problem with this change.

A very typical case in our material, here with a comparison of the HTR result and GT, can illustrate the problem. The error rate in the linguistically different text sections of the page is quite comparable. The models used were the Spruchakten_M 2-8 and M 3-1. The first is an generic model, the second is specialized for writings from 1583 to 1627.

Posted by Anna Brandt on 2. December 2019

layout tab

Release 1.7.1

If you correct the automatic layout analysis, you can do this directly in the image or navigate via the layout tab on the left side. There are all shapes, like the textregions and the baselines, displayed with their position in the image and their structural tags. It is possible to delete or move shapes. In the image you can always see where you are at the moment, which element is currently marked – thus what you can change.

If you want to merge two baselines, just mark them in the layout tab instead of trying to hit the thin line in the image.

The navigation in the tab is especially useful if you want to see the complete image in the right window. This way you keep a better overview, because everything in the image and in the tab will be changed at the same time.

Tips & Tools
You can change the reading order of the baselines in the layout tab either by moving the lines or by clicking and changing the number in the column “Reading Order”.

Posted by Dirk Alvermann on 27. November 2019

mixed layouts

Release 1.7.1

The CITlab Advanced Layout Analysis handles most “ordinary” layouts well – in 90% of the cases. Let’s talk about the remaining 10%.

We already discussed how to proceed in order to avoid trouble with the Reading Order. But what happens if we have to deal with really mixed – crazy – layouts, e.g. concept writings?

With complicated layouts, you’ll quickly notice that the manually drawn TRs overlap. That’s not good – because in such overlapping text regions the automatic line detection doesn’t work reliably. This problem is easily solved because TRs can have shapes other than square. They can be drawn as polygons and are therefore easily separated from each other.

It makes sense to add structural tags if there are many text regions in order to be able to distinguish them better. You can also assign them to certain processing routines during later processing. This is a small effort with great benefits, because the structural tagging is not more complex than the tagging in context.

Tips & Tools
Automatic line detection can be a real challenge here. Sections where you can already predict (with a little experience) that this won’t happen are best handled manually. For automatic line detection, CITlab Advanced should be configured so that the default setting is replaced by “heterogeneous”. The LA will now take both horizontal and vertical or skewed and oblique lines into account. This will take a little longer, but the result will be better.

If such complicated layouts are a continuous characteristic of your material, then it is worth designing a P2PaLA training course. This will create your own layout analysis model that is tailored to the specific challenges of your material. By the way, structure tagging is the basic requirement for such training.

Posted by Elisabeth Heigl on 17. November 2019

Generic vs. specialized model

Release 1.7.1

Did you notice in the graph for the model development that the character error rate (CER) of the last model got slightly worse again? Despite the fact that we had significantly increased the GT input? We had about 43,000 more words in training but the average CER deteriorated from 2.79% to 3.43%. We couldn’t really explain that.

At this point we couldn’t get any further with more and more GT. So we had to change our training strategy. So far we had trained large models, with writings from a total period of 70 years and more than 500 writers.

Our first suspicion fell on the concept writings, of which we already knew that the machine (LA and HTR) – just like ourselves – had its problems with it. During the next training we excluded these concept writings and trained exclusively with “clean” office writings. But that didn’t lead to a noticeable improvement: the Test Set-CER dropped from 3.43% to just 3.31%.

In the following trainings, we additionally focused on a chronological seuqencing of the models. We split our material and created two different models: Spruchakten_M_3-1 (Spruchakten 1583-1627) and Spruchakten_M_4-1 (Spruchakten 1627-1653).

With these new specialized models we actually achieved an improvement of the HTR – where the generic model was no longer sufficient. In the test sets several pages showed an error rate of less than 2 %. In the case of the model M_4-1, many CERs of single pages remained below 1 % and two pages were with 0 % even free of any errors.

Whether an generic or specialized model will help and produce better results depends a lot on the size and composition of the material. In the beginning, when you are keen to progress as quickly as possible (the more, the better), an generic model is useful. However, if that reaches its limits, you shouldn’t “overburden” the HTR any further, but instead specialize your models.

Posted by Dirk Alvermann on 7. November 2019

How to create test sets and why they are important, #2

Release 1.7.1

What is the best procedure for creating test sets?
In the end, everyone can find their own way. In our project, the pages for the test sets are already selected during the creation of the GT. They receive a special edit status (Final) and are later collected in separate documents. This ensures that they will not accidentally be put into training. Whenever new GT is created for future training, the material for the test set is also extended at the same time. So both sets grow in proportion.

For the systematic training we create several Documents, which we call “test sets” and which are each related to a single Spruchakte (one year). For example, we create a “test set 1594” for the Document of the Spruchakte 1594. Here, we place representatively selected images, which should reflect the variety of writers as exactly as possible. In the “mother document” we mark the pages selected for the test set as “Final” to make sure that they will not be edited there in the future. We have not created a separate test set for each singel record or year, but have proceeded in five-year steps.

Since a model is often trained over many rounds, this procedure also has the advantage that the test set always remains representative. The CERs of the different versions of a model can therefore always be compared and observed during development, because the test is always executed on the same (or extended) set. This makes it easier to evaluate the progress of a model and to adjust the training strategy accordingly.

Transkribus also stores the test set used for each training session in the affected collection independently. So you can always fall back on it.
It is also possible to select a test set just before the training and simply assign individual pages of the documents from the training material to the test set. This may be a quick and pragmatic solution for the individual case, but it is not suitable for the planned development of powerful models.

Posted by Dirk Alvermann on 2. November 2019

How to create test sets and why they are important, #1

Release 1.7.1

If we want to know how much a model has learned in training, we have to test it. We do this with precisely defined test sets. Test sets – like the training set – contain exclusively Ground Truth. However, we make sure that this GT has never been used to train the model. So the model does not “know” this material. This is the most important characteristic of test sets. A text page that has already been used as training material will always be better read by the model than one it is not yet “familiar” with. This can easily be proved experimentally. So if you want to get valid statements about CERs and WER, you need “non-corrupted” test sets.

It is also important that a test set is representative. As long as you train an HTR model for a single writer or an individual handwriting, it’s not difficult – after all, it’s always the same hand. As soon as there are several writers involved, you have to make sure that all the individual handwritings used in the training material are also included in the test set. The more different handwritings are trained in a model, the larger the test sets will be.

The size of the test set is another factor that influences representativity. As a rule, a test set should contain 5-10% of the training material. However, this rule of thumb should always be adapted to the specific requirements of the material and the training objectives.

To illustrate this with two examples: Our model for the Spruchakten from 1580 to 1627 was trained with a training set of almost 200,000 words. The test set contains 44,000 words. This is of course a very high proportion of about 20%. It is due to the fact that material of about 300 different writers was trained in this model, which must also be represented in the test set. – In our model for the judges’ opinions of the Wismar Tribunal, there are about 46,000 words in the training set, the test set contains only 2,500 words, i.e. a share of about 5%. However, we only have to deal with 5 different writers. In order to have a representative test set, the material is sufficient.

Posted by Dirk Alvermann on 28. October 2019

Word Error Rate & Character Error Rate – How to evaluate a model

Release 1.7.1

The Word Error Rate (WER) and Character Error Rate (CER) indicate the amount of text in a handwriting that the applied HTR model did not read correctly. A CER of 10% means that every tenth character (and these are not only letters, but also punctuations, spaces, etc.) was not correctly identified. The accuracy rate would therefore be 90 %. A good HTR model should recognize 95% of a handwriting correctly, the CER is not more than 5%. This is roughly the value that is achieved today with “dirty” OCR for fracture fonts. Incidentally, an accuracy rate of 95% also corresponds to the expectations formulated in the DFG’s Practical Rules on Digitization.

Even with a good CER, the word error rate can be high. The WER shows how good the exact reproduction of the words in the text is. As a rule, the WER is three to four times higher than the CER and is proportional to it. The value of the WER is not particularly meaningful for the quality of the model, because unlike characters, words are of different lengths and do not allow a clear comparison (a word is already incorrectly recognized if just one letter in it is not correct). That is why the WER is rarely used to characterize the value of a model.

The WER, however, gives clues to an important aspect. Because when I perform a text recognition with the aim of later performing a full text search in my document, the WER shows me the exact success rate that I can expect in my search. The search is for words or parts of words. So no matter how good my CER is: with a WER of 10%, potentially every tenth search term cannot be found.

Tips & Tools
The easiest way to display the CER and WER is to use the Compare function under Tools. Here you can compare one or more pages of a Ground Truth version with an HTR text to estimate the quality of the model.

Posted by Elisabeth Heigl on 23. October 2019

The more, the better – how to generate more and more GT?

Release 1.7.1

To make sure that the model can reproduce the content of the handwriting as accurately as possible, learning requires a lot of Ground Truth; the more, the better. But how do you get as much GT as possible?

It takes some time to produce a lot of GT. When we were at the beginning of our project and had no models available yet, it took us one hour to transcribe 1 to 2 pages. That’s an average of 150 to 350 words per hour.

Five months later, however, we had almost 250,000 words in training. We neither had a legion of transcribers nor did one person have to write GT day and night. Just the exponential improvement of the models themselves enabled us to produce more and more GT:

The more GT you invest, the better your model will be. The better your model reads, the easier it will be to write GT. You don’t have to write by yourself anymore, you just correct the HTR. With models that have an average error rate of less than 8%, we’ve produced about 6 pages of GT per hour.

The better the model reads, the more GT can be produced and the more GT there is, the better the model will be. What is the opposite of a vicious circle?

Rechtsprechung im Ostseeraum

Category Archives