October 2019

Posted by Elisabeth Heigl on 28. October 2019

Word Error Rate & Character Error Rate – How to evaluate a model

Release 1.7.1

The Word Error Rate (WER) and Character Error Rate (CER) indicate the amount of text in a handwriting that the applied HTR model did not read correctly. A CER of 10% means that every tenth character (and these are not only letters, but also punctuations, spaces, etc.) was not correctly identified. The accuracy rate would therefore be 90 %. A good HTR model should recognize 95% of a handwriting correctly, the CER is not more than 5%. This is roughly the value that is achieved today with “dirty” OCR for fracture fonts. Incidentally, an accuracy rate of 95% also corresponds to the expectations formulated in the DFG’s Practical Rules on Digitization.

Even with a good CER, the word error rate can be high. The WER shows how good the exact reproduction of the words in the text is. As a rule, the WER is three to four times higher than the CER and is proportional to it. The value of the WER is not particularly meaningful for the quality of the model, because unlike characters, words are of different lengths and do not allow a clear comparison (a word is already incorrectly recognized if just one letter in it is not correct). That is why the WER is rarely used to characterize the value of a model.

The WER, however, gives clues to an important aspect. Because when I perform a text recognition with the aim of later performing a full text search in my document, the WER shows me the exact success rate that I can expect in my search. The search is for words or parts of words. So no matter how good my CER is: with a WER of 10%, potentially every tenth search term cannot be found.

Tips & Tools
The easiest way to display the CER and WER is to use the Compare function under Tools. Here you can compare one or more pages of a Ground Truth version with an HTR text to estimate the quality of the model.

Posted by Elisabeth Heigl on 23. October 2019

The more, the better – how to generate more and more GT?

Release 1.7.1

To make sure that the model can reproduce the content of the handwriting as accurately as possible, learning requires a lot of Ground Truth; the more, the better. But how do you get as much GT as possible?

It takes some time to produce a lot of GT. When we were at the beginning of our project and had no models available yet, it took us one hour to transcribe 1 to 2 pages. That’s an average of 150 to 350 words per hour.

Five months later, however, we had almost 250,000 words in training. We neither had a legion of transcribers nor did one person have to write GT day and night. Just the exponential improvement of the models themselves enabled us to produce more and more GT:

The more GT you invest, the better your model will be. The better your model reads, the easier it will be to write GT. You don’t have to write by yourself anymore, you just correct the HTR. With models that have an average error rate of less than 8%, we’ve produced about 6 pages of GT per hour.

The better the model reads, the more GT can be produced and the more GT there is, the better the model will be. What is the opposite of a vicious circle?

Posted by Anna Brandt on 18. October 2019

Collaboration – Versions Management

Release 1.7.1

The second important element for organized collaboration is the version management of Transkribus. In the toolbar it seems rather inconspicuous, but it is enormously important. Transkribus stores a version of the currently edited page each time it is saved. It contains the current status of the layout work and content processing.

These versions are provided with an “edit status” so that they can be easier distinguished. A newly uploaded Document contains only pages with the edit status “new”. As soon as you edit a page, the edit status automatically changes to “in progress”. The three other status options – “done”, “final” and “Ground Truth” – can only be set manually.

The logical time to set such a “higher” status depends on the agreements within the team. We use versions management mostly during the production of training material – Ground Truth. All pages that have a finished layout analysis are set to “done” so that the transcribers and editors know that this page can now be finished by them. This status will not be changed until the page has a 100% secure transcription. Then it will be set to “Ground Truth” or “final”. All pages with the status “GT” will later be used as training material for HTR models, while the pages with edit status “final” will be used to create the test sets.

Each collaborator can access and edit or delete all versions of a page at any time. The edit status helps him to find the desired version faster. In addition to the edit status, the last editor and the save time are displayed for each version. If the version was edited with an automatic process (layout analysis or HTR), this is also commented. Thus, the processing steps are traceable in detail.

Tips & Tools
You can have multiple versions with the same status.
You can set any version to any other status – except to “New”.
You can delete single or multiple versions – except final versions, which cannot be deleted.

Posted by Elisabeth Heigl on 13. October 2019

The more, the better – how much GT do I have to put in?

Release 1.7.1

As I said before: Ground Truth is the key factor when creating HTR models.

GT is the correct and machine-readable copy of the handwriting that the machine uses to learn to “read”. The more the machine can “practice”, the better it will be. The more Ground Truth we have, the lower the error rate.

Of course, the quantity always depends on the specific use case. If we work with a few, easy-to-read writing, little GT is usually enough to train a solid model. However, if the writings are very different because we are dealing with a large number of different writers, the effort will be higher. This means that in such cases we need to provide more GT to produce good HTR models.

In the Spruchakten we find many different writers. That’s why a lot of GT was created to train the models. Our HTR-models (Spruchakten_M_2-1 to 2-11) clearly show how quickly the error rate actually decreases if as much GT as possible is invested. We can roughly say that doubling the amount of GT in training (words in trainset) will halve the error rate (CER page) of the model.

In our examples we could observe that we have to train the models with at least 50,000 words of GT in order to get good results. With 100,000 words in training, you can already create excellent HTR models.

Posted by Anna Brandt on 8. October 2019

Train sets & test sets (for Beginners)

Release 1.7.1

When we train an HTR model, we create training sets and test sets, all based on Ground Truth. In the next posts on this topic you will learn more about it, especially that both sets must must not be mixed together. But what exactly is the difference between the two and what are they used for?

Training and test sets are very similar in the choice of material they contain. The material in both sets should come from the same handwritings and be at the same status (GT). The difference is how Transkribus uses it to create a new model: The training set is learned by the program in a hundred (or more) rounds (epochs). Imagine writing a test a hundred times – for practice purposes, so to speak. Every time you write the test, after going through all the pages, you get the solution and can look at your mistakes. Then you start again with the same exercise. Of course you’ll get better and better. The same way does Transkribus learn a bit more with each pass.

After each round in the training set, the learned skills are checked on the test set. Imagine your test again. This time you write the test, get the grade, but they don’t tell you what you did wrong. So Transkribus goes through the same pages many times, but can never see the right solution. The model has to fall back on the previously learned training and you can see how well it has studied.

So if there were the same pages in the test set as in training, Transkribus could “cheat”. It would already know the pages, have practised on them a hundred times and seen the solution a hundred times. This is the reason why the CER (Character Error Rate) in the training set is almost always lower than in the test set. This is best seen in the “learning curve” of a model.

Posted by Elisabeth Heigl on 3. October 2019

Collaboration – User Management

Release 1.7.1

The Transkribus platform is designed for collaboration. So many users can work on a collection and even a document at the same time. Collisions should be easily avoided with a little organizational skill.

The two most important elements enabling organized collaboration are User Management and Version Management in Transkribus. User Management refers explicitly to the collections. The person who creates a collection is always its “owner”, meaning that he has full rights, including the right to delete the entire collection. He can grant other users access to the collection and assign them roles that correspond to different rights:

Owner – Editor – Transcriber

It is wise if more than one member of the team is the “owner” of a collection. All the rest of us are “editors”. Assigning the role “transcriber” is especially useful if you run crowd-projects where volunteers do nothing but transcribe or tag texts. For such “transcribers”, access via the WebUI, with its range of functions adapted to this role, is ideally suited.

Rechtsprechung im Ostseeraum

Monthly Archives

Word Error Rate & Character Error Rate – How to evaluate a model

The more, the better – how to generate more and more GT?

Collaboration – Versions Management

The more, the better – how much GT do I have to put in?

Train sets & test sets (for Beginners)

Collaboration – User Management