Dirk Alvermann


Posted by Dirk Alvermann on

How to create test sets and why they are important, #1

Release 1.7.1

If we want to know how much a model has learned in training, we have to  test it. We do this with precisely defined test sets. Test sets – like the training set – contain exclusively Ground Truth. However, we make sure that this GT has never been used to train the model. So the model does not “know” this material. This is the most important characteristic of test sets. A text page that has already been used as training material will always be better read by the model than one it is not yet “familiar” with. This can easily be proved experimentally. So if you want to get valid statements about CERs and WER, you need “non-corrupted” test sets.

It is also important that a test set is representative. As long as you train an HTR model for a single writer or an individual handwriting, it’s not difficult – after all, it’s always the same hand. As soon as there are several writers involved, you have to make sure that all the individual handwritings used in the training material are also included in the test set. The more different handwritings are trained in a model, the larger the test sets will be.

The size of the test set is another factor that influences representativity. As a rule, a test set should contain 5-10% of the training material. However, this rule of thumb should always be adapted to the specific requirements of the material and the training objectives.

To illustrate this with two examples: Our model for the Spruchakten from 1580 to 1627 was trained with a training set of almost 200,000 words. The test set contains 44,000 words. This is of course a very high proportion of about 20%. It is due to the fact that material of about 300 different writers was trained in this model, which must also be represented in the test set. – In our model for the judges’ opinions of the Wismar Tribunal, there are about 46,000 words in the training set, the test set contains only 2,500 words, i.e. a share of about 5%. However, we only have to deal with 5 different writers. In order to have a representative test set, the material is sufficient.

Posted by Dirk Alvermann on

Word Error Rate & Character Error Rate – How to evaluate a model

Release 1.7.1

The Word Error Rate (WER) and Character Error Rate (CER) indicate the amount of text in a handwriting that the applied HTR model did not read correctly. A CER of 10% means that every tenth character (and these are not only letters, but also punctuations, spaces, etc.) was not correctly identified. The accuracy rate would therefore be 90 %. A good HTR model should recognize 95% of a handwriting correctly, the CER is not more than 5%. This is roughly the value that is achieved today with “dirty” OCR for fracture fonts. Incidentally, an accuracy rate of 95% also corresponds to the expectations formulated in the DFG’s Practical Rules on Digitization.

Even with a good CER, the word error rate can be high. The WER shows how good the exact reproduction of the words in the text is. As a rule, the WER is three to four times higher than the CER and is proportional to it. The value of the WER is not particularly meaningful for the quality of the model, because unlike characters, words are of different lengths and do not allow a clear comparison (a word is already incorrectly recognized if just one letter in it is not correct). That is why the WER is rarely used to characterize the value of a model.

The WER, however, gives clues to an important aspect. Because when I perform a text recognition with the aim of later performing a full text search in my document, the WER shows me the exact success rate that I can expect in my search. The search is for words or parts of words. So no matter how good my CER is: with a WER of 10%, potentially every tenth search term cannot be found.

Tips & Tools
The easiest way to display the CER and WER is to use the Compare function under Tools. Here you can compare one or more pages of a Ground Truth version with an HTR text to estimate the quality of the model.

Posted by Dirk Alvermann on

Ground Truth is the Alpha and Omega

Release 1.7.1

Ground Truth (GT) is the basis for the creation of HTR models. It is simply a typewritten copy of the historical manuscript, a classic literal or diplomatic transcription that is 100% correct – plainly “Groundt Truth”.

Any mistake in this training material will cause “the machine” to learn – among many correct things – something wrong. That’s why quality management is so important when creating GT. But don’t panic, not every mistake in the GT has devastating consequences. It simply must not be repeated too often; otherwise it becomes “chronic” for the model.

In order to ensure the quality of the GT within our project, we have set up a few fixed transcription guidelines, as you know them from edition projects. It is worthwhile to strive for a literal, character-accurate transcription. Regulations of any kind must be avoided; e.g. normalizations, such as the vocal or consonant usage of “u” and “v” or the encoding of complex abbreviations.

If the material contains only one or two different handwritings, about 100 pages of transcribed text are sufficient for a first training session. This creates a basic model that can be used for further work. In our experience, the number of languages used in the text is irrelevant, since the HTR models usually work without dictionaries.

In addition to conventional transcription, Ground Truth can also be created semi-automatically. Transkribus offers a special tool – Text2Image – which is presented in another post.

Posted by Dirk Alvermann on

WebUI & Expert Client

As we said before, this blog is almost exclusively about the Expert Client of Transkribus. It offers a variety of possibilities. To handle them it requires a certain level of knowledge.

The tools of the WebUI are much more limited, but also easier to work with. In the WebUI it is not possible to perform an automatic layout analysis or to start an HTR, let alone to train a model or to interfere in the user management. But that’s not what it’s meant for.

The WebUI is the ideal interface for crowd projects with a lot of volunteers who mainly transcribe or comment and tag content. And this is exactly what it is used for most of the time. The coordination of such a crowd project is done via the Expert Client.

The WebUI’s advantages are that it can be used without any requirements. It is a web application called from the browser; no installation, no updates, etc. Moreover, it is almost intuitive and can be used by anyone without any previous knowledge.

 

Tips & Tools
The WebUI has also a version management – somewhat adapted for crowd projects. When a transcriber is done with the page to be edited, he sets the edit status to “ready for review”, so that his supervisor knows that now it’s his turn.

 

Posted by Dirk Alvermann on

Knowing what you want

A digitization project with Handwritten Text Recognition can have very different goals. They can range from the critical digital edition to the provision of manuscripts as full texts to the indexing of large text corpora via Key Word Spotting. All three objectives allow different approaches, which have a great influence on the technical and personnel efforts.

In this project, only the last two target definitions are of interest.  A critical edition is not intended, even if the full texts generated in this project could serve as the basis of such.

We aim at a complete indexing of the manuscripts by automatic text recognition. The results will then be made public online in the Digital Library Mecklenburg-Vorpommern. A search is available there, which shows the hits in the image itself. The user, who has sufficient palaeographic knowledge, can explore the context of the hit in the image himself or switch to a modern full text view, or even only use the latter.

Posted by Dirk Alvermann on

Why HTR will change it all

For some years now, archives and libraries have been dedicating more and more of their time to the digitisation of historical manuscripts. The strategies are quite different. Some would like to present their “treasures” in a contemporary manner, others would like to make more extensive collections available for use in an appropriate digital form. The advantages of digitisation are obvious. The original sources are preserved and the interested researchers and non-experts can access the material independently of place and time without having to spend days or weeks in reading rooms. Considering the practice of the 20th century, this is an enormous step forward.

Initially, such digital services provide no more than a digital image of the original historical source. They are developed and maintained at gerat expense, both financially and in terms of staff. If you look at the target groups of these services, you can see that they are mainly aimed at the very same people who also visit archives and libraries. However, the addressees usually have the ability to decipher such historical manuscripts. Optimistically speaking, we are talking about one or two percent of the population. For everyone else, these digital copies are just beautiful to look at.

Keep this picture in mind if you want to understand why the Handwritten Text Recognition (HTR) is opening a whole new chapter in the history of digital indexing and use of historical documents. In a nutshell: HTR allows us to move from simple digitalization to the digital transformation of historical sources. Thanks to the HTR, not only the digital image of a manuscript but also its content is made available in a form that can be read by everyone and searched by machines – over hundreds of thousands of pages.

Thus the contents of historical handwritings can be opened up to a public to whom it has so far remained closed or at least not easily accessible. This does not only adress the non-professional researchers. Access to the contents of the sources will also be much easier for academic experts from disciplines that do not have historical auxiliary sciences like palaeography as part of their classical educational canon. This makes new constellations of inderdisciplinary research possible. Ultimately, since the contents of the manucsripts can now be evaluated by machine, questions and methods of the Digital Humanities can be more easily applied to the material than before.

Tips & Tools
Recommendation for further reading: Mühlberger, Archiv 4.0 oder warum die automatisierte Texterkennung alles verändern wird Tagungsband Archivtag Wolfsburg, in: Massenakten – Massendaten. Rationalisierung und Automatisierung im Archiv (Tagungsdokumentationen zum Deutschen Archivtag, Band 22), hg. v. VdA, Fulda 2018, S. 145-156.