Elisabeth Heigl

Posted by Elisabeth Heigl on 12. November 2019

Transcribus in Chicago

Transkribus will be presented at this year’s meeting of the Social Sciences History Association (SSHA) in Chicago. Günter Mühlberger will not only present the potential of Transkribus, but also first results and experiences. These results come from the processing of the cadastral protocols of the Tiroler Landesarchiv and our digitization project. He will pay special attention to the training of HTR models and the chances of keyword spotting. The lecture will take place on 21.11. at 11:00 am under the title: ‘Handwritten Text Recognition and Keyword Spotting as Research Tools for Social Science and History’ in Session 31 (Emerging Methods: Computation/Spatial Econometrics).

Posted by Elisabeth Heigl on 7. November 2019

How to create test sets and why they are important, #2

Release 1.7.1

What is the best procedure for creating test sets?
In the end, everyone can find their own way. In our project, the pages for the test sets are already selected during the creation of the GT. They receive a special edit status (Final) and are later collected in separate documents. This ensures that they will not accidentally be put into training. Whenever new GT is created for future training, the material for the test set is also extended at the same time. So both sets grow in proportion.

For the systematic training we create several Documents, which we call “test sets” and which are each related to a single Spruchakte (one year). For example, we create a “test set 1594” for the Document of the Spruchakte 1594. Here, we place representatively selected images, which should reflect the variety of writers as exactly as possible. In the “mother document” we mark the pages selected for the test set as “Final” to make sure that they will not be edited there in the future. We have not created a separate test set for each singel record or year, but have proceeded in five-year steps.

Since a model is often trained over many rounds, this procedure also has the advantage that the test set always remains representative. The CERs of the different versions of a model can therefore always be compared and observed during development, because the test is always executed on the same (or extended) set. This makes it easier to evaluate the progress of a model and to adjust the training strategy accordingly.

Transkribus also stores the test set used for each training session in the affected collection independently. So you can always fall back on it.
It is also possible to select a test set just before the training and simply assign individual pages of the documents from the training material to the test set. This may be a quick and pragmatic solution for the individual case, but it is not suitable for the planned development of powerful models.

Posted by Elisabeth Heigl on 5. November 2019

Search and Browse documents and HTR-Results in the Digital Library MV

We present our results in the Digital Library Mecklenburg-Vorpommern. Here you will find the digital versions with their corresponding transcriptions.

If you have selected a document– as here for example the Spruchakte of 1586 – you will then see its first page in the centre of the display. The box above allows you to switch to the next, the previous or any other page of your choice (1.). You can rotate the image (3.) zoom in or out (5.), choose two-page mode (2.) and switch to full screen mode (4.).

On the left side you can select different view options („Ansicht“). Here you can, for example, display all images at once instead of just one page („Seitenvorschau“) or you can read the transcribed text right away („Volltext“).

If you want to navigate in the structure of the file, first open the structure tree of the file in the bottom left box using the small plus symbol. Then you can select any given date.

Are you looking for a certain name, a place or some other term? Simply enter it in the search box on the left („Suche in: Spruchakte 1586“). If the term occurs in the file, the full-text hits („Volltexttreffer“), meaning all places where your search term occurs in the text, are indicated.

If you select one of the hits here, your search term will be marked yellow on the digital image. For now, highlighting the search result will only work on the digitized page, not yet in full text.

Tips & Tools
Display the found full text hits in a new tab (right mouse button). Navigating forwards and backwards in the Digital Library is still a bit tricky. This way you can be sure that you will always return to your previous selection.

Posted by Elisabeth Heigl on 2. November 2019

How to create test sets and why they are important, #1

Release 1.7.1

If we want to know how much a model has learned in training, we have to test it. We do this with precisely defined test sets. Test sets – like the training set – contain exclusively Ground Truth. However, we make sure that this GT has never been used to train the model. So the model does not “know” this material. This is the most important characteristic of test sets. A text page that has already been used as training material will always be better read by the model than one it is not yet “familiar” with. This can easily be proved experimentally. So if you want to get valid statements about CERs and WER, you need “non-corrupted” test sets.

It is also important that a test set is representative. As long as you train an HTR model for a single writer or an individual handwriting, it’s not difficult – after all, it’s always the same hand. As soon as there are several writers involved, you have to make sure that all the individual handwritings used in the training material are also included in the test set. The more different handwritings are trained in a model, the larger the test sets will be.

The size of the test set is another factor that influences representativity. As a rule, a test set should contain 5-10% of the training material. However, this rule of thumb should always be adapted to the specific requirements of the material and the training objectives.

To illustrate this with two examples: Our model for the Spruchakten from 1580 to 1627 was trained with a training set of almost 200,000 words. The test set contains 44,000 words. This is of course a very high proportion of about 20%. It is due to the fact that material of about 300 different writers was trained in this model, which must also be represented in the test set. – In our model for the judges’ opinions of the Wismar Tribunal, there are about 46,000 words in the training set, the test set contains only 2,500 words, i.e. a share of about 5%. However, we only have to deal with 5 different writers. In order to have a representative test set, the material is sufficient.

Posted by Elisabeth Heigl on 28. October 2019

Word Error Rate & Character Error Rate – How to evaluate a model

Release 1.7.1

The Word Error Rate (WER) and Character Error Rate (CER) indicate the amount of text in a handwriting that the applied HTR model did not read correctly. A CER of 10% means that every tenth character (and these are not only letters, but also punctuations, spaces, etc.) was not correctly identified. The accuracy rate would therefore be 90 %. A good HTR model should recognize 95% of a handwriting correctly, the CER is not more than 5%. This is roughly the value that is achieved today with “dirty” OCR for fracture fonts. Incidentally, an accuracy rate of 95% also corresponds to the expectations formulated in the DFG’s Practical Rules on Digitization.

Even with a good CER, the word error rate can be high. The WER shows how good the exact reproduction of the words in the text is. As a rule, the WER is three to four times higher than the CER and is proportional to it. The value of the WER is not particularly meaningful for the quality of the model, because unlike characters, words are of different lengths and do not allow a clear comparison (a word is already incorrectly recognized if just one letter in it is not correct). That is why the WER is rarely used to characterize the value of a model.

The WER, however, gives clues to an important aspect. Because when I perform a text recognition with the aim of later performing a full text search in my document, the WER shows me the exact success rate that I can expect in my search. The search is for words or parts of words. So no matter how good my CER is: with a WER of 10%, potentially every tenth search term cannot be found.

Tips & Tools
The easiest way to display the CER and WER is to use the Compare function under Tools. Here you can compare one or more pages of a Ground Truth version with an HTR text to estimate the quality of the model.

Posted by Elisabeth Heigl on 23. October 2019

The more, the better – how to generate more and more GT?

Release 1.7.1

To make sure that the model can reproduce the content of the handwriting as accurately as possible, learning requires a lot of Ground Truth; the more, the better. But how do you get as much GT as possible?

It takes some time to produce a lot of GT. When we were at the beginning of our project and had no models available yet, it took us one hour to transcribe 1 to 2 pages. That’s an average of 150 to 350 words per hour.

Five months later, however, we had almost 250,000 words in training. We neither had a legion of transcribers nor did one person have to write GT day and night. Just the exponential improvement of the models themselves enabled us to produce more and more GT:

The more GT you invest, the better your model will be. The better your model reads, the easier it will be to write GT. You don’t have to write by yourself anymore, you just correct the HTR. With models that have an average error rate of less than 8%, we’ve produced about 6 pages of GT per hour.

The better the model reads, the more GT can be produced and the more GT there is, the better the model will be. What is the opposite of a vicious circle?

Posted by Elisabeth Heigl on 13. October 2019

The more, the better – how much GT do I have to put in?

Release 1.7.1

As I said before: Ground Truth is the key factor when creating HTR models.

GT is the correct and machine-readable copy of the handwriting that the machine uses to learn to “read”. The more the machine can “practice”, the better it will be. The more Ground Truth we have, the lower the error rate.

Of course, the quantity always depends on the specific use case. If we work with a few, easy-to-read writing, little GT is usually enough to train a solid model. However, if the writings are very different because we are dealing with a large number of different writers, the effort will be higher. This means that in such cases we need to provide more GT to produce good HTR models.

In the Spruchakten we find many different writers. That’s why a lot of GT was created to train the models. Our HTR-models (Spruchakten_M_2-1 to 2-11) clearly show how quickly the error rate actually decreases if as much GT as possible is invested. We can roughly say that doubling the amount of GT in training (words in trainset) will halve the error rate (CER page) of the model.

In our examples we could observe that we have to train the models with at least 50,000 words of GT in order to get good results. With 100,000 words in training, you can already create excellent HTR models.

Posted by Elisabeth Heigl on 3. October 2019

Collaboration – User Management

Release 1.7.1

The Transkribus platform is designed for collaboration. So many users can work on a collection and even a document at the same time. Collisions should be easily avoided with a little organizational skill.

The two most important elements enabling organized collaboration are User Management and Version Management in Transkribus. User Management refers explicitly to the collections. The person who creates a collection is always its “owner”, meaning that he has full rights, including the right to delete the entire collection. He can grant other users access to the collection and assign them roles that correspond to different rights:

Owner – Editor – Transcriber

It is wise if more than one member of the team is the “owner” of a collection. All the rest of us are “editors”. Assigning the role “transcriber” is especially useful if you run crowd-projects where volunteers do nothing but transcribe or tag texts. For such “transcribers”, access via the WebUI, with its range of functions adapted to this role, is ideally suited.

Posted by Elisabeth Heigl on 4. September 2019

transcription guidelines

In the transcripts for the Ground Truth, the litteral or diplomatic transcription is used. This means that we do not regulate the characters in the transcription, if possible. The machine must learn from a transcription that is as accurate as possible so that it can later reproduce exactly what is written on the sheet. For example, we consequently adopt the vocal and consonant use of “u” and “v” of the original. You can get used to the “Vrtheill” (sentence) and the “Vniuersitet” (university) quite quickly.

We made only the following exceptions from the literal transcription and regulated characters. The handling of abbreviations is dealt with separately.

We cannot literally transcribe the so-called “long-s” (“ſ”) and the “final-s” because we are dependent on the antiqua sign system. Therefore we transfer both forms as “s”.

We reproduce umlauts as they appear. Diacritical signs are adopted, unless the modern sign system does not allow this; as in the case of the “a” with ‘diacritical e’, which becomes the “ä”. Diphthongs are replaced, for example the “æ″ becomes “ae″.

The Ypsilon is written in many manuscripts as “ÿ″. However, we usually transcribe it as a simple “y″. In clear cases, we differentiate between “y” and the similarly used “ij” in the transcription.

There are also some exceptions to the literal transcription with regard to the punctuation and special characters: In the manuscripts, brackets are represented in very different ways. But here we use the modern brackets (…) uniformly. The hyphenation at the end of the lines is indicated by different characters. We transcribe them exclusively with a “¬”. The common linkage sign in modern use – the hyphen – hardly occurs in the manuscripts. Instead, when two words are linked, we often find the “=”, which we reproduce with a simple hyphen.

We take the comma and point setting as it appears – if it exists at all. If the sentence does not end with a dot, we do not set a dot.

Upper and lower case will be adopted unchanged according to the original. However, it is not always possible to strictly distinguish between upper and lower case letters. This applies to a large extent to the D/d, the V/v and also the Z/z, regardless of the writer. In case of doubt, we compare the letter in question with its usual appearance in the text. In composites, capital letters can occur within a word – they are also transcribed accurately according to the original.

Posted by Elisabeth Heigl on 4. September 2019

Use case Spruchakten

The surfaces of file-sheets from the early modern times are usually uneven. Therefore, we always use a scanning glass-plate. Thus, at least rough foldings and creases can be smoothed and the writing can be straightened a little.

Contrary to the usual scanning procedure of books, we scan each page of a file individually. Thus, we deliberately excluded the possibilities of a subsequent layout processing of scans. Earlier digitization projects have shown that such post-scan layout editing can be laborious, error-prone and easily disrupt the workflow. But because subsequent layout processing was ruled out, the scans have to be produced as presentable as possible right from the start.

This is why we use the so-called “Crop Mode” (UCC project settings) for scanning. This automatically captures the sheet edge of the original and sets it as the frame of the scanned image. The result is an image with barely a black border. A possible misalignment of the sheet can automatically be compensated up to 40°. This leads to images that are reliably aligned and it also makes it easier to change pages during scanning.

For the “Crop Mode” to recognize a page and to scan it as such, only this page must be visible. This means that everything else, both the opposite side and the pages beneath, must be covered in black. For this purpose we use two common black photo cardboard sheets (A3 or A2).

In the Spruchakten there are often paper-sheets in which the lock-seals have been removed by cutting out. These pages must be additionally underlaid with a sheet as close as possible to the original colour. The “Crop Mode” will then complete the border so that no parts of the sheet are cut off during the scan.

Thus, during the scanning of the Spruchakten, we cannot simply “browse” and trigger scans, basically we have to prepare every single image. The average scanning speed with this procedure is 100 pages per hour. This way, we also save a possible costly post-processing of the images.

Rechtsprechung im Ostseeraum

Elisabeth Heigl