Elisabeth Heigl


Posted by Elisabeth Heigl on

Generic vs. specialized model

Release 1.7.1

Did you notice in the graph for the model development that the character error rate (CER) of the last model got slightly worse again? Despite the fact that we had significantly increased the GT input? We had about 43,000 more words in training but the average CER deteriorated from 2.79% to 3.43%. We couldn’t really explain that.

At this point we couldn’t get any further with more and more GT. So we had to change our training strategy. So far we had trained large models, with writings from a total period of 70 years and more than 500 writers.

Our first suspicion fell on the concept writings, of which we already knew that the machine (LA and HTR) – just like ourselves – had its problems with it. During the next training we excluded these concept writings and trained exclusively with “clean” office writings. But that didn’t lead to a noticeable improvement: the Test Set-CER dropped from 3.43% to just 3.31%.

In the following trainings, we additionally focused on a chronological seuqencing of the models. We split our material and created two different models: Spruchakten_M_3-1 (Spruchakten 1583-1627) and Spruchakten_M_4-1 (Spruchakten 1627-1653).

With these new specialized models we actually achieved an improvement of the HTR – where the generic model was no longer sufficient. In the test sets several pages showed an error rate of less than 2 %. In the case of the model M_4-1, many CERs of single pages remained below 1 % and two pages were with 0 % even free of any errors.

Whether an generic or specialized model will help and produce better results depends a lot on the size and composition of the material. In the beginning, when you are keen to progress as quickly as possible (the more, the better), an generic model is useful. However, if that reaches its limits, you shouldn’t “overburden” the HTR any further, but instead specialize your models.

Posted by Elisabeth Heigl on

Transkribus as an instrument for students and professors

In this year’s 24-hour lecture of the University of Greifswald Transkribus and our digitization project will be presented. Elisabeth Heigl, who is involved in the project as a academic assistant, will present some of the exciting cases from the rulings of the law faculty of Greifswald. If you are interested in the history of law, join  the lecture at lecture hall 2, Audimax (Rubenowstraße 1) on 16.11.2019 at 12:00.
Read the whole program of the 24-hour lecture here.

Posted by Elisabeth Heigl on

Search and Browse documents and HTR-Results in the Digital Library MV

We present our results in the Digital Library Mecklenburg-Vorpommern. Here you will find the digital versions with their corresponding transcriptions.

If you have selected a document– as here for example the Spruchakte of 1586 – you will then see its first page in the centre of the display. The box above allows you to switch to the next, the previous or any other page of your choice (1.). You can rotate the image (3.) zoom in or out (5.), choose two-page mode (2.) and switch to full screen mode (4.).

On the left side you can select different view options („Ansicht“). Here you can, for example, display all images at once instead of just one page („Seitenvorschau“) or you can read the transcribed text right away („Volltext“).

If you want to navigate in the structure of the file, first open the structure tree of the file in the bottom left box using the small plus symbol. Then you can select any given date.

Are you looking for a certain name, a place or some other term? Simply enter it in the search box on the left („Suche in: Spruchakte 1586“). If the term occurs in the file, the full-text hits („Volltexttreffer“), meaning all places where your search term occurs in the text, are indicated.

If you select one of the hits here, your search term will be marked yellow on the digital image. For now, highlighting the search result will only work on the digitized page, not yet in full text.

Tips & Tools
Display the found full text hits in a new tab (right mouse button). Navigating forwards and backwards in the Digital Library is still a bit tricky. This way you can be sure that you will always return to your previous selection.

Posted by Elisabeth Heigl on

The more, the better – how to generate more and more GT?

Release 1.7.1

To make sure that the model can reproduce the content of the handwriting as accurately as possible, learning requires a lot of Ground Truth; the more, the better. But how do you get as much GT as possible?

It takes some time to produce a lot of GT. When we were at the beginning of our project and had no models available yet, it took us one hour to transcribe 1 to 2 pages. That’s an average of 150 to 350 words per hour.

Five months later, however, we had almost 250,000 words in training. We neither had a legion of transcribers nor did one person have to write GT day and night. Just the exponential improvement of the models themselves enabled us to produce more and more GT:

The more GT you invest, the better your model will be. The better your model reads, the easier it will be to write GT. You don’t have to write by yourself anymore, you just correct the HTR. With models that have an average error rate of less than 8%, we’ve produced about 6 pages of GT per hour.

The better the model reads, the more GT can be produced and the more GT there is, the better the model will be. What is the opposite of a vicious circle?

Posted by Elisabeth Heigl on

The more, the better – how much GT do I have to put in?

Release 1.7.1

As I said before: Ground Truth is the key factor when creating HTR models.

GT is the correct and machine-readable copy of the handwriting that the machine uses to learn to “read”. The more the machine can “practice”, the better it will be. The more Ground Truth we have, the lower the error rate.

Of course, the quantity always depends on the specific use case. If we work with a few, easy-to-read writing, little GT is usually enough to train a solid model. However, if the writings are very different because we are dealing with a large number of different writers, the effort will be higher. This means that in such cases we need to provide more GT to produce good HTR models.

In the Spruchakten we find many different writers. That’s why a lot of GT was created to train the models. Our HTR-models (Spruchakten_M_2-1 to 2-11) clearly show how quickly the error rate actually decreases if as much GT as possible is invested. We can roughly say that doubling the amount of GT in training (words in trainset) will halve the error rate (CER page) of the model.

In our examples we could observe that we have to train the models with at least 50,000 words of GT in order to get good results. With 100,000 words in training, you can already create excellent HTR models.

Posted by Elisabeth Heigl on

Collaboration – User Management

Release 1.7.1

The Transkribus platform is designed for collaboration. So many users can work on a collection and even a document at the same time. Collisions should be easily avoided with a little organizational skill.

The two most important elements enabling organized collaboration are User Management and Version Management in Transkribus. User Management refers explicitly to the collections. The person who creates a collection is always its “owner”, meaning that he has full rights, including the right to delete the entire collection. He can grant other users access to the collection and assign them roles that correspond to different rights:

Owner – Editor – Transcriber

It is wise if more than one member of the team is the “owner” of a collection. All the rest of us are “editors”. Assigning the role “transcriber” is especially useful if you run crowd-projects where volunteers do nothing but transcribe or tag texts. For such “transcribers”, access via the WebUI, with its range of functions adapted to this role, is ideally suited.

Posted by Elisabeth Heigl on

transcription guidelines

In the transcripts for the Ground Truth, the litteral or diplomatic transcription is used. This means that we do not regulate the characters in the transcription, if possible. The machine must learn from a transcription that is as accurate as possible so that it can later reproduce exactly what is written on the sheet. For example, we consequently adopt the vocal and consonant use of “u” and “v” of the original. You can get used to the “Vrtheill” (sentence) and the “Vniuersitet” (university) quite quickly.

We made only the following exceptions from the literal transcription and regulated characters. The handling of abbreviations is dealt with separately.

We cannot literally transcribe the so-called “long-s” (“ſ”) and the “final-s” because we are dependent on the antiqua sign system. Therefore we transfer both forms as “s”.

We reproduce umlauts as they appear. Diacritical signs are adopted, unless the modern sign system does not allow this; as in the case of the “a” with ‘diacritical e’, which becomes the “ä”. Diphthongs are replaced, for example the “æ″ becomes “ae″.

The Ypsilon is written in many manuscripts as “ÿ″. However, we usually transcribe it as a simple “y″. In clear cases, we differentiate between “y” and the similarly used “ij” in the transcription.

There are also some exceptions to the literal transcription with regard to the punctuation and special characters: In the manuscripts, brackets are represented in very different ways. But here we use the modern brackets (…) uniformly. The hyphenation at the end of the lines is indicated by different characters. We transcribe them exclusively with a “¬”. The common linkage sign in modern use – the hyphen – hardly occurs in the manuscripts. Instead, when two words are linked, we often find the “=”, which we reproduce with a simple hyphen.

We take the comma and point setting as it appears – if it exists at all. If the sentence does not end with a dot, we do not set a dot.

Upper and lower case will be adopted unchanged according to the original. However, it is not always possible to strictly distinguish between upper and lower case letters. This applies to a large extent to the D/d, the V/v and also the Z/z, regardless of the writer. In case of doubt, we compare the letter in question with its usual appearance in the text. In composites, capital letters can occur within a word – they are also transcribed accurately according to the original.

Posted by Elisabeth Heigl on

Use case Spruchakten

The surfaces of file-sheets from the early modern times are usually uneven. Therefore, we always use a scanning glass-plate. Thus, at least rough foldings and creases can be smoothed and the writing can be straightened a little.

Contrary to the usual scanning procedure of books, we scan each page of a file individually. Thus, we deliberately excluded the possibilities of a subsequent layout processing of scans. Earlier digitization projects have shown that such post-scan layout editing can be laborious, error-prone and easily disrupt the workflow. But because subsequent layout processing was ruled out, the scans have to be produced as presentable as possible right from the start.

This is why we use the so-called “Crop Mode” (UCC project settings) for scanning. This automatically captures the sheet edge of the original and sets it as the frame of the scanned image. The result is an image with barely a black border. A possible misalignment of the sheet can automatically be compensated up to 40°. This leads to images that are reliably aligned and it also makes it easier to change pages during scanning.

For the “Crop Mode” to recognize a page and to scan it as such, only this page must be visible. This means that everything else, both the opposite side and the pages beneath, must be covered in black. For this purpose we use two common black photo cardboard sheets (A3 or A2).

In the Spruchakten there are often paper-sheets in which the lock-seals have been removed by cutting out. These pages must be additionally underlaid with a sheet as close as possible to the original colour. The “Crop Mode” will then complete the border so that no parts of the sheet are cut off during the scan.

Thus, during the scanning of the Spruchakten, we cannot simply “browse” and trigger scans, basically we have to prepare every single image. The average scanning speed with this procedure is 100 pages per hour. This way, we also save a possible costly post-processing of the images.

Posted by Elisabeth Heigl on

How to transcribe – Basic Decisions

In Transkribus, we create transcripts primarily to produce training material for our HTR-models; the so called „Ground Truth“. There are already a number of recommendations in the How-to’s for simple and advanced requirements.

We do not intend to create a critical edition. Nonetheless, we need some sort of guidelines, especially if we want to be successful in a team where several transcribers work on the same texts. Unlike classical edition guidelines, ours are not based on the needs of the scholarly reader. Instead, we focus on the needs of the ‘machine’ and the usability of the HTR result for a future full text search. We are well aware that this can only lead to a compromise in the result.

The training material should help the machine to recognize what we see as well. So it has to be accurate and not falsified by interpretation. This is the only way the machine can learn to read in the ‘right’ way. This principle is the priority and a kind of guideline for all our further decisions regarding transcriptions.

Many questions that are familiar to us from edition projects must also be decided here. In our project we generally use the literal or diplomatic transcription, meaning that we transcribe the characters exactly as we see them. This applies to the entire range of letters and punctuation marks. To give just an example: we don´t regulate the consonantal and vocal usage of the letters “v” and “u”. If the writer meant “und” (and) but wrote “vnndt”, we take it as literal and transcribe as the latter.

The perfection of the training data has a high priority for us. But there are also some other considerations influencing the creation of the GT. We would like to make the HTR results accessible via a full-text search. This means that a user must first phrase a word he is searching before receiving an answer. Since certain characters of the Kurrents font, such as the long „ſ“ (s) will hardly be part of a search term, we do regulate the transcription in such and similar cases.

In the case of sentence characters – using a certain amount of leeway here – we only regulate some. We regulate the bracket character for example, which is represented quite differently in the manuscripts. The same applies to word separators at the end of a line.

The usual “[…]” is never used for illegible passages. Instead this text area is tagged as “unclear”.

Posted by Elisabeth Heigl on

Workflow and Information System

The journey from a file in the archive to its digital and HTR-based presentation on an online platform passes through several stations. These steps make up the overall project workflow. They are based on a broad technical infrastructure. The workflow of our project, which is geographically spread over three locations, consists roughly of six main stations:

  1. Preaparation of the files and the process (restorative, archival, digital)
  2. Scanning
  3. Enrichment with structural and metadata
  4. Providing the files for Transkribus
  5. Automatic Handwritten Textrecognition (HTR) with Transkribus
  6. Online presentation in the Digital Library Mecklenburg-Vorpommern

It proved to be helpful that we not only had strictly defined the individual steps in advance but also determined the persons responsible from the beginning, i.a. experts for the individual tasks as well as coordinators for the steps across stations and locations. This ensures that all parties involved always know the respective contact person. Open questions can thus be answered more easily and any problems that may arise can be solved more efficiently.

Especially with the scanning of the Spruchakten, we have not proceeded chronologically from the start. We did not scan the inventory ‘from top to bottom’. Instead, we first selected and edited individual representative volumes between 1580 and 1675. We wanted to create mighty HTR models first. Only then did we ‘fill up the gaps’. This procedure showed us, how crucial it is to continuously document the progress of the project with all its indivdual areas and stages so that it may not become confusing. There are many ways to do this.

We keep – meanwhile very colourful – spreadsheets on the progress of the various collections and files. However, they only depict partial processes and are only accessible to the coordinators. But these spreadsheets have to be maintained and for this purpose the progress in the various areas have to be closely monitored.

Another possiblity is the Goobi workflow. Some tasks take place on the Goobi server anyway. In addition to those we can freely add tasks to the Goobi-workflow which do not have to be related to Goobi itself. All of them can be ‘accepted’ and ‘completed’ on that platform to reflect the progress of the project. However, the condition here is that all project contributors must be familiar with this workflow system. Where this is not the case, an „external“ information system must be selected that everyone can access and handle.

The different partners of our project therefor jointly keep a wiki (e-collaboration).