Yearly Archives

36 Articles

Posted by Elisabeth Heigl on

Search and Browse documents and HTR-Results in the Digital Library MV

We present our results in the Digital Library Mecklenburg-Vorpommern. Here you will find the digital versions with their corresponding transcriptions.

If you have selected a document– as here for example the Spruchakte of 1586 – you will then see its first page in the centre of the display. The box above allows you to switch to the next, the previous or any other page of your choice (1.). You can rotate the image (3.) zoom in or out (5.), choose two-page mode (2.) and switch to full screen mode (4.).

On the left side you can select different view options („Ansicht“). Here you can, for example, display all images at once instead of just one page („Seitenvorschau“) or you can read the transcribed text right away („Volltext“).

If you want to navigate in the structure of the file, first open the structure tree of the file in the bottom left box using the small plus symbol. Then you can select any given date.

Are you looking for a certain name, a place or some other term? Simply enter it in the search box on the left („Suche in: Spruchakte 1586“). If the term occurs in the file, the full-text hits („Volltexttreffer“), meaning all places where your search term occurs in the text, are indicated.

If you select one of the hits here, your search term will be marked yellow on the digital image. For now, highlighting the search result will only work on the digitized page, not yet in full text.

Tips & Tools
Display the found full text hits in a new tab (right mouse button). Navigating forwards and backwards in the Digital Library is still a bit tricky. This way you can be sure that you will always return to your previous selection.

Posted by Dirk Alvermann on

How to create test sets and why they are important, #1

Release 1.7.1

If we want to know how much a model has learned in training, we have to  test it. We do this with precisely defined test sets. Test sets – like the training set – contain exclusively Ground Truth. However, we make sure that this GT has never been used to train the model. So the model does not “know” this material. This is the most important characteristic of test sets. A text page that has already been used as training material will always be better read by the model than one it is not yet “familiar” with. This can easily be proved experimentally. So if you want to get valid statements about CERs and WER, you need “non-corrupted” test sets.

It is also important that a test set is representative. As long as you train an HTR model for a single writer or an individual handwriting, it’s not difficult – after all, it’s always the same hand. As soon as there are several writers involved, you have to make sure that all the individual handwritings used in the training material are also included in the test set. The more different handwritings are trained in a model, the larger the test sets will be.

The size of the test set is another factor that influences representativity. As a rule, a test set should contain 5-10% of the training material. However, this rule of thumb should always be adapted to the specific requirements of the material and the training objectives.

To illustrate this with two examples: Our model for the Spruchakten from 1580 to 1627 was trained with a training set of almost 200,000 words. The test set contains 44,000 words. This is of course a very high proportion of about 20%. It is due to the fact that material of about 300 different writers was trained in this model, which must also be represented in the test set. – In our model for the judges’ opinions of the Wismar Tribunal, there are about 46,000 words in the training set, the test set contains only 2,500 words, i.e. a share of about 5%. However, we only have to deal with 5 different writers. In order to have a representative test set, the material is sufficient.

Posted by Dirk Alvermann on

Word Error Rate & Character Error Rate – How to evaluate a model

Release 1.7.1

The Word Error Rate (WER) and Character Error Rate (CER) indicate the amount of text in a handwriting that the applied HTR model did not read correctly. A CER of 10% means that every tenth character (and these are not only letters, but also punctuations, spaces, etc.) was not correctly identified. The accuracy rate would therefore be 90 %. A good HTR model should recognize 95% of a handwriting correctly, the CER is not more than 5%. This is roughly the value that is achieved today with “dirty” OCR for fracture fonts. Incidentally, an accuracy rate of 95% also corresponds to the expectations formulated in the DFG’s Practical Rules on Digitization.

Even with a good CER, the word error rate can be high. The WER shows how good the exact reproduction of the words in the text is. As a rule, the WER is three to four times higher than the CER and is proportional to it. The value of the WER is not particularly meaningful for the quality of the model, because unlike characters, words are of different lengths and do not allow a clear comparison (a word is already incorrectly recognized if just one letter in it is not correct). That is why the WER is rarely used to characterize the value of a model.

The WER, however, gives clues to an important aspect. Because when I perform a text recognition with the aim of later performing a full text search in my document, the WER shows me the exact success rate that I can expect in my search. The search is for words or parts of words. So no matter how good my CER is: with a WER of 10%, potentially every tenth search term cannot be found.

Tips & Tools
The easiest way to display the CER and WER is to use the Compare function under Tools. Here you can compare one or more pages of a Ground Truth version with an HTR text to estimate the quality of the model.

Posted by Elisabeth Heigl on

The more, the better – how to generate more and more GT?

Release 1.7.1

To make sure that the model can reproduce the content of the handwriting as accurately as possible, learning requires a lot of Ground Truth; the more, the better. But how do you get as much GT as possible?

It takes some time to produce a lot of GT. When we were at the beginning of our project and had no models available yet, it took us one hour to transcribe 1 to 2 pages. That’s an average of 150 to 350 words per hour.

Five months later, however, we had almost 250,000 words in training. We neither had a legion of transcribers nor did one person have to write GT day and night. Just the exponential improvement of the models themselves enabled us to produce more and more GT:

The more GT you invest, the better your model will be. The better your model reads, the easier it will be to write GT. You don’t have to write by yourself anymore, you just correct the HTR. With models that have an average error rate of less than 8%, we’ve produced about 6 pages of GT per hour.

The better the model reads, the more GT can be produced and the more GT there is, the better the model will be. What is the opposite of a vicious circle?

Posted by Anna Brandt on

Collaboration – Versions Management

Release 1.7.1

The second important element for organized collaboration is the version management of Transkribus. In the toolbar it seems rather inconspicuous, but it is enormously important. Transkribus stores a version of the currently edited page each time it is saved. It contains the current status of the layout work and content processing.

These versions are provided with an “edit status” so that they can be easier distinguished. A newly uploaded Document contains only pages with the edit status “new”. As soon as you edit a page, the edit status automatically changes to “in progress”. The three other status options – “done”, “final” and “Ground Truth” – can only be set manually.

The logical time to set such a “higher” status depends on the agreements within the team. We use versions management mostly during the production of training material – Ground Truth. All pages that have a finished layout analysis are set to “done” so that the transcribers and editors know that this page can now be finished by them. This status will not be changed until the page has a 100% secure transcription. Then it will be set to “Ground Truth” or “final”. All pages with the status “GT” will later be used as training material for HTR models, while the pages with edit status “final” will be used to create the test sets.

Each collaborator can access and edit or delete all versions of a page at any time. The edit status helps him to find the desired version faster. In addition to the edit status, the last editor and the save time are displayed for each version. If the version was edited with an automatic process (layout analysis or HTR), this is also commented. Thus, the processing steps are traceable in detail.

Tips & Tools
You can have multiple versions with the same status.
You can set any version to any other status – except to “New”.
You can delete single or multiple versions – except final versions, which cannot be deleted.

Posted by Elisabeth Heigl on

The more, the better – how much GT do I have to put in?

Release 1.7.1

As I said before: Ground Truth is the key factor when creating HTR models.

GT is the correct and machine-readable copy of the handwriting that the machine uses to learn to “read”. The more the machine can “practice”, the better it will be. The more Ground Truth we have, the lower the error rate.

Of course, the quantity always depends on the specific use case. If we work with a few, easy-to-read writing, little GT is usually enough to train a solid model. However, if the writings are very different because we are dealing with a large number of different writers, the effort will be higher. This means that in such cases we need to provide more GT to produce good HTR models.

In the Spruchakten we find many different writers. That’s why a lot of GT was created to train the models. Our HTR-models (Spruchakten_M_2-1 to 2-11) clearly show how quickly the error rate actually decreases if as much GT as possible is invested. We can roughly say that doubling the amount of GT in training (words in trainset) will halve the error rate (CER page) of the model.

In our examples we could observe that we have to train the models with at least 50,000 words of GT in order to get good results. With 100,000 words in training, you can already create excellent HTR models.

Posted by Anna Brandt on

Train sets & test sets (for Beginners)

Release 1.7.1

When we train an HTR model, we create training sets and test sets, all based on Ground Truth. In the next posts on this topic you will learn more about it, especially that both sets must must not be mixed together. But what exactly is the difference between the two and what are they used for?

Training and test sets are very similar in the choice of material they contain. The material in both sets should come from the same handwritings and be at the same status (GT). The difference is how Transkribus uses it to create a new model: The training set is learned by the program in a hundred (or more) rounds (epochs). Imagine writing a test a hundred times – for practice purposes, so to speak. Every time you write the test, after going through all the pages, you get the solution and can look at your mistakes. Then you start again with the same exercise. Of course you’ll get better and better. The same way does Transkribus learn a bit more with each pass.

After each round in the training set, the learned skills are checked on the test set. Imagine your test again. This time you write the test, get the grade, but they don’t tell you what you did wrong. So Transkribus goes through the same pages many times, but can never see the right solution. The model has to fall back on the previously learned training and you can see how well it has studied.

So if there were the same pages in the test set as in training, Transkribus could “cheat”. It would already know the pages, have practised on them a hundred times and seen the solution a hundred times. This is the reason why the CER (Character Error Rate) in the training set is almost always lower than in the test set. This is best seen in the “learning curve” of a model.

Posted by Elisabeth Heigl on

Collaboration – User Management

Release 1.7.1

The Transkribus platform is designed for collaboration. So many users can work on a collection and even a document at the same time. Collisions should be easily avoided with a little organizational skill.

The two most important elements enabling organized collaboration are User Management and Version Management in Transkribus. User Management refers explicitly to the collections. The person who creates a collection is always its “owner”, meaning that he has full rights, including the right to delete the entire collection. He can grant other users access to the collection and assign them roles that correspond to different rights:

Owner – Editor – Transcriber

It is wise if more than one member of the team is the “owner” of a collection. All the rest of us are “editors”. Assigning the role “transcriber” is especially useful if you run crowd-projects where volunteers do nothing but transcribe or tag texts. For such “transcribers”, access via the WebUI, with its range of functions adapted to this role, is ideally suited.

Posted by Anna Brandt on

Toolbar – the most important tools and how to use them #2

Release 1.7.1

Correcting layouts

If the basic text regions are drawn, they can be edited. If you select one of the text regions, the other tools on the toolbar will be enabled.

With 1 you can add one or more points to the selected shape (TR or BL!). All shapes consist of dots and straight lines connecting them. You can edit the shape by moving the dots. You can use this tool to make a polygon out of the basic text region, whatever fits best to the text block. Press 2 to remove a dot from the selected shape. This tool is especially useful for correcting or shortening baselines.

This is especially useful if you have split elements. With 3,4 and 5 it is possible to cut a selected shape. This is also possible for both text regions and baselines: 3 cuts horizontally, 4 vertically. With 5 you draw your own line, which does not necessarily have to be horizontal or vertical.

The last important tool (red circle) is the Merge tool. This is especially important if the automatic LA has split baselines in the image. You can  use Merge to reassemble all shapes. So baselines with baselines and text regions with text regions. To do this you have to mark the corresponding shapes, which you can do directly in the image or in the layout tab.

 

Tips & Tools
When splitting, note that the TR and BL can only be cut where they have lines. It is not possible to cut through the dots.
Be aware that when you split a shape Transkribus will automatically change the Reading Order. For example, if two TRs are made from one, a new reading order is started in each TR.

Posted by Anna Brandt on

Reading Order

Release 1.7.1

The Reading Order displays the order in which the HTR will read the lines in an image. This RO is created automatically during the layout analysis, but can also be changed manually later. With the automatic LA, the RO is determined by the coordinates of the lines in the image: the top line, which is furthest to the left, is number one, and so on.

If the writing in the image is not completely horizontal or if baselines are split, this can cause errors in the Reading Order. If you correct the LA, you should always look at the RO again, otherwise the transcribed text gets confused and makes little sense. To change the RO you can either click on the circles at the lines where the line numbers appear and correct them directly. Or you can change the RO by selecting the corresponding line in the layout tab and moving it with the mouse. If the later full text is to make sense at first glance, such corrections are essential. After all, the RO determines the context of the content. If the HTR-Result of the document is only to be used for a full text search and is not to be displayed in structured full text, the RO is less relevant.

 

Tips & Tools
If you want to move a line forward or backward, the numbers of the following lines will change automatically. Sometimes it is necessary to calculate a bit beforehand which number will be the correct one.
Very important: When the author writes an increasing line from left to right – which happens very, very often – and when the baseline is split on the LA, the second half of the split BL will have the smaller number. If you want to merge these baselines with the Merge Tool, you have to look at the RO first. If the RO is wrong, Transkribus will merge it with a loop according to their coordinates. This baseline can no longer be interpreted by the HTR.
Edit: This problem was solved with the version 1.8.0. The problem now only occurs with vertically recognized lines.