Category Archives

67 Articles

Posted by Elisabeth Heigl on

transcription guidelines

In the transcripts for the Ground Truth, the litteral or diplomatic transcription is used. This means that we do not regulate the characters in the transcription, if possible. The machine must learn from a transcription that is as accurate as possible so that it can later reproduce exactly what is written on the sheet. For example, we consequently adopt the vocal and consonant use of “u” and “v” of the original. You can get used to the “Vrtheill” (sentence) and the “Vniuersitet” (university) quite quickly.

We made only the following exceptions from the literal transcription and regulated characters. The handling of abbreviations is dealt with separately.

We cannot literally transcribe the so-called “long-s” (“ſ”) and the “final-s” because we are dependent on the antiqua sign system. Therefore we transfer both forms as “s”.

We reproduce umlauts as they appear. Diacritical signs are adopted, unless the modern sign system does not allow this; as in the case of the “a” with ‘diacritical e’, which becomes the “ä”. Diphthongs are replaced, for example the “æ″ becomes “ae″.

The Ypsilon is written in many manuscripts as “ÿ″. However, we usually transcribe it as a simple “y″. In clear cases, we differentiate between “y” and the similarly used “ij” in the transcription.

There are also some exceptions to the literal transcription with regard to the punctuation and special characters: In the manuscripts, brackets are represented in very different ways. But here we use the modern brackets (…) uniformly. The hyphenation at the end of the lines is indicated by different characters. We transcribe them exclusively with a “¬”. The common linkage sign in modern use – the hyphen – hardly occurs in the manuscripts. Instead, when two words are linked, we often find the “=”, which we reproduce with a simple hyphen.

We take the comma and point setting as it appears – if it exists at all. If the sentence does not end with a dot, we do not set a dot.

Upper and lower case will be adopted unchanged according to the original. However, it is not always possible to strictly distinguish between upper and lower case letters. This applies to a large extent to the D/d, the V/v and also the Z/z, regardless of the writer. In case of doubt, we compare the letter in question with its usual appearance in the text. In composites, capital letters can occur within a word – they are also transcribed accurately according to the original.

Posted by Dirk Alvermann on

Ground Truth is the Alpha and Omega

Release 1.7.1

Ground Truth (GT) is the basis for the creation of HTR models. It is simply a typewritten copy of the historical manuscript, a classic literal or diplomatic transcription that is 100% correct – plainly “Groundt Truth”.

Any mistake in this training material will cause “the machine” to learn – among many correct things – something wrong. That’s why quality management is so important when creating GT. But don’t panic, not every mistake in the GT has devastating consequences. It simply must not be repeated too often; otherwise it becomes “chronic” for the model.

In order to ensure the quality of the GT within our project, we have set up a few fixed transcription guidelines, as you know them from edition projects. It is worthwhile to strive for a literal, character-accurate transcription. Regulations of any kind must be avoided; e.g. normalizations, such as the vocal or consonant usage of “u” and “v” or the encoding of complex abbreviations.

If the material contains only one or two different handwritings, about 100 pages of transcribed text are sufficient for a first training session. This creates a basic model that can be used for further work. In our experience, the number of languages used in the text is irrelevant, since the HTR models usually work without dictionaries.

In addition to conventional transcription, Ground Truth can also be created semi-automatically. Transkribus offers a special tool – Text2Image – which is presented in another post.

Posted by Elisabeth Heigl on

How to transcribe – Basic Decisions

In Transkribus, we create transcripts primarily to produce training material for our HTR-models; the so called „Ground Truth“. There are already a number of recommendations in the How-to’s for simple and advanced requirements.

We do not intend to create a critical edition. Nonetheless, we need some sort of guidelines, especially if we want to be successful in a team where several transcribers work on the same texts. Unlike classical edition guidelines, ours are not based on the needs of the scholarly reader. Instead, we focus on the needs of the ‘machine’ and the usability of the HTR result for a future full text search. We are well aware that this can only lead to a compromise in the result.

The training material should help the machine to recognize what we see as well. So it has to be accurate and not falsified by interpretation. This is the only way the machine can learn to read in the ‘right’ way. This principle is the priority and a kind of guideline for all our further decisions regarding transcriptions.

Many questions that are familiar to us from edition projects must also be decided here. In our project we generally use the literal or diplomatic transcription, meaning that we transcribe the characters exactly as we see them. This applies to the entire range of letters and punctuation marks. To give just an example: we don´t regulate the consonantal and vocal usage of the letters “v” and “u”. If the writer meant “und” (and) but wrote “vnndt”, we take it as literal and transcribe as the latter.

The perfection of the training data has a high priority for us. But there are also some other considerations influencing the creation of the GT. We would like to make the HTR results accessible via a full-text search. This means that a user must first phrase a word he is searching before receiving an answer. Since certain characters of the Kurrents font, such as the long „ſ“ (s) will hardly be part of a search term, we do regulate the transcription in such and similar cases.

In the case of sentence characters – using a certain amount of leeway here – we only regulate some. We regulate the bracket character for example, which is represented quite differently in the manuscripts. The same applies to word separators at the end of a line.

The usual “[…]” is never used for illegible passages. Instead this text area is tagged as “unclear”.

Posted by Anna Brandt on

Elements

Release 1.7.1

For handwritten text recognition (HTR), automatic layout analysis is essential – no text recognition without layout analysis.
The layout analysis ensures that the image is divided into different areas, those that do not need further attention and others that contain the text to be recognized. These areas are called “Text Regions” (TR, green in the image). Transkribus needs “Baselines” (BL, red in the image) to recognize characters or letters within the text regions.
They are drawn underneath each text line. Baselines are surrounded by their own region, which is called “Line” (blue in the image). It has no practical relevance for the user. The three elements Text Region – Line – Baseline have a parent-child relationship to each other and cannot exist without the respective parent element – no baseline without line and no line without text region. One should know these elements, their functions and their relationship to each other, especially if you have to work on the layout manually.

Manual layouts should rather be an exception than the rule. For most use cases, Transkribus has an extremely powerful tool – the “CITlab Advances Layout Analysis”. It is the standard Transkribus model that has been used successfully since 2017. In most cases it delivers great results in automatic segmentation. This automatic layout analysis can be used for a single page, a selection of pages or an entire document.

All elements for segmentation can also be set, modified and edited manually, which is recommended in more complex layouts. An extensive toolbar is available for this purpose.

Posted by Anna Brandt on

Material

Release 1.7.1

Successful handwriting text recognition depends on four factors:

– Quality of Originals
– Quality of digital copies
– Reliable layout analysis and segmentation of image areas containing the text to be recognized
– Performance of the HTR models, “reading” the handwriting

Our blog will provide regular field reports on all these points. First of all, here are some general remarks.
Basically you can edit all handwritten documents with the tools available in Transkribus. Neither the used character system (Latin, Greek, Hebrew, Russian, Serbian etc.) nor the language is a criterion – the “models” can “learn” almost everything.
However, the quality of the originals has a big effect on the result. In other words – heavily soiled, completely faded or blackened documents have less chances for automatic text recognition than clean, strong writings.
Completely muddled  text layouts, i.e. with horizontal and vertical or diagonal lines, numerous marginal notes or insertions and text between the lines, cause more problems for the automatic layout analysis than chancellery copies. And more problems means more work for the editors.
When selecting the material, one should therefore consider the challenges it poses for the available tools and the individual work areas. This can only be done with a little experience.

In our project, multilingual documents from the 16th to 20th centuries are processed with varying degrees of difficulty. We are glad to share our experience.

Posted by Dirk Alvermann on

WebUI & Expert Client

As we said before, this blog is almost exclusively about the Expert Client of Transkribus. It offers a variety of possibilities. To handle them it requires a certain level of knowledge.

The tools of the WebUI are much more limited, but also easier to work with. In the WebUI it is not possible to perform an automatic layout analysis or to start an HTR, let alone to train a model or to interfere in the user management. But that’s not what it’s meant for.

The WebUI is the ideal interface for crowd projects with a lot of volunteers who mainly transcribe or comment and tag content. And this is exactly what it is used for most of the time. The coordination of such a crowd project is done via the Expert Client.

The WebUI’s advantages are that it can be used without any requirements. It is a web application called from the browser; no installation, no updates, etc. Moreover, it is almost intuitive and can be used by anyone without any previous knowledge.

 

Tips & Tools
The WebUI has also a version management – somewhat adapted for crowd projects. When a transcriber is done with the page to be edited, he sets the edit status to “ready for review”, so that his supervisor knows that now it’s his turn.

 

Posted by Dirk Alvermann on

Knowing what you want

A digitization project with Handwritten Text Recognition can have very different goals. They can range from the critical digital edition to the provision of manuscripts as full texts to the indexing of large text corpora via Key Word Spotting. All three objectives allow different approaches, which have a great influence on the technical and personnel efforts.

In this project, only the last two target definitions are of interest.  A critical edition is not intended, even if the full texts generated in this project could serve as the basis of such.

We aim at a complete indexing of the manuscripts by automatic text recognition. The results will then be made public online in the Digital Library Mecklenburg-Vorpommern. A search is available there, which shows the hits in the image itself. The user, who has sufficient palaeographic knowledge, can explore the context of the hit in the image himself or switch to a modern full text view, or even only use the latter.