3 Articles

Posted by Anna Brandt on

Textregions

Release 1.7.1

Usually, the automatic CITlab Advanced Layout Analysis in its standard setting will recognize a single Text Region (TR) on an image with the corresponding baselines.
However, there are also simple layouts where the use of several TRs is recommended, e.g. if there are marginal notes or footnotes and similar recurring elements. As long as these text areas, which differ in content and structure, are contained in a single TR, the layout analysis simply counts the lines from top to bottom.

This “Reading Order” does not take into account where a text actually belongs in terms of content (e.g. an insertion), but only where it is graphically located on the page. Correcting an automatically generated but unsatisfactory Reading Order is boring and often time-consuming. The problem can easily be avoided by creating several text regions in which the related texts and lines are well kept like in a box.
To do this, you create the TRs manually at the appropriate places. Afterwards start the line detection with CITlab Advanced to add the baselines automatically.

Tips & Tools
If you have drawn the TRs manually and want to have the baselines drawn automatically by CITlab Advanced LA, you should first uncheck the box “Find Textregions”. Otherwise the manually drawn TRs will be replaced immediately. You should also make sure that none of the individual text regions is activated, otherwise only these will be edited.

Posted by Elisabeth Heigl on

transcription guidelines

In the transcripts for the Ground Truth, the litteral or diplomatic transcription is used. This means that we do not regulate the characters in the transcription, if possible. The machine must learn from a transcription that is as accurate as possible so that it can later reproduce exactly what is written on the sheet. For example, we consequently adopt the vocal and consonant use of “u” and “v” of the original. You can get used to the “Vrtheill” (sentence) and the “Vniuersitet” (university) quite quickly.

We made only the following exceptions from the literal transcription and regulated characters. The handling of abbreviations is dealt with separately.

We cannot literally transcribe the so-called “long-s” (“ſ”) and the “final-s” because we are dependent on the antiqua sign system. Therefore we transfer both forms as “s”.

We reproduce umlauts as they appear. Diacritical signs are adopted, unless the modern sign system does not allow this; as in the case of the “a” with ‘diacritical e’, which becomes the “ä”. Diphthongs are replaced, for example the “æ″ becomes “ae″.

The Ypsilon is written in many manuscripts as “ÿ″. However, we usually transcribe it as a simple “y″. In clear cases, we differentiate between “y” and the similarly used “ij” in the transcription.

There are also some exceptions to the literal transcription with regard to the punctuation and special characters: In the manuscripts, brackets are represented in very different ways. But here we use the modern brackets (…) uniformly. The hyphenation at the end of the lines is indicated by different characters. We transcribe them exclusively with a “¬”. The common linkage sign in modern use – the hyphen – hardly occurs in the manuscripts. Instead, when two words are linked, we often find the “=”, which we reproduce with a simple hyphen.

We take the comma and point setting as it appears – if it exists at all. If the sentence does not end with a dot, we do not set a dot.

Upper and lower case will be adopted unchanged according to the original. However, it is not always possible to strictly distinguish between upper and lower case letters. This applies to a large extent to the D/d, the V/v and also the Z/z, regardless of the writer. In case of doubt, we compare the letter in question with its usual appearance in the text. In composites, capital letters can occur within a word – they are also transcribed accurately according to the original.

Posted by Elisabeth Heigl on

Use case Spruchakten

The surfaces of file-sheets from the early modern times are usually uneven. Therefore, we always use a scanning glass-plate. Thus, at least rough foldings and creases can be smoothed and the writing can be straightened a little.

Contrary to the usual scanning procedure of books, we scan each page of a file individually. Thus, we deliberately excluded the possibilities of a subsequent layout processing of scans. Earlier digitization projects have shown that such post-scan layout editing can be laborious, error-prone and easily disrupt the workflow. But because subsequent layout processing was ruled out, the scans have to be produced as presentable as possible right from the start.

This is why we use the so-called “Crop Mode” (UCC project settings) for scanning. This automatically captures the sheet edge of the original and sets it as the frame of the scanned image. The result is an image with barely a black border. A possible misalignment of the sheet can automatically be compensated up to 40°. This leads to images that are reliably aligned and it also makes it easier to change pages during scanning.

For the “Crop Mode” to recognize a page and to scan it as such, only this page must be visible. This means that everything else, both the opposite side and the pages beneath, must be covered in black. For this purpose we use two common black photo cardboard sheets (A3 or A2).

In the Spruchakten there are often paper-sheets in which the lock-seals have been removed by cutting out. These pages must be additionally underlaid with a sheet as close as possible to the original colour. The “Crop Mode” will then complete the border so that no parts of the sheet are cut off during the scan.

Thus, during the scanning of the Spruchakten, we cannot simply “browse” and trigger scans, basically we have to prepare every single image. The average scanning speed with this procedure is 100 pages per hour. This way, we also save a possible costly post-processing of the images.