Digitization

Posted by Elisabeth Heigl on 3. March 2020

Image resolution

A technical parameter that needs to be uniformly determined at the beginning of the scanning process is the resolution of the digital images, that means how many pixels/dots per inch (dpi) the scanned image should have.

The DFG Code of Practice for Digitisation generally recommends 300 dpi (p. 15). For “text documents with the smallest significant character” of up to 1.5 mm, however, a resolution of 400 dpi can be selected (p. 22). Indeed, in the case of manuscripts – especially concept writings – of the early modern period, the smallest character components can result in different readings and should therefore be as clearly identifiable as possible. We have therefore chosen 400 dpi.

In addition to the advantages for deciphering the scripts, however, the significantly larger storage format of the 400 (around 40,000 KB/img.) compared to the 300 (around 30,000 KB/img.) files must be taken into account and planned for!

The selected dpi number also has an impact on the process of automatic handwritten text recognition. Different resolutions lead to different results of the layout analysis and the HTR. To verify this thesis, we have selected three pages from a speech file of 1618, scanned them in 150, 200, 300 and 400 dpi each, processed all pages in transcript and determined the following CERs:

Seite/dpi	150	200	300	400
2	3,99	3,5	3,63	3,14
5	2,1	2,37	2,45	2,11
9	6,73	6,81	6,52	6,37

Generally speaking, a lower resolution therefore means a decrease in CERs – though within the range of less than one percent.To be honest, such comparisons of HTR results are somewhat misleading. The very basis of the HTR – the layout analysis – leads to latently different results at different resolutions, which in turn influences the HTR results (apparently cruder analyses achieve worse HTR results) but also the GT production itself (e.g. with cut-off words).

In our example you see the same image in different resolutions. The result of the CITlab Advanced LA changes as the resolution increases. The initial “V” of the first line is no longer recognized at higher resolution, while line 10 is increasingly “torn apart” at higher resolution. With increasing resolution, the LA becomes more sensitive – this can have advantages and disadvantages at the same time.

It is therefore important to have a uniform dpi number for the entire project so that the average CER values and the entire statistical material can be reliably evaluated.

Posted by Elisabeth Heigl on 4. September 2019

Use case Spruchakten

The surfaces of file-sheets from the early modern times are usually uneven. Therefore, we always use a scanning glass-plate. Thus, at least rough foldings and creases can be smoothed and the writing can be straightened a little.

Contrary to the usual scanning procedure of books, we scan each page of a file individually. Thus, we deliberately excluded the possibilities of a subsequent layout processing of scans. Earlier digitization projects have shown that such post-scan layout editing can be laborious, error-prone and easily disrupt the workflow. But because subsequent layout processing was ruled out, the scans have to be produced as presentable as possible right from the start.

This is why we use the so-called “Crop Mode” (UCC project settings) for scanning. This automatically captures the sheet edge of the original and sets it as the frame of the scanned image. The result is an image with barely a black border. A possible misalignment of the sheet can automatically be compensated up to 40°. This leads to images that are reliably aligned and it also makes it easier to change pages during scanning.

For the “Crop Mode” to recognize a page and to scan it as such, only this page must be visible. This means that everything else, both the opposite side and the pages beneath, must be covered in black. For this purpose we use two common black photo cardboard sheets (A3 or A2).

In the Spruchakten there are often paper-sheets in which the lock-seals have been removed by cutting out. These pages must be additionally underlaid with a sheet as close as possible to the original colour. The “Crop Mode” will then complete the border so that no parts of the sheet are cut off during the scan.

Thus, during the scanning of the Spruchakten, we cannot simply “browse” and trigger scans, basically we have to prepare every single image. The average scanning speed with this procedure is 100 pages per hour. This way, we also save a possible costly post-processing of the images.

Posted by Elisabeth Heigl on 30. August 2019

Scanning and Structural Data

The Spruchakten of the Greifswald Law Faculty are scanned on Bookeye4 book scanners (Image Access) in combination with the scanning software UCC (Universal Capturing Client) by Intranda. UCC not only allows the capturing of structural data while scanning, but is also directly connected to the Goobi server (also by Intranda), where the digital processes of our project (except for handwritten text recognition) are controlled. Tasks already created in Goobi (Goobi-Vorgang) can thus be accessed in UCC, ‘filled up’ with image-files and associated structural data and then be exported to the Goobi server.

We scan consistently at 400 dpi and 24-bit color depth. The original files created are saved as uncompressed TIF files. For further processing and presentation at the Digital Library Mecklenburg-Vorpommern however they are copied as compressed JPG files.

UCC enables you to capture structural data during the scanning process. This means that the scan operator can already set a structural element for related pages of the file while scanning. Every single “Responsum” (meaning each legal case in the file) receives the structural element “Vorgang”. In the later editing of the metadata, we only have to add a descriptive main title.

Rechtsprechung im Ostseeraum

Category Archives

Image resolution

Use case Spruchakten

Scanning and Structural Data