Monthly Archives

8 Articles

Posted by Anna Brandt on

Elements

Release 1.7.1

For handwritten text recognition (HTR), automatic layout analysis is essential – no text recognition without layout analysis.
The layout analysis ensures that the image is divided into different areas, those that do not need further attention and others that contain the text to be recognized. These areas are called “Text Regions” (TR, green in the image). Transkribus needs “Baselines” (BL, red in the image) to recognize characters or letters within the text regions.
They are drawn underneath each text line. Baselines are surrounded by their own region, which is called “Line” (blue in the image). It has no practical relevance for the user. The three elements Text Region – Line – Baseline have a parent-child relationship to each other and cannot exist without the respective parent element – no baseline without line and no line without text region. One should know these elements, their functions and their relationship to each other, especially if you have to work on the layout manually.

Manual layouts should rather be an exception than the rule. For most use cases, Transkribus has an extremely powerful tool – the “CITlab Advances Layout Analysis”. It is the standard Transkribus model that has been used successfully since 2017. In most cases it delivers great results in automatic segmentation. This automatic layout analysis can be used for a single page, a selection of pages or an entire document.

All elements for segmentation can also be set, modified and edited manually, which is recommended in more complex layouts. An extensive toolbar is available for this purpose.

Posted by Anna Brandt on

Material

Release 1.7.1

Successful handwriting text recognition depends on four factors:

– Quality of Originals
– Quality of digital copies
– Reliable layout analysis and segmentation of image areas containing the text to be recognized
– Performance of the HTR models, “reading” the handwriting

Our blog will provide regular field reports on all these points. First of all, here are some general remarks.
Basically you can edit all handwritten documents with the tools available in Transkribus. Neither the used character system (Latin, Greek, Hebrew, Russian, Serbian etc.) nor the language is a criterion – the “models” can “learn” almost everything.
However, the quality of the originals has a big effect on the result. In other words – heavily soiled, completely faded or blackened documents have less chances for automatic text recognition than clean, strong writings.
Completely muddled  text layouts, i.e. with horizontal and vertical or diagonal lines, numerous marginal notes or insertions and text between the lines, cause more problems for the automatic layout analysis than chancellery copies. And more problems means more work for the editors.
When selecting the material, one should therefore consider the challenges it poses for the available tools and the individual work areas. This can only be done with a little experience.

In our project, multilingual documents from the 16th to 20th centuries are processed with varying degrees of difficulty. We are glad to share our experience.

Posted by Dirk Alvermann on

WebUI & Expert Client

As we said before, this blog is almost exclusively about the Expert Client of Transkribus. It offers a variety of possibilities. To handle them it requires a certain level of knowledge.

The tools of the WebUI are much more limited, but also easier to work with. In the WebUI it is not possible to perform an automatic layout analysis or to start an HTR, let alone to train a model or to interfere in the user management. But that’s not what it’s meant for.

The WebUI is the ideal interface for crowd projects with a lot of volunteers who mainly transcribe or comment and tag content. And this is exactly what it is used for most of the time. The coordination of such a crowd project is done via the Expert Client.

The WebUI’s advantages are that it can be used without any requirements. It is a web application called from the browser; no installation, no updates, etc. Moreover, it is almost intuitive and can be used by anyone without any previous knowledge.

 

Tips & Tools
The WebUI has also a version management – somewhat adapted for crowd projects. When a transcriber is done with the page to be edited, he sets the edit status to “ready for review”, so that his supervisor knows that now it’s his turn.

 

Posted by Elisabeth Heigl on

Workflow and Information System

The journey from a file in the archive to its digital and HTR-based presentation on an online platform passes through several stations. These steps make up the overall project workflow. They are based on a broad technical infrastructure. The workflow of our project, which is geographically spread over three locations, consists roughly of six main stations:

  1. Preaparation of the files and the process (restorative, archival, digital)
  2. Scanning
  3. Enrichment with structural and metadata
  4. Providing the files for Transkribus
  5. Automatic Handwritten Textrecognition (HTR) with Transkribus
  6. Online presentation in the Digital Library Mecklenburg-Vorpommern

It proved to be helpful that we not only had strictly defined the individual steps in advance but also determined the persons responsible from the beginning, i.a. experts for the individual tasks as well as coordinators for the steps across stations and locations. This ensures that all parties involved always know the respective contact person. Open questions can thus be answered more easily and any problems that may arise can be solved more efficiently.

Especially with the scanning of the Spruchakten, we have not proceeded chronologically from the start. We did not scan the inventory ‘from top to bottom’. Instead, we first selected and edited individual representative volumes between 1580 and 1675. We wanted to create mighty HTR models first. Only then did we ‘fill up the gaps’. This procedure showed us, how crucial it is to continuously document the progress of the project with all its indivdual areas and stages so that it may not become confusing. There are many ways to do this.

We keep – meanwhile very colourful – spreadsheets on the progress of the various collections and files. However, they only depict partial processes and are only accessible to the coordinators. But these spreadsheets have to be maintained and for this purpose the progress in the various areas have to be closely monitored.

Another possiblity is the Goobi workflow. Some tasks take place on the Goobi server anyway. In addition to those we can freely add tasks to the Goobi-workflow which do not have to be related to Goobi itself. All of them can be ‘accepted’ and ‘completed’ on that platform to reflect the progress of the project. However, the condition here is that all project contributors must be familiar with this workflow system. Where this is not the case, an „external“ information system must be selected that everyone can access and handle.

The different partners of our project therefor jointly keep a wiki (e-collaboration).

Posted by Elisabeth Heigl on

Scanning and Structural Data

The Spruchakten of the Greifswald Law Faculty are scanned on Bookeye4 book scanners (Image Access) in combination with the scanning software UCC (Universal Capturing Client) by Intranda. UCC not only allows the capturing of structural data while scanning, but is also directly connected to the Goobi server (also by Intranda), where the digital processes of our project (except for handwritten text recognition) are controlled. Tasks already created in Goobi (Goobi-Vorgang) can thus be accessed in UCC, ‘filled up’ with image-files and associated structural data and then be exported to the Goobi server.

We scan consistently at 400 dpi and 24-bit color depth. The original files created are saved as uncompressed TIF files. For further processing and presentation at the Digital Library Mecklenburg-Vorpommern however they are copied as compressed JPG files.

UCC enables you to capture structural data during the scanning process. This means that the scan operator can already set a structural element for related pages of the file while scanning. Every single “Responsum” (meaning each legal case in the file) receives the structural element “Vorgang”. In the later editing of the metadata, we only have to add a descriptive main title.

Posted by Dirk Alvermann on

Knowing what you want

A digitization project with Handwritten Text Recognition can have very different goals. They can range from the critical digital edition to the provision of manuscripts as full texts to the indexing of large text corpora via Key Word Spotting. All three objectives allow different approaches, which have a great influence on the technical and personnel efforts.

In this project, only the last two target definitions are of interest.  A critical edition is not intended, even if the full texts generated in this project could serve as the basis of such.

We aim at a complete indexing of the manuscripts by automatic text recognition. The results will then be made public online in the Digital Library Mecklenburg-Vorpommern. A search is available there, which shows the hits in the image itself. The user, who has sufficient palaeographic knowledge, can explore the context of the hit in the image himself or switch to a modern full text view, or even only use the latter.

Posted by Dirk Alvermann on

Why HTR will change it all

For some years now, archives and libraries have been dedicating more and more of their time to the digitisation of historical manuscripts. The strategies are quite different. Some would like to present their “treasures” in a contemporary manner, others would like to make more extensive collections available for use in an appropriate digital form. The advantages of digitisation are obvious. The original sources are preserved and the interested researchers and non-experts can access the material independently of place and time without having to spend days or weeks in reading rooms. Considering the practice of the 20th century, this is an enormous step forward.

Initially, such digital services provide no more than a digital image of the original historical source. They are developed and maintained at gerat expense, both financially and in terms of staff. If you look at the target groups of these services, you can see that they are mainly aimed at the very same people who also visit archives and libraries. However, the addressees usually have the ability to decipher such historical manuscripts. Optimistically speaking, we are talking about one or two percent of the population. For everyone else, these digital copies are just beautiful to look at.

Keep this picture in mind if you want to understand why the Handwritten Text Recognition (HTR) is opening a whole new chapter in the history of digital indexing and use of historical documents. In a nutshell: HTR allows us to move from simple digitalization to the digital transformation of historical sources. Thanks to the HTR, not only the digital image of a manuscript but also its content is made available in a form that can be read by everyone and searched by machines – over hundreds of thousands of pages.

Thus the contents of historical handwritings can be opened up to a public to whom it has so far remained closed or at least not easily accessible. This does not only adress the non-professional researchers. Access to the contents of the sources will also be much easier for academic experts from disciplines that do not have historical auxiliary sciences like palaeography as part of their classical educational canon. This makes new constellations of inderdisciplinary research possible. Ultimately, since the contents of the manucsripts can now be evaluated by machine, questions and methods of the Digital Humanities can be more easily applied to the material than before.

Tips & Tools
Recommendation for further reading: Mühlberger, Archiv 4.0 oder warum die automatisierte Texterkennung alles verändern wird Tagungsband Archivtag Wolfsburg, in: Massenakten – Massendaten. Rationalisierung und Automatisierung im Archiv (Tagungsdokumentationen zum Deutschen Archivtag, Band 22), hg. v. VdA, Fulda 2018, S. 145-156.

Posted by Anna Brandt on

What you find here and what you don’t

This blog mainly reports about our work with Transkribus. In addition, we also present the project workflow and our experience with the scanning processes, the applied parameters, the creation of structural and metadata and the presentation of the project results in the Viewer of the Digital Library Mecklenburg-Vorpommern.

This blog is not a manual. So don’t expect us to give step-by-step instructions for individual tasks that can be done in Transkribus (although we sometimes do). But there are a lot of good and proven How-To’s, which the Transkribus team and users have developed over the past years. Here you can read about practical experiences and some tips & tricks.

Transkribus now has two interfaces: the “Expert Client”, which you can download here, and the Web User Interface (WebUI), which you can reach at this address. This blog is almost exclusively about the Expert Client, because it provides the full functionality needed to handle challenging projects. Under which circumstances and why the use of the WebUI is nevertheless useful and appropriate, we explain here .

Our experiences are based on a medium-sized large-scale-project. Here approx. 250,000 images are processed. Our focus is accordingly aligned. We use the possibilities of Transkribus to open up large quantities of documents through handwritten text recognition (HTR), to enrich them with content and to make them available online. Searchability is to be made possible by means of full text search or keyword spotting (KWS). The type of methods used and the demands placed on the results are aligned to this goal. Projects on a smaller scale may use differentiated and more subtle methods; nevertheless, there are some useful experiences for them as well.

Tips & Tools
Recommendation for further reading: Günter Mühlberger, Tamara Terbul: Handschriftenerkennung für historische Schriften. Die Transkribus Plattform