Category Archives

3 Articles

Posted by Elisabeth Heigl on

Edit multiple documents simultaneously

Version 1.15.1

Until now, we were used to initiating the layout analysis and the HTR for the document we were currently in. Now, however, it is possible to activate both steps for all documents of the entire collection in which we are currently located. We will describe how this works in a moment – but first of all why we are very happy about this:

In order to check the results of our freshly trained models, we have created a separate collection of Spruchakten-testsets. How exactly and why, you can read elsewhere. This means that for each document from which we have used GT in training, there is a separate test set document in the test set collection.

When a new HTR model has finished training and we are curious to see how it will compare to the previous models, we run it through each of the test sets and then calculate the CER. After more than two years of training, our test set collection is now quite full; with almost 70 test sets already in it.

Imagine how time-consuming it used to be to open each test set individually to activate the new HTR. Even with only 40 test sets, you had to be very curious. And now imagine how much easier it is that we can trigger the HTR (and also the LA) for all documents at the same time with one click. This should please all those who process a lot of small documents, such as index cards, in one collection.

And how does that work now? You can see it immediately under the layout analysis tools: In red, under “Document Selection”, there is the new option “Current collection” to be selected for the following step.

However, it is not enough to simply select “Current Selection” and then start the LA; irst you must always enter the selection via “Choose docs…”. Either you simply confirm the preselection (all docs in the collection) or you select individual docs.

For the HTR, the same option will only appear in the selection window for “Text Recognition”. Here, too, you can select the “Current collection” for the following step. And here, too, you must confirm the selection again via “choose docs….”.

Posted by Dirk Alvermann on

Why HTR will change it all

For some years now, archives and libraries have been dedicating more and more of their time to the digitisation of historical manuscripts. The strategies are quite different. Some would like to present their “treasures” in a contemporary manner, others would like to make more extensive collections available for use in an appropriate digital form. The advantages of digitisation are obvious. The original sources are preserved and the interested researchers and non-experts can access the material independently of place and time without having to spend days or weeks in reading rooms. Considering the practice of the 20th century, this is an enormous step forward.

Initially, such digital services provide no more than a digital image of the original historical source. They are developed and maintained at gerat expense, both financially and in terms of staff. If you look at the target groups of these services, you can see that they are mainly aimed at the very same people who also visit archives and libraries. However, the addressees usually have the ability to decipher such historical manuscripts. Optimistically speaking, we are talking about one or two percent of the population. For everyone else, these digital copies are just beautiful to look at.

Keep this picture in mind if you want to understand why the Handwritten Text Recognition (HTR) is opening a whole new chapter in the history of digital indexing and use of historical documents. In a nutshell: HTR allows us to move from simple digitalization to the digital transformation of historical sources. Thanks to the HTR, not only the digital image of a manuscript but also its content is made available in a form that can be read by everyone and searched by machines – over hundreds of thousands of pages.

Thus the contents of historical handwritings can be opened up to a public to whom it has so far remained closed or at least not easily accessible. This does not only adress the non-professional researchers. Access to the contents of the sources will also be much easier for academic experts from disciplines that do not have historical auxiliary sciences like palaeography as part of their classical educational canon. This makes new constellations of inderdisciplinary research possible. Ultimately, since the contents of the manucsripts can now be evaluated by machine, questions and methods of the Digital Humanities can be more easily applied to the material than before.

Tips & Tools
Recommendation for further reading: Mühlberger, Archiv 4.0 oder warum die automatisierte Texterkennung alles verändern wird Tagungsband Archivtag Wolfsburg, in: Massenakten – Massendaten. Rationalisierung und Automatisierung im Archiv (Tagungsdokumentationen zum Deutschen Archivtag, Band 22), hg. v. VdA, Fulda 2018, S. 145-156.

Posted by Anna Brandt on

What you find here and what you don’t

This blog mainly reports about our work with Transkribus. In addition, we also present the project workflow and our experience with the scanning processes, the applied parameters, the creation of structural and metadata and the presentation of the project results in the Viewer of the Digital Library Mecklenburg-Vorpommern.

This blog is not a manual. So don’t expect us to give step-by-step instructions for individual tasks that can be done in Transkribus (although we sometimes do). But there are a lot of good and proven How-To’s, which the Transkribus team and users have developed over the past years. Here you can read about practical experiences and some tips & tricks.

Transkribus now has two interfaces: the “Expert Client”, which you can download here, and the Web User Interface (WebUI), which you can reach at this address. This blog is almost exclusively about the Expert Client, because it provides the full functionality needed to handle challenging projects. Under which circumstances and why the use of the WebUI is nevertheless useful and appropriate, we explain here .

Our experiences are based on a medium-sized large-scale-project. Here approx. 250,000 images are processed. Our focus is accordingly aligned. We use the possibilities of Transkribus to open up large quantities of documents through handwritten text recognition (HTR), to enrich them with content and to make them available online. Searchability is to be made possible by means of full text search or keyword spotting (KWS). The type of methods used and the demands placed on the results are aligned to this goal. Projects on a smaller scale may use differentiated and more subtle methods; nevertheless, there are some useful experiences for them as well.

Tips & Tools
Recommendation for further reading: Günter Mühlberger, Tamara Terbul: Handschriftenerkennung für historische Schriften. Die Transkribus Plattform