Monthly Archives

2 Articles

Posted by Dirk Alvermann on

Our first public model for german current (17th century)

Today we proudly present our HTR-model “Acta 17” as a public model.

The model was trained on the base of more than 500,000 words from about a 1000 different writers during the period of 1580-1705. It can handle the languages German, Lower German and Latin and is able to decipher simple german and latin abbreviations. Besides the usual chancellery lettering, the training material also contained a selection of concept writings and printed material of the period.

The entire training material is based on legal texts or court writings from the Responsa of the Greifswald Law Faculty. Validation sets are based on a chronological selection of the years: 1580 – 1705. GT & validation set was produced by Dirk Alvermann, Elisabeth Heigl, Anna Brandt.

Due to some problems with creating a large series of base-model-trainings for HTR+ Models in the last couple of weeks, we decided to launch an HTR+ Model trained from the scratch.

It is accompanied by a PyLaia model, which is based on the same training and validation sets and was also trained without using a base model.

For the validation set we choose pages representing single years of the total set of documents. All together they represent 48 selected years, five pages of each year.

How the models do perform in the several time periods of the validation set, you can check in the comparison below. Both models did run without language model.

Posted by Anna Brandt on

Searching and editing tags

Release 1.11.0

If you tag large amounts of historical text, as we have tried to do with place and person names, you will sooner or later have a problem: the spelling varies a lot – or in other words, the tag values are not identical.
Let’s take the places and a simple example. As “Rosdogk”, “Rosstok” or “Rosdock” the same place is always referred to – the City of Rostock. To make this recognizable, you use the properties. But if you do this over more than ten thousand pages with hundreds or thousands of places (we set about 15,000 tags for places in our attempt), you easily lose the overview. And besides, tagging takes much longer if you also assign properties.

Fortunately, there is an alternative. You can search in the tags, not only in the document you are working on, but in the whole collection. To do this, you just have to select the “binoculars” in the menu, similar to starting a full text search or KWS, only that you now select the submenu “Tags”.

Here you can select the search area (Collection, Document, Page) and also on which level you want to search (Line or Word).Then you have to select the corresponding tag and if you want to limit the search, you have to enter the tagged word. The search results can also be sorted. This way we can quickly find all “Rostocks” in our collection and can enter the desired additional information in the properties, such as the current name, geodata and similar. These “properties” can then be assigned to all selected tagged words. In this way, tagging and enrichment of data can be separated from each other and carried out efficiently.

The same is possible with tags like “Person” or “Abbrev” (where you would put the resolution/expansion in the properties).