Yearly Archives

34 Articles

Posted by Dirk Alvermann on

P2PaLA vs. standard LA

Release 1.9.1

In the previous post we described that if the document layouts are very complicated, the standard LA in Transkribus does not always provide good results. But for a perfect HTR result you need a perfect LA.

Especially in the documents of the 16th and early 17th century the CITlab Advanced LA could not convince us. It was clear to us from the beginning that the standard LA wouldn’t identify the more complex layouts (text regions) in a differentiated way. However, it was the line detection that ultimately failed to meet our demands in these documents.

An example of how (in the worst case) the line detection of the standard LA worked on our material can be seen here:


1587, page 41

This may be an isolated case. However, if you process large quantities of documents in Transkribus, such cases may occur more frequently. In order to be able to evaluate the problem correctly, we have therefore recorded representative error statistics on two bundles of our material. It has been found that the standard LA here worked with an average of 12 errors in the line detection per page (see graph below, 1598). This of course has undesirable effects on the HTR result, which we will describe in more detail in the next post.

Posted by Dirk Alvermann on

P2PaLA or structure training

Release 1.9.1

Page-to-page layout analysis (P2PaLA) is a form of layout analysis that can be trained for indivudual models- similar to the HTR. You can train structure models either to recognize textregions only or textregions with baselines. They therefore fulfill the same functions as the standard layout analysis (CITlabAdvanced). The P2PaLA is particularly suitable if a document has many pages with mixed layouts. The standard layout analysis usually recognizes just one TR – and this can lead to problems with the reading order of the text.

With the help of a structure training, the layout analysis can learn where and how many TRs it should recognize.

The CITlab Advanced LA often had problems to identify the text regions in a differentiated way on our material. That’s why we experimented with P2PaLA early on in our project. First, we tried out structural models that exclusively set text regions (Main text, marginal notes, footnotes etc.). Within the TRs thus generated, we worked with the usual line detection. However, the results were not always satisfactory.

The BLs were often too short (at the beginning or the end of the line) or were torn many times – even on pages with simple layouts. Therefore we trained another one with an included recognition of the BLs, based on our already working P2PaLA model. Our newest model recognizes all ‘simple’ pages almost without any errors. For pages with very complex layouts, the results still have to be corrected, but with much less effort than before.

Posted by Anna Brandt on

Structural Tagging

How structural tagging is done exactly is explained in this Wiki. In contrast to “textual” tagging you can tag all structures, for example text regions, baselines or tables. In our case, only the text regions are tagged, because we use structure tagging to train a P2PaLA model.

When you create your training material and decide where to position the specific structural elements, you should stick to your choices. For example: for us a “paragraph” is always the TR at the top in the middle, the core so to speak; “marginalia” are all the notes on the left side of the image, separated from the “paragraph”.  With this you can divide the images into ‘types’, i.e. groups of images in which all TRs with the same tags are always in a certain coordinate area of the page.

Tips & Tools
There are three ways to set the corresponding tag. First by right-clicking on the marked area and then assigning a tag via “assign structure type”. Or you can choose the area “Structural” in the tab “Metadata”, where the existing structure types are displayed. There you can also define shortcuts for tags that you are using a lot: click on the button “Customize” and enter a number from one to nine in the column “Shortcut”. Then the shortcut is displayed in the tab, it is always Ctrl+Alt+Number.

Posted by Elisabeth Heigl on

Abbreviations

Release 1.9.1

Medieval and early modern manuscripts are usually full of abbreviations in all possible variations. These can be contractions (omission in the word) and suspensions (omission at the end of the word) as well as a wide variety of special characters. So if we want to transcribe old manuscripts, we must first consider how we want to reproduce the abbreviations: Do we reproduce everything as it appears in the text, or do we resolve everything – or do we adapt to the capacities of the HTR?

Basically there are three different ways to deal with abbreviations in Transkribus:

– You can try to reproduce abbreviation characters as Unicode characters. Many of the abbreviation characters used in 15th and 16th century Latin and German manuscripts can be found in the Unicode block “Latin Extended-D”. For special characters written in medieval latin texts, check the Medieval Unicolde Font Initiative. It depends entirely on the goals of your own project whether and when this path makes sense – it is quite complex anyhow.

– If you don’t want to work with Unicode characters, you could also use the “basic letter” of the abbreviation from the regular alphabet – like a literal transcription. Such a “placeholder” can then be provided with a textual tag that marks the word as an abbreviation (“abbrev”). How the tagged abbreviation is to be resolved can then be entered for each tag as “expansion”.

Thus the resolution of the abbreviation becomes part of the metadata. This approach offers the most possibilities for further use of the material. But it is also very laborious, because each and every abbreviation has to be tagged.

– Or you just dissolve the abbreviations. If you want to make large quantities of full text searchable, as we do, it makes sense to resolve the abbreviations consistently because it makes the search easier: Who is looking for “pfessores” instead of “professores”? We have made the experience that the HTR can handle abbreviations quite well; both the classic Latin and German abbreviations, as well as currency symbols or other special characters. This is why we resolve most abbreviations during transcription and use them as part of Ground Truth in HTR training.

The models we train have learned some abbreviations very well. The abbreviations frequently used in the manuscripts, such as the suffix “-en”, can be resolved by an HTR model – if it has been taught consistently.

But more complex abbreviations, especially the contractions, do cause difficulties for the HTR. In our project we have therefore decided to reproduce such abbreviations only in literal form.

In our Collection of Abbreviations we present the many different abbreviations that we find in our material from the 16th to 18th century. We also show how we (and later the HTR models) resolve them. This Collection will be updated by us from time to time – work in progress!