Category Archives

15 Articles

Posted by Anna Brandt on

Transcribing without layout analysis?

Release 1.10.1

We have emphasized in previous posts how important LA is. Without it, an HTR model, no matter how good it is, has no chance of transcribing a text properly. The steps of automatic LA (or a P2PaLA model) and HTR are usually initiated separately. Now we noticed that when an HTR model runs over a completely new or unedited page, the program automatically executes an LA.

This LA runs with the default settings of CITLab-Advanced LA. On pure pages, fewer lines have to be merged and sometimes more than one text region is recognized.

But it also means that only horizontal text is recognized. We had the same problem with our P2PaLA models. Everything that is slanted or vertical cannot be recognized this way. To do this, the LA must be initiated manually, with the setting ‘Text Orientation’ set to ‘Heterogeneous’.

Interestingly, the HTR results are better with this method than with an HTR that has been run over a corrected layout analysis. We have calculated the CER for some pages to show this.

Thus this method is a very good alternative, especially for pages with an uncomplicated layout. You save time, because you only have to initiate one process, and in the end you have a better result.

Posted by Anna Brandt on

Tools in the Layout tab

Release 1.10.

The layout tab has two more tools, which we did not mention in our last post. They are especially useful for correcting the layout and save you from annoying detail work.
The first one corrects the Reading Order. If one or more text regions are selected, this tool automatically arranges the child shapes, in this case the lines and baselines, according to their position in the coordinate system of the page. So the reading order starts at the top left and continues counting from there to the bottom right.  In the example below, a TR was split but the RO of the marginal notes got mixed up. This tool saves you now from renaming each BL individually.

The second tool (“assign child shapes”) helps to assign the BL to the correct TR. This can happen when cutting text regions or baselines that run through multiple TRs. Each BL then has to be marked in the layout tab and moved to the correct TR. For assigning them automatically to the corresponding TR you just select the TR where the BLs belong to and start the tool.

Posted by Anna Brandt on

P2PaLA – line detection and HTR

Release 1.9.1

As already mentioned in a previous post, we noticed in the course of our project that the CITLabAdvanced-LA does not optimally identify the layout in our material. This happens not only on the ‘bad’ pages with mixed layouts, but also on simple layouts, i.e. on pages without any marginalias at the edge, great deletions in the text or similar. Here the automatic LA recognizes the TR correctly, but the baselines are often faulty.

This is not only confusing when the full text is displayed later; an insufficient LA also influences the result of the HTR. No matter how good your HTR model is: if the LA does not offer adequate quality, it is a problem.

The HTR does not read the single characters, but works line based and should recognize patterns. But if the line detection did not identify the lines correctly (in case letters or words were not recognized by the LA) this often produces wrong HTR results. This can have dramatic effects on the accuracy rate of a page or an entire document, as our example shows.


1587, page 41

For this reason we have trained a P2PaLA model which also detects BLs. That was very helpful. It is not possible to calculate statistics like CERs for these layout models, but from the visual point it seems to work almost error-free on ‘simple’ pages. In addition, a postprocessing is no longer necessary in many cases.

The training material for such a model is created in a similar way to models that should recognize TRs only. The individual baselines do not have to be tagged manually for the structural analysis, even if the model does so later in order to assign them to the tagged TR. With the support of the Transkribus team and a training material of 2500 pages, we were able to train the structural model that we use today instead of the standard LA.

Posted by Anna Brandt on

P2PaLA – Postprocessing

Release 1.9.1

Especially at the beginning of the development of a structure model, it occurred to us that the model recognized every irregularity in the layout as a TR. This leads to excessive – and unnecessary – many text regions. Many of these TRs were also extremely small.

The more training material you invest, the smaller the problem. In our case these mini TRs disappeared, after we had trained our model with about 1000 pages. Until then, they are annoying because removing them all by hand is tedious.

To reduce this labour you have two options. Firstly, starting the P2PaLA you can determine how large the smallest TR is allowed to be. For this you have to select the corresponding value in the “P2PaLA structure analysis tool” before starting the job (“Min area”).

If this option does not bring the expected success, there is the option “remove small textregions”. You will find this on the left toolbar, under the item “other segmentation tools”. In the menu you can set the pages on which the filter should run as well as the size of the TR to be removed.  The size is calculated in “Threshold percentage of image size”. Here the value can be calibrated finer than with the above mentioned option. If the images, as with our material, often have small notes, for example the marginalias where there is only a single word in a TR, then the smallest or second smallest value possible should be chosen. We usually use the “Threshold percentage” of 0.005.

Even with a good structural model, it may still be possible that individual TRs have to be manually merged, split or removed, but to a much lesser extent than the standard LA would require.

Tips & Tools
Important: If you want to be sure that you don’t remove too many TRs, you can start with a “dry run”. Then the number of potentially removable TRs will be listed. As soon as you uncheck the box, the affected TRs will be deleted immediately.

Posted by Anna Brandt on

P2PaLA – Training for Textregions

Release 1.9.1

At another place of these blog you can find information and tips for structure tagging. This kind of tagging can be good for a lot of things – the following is about its use for an improved layout analysis. Because structure tagging is an important part of training P2PaLA models.

With our mixed layouts the standard LA simply had to fail. The material was too extensive for a manual creation of the layout. So we decided to try P2PaLA. For this we created training material for which we selected particularly “difficult” but at the same time “typical” pages. These were pages that contained, in addition to the actual main text, comments and additions and the like.


coll: UAG Strukturtagging, doc. UAG 1618-1, image 12

For the training material only the correctly drawn and tagged text regions are important. No additional line detection or HTR is required. However, it doesn’t bother either, so you can include pages that have already been completely edited in the training. However, if you take new pages on which only the TR has to be drawn and tagged, you’ll be faster. Then you can prepare eighty to one hundred pages for training in one hour.

While we had tagged seven different structure types with our first model, we later reduced the number to five. In our experience, a too strong differentiation of the structure types has a rather negative effect on the training.

Of course, the success of the training also depends on the amount of training material you invest. According to our experience (and based on our material) you can make a good start with 200 pages, with 600 pages you get a model you can already work with; from 2000 pages on it is very reliable.

Tips & Tools
When you create the material for structure training, it is initially difficult to realize that this is not about content. That means no matter what the content is, the TR in the middle is always the paragraph. Even if there is only one note in the middle and the concept is much longer and more important. This is the only way to really recognize the necessary patterns during training.

Posted by Dirk Alvermann on

P2PaLA vs. standard LA

Release 1.9.1

In the previous post we described that if the document layouts are very complicated, the standard LA in Transkribus does not always provide good results. But for a perfect HTR result you need a perfect LA.

Especially in the documents of the 16th and early 17th century the CITlab Advanced LA could not convince us. It was clear to us from the beginning that the standard LA wouldn’t identify the more complex layouts (text regions) in a differentiated way. However, it was the line detection that ultimately failed to meet our demands in these documents.

An example of how (in the worst case) the line detection of the standard LA worked on our material can be seen here:


1587, page 41

This may be an isolated case. However, if you process large quantities of documents in Transkribus, such cases may occur more frequently. In order to be able to evaluate the problem correctly, we have therefore recorded representative error statistics on two bundles of our material. It has been found that the standard LA here worked with an average of 12 errors in the line detection per page (see graph below, 1598). This of course has undesirable effects on the HTR result, which we will describe in more detail in the next post.

Posted by Dirk Alvermann on

P2PaLA or structure training

Release 1.9.1

Page-to-page layout analysis (P2PaLA) is a form of layout analysis that can be trained for indivudual models- similar to the HTR. You can train structure models either to recognize textregions only or textregions with baselines. They therefore fulfill the same functions as the standard layout analysis (CITlabAdvanced). The P2PaLA is particularly suitable if a document has many pages with mixed layouts. The standard layout analysis usually recognizes just one TR – and this can lead to problems with the reading order of the text.

With the help of a structure training, the layout analysis can learn where and how many TRs it should recognize.

The CITlab Advanced LA often had problems to identify the text regions in a differentiated way on our material. That’s why we experimented with P2PaLA early on in our project. First, we tried out structural models that exclusively set text regions (Main text, marginal notes, footnotes etc.). Within the TRs thus generated, we worked with the usual line detection. However, the results were not always satisfactory.

The BLs were often too short (at the beginning or the end of the line) or were torn many times – even on pages with simple layouts. Therefore we trained another one with an included recognition of the BLs, based on our already working P2PaLA model. Our newest model recognizes all ‘simple’ pages almost without any errors. For pages with very complex layouts, the results still have to be corrected, but with much less effort than before.

Posted by Anna Brandt on

layout tab

Release 1.7.1

If you correct the automatic layout analysis, you can do this directly in the image or navigate via the layout tab on the left side. There are all shapes, like the textregions and the baselines, displayed with their position in the image and their structural tags. It is possible to delete or move shapes. In the image you can always see where you are at the moment, which element is currently marked – thus what you can change.

If you want to merge two baselines, just mark them in the layout tab instead of trying to hit the thin line in the image.

The navigation in the tab is especially useful if you want to see the complete image in the right window. This way you keep a better overview, because everything in the image and in the tab will be changed at the same time.

Tips & Tools
You can change the reading order of the baselines in the layout tab either by moving the lines or by clicking and changing the number in the column “Reading Order”.

Posted by Dirk Alvermann on

mixed layouts

Release 1.7.1

The CITlab Advanced Layout Analysis handles most “ordinary” layouts well – in 90% of the cases. Let’s talk about the remaining 10%.

We already discussed how to proceed in order to avoid trouble with the Reading Order. But what happens if we have to deal with really mixed – crazy – layouts, e.g. concept writings?

With complicated layouts, you’ll quickly notice that the manually drawn TRs overlap. That’s not good – because in such overlapping text regions the automatic line detection doesn’t work reliably. This problem is easily solved because TRs can have shapes other than square. They can be drawn as polygons and are therefore easily separated from each other.

It makes sense to add structural tags if there are many text regions in order to be able to distinguish them better. You can also assign them to certain processing routines during later processing. This is a small effort with great benefits, because the structural tagging is not more complex than the tagging in context.

Tips & Tools
Automatic line detection can be a real challenge here. Sections where you can already predict (with a little experience) that this won’t happen are best handled manually. For automatic line detection, CITlab Advanced should be configured so that the default setting is replaced by “heterogeneous”. The LA will now take both horizontal and vertical or skewed and oblique lines into account. This will take a little longer, but the result will be better.

If such complicated layouts are a continuous characteristic of your material, then it is worth designing a P2PaLA training course. This will create your own layout analysis model that is tailored to the specific challenges of your material. By the way, structure tagging is the basic requirement for such training.

Posted by Anna Brandt on

Toolbar – the most important tools and how to use them #2

Release 1.7.1

Correcting layouts

If the basic text regions are drawn, they can be edited. If you select one of the text regions, the other tools on the toolbar will be enabled.

With 1 you can add one or more points to the selected shape (TR or BL!). All shapes consist of dots and straight lines connecting them. You can edit the shape by moving the dots. You can use this tool to make a polygon out of the basic text region, whatever fits best to the text block. Press 2 to remove a dot from the selected shape. This tool is especially useful for correcting or shortening baselines.

This is especially useful if you have split elements. With 3,4 and 5 it is possible to cut a selected shape. This is also possible for both text regions and baselines: 3 cuts horizontally, 4 vertically. With 5 you draw your own line, which does not necessarily have to be horizontal or vertical.

The last important tool (red circle) is the Merge tool. This is especially important if the automatic LA has split baselines in the image. You can  use Merge to reassemble all shapes. So baselines with baselines and text regions with text regions. To do this you have to mark the corresponding shapes, which you can do directly in the image or in the layout tab.

 

Tips & Tools
When splitting, note that the TR and BL can only be cut where they have lines. It is not possible to cut through the dots.
Be aware that when you split a shape Transkribus will automatically change the Reading Order. For example, if two TRs are made from one, a new reading order is started in each TR.