P2PaLA – line detection and HTR
As already mentioned in a previous post, we noticed in the course of our project that the CITLabAdvanced-LA does not optimally identify the layout in our material. This happens not only on the ‘bad’ pages with mixed layouts, but also on simple layouts, i.e. on pages without any marginalias at the edge, great deletions in the text or similar. Here the automatic LA recognizes the TR correctly, but the baselines are often faulty.
This is not only confusing when the full text is displayed later; an insufficient LA also influences the result of the HTR. No matter how good your HTR model is: if the LA does not offer adequate quality, it is a problem.
The HTR does not read the single characters, but works line based and should recognize patterns. But if the line detection did not identify the lines correctly (in case letters or words were not recognized by the LA) this often produces wrong HTR results. This can have dramatic effects on the accuracy rate of a page or an entire document, as our example shows.
For this reason we have trained a P2PaLA model which also detects BLs. That was very helpful. It is not possible to calculate statistics like CERs for these layout models, but from the visual point it seems to work almost error-free on ‘simple’ pages. In addition, a postprocessing is no longer necessary in many cases.
The training material for such a model is created in a similar way to models that should recognize TRs only. The individual baselines do not have to be tagged manually for the structural analysis, even if the model does so later in order to assign them to the tagged TR. With the support of the Transkribus team and a training material of 2500 pages, we were able to train the structural model that we use today instead of the standard LA.