Train sets & test sets (for Beginners)

When we train an HTR model, we create training sets and test sets, all based on Ground Truth. In the next posts on this topic you will learn more about it, especially that both sets must must not be mixed together. But what exactly is the difference between the two and what are they used for?

Training and test sets are very similar in the choice of material they contain. The material in both sets should come from the same handwritings and be at the same status (GT). The difference is how Transkribus uses it to create a new model: The training set is learned by the program in a hundred (or more) rounds (epochs). Imagine writing a test a hundred times – for practice purposes, so to speak. Every time you write the test, after going through all the pages, you get the solution and can look at your mistakes. Then you start again with the same exercise. Of course you’ll get better and better. The same way does Transkribus learn a bit more with each pass.

After each round in the training set, the learned skills are checked on the test set. Imagine your test again. This time you write the test, get the grade, but they don’t tell you what you did wrong. So Transkribus goes through the same pages many times, but can never see the right solution. The model has to fall back on the previously learned training and you can see how well it has studied.

So if there were the same pages in the test set as in training, Transkribus could “cheat”. It would already know the pages, have practised on them a hundred times and seen the solution a hundred times. This is the reason why the CER (Character Error Rate) in the training set is almost always lower than in the test set. This is best seen in the “learning curve” of a model.