Ancient historical manuscripts are a rich source of history and civilization. They consist of different patterns or layers of information.

Unfortunately, these documents are often affected by different ages and storage-related degradation, which affect their readability and information content. Usually these damages appear as damp spots and mold or ink seeping from the back, affecting the main text.

In a new study published in PLoS ONE, scientists proposed a document recovery method that removes the unwanted disruptive degradation patterns found in old color manuscripts. The method uses single-sided RGB manuscripts, avoiding the need for recto-verso alignment, and adopts the approach of conducting an analysis of their contents that consists of individually detecting and locating their constituent patterns.

Virtual restoration is then obtained by coloring the unwanted interfering content with the background texture.

Scientists noted, “It is clear that such an approach can facilitate the performance of other tasks in addition to virtual restoration, for example document binarization or text extraction, and analysis of geometric and logical page layout.”

“Unlike binary restoration, the main emphasis is on restoring the aesthetic appearance of the manuscript, which is important when processing old documents. So we combine three different color space information to create a feature space that can capture all the necessary information to distinguish the classes (foreground text, background medium, and various other different information/degradation patterns) by the differences, even the smallest ones, in their spectral reactions.”

Specifically, scientists associated each pixel with its representations in the RGB, CIELUV, and CIELAB color spaces, along with its spatial location in the image. The limitation of spatial smoothness imposed by spatial pixel information is particularly well suited to describe the homogeneity of color commonly observed in typical manuscript patterns.

A Gaussian Mixture Model (GMM) based clustering can be used for pixel based classification. To improve and accelerate GMM clustering, scientists performed a principal component analysis (PCA) of the initial data space to decorate it and reduce its dimension without losing information.

The team found that PCA is especially beneficial for K-means clustering by eliminating associations between data and improving the quality of segmentation. The PCA components captured the essential information and organized it in a more coherent way.

After segmentation, a virtually restored image of the manuscript with all its informative content is generated by selectively replacing the detected degradation pixels with appropriate fill-in pixels reproducing the textured background.

Scientists noted, “The results show that the proposed method can be satisfactorily used to remove the interference commonly found in ancient manuscripts and to extract typical salient features.”

For experiments, scientists evaluated the performance of the new method using a series of experimental results on old color document images. They compared the results to a recently published method for removing blood-bleeds, one of the most damaging degradations in ancient manuscripts.

For comparison, scientists used images from the well-known Database of Ancient Documents, which contains 25 pairs of recto-verso images of ancient manuscripts affected by various levels of perfusion, along with hand-created binary ground truth images of the foreground text.

Scientists noted, “It is worth noting that although this database mainly focuses on show-through effects, our method can also be used to remove other document degradations, such as smudges, folding marks, etc.”

Magazine reference:

  1. Hanif M, Tonazzini A, Hussain SF, Khalil A, Habib U (2023) Restoration and content analysis of ancient manuscripts via color space-based segmentation. PLoS ONE 18(3): e0282142. DOI: 10.1371/journal.pone.0282142