01/08/1995

(Chem)DeTeX automatic generation of a markup language description of (chemical) documents from bitmap images

Third International Conference on Document Analysis and Recognition
ICDAR 1995, IEEE Computer Society, 1995, 1, pp 458-462
Montreal, Canada

Aniko Simon, Jean-Christope Pret, A. Peter Johnson

This paper presents a novel view of document processing, as being the reverse process to TeX. This concept simplifies the analysis of the physical structure of documents, and also suggests the use of a style file for layout recognition. An algorithm is given for both phases, layout analysis and layout recognition. The bottom-up layout analysis method employed is based on the Kruskal’s algorithm and uses the distances between the components to construct the physical page structure. The algorithm is linear with respect to the number of the connected components. For layout recognition, a document style description language (DSDL) is introduced. This helps a fault-tolerant, recursive parsing algorithm to label the blocks of the document. The presented methods were designed to be used for scientific publications (papers, reports, books), but could be applied to a broader range of documents.

Get in touch

You need a high-performing, reliable and easy-to-use software solution to speed up your next big scientific breakthrough. Getting the right solution is integral to advance your research and workflow.