Previous |  Up |  Next


metadata extraction; page segmentation; zone classification; Hidden Markov Model
In this work-in-progress report we propose a workflow for metadata extraction from articles in a digital form. We decompose the problem into clearly defined sub-tasks and outline possible implementations of the sub-tasks. We report the progress of implementation and tests, and state future work.
4. Automating the production of bibliographic records for MEDLINE. Tech. rep. (2001).
5. Cui, B., Chen, X.: An improved hidden Markov model for literature metadata extraction. Advanced Intelligent Computing Theories and Applications. pp. 205–212 (2010).
6. Hetzner, E.: A simple method for citation metadata extraction using Hidden Markov Models. In: JCDL ’08: Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries. pp. 280–284. ACM, New York, NY, USA (2008).
7. Marinai, S.: Metadata Extraction from PDF Papers for Digital Library Ingest. 10th International Conference on Document Analysis and Recognition. pp. 251–255 (2009).
8. Nagy, G., Seth, S., Viswanathan, M.: A prototype document image analysis system for technical journals. Computer 25(7), 10–22 (1992).
9. O’Gorman, L.: The document spectrum for page layout analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 15(11), 1162–1173 (1993).
10. Sojka, P.: An Experience with Building Digital Open Access Repository DML-CZ. In: Proceedings of CASLIN 2009. pp. 74–78 (2009).
11. Sutton, C., McCallum, A.: An Introduction to Conditional Random Fields for Relational Learning. (2006).
Partner of
EuDML logo