Workflow of Metadata Extraction from Retro-Born Digital Documents

Tkaczyk, Dominika; Bolikowski, Łukasz

About DML-CZ | FAQ | Conditions of Use | Math Archives | Contact Us

Previous | Up | Next

Article

Workflow of Metadata Extraction from Retro-Born Digital Documents. (English). In: Sojka, Petr and Bouche, Thierry (eds.): Towards a Digital Mathematics Library. Bertinoro, Italy, July 20-21st, 2011. Masaryk University Press, Brno, Czech Republic, 2011. pp. 39-44

MSC: 68-06, 68U10, 68U15, 68U99

Full entry |

PDF (0.2 MB) Feedback

Keywords:
metadata extraction; page segmentation; zone classification; Hidden Markov Model

Summary:
In this work-in-progress report we propose a workflow for metadata extraction from articles in a digital form. We decompose the problem into clearly defined sub-tasks and outline possible implementations of the sub-tasks. We report the progress of implementation and tests, and state future work.

Similar articles:

References:

1. iText. http://itextpdf.com/

2. MARG. http://marg.nlm.nih.gov/ Zbl 1143.68407

3. PDFBox. http://pdfbox.apache.org/

4. Automating the production of bibliographic records for MEDLINE. Tech. rep. (2001).

5. Cui, B., Chen, X.: An improved hidden Markov model for literature metadata extraction. Advanced Intelligent Computing Theories and Applications. pp. 205–212 (2010).

6. Hetzner, E.: A simple method for citation metadata extraction using Hidden Markov Models. In: JCDL ’08: Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries. pp. 280–284. ACM, New York, NY, USA (2008).

7. Marinai, S.: Metadata Extraction from PDF Papers for Digital Library Ingest. 10th International Conference on Document Analysis and Recognition. pp. 251–255 (2009).

8. Nagy, G., Seth, S., Viswanathan, M.: A prototype document image analysis system for technical journals. Computer 25(7), 10–22 (1992).

9. O’Gorman, L.: The document spectrum for page layout analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 15(11), 1162–1173 (1993).

10. Sojka, P.: An Experience with Building Digital Open Access Repository DML-CZ. In: Proceedings of CASLIN 2009. pp. 74–78 (2009).

11. Sutton, C., McCallum, A.: An Introduction to Conditional Random Fields for Relational Learning. (2006).

Browse
- Collections
- Titles
- Authors
- MSC

About DML-CZ

Partner of

Article

Search

Browse