Title:
|
Workflow of Metadata Extraction from Retro-Born Digital Documents (English) |
Author:
|
Tkaczyk, Dominika |
Author:
|
Bolikowski, Łukasz |
Language:
|
English |
Journal:
|
Towards a Digital Mathematics Library. Bertinoro, Italy, July 20-21st, 2011 |
Volume:
|
|
Issue:
|
2011 |
Year:
|
|
Pages:
|
39-44 |
. |
Category:
|
math |
. |
Summary:
|
In this work-in-progress report we propose a workflow for metadata extraction from articles in a digital form. We decompose the problem into clearly defined sub-tasks and outline possible implementations of the sub-tasks. We report the progress of implementation and tests, and state future work. (English) |
Keyword:
|
metadata extraction |
Keyword:
|
page segmentation |
Keyword:
|
zone classification |
Keyword:
|
Hidden Markov Model |
MSC:
|
68-06 |
MSC:
|
68U10 |
MSC:
|
68U15 |
MSC:
|
68U99 |
. |
Date available:
|
2011-07-15T09:26:55Z |
Last updated:
|
2012-08-27 |
Stable URL:
|
http://hdl.handle.net/10338.dmlcz/702601 |
. |
Reference:
|
1.
: iText.http://itextpdf.com/. |
Reference:
|
2.
: MARG.http://marg.nlm.nih.gov/. Zbl 1143.68407 |
Reference:
|
3.
: PDFBox.http://pdfbox.apache.org/ |
Reference:
|
4.
: Automating the production of bibliographic records for MEDLINE.Tech. rep. (2001). |
Reference:
|
5. Cui, B., Chen, X.: An improved hidden Markov model for literature metadata extraction.Advanced Intelligent Computing Theories and Applications. pp. 205–212 (2010). |
Reference:
|
6. Hetzner, E.: A simple method for citation metadata extraction using Hidden Markov Models.In: JCDL ’08: Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries. pp. 280–284. ACM, New York, NY, USA (2008). |
Reference:
|
7. Marinai, S.: Metadata Extraction from PDF Papers for Digital Library Ingest.10th International Conference on Document Analysis and Recognition. pp. 251–255 (2009). |
Reference:
|
8. Nagy, G., Seth, S., Viswanathan, M.: A prototype document image analysis system for technical journals.Computer 25(7), 10–22 (1992). |
Reference:
|
9. O’Gorman, L.: The document spectrum for page layout analysis.IEEE Transactions on Pattern Analysis and Machine Intelligence 15(11), 1162–1173 (1993). |
Reference:
|
10. Sojka, P.: An Experience with Building Digital Open Access Repository DML-CZ.In: Proceedings of CASLIN 2009. pp. 74–78 (2009). |
Reference:
|
11. Sutton, C., McCallum, A.: An Introduction to Conditional Random Fields for Relational Learning.(2006). |
. |