Previous |  Up |  Next

Article

Title: Text document classification based on mixture models (English)
Author: Novovičová, Jana
Author: Malík, Antonín
Language: English
Journal: Kybernetika
ISSN: 0023-5954
Volume: 40
Issue: 3
Year: 2004
Pages: [293]-304
Summary lang: English
.
Category: math
.
Summary: Finite mixture modelling of class-conditional distributions is a standard method in a statistical pattern recognition. This paper, using bag-of-words vector document representation, explores the use of the mixture of multinomial distributions as a model for class-conditional distribution for multiclass text document classification task. Experimental comparison of the proposed model and the standard Bernoulli and multinomial models as well as the model based on mixture of multivariate Bernoulli distributions was performed using Reuters-21578 and Newsgroups data sets. Preliminary experimental results indicate the effectiveness of the proposed model in a text classification problem. (English)
Keyword: text classification
Keyword: multinomialmixture model
MSC: 62G05
MSC: 62H30
MSC: 68T10
idZBL: Zbl 1248.62107
idMR: MR2103933
.
Date available: 2009-09-24T20:01:27Z
Last updated: 2015-03-23
Stable URL: http://hdl.handle.net/10338.dmlcz/135596
.
Reference: [1] Battiti R.: Using mutual information for selecting features in supervised neural net learning.IEEE Trans. Neural Networks 5 (1994), 537–550 10.1109/72.298224
Reference: [2] Dempster A. P., Laird N. M., Rubin D. B.: Maximum likelihood from incomplete data via the EM algorithm.J. Roy. Statist. Soc. Ser. B 39 (1977), 1–38 Zbl 0364.62022, MR 0501537
Reference: [3] Forman G.: An experimental study of feature selection metrics for text categorization.J. Mach. Learning Res. 3 (2003), 1289–1305
Reference: [4] Joachims T.: Text categorization with support vector machines: Learning with many relevant features.In: Proc. 10th European Conference on Machine Learning (ECML’98), 1998, pp. 137–142
Reference: [5] Juan A., Vidal E.: On the use of Bernoulli mixture models for text clasification.Pattern Recognition 35 (2002), 2705–2710 10.1016/S0031-3203(01)00242-4
Reference: [6] Kwak N., Choi C.: Improved mutual information feature selector for neural networks in supervised learning.In: Proc. Internat. Joint Conference on Neural Networks (IJCNN ’99), 1999 pp. 1313–1318
Reference: [7] McCallum A., Nigam K.: A comparison of event models for naive Bayes text classification.In: Proc. AAAI-98 Workshop on Learning for Text Categorization, 1998
Reference: [8] McLachlan G. J., Peel D.: Finite Mixture Models.Wiley, New York 2000 Zbl 0963.62061, MR 1789474
Reference: [9] Mladenic D., Grobelnik M.: Feature selection for unbalanced class distribution and Naive Bayes.In: Proc. Sixteenth Internat. Conference on Machine Learning, 1999, pp. 258–267
Reference: [10] Nigam K., McCallum A., Thrun, S., Mitchell T.: Text classification from labeled and unlabeled documents using EM.Mach. Learning 39 (2000), 103–134 Zbl 0949.68162, 10.1023/A:1007692713085
Reference: [11] Novovičová J., Pudil, P., Kittler J.: Divergence based feature selection for multimodal class densities.IEEE Trans. Pattern Anal. Machine Intell. 18 (1996), 218–223 10.1109/34.481557
Reference: [12] Novovičová J., Malík A.: Text Document Classification Using Finite Mixtures.Research Report No. 2063, Institute of Information Theory and Automation, Prague 2002
Reference: [13] Novovičová J., Malík A.: Application of multinomial mixture model to text classification.In: Pattern Recognition and Image Analysis (Lecture Notes in Computer Sciences 2652), Springer–Verlag, Berlin 2003, pp. 646–653
Reference: [14] Novovičová J., Malík, A., Pudil P.: Feature selection using improved mutual information for text classification.In: Structural, Syntactic and Statistical Pattern Recognition (Lecture Notes in Computer Science), Springer–Verlag, Berlin 2004 (in press) Zbl 1104.68663
Reference: [15] Pudil P., Novovičová, J., Kittler J.: Feature selection based on approximation of class densities by finite mixtures of special type.Pattern Recognition 28 (1995), 1389–1398 10.1016/0031-3203(94)00009-B
Reference: [16] Ueda N., Saito K.: Parametric mixture models for multi-labeled text.In: Proc. Neural Information Processing Systems, 2003
Reference: [17] Yang Y., Pedersen J. O.: A comparative study on feature selection in text categorization.In: Proc. Internat. Conference on Machine Learning, 1997, pp. 412–420
Reference: [18] Yang Y., Liu X.: A re-examination of text categorization methods.In: Proc. 22nd Internat. ACM SIGIR Conference on Research and Development in Inform. Retrieval, 1999, pp. 42–49
Reference: [19] Yang Y.: An evaluation of statistical approaches to text categorization.J. Inform. Retrieval 1 (1999), 67–88 10.1023/A:1009982220290
Reference: [20] Yang Y., Zhang, J., Kisiel B.: A scalability analysis of classifier in text categorization.In: Proc. 26th ACM SIGIR Conference on Research and Development in Inform. Retrieval, 2003
.

Files

Files Size Format View
Kybernetika_40-2004-3_3.pdf 1.857Mb application/pdf View/Open
Back to standard record
Partner of
EuDML logo