摘要
在信息爆炸的时代,面对浩如烟海的信息,怎样有效地组织和管理这些信息并且快速
准确地获得所需信息仍是一个亟待解决的问题。文本自动分类是一个有效的解决办法,它
能够处理大量的文本,较大程度解决信息紊乱的现状,帮助用户方便准确地把握所需要的
信息。
支持向量机(SVM)是建立在结构风险最小化原则以及 VC 理论基础上的一种机器学
习算法。由于它对特征相关性和稀疏性不敏感,对高维问题的处理具有较大的优势。因此,
支持向量机在文本分类上具有应用前景。然而,使用支持向量机进行分类时总会出现在分
界面附近的样本分类精度不高的问题。
针对该缺点做了进一步的研究提出了一种改进 K 近邻的支持向量机算法。通过计算一
些已知类别样本在不同阈值下的分类情况来自动确定最优阈值;同时将改进的加权 KNN
算法融合到支持向量机中力求在不增加支持向量机算法时间复杂度的基础上,减少分类超
平面附近样本的错分率。最后,将改进算法应用到新闻分类系统中,实现新闻信息的文本
归类,方便了用户阅读和浏览新闻。
关键词:支持向量机;文本分类; K 近邻算法;新闻分类系统
- I -
Abstract
In the era of information explosion, facing the voluminous information, how anize
and manage these vast amounts of information, and obtain the necessary information quickly and
accurately is still a serious problem. Text automatic classification is an effective solution, it can
handle a large number of text messages and resolve the status of information disorder to a large
extent address so that help users to easily and accurately grasp the information which they need.
Support Vector Machine (SVM) is machine learning algorithm, which is built on the
principle of structural risk minimization and the based theory of VC. For treatment of
high-dimensional problem it has a large advantage because of its insensitive to the nature of
feature relevance and sparse. Thus, the using of support vector machines in text categorization
applications has great potential. However, using SVM often has the sample near the interface
classification accuracy is not high.
To address this ing, it was proposed an improved KNN-SVM algorithm. By
calculating the number of known categories of sample under different classification thresholds to
automatically determine the optimal threshold, at the same time let the improved KNN algorithm
merge into the support vector machines in an strive not to increase the plexity of
support vector machine basis, reduce the rate of wrong classification of the sample near the
support vector hyperplanes. Finally, the improve
面向文本分类的改进K近邻的支持向量机算法的研究 来自淘豆网m.daumloan.com转载请标明出处.