论文编码:TP181
首都师范大学学士学位论文
基于Web的文本分类挖掘的研究
院系信息工程学院
专业计算机科学与技术系(师范)
年级 2001
学号 1011000035
指导老师刘丽珍
论文作者徐颖
完成日期2005年6月6日
中文提要
互联网现在已经成为一个巨大的信息源,如何让互联网信息更好地为人类服务,如何快速、准确获取所需信息,是我们面临的一个重要课题。因此,基于Web的网络信息处理成了当前的研究热点,其中,Web上的文本分类方法的研究是网络数据挖掘的研究重点之一。
本文介绍了数据挖掘,Web挖掘和文本分类的理论,对Web数据的特点作了分析,比较了HTML与传统数据的区别,分析了文本分类的几种算法,重点研究了朴素贝叶斯分类算法和算法改进的具体过程。尝试利用HTML标记权重来改善朴素贝叶斯算法的条件独立假设的不足。简述了现有的对网页的标记过滤的知识,并利用标记中的有用信息结合文本分类算法进行文本分类。最后,针对改进的分类器的在精确率上不太理想的特点,对本课题下一步要研究的内容进行了总结,并提出了自己的一些看法。
关键词
Web挖掘朴素贝叶斯数据挖掘文本分类网页标记
Research of Text Classification Mining based on WEB
ABSTRACT
has e a great information source. It is an important issues for us to confront that how to make the information serve people better and how to obtain the information quickly and accurately. Nowadays the Research of information processing based on web is a hotspot. The text categorization of web has became more important than the other research of web mining.
The theoretical development of data mining, Web mining and text classification are introduced, analyzes the feature of Web pares with the other datanaive bayes classifier . Analyzes some arithmetics of text categorization and the concrete process of the improvement of arithmetic in naive bayes classifier are put emphasis on. This thesis tries to make use of HTML tags to improve the arithmetic of naive bayes classifier whose bug is its hypothesis. In the practice of the classifier ,the thesis summarizes the method which can leach HTML tags,then tries to use the information from the tags and the text categorization arithmetic to classify the text.
Finally, the precision of the classifier which has been improved is not ideal, so the next contentsof this subject are summarized and some one's own views are also presented.
Xu Ying
Directed by Liu Li-zhen
Key word
WebMining Naïve Bayes Data Mining Text categorization HTML tags
目录
中文提要 1
外文提要 错误!未定义书签。
第一章 绪论 4
选题背景及意义 4
数据挖掘 4
Web挖掘 5
Web挖掘的研究现状与发展 8
本文的主要研究内容与组
基于Web的文本分类挖掘的研究 来自淘豆网m.daumloan.com转载请标明出处.