文本相似度计算系统
摘要
在中文信息处理中,文本相似度的计算广泛应用于信息检索、机器翻译、自动问答系统、文本挖掘等领域,是一个非常基础而关键的问题,长期以来一直是人们研究的热点和难点。本次毕设的设计目标就是用两种方法来实现文本相似度的计算。
本文采用传统的设计方法,第一种是余弦算法。余弦算法是一种易于理解且结果易于观察的算法。通过余弦算法可以快捷的计算出文本间相似度,并通过余弦算法的结果(0、1之间)判断出相似度的大小。由于余弦计算是在空间向量模型的基础上,所以说要想用余弦算法来完成本次系统,那么必须要将文本转化成空间向量模型。而完成空间向量模型的转换则要用到加权。在空间向量模型实现之前,必须要进行文本的去停用词处理和特征选择的处理。第二种算法是BM25算法,本文将采用最基础的循环来完成,目的是观察余弦算法中使用倒排索引效率是否提高有多大提高。
本次文本相似度计算系统的主要工作是去除停用词、文本特征选择、加权,在加权之后用余弦算法计算文本的相似度。在文本特征选择之后用BM25计算相似度。由于为了使系统的效率提高,在程序设计中应用了大量的容器知识以及内积、倒排算法。
关键词:文本相似度;余弦;BM25;容器
Text Similarity Algorithm Research
Abstract
In Chinese information processing,text putation is widely used in the area of information retrieval,machine translation,automatic question—answering,text mining and is a very essential and important issue that people study as a hotspot and difficulty for a long ,most text similarity algorithms are based on vector space model(VSM).However,these methods will cause problems of high dimension and ,these methods do not effectively solve natural language problems existed in text natural language problems are synonym and problems sidturb the efficiency and accuracy of text similarity algorithms and make the performance of text putation decline.
This paper uses a new thought which gets semantic putation into traditional text putation to prove the performance of text similarity paper deeply discusses the existing text similarity algorithms and samentic putation and gives a Chinese text similarity algorithm which is based on semantic is an online information management system which is used to manage students’graduate design papers ale used to calculate similarity by that the algorithm to validate that algorithm.
This text puting system's main job is to stop word removal, text feature selection, weighting, after weighting using cosine algorithm to calculate the
similarity of the text. After the text feature sel
文本相似度计算 来自淘豆网m.daumloan.com转载请标明出处.