丝塑堕篁塑塑堡堕选叁堡丝苎堂丝一——————————————竺壁旦!L Abstract Data quality iscruciaI forinformation management systems inalldomains. While the data isoften dirty due tovarious data qualityproblems, which range from thesimple data entry errors plex Inconsistencies. Data cleaning deals with detecting and removing errors and inconsistencies from data toimprove thequailty therapidgrowth oftheinformation anizations today are depending more and more on various information need for data cleaning increases significantly due to the“garbage in,garage out”principle. After providing a classification ofdata quality problems and a survey of data cleaning,this thesis presents aspecification of an extensible data‘cleaning framework,and introduces asynthetical approach fordetecting approximately duplicate records ofmulti-language data. The contribution ofthis thesis is as foUows 1 A specification of an open and extensible data—cleaning framework is itsextensibility byemploying innovative features like term model、processing description fileand rule&Dic base. 2 A synthetical approach fordetecting approximately duplicate records of multi—language data approach presents an efficient algorithm forsorting multi—language data,an efficient edit-distance based pair-parison method for multi—language data and employs a priority queue ofduplicates clusters and representative records strategy to respond adaptively tothe data scale. an extensible data cleaning system: system provides avisual environment todefine the data cleaning typical data cleaning taskmodules are alsoimplemented according to our cleaning framework data quality problems can be atlacked by thesystem. Keywords:Data quality,Data cleaning,Duplicate detection,Extensible Data—cleaning framework 数据质量和数据清洗关键技术讲究 GJ言第一章引言本章对数据质量和数据清洗的概念、研究内容及研究现状做了简单的介绍, 是全文研究工作的起点。其中第一节阐述了数据质量的定义。第二节说明了数据清洗的任务。第三节介绍了数据清洗研究的发展状况。第四节介绍了本文的工作以及本文的
数据质量与数据清洗关键技术研究 来自淘豆网m.daumloan.com转载请标明出处.