
你可能喜欢上传用户:vuldbtsjgv资料价格:5财富值&&『』文档下载 :『』&&『』学位专业:&关 键 词 :&&&&权力声明:若本站收录的文献无意侵犯了您的著作版权,请点击。摘要:(摘要内容经过系统自动伪原创处理以避免复制,下载原文正常,内容请直接查看目录。)信息抽取技巧已成为以后的研讨热门之一,而对搜刮引擎前往信息中存在的所谓的Rich Data Poor Information成绩也是亟待处理的,若将二者相联合无疑是件很风趣又有现实价值的工作。本文就把为年夜家所熟知熟用的搜刮引擎与信息提取技巧相联合,开辟出了一种基于搜刮引擎的邮箱地址提取体系。有用的处理了罕见邮箱搜刮器中广泛存在的准确度不高、用户自立选择性低、前后两次成果会被反复提取等成绩。本文的重要任务内容及立异点以下:起首,经由过程URL地址拼接技巧,挪用各年夜搜刮引擎的前往数据获得源数据。用户提交症结字和须要处置的搜刮引擎肇端页面后,依据搜刮引擎前往数据首页的url地址构造,拼接出首页的URL链接地址。比较于之前的研讨,本文完成了主动翻页提取,即完成对“下一页”链接地址的获得。另外,为了增长Email体系顶用户的自立选择性,用户可以依据须要,对要处置的网页页数规模停止限制。其次,HTMLParser包对html网页停止解析,应用正则表达式并对Email地址停止提取。为了获得更多更周全的信息,本文应用HTMLParser对网页外部的URL链接地址停止了深层提取。用户可以依据本身的须要,选择须要处置的网页层数级别。再次,为了进一步进步用户的自立选择性,用户可以依据本身须要,选择对最初搜刮成果中邮件办事器域名(如、、等等)停止过滤。另外为了不本次提取到的信息下次不会被反复提取,选择将成果保留在Access数据库中。抽取的成果也能够手动选择以文本文件的格局保留。最初,对体系停止了测试任务,针对涌现的成绩停止了改良,并对体系成果做了剖析和评价,发明体系稳固性优越,可正常运转15小时(早800至2300),足以知足现实须要。并且召回率和精确率都在94%以上,这比现存的邮箱地址搜刮器完成的成果都要高。Abstract:Information extraction technology has become the research hot topics, and to search engine to the information in the presence of so-called rich data poor information &achievement is to be solved urgently, if two phase combination is undoubtedly a very interesting and practical value. This paper takes to the eve of the familiar familiar with search engines and information extraction techniques, to develop a based on search engine's email address extraction system. Useful to deal with the rare mailbox search is widely existed in accuracy is not high, user independent low selectivity, before and after the two results will be repeated extraction results. The important task content and innovation points below: first and foremost, through the URL address splicing techniques, misappropriation of each big search engine to data access to the source data. Users submit the crux of the word and you need to deal with the search engine starting page after, according to the search engine to the data of the first page of the URL structure, splicing out home page URL. Compared to the previous research, this paper completed the active page extraction, complete the &next page& link address. Also in the email system for users to self selective growth, users can according to the need of disposal of the page size to stop limit. Secondly, analysis of the HTML HTMLParser package to stop &, based on regular expressions and the Email address extraction. In order to obtain more comprehensive information, the application of HTMLParser URL on the web link address external stop deep extraction. The user can choose according to their needs, need to be addressed &layer level. Again, in order to further improve the user self selective, users can according to their own needs, choose the initial search results in the mail service domain name (such , , , etc.) to stop filtering. In addition to the extraction of the information the next time will not be repeated extraction, selection results will be retained in the Access database. From the results can also manually select a text document retention pattern. Initially, the system stopped testing tasks, for the emergence of achievement has been improved, and do the analysis and evaluation of the results of the system, present system stable superiority, normal operation of 15 hours (as early as 800 to 2300), enough to satisfy the real need. And the recall rate and the accurate rate is above 94%, the ratio of existing e-mail address search results to complete.目录:摘要4-5Abstract5第1章 绪论8-17&&&&1.1 课题研究的背景及意义8-10&&&&1.2 发展历史和研究现状10-15&&&&&&&&1.2.1 国内研究现状10-12&&&&&&&&1.2.2 国外研究现状12-13&&&&&&&&1.2.3 常见的邮箱搜索器13-15&&&&1.3 本文主要内容15-16&&&&1.4 论文组织结构16-17第2章 搜索引擎技术和 Web 中 Email 信息提取17-32&&&&2.1 搜索引擎17-20&&&&&&&&2.1.1 搜索引擎的基本概念及工作原理17-18&&&&&&&&2.1.2 搜索引擎的分类18-19&&&&&&&&2.1.3 搜索引擎 API19-20&&&&2.2 网页页面的组成20-24&&&&&&&&2.2.1 Web 网页概述20-22&&&&&&&&2.2.2 HTML 简介及常用标签的介绍22-24&&&&2.3 常用的 web 提取算法24-31&&&&&&&&2.3.1 基于 ontology 方式的信息抽取算法25-26&&&&&&&&2.3.2 基于包装器归纳方式的信息抽取算法26&&&&&&&&2.3.3 基于 Web 查询的信息抽取算法26&&&&&&&&2.3.4 基于 HTMLParser 包的信息抽取算法26-28&&&&&&&&2.3.5 基于正则表达式的信息抽取算法28-31&&&&2.4 网页 Email 信息提取的效果评价31&&&&2.5 本章小结31-32第3章 基于正则表达式和 HTMLParser 的 Web 信息提取算法32-37&&&&3.1 HTMLParser 的应用32-34&&&&&&&&3.1.1 HTMLParser 包测试32-33&&&&&&&&3.1.2 邮箱地址提取系统中 HTMLParser 的应用33-34&&&&3.2 正则表达式的应用34-35&&&&&&&&3.2.1 Java 中支持正则表达式的 API34-35&&&&&&&&3.2.2 邮箱地址提取系统中正则表达式的应用35&&&&3.3 HTMLParser 和正则表达式的结合35-36&&&&3.4 本章小结36-37第4章 Email 地址自动提取系统的实现37-52&&&&4.1 系统结构分析37&&&&4.2 系统实现的基本思路37-39&&&&4.3 Email 地址自动提取系统各模块的实现39-50&&&&&&&&4.3.1 获取搜索引擎结果页面信息39-41&&&&&&&&4.3.2 网页编码转换41-42&&&&&&&&4.3.3 网站内部深层 URL 及 Email 地址自动提取42-47&&&&&&&&4.3.4 避免重复搜索提取的办法47-48&&&&&&&&4.3.5 按邮箱地址类型的过滤与信息存储48-50&&&&4.4 本章小结50-52第5章 Email 地址自动提取系统功能及评测52-64&&&&5.1 界面搭建及开发环境设置52-56&&&&&&&&5.1.1 基于 eclipse 的邮箱搜索器环境设置52-55&&&&&&&&5.1.2 系统用户界面搭建55-56&&&&5.2 系统测试中出现的问题及解决方法56-57&&&&5.3 改进后的 Email 地址自动提取系统57-60&&&&5.4 系统相关的有效性评价60-63&&&&5.5 本章小结63-64第6章 总结与展望64-66&&&&6.1 结论64-65&&&&6.2 工作展望65-66参考文献66-69致谢69-70攻读硕士期间发表及录用论文70分享到:相关文献|


更多关于 中国达人秀菜花甜妈 的文章

