现在爬虫是必须要挂加速器怎样才能做python爬虫工作好吗?

声明:严禁使用小牛云加速器从事违法犯罪行为用户若擅自利用本站资源从事任何违反法律法规的活动,由此引起的一切后果与本站无关
郑州同城爱乐购电子商务有限公司
ICP证编号:
小牛加速器官网上所有内容的最终解释权归本公司所有
小牛微信服务号
(免费领取2小时体验时间)python爬虫常用第三方库 - 简书
python爬虫常用第三方库
这个列表包含与网页抓取和数据处理的库
-网络库(stdlib)。
– 网络库(基于pycurl)。
– 网络库(绑定)。
– Python HTTP库,安全连接池、支持文件post、可用性高。
– 网络库。
– 一个简单的、极具Python风格的Python库,无需独立的浏览器即可浏览网页。
-一个与网站自动交互Python库。
-有状态、可编程的Web浏览库。
– 底层网络接口(stdlib)。
– Unirest是一套可用于多种语言的轻量级的HTTP库。
– Python的HTTP/2客户端。
– SocksiPy更新并积极维护的版本,包括错误修复和一些其他的特征。作为socket模块的直接替换。
– 类似于requests的API(基于twisted)。
– asyncio的HTTP客户端/服务器(PEP-3156)。
网络爬虫框架
功能齐全的爬虫
– 网络爬虫框架(基于pycurl/multicur)。
– 网络爬虫框架(基于twisted),不支持Python3。
– 一个强大的爬虫系统。
– 一个分布式爬虫框架。
– 基于Scrapy的可视化爬虫。
– Python的HTTP资源工具包。它可以让你轻松地访问HTTP资源,并围绕它建立的对象。
– 基于PyQuery的爬虫微框架。
HTML/XML解析器
– C语言编写高效HTML/ XML处理库。支持XPath。
– 解析DOM树和CSS选择器。
– 解析DOM树和jQuery选择器。
– 低效HTML/ XML处理库,纯Python实现。
– 根据WHATWG规范生成HTML/ XML文档的DOM。该规范被用在现在所有的浏览器上。
– 解析RSS/ATOM feeds。
– 为XML/HTML/XHTML提供了安全转义的字符串。
– 一个可以让你在处理XML时感觉像在处理JSON一样的Python模块。
– 将HTML/CSS转换为PDF。
– 轻松实现将XML文件转换为Python对象。
– 清理HTML(需要html5lib)。
– 为混乱的数据世界带来清明。
用于解析和操作简单文本的库。
– (Python标准库)帮助进行差异化比较。
– 快速计算Levenshtein距离和字符串相似度。
– 模糊字符串匹配。
– 正则表达式加速器。
– 自动整理Unicode文本,减少碎片化。
– 将Unicode文本转为ASCII。
– 打印可读字符,而不是被转义的字符串。
– 兼容 Python的2/3的字符编码器。
– 一个将中国汉字转为拼音的库。
– 格式化文本中CJK和字母数字的间距。
– 一个可以保留unicode的Python slugify库。
– 一个可以将Unicode转为ASCII的Python slugify库。
– 一个可以将生成Unicode slugs的工具。
– 处理俄语字符串的简单工具(包括pytils.translit.slugify)。
通用解析器
– lex和yacc解析工具的Python实现。
– 一个通用框架的生成语法分析器。
-解析人的名字的组件。
-解析,格式化,存储和验证国际电话号码。
用户代理字符串
– 浏览器用户代理的解析器。
– Python的HTTP代理分析器。
特定格式文件处理
解析和处理特定文本格式的库。
– 一个把数据导出为XLS、CSV、JSON、YAML等格式的模块。
– 从各种文件中提取文本,比如 Word、PowerPoint、PDF等。
– 解析混乱的表格数据的工具。
– 一个常用数据接口,支持的格式很多(目前支持CSV,HTML,XLS,TXT – 将来还会提供更多!)。
– 读取,查询和修改的Microsoft Word的docx文件。
– 从Excel文件读取写入数据和格式信息。
– 一个创建Excel.xlsx文件的Python模块。
– 一个BSD许可的库,可以很容易地在Excel中调用Python,反之亦然。
– 一个用于读取和写入的Excel2010 XLSX/ XLSM/ xltx/ XLTM文件的库。
– 提取Python数据结构并将其转换为电子表格。
– 一个从PDF文档中提取信息的工具。
– 一个能够分割、合并和转换PDF页面的库。
– 允许快速创建丰富的PDF文档。
– 直接从PDF文件中提取表格。
– 一个用Python实现的John Gruber的Markdown。
– 速度最快,功能全面的Markdown纯Python解析器。
– 一个完全用Python实现的快速的Markdown。
– 一个Python的YAML解析器。
– 一个Python的CSS库。
– 通用的feed解析器。
– 一个非验证的SQL语句分析器。
– C语言实现的HTTP请求/响应消息解析器。
– 一个用来解析Open Graph协议标签的Python模块。
可移植的执行体
– 一个多平台的用于解析和处理可移植执行体(即PE)文件的模块。
– 将Adobe Photoshop PSD(即PE)文件读取到Python数据结构。
自然语言处理
处理人类语言问题的库。
-编写Python程序来处理人类语言数据的最好平台。
– Python的网络挖掘模块。他有自然语言处理工具,机器学习以及其它。
– 为深入自然语言处理任务提供了一致的API。是基于NLTK以及Pattern的巨人之肩上发展的。
– 中文分词工具。
– 中文文本处理库。
– 另一个中文分词库。
– 基于条件随机域的中文分词。
– 独立的语言识别系统。
– 一个韩文形态库。
– 俄语形态分析器(词性标注+词形变化引擎)。
– 用Python编写的分布式自然语言处理通道。这个项目的目标是创建一种简单的方法使用NLTK通过网络接口处理大语言库。
浏览器自动化与仿真
– 自动化真正的浏览器(Chrome浏览器,火狐浏览器,Opera浏览器,IE浏览器)。
– 对PyQt的webkit的封装(需要PyQT)。
– 对PyQt的webkit的封装(需要PyQT)。
– 通用API浏览器模拟器(selenium web驱动,Django客户端,Zope)。
– Python标准库的线程运行。对于I/O密集型任务很有效。对于CPU绑定的任务没用,因为python GIL。
– 标准的Python库运行多进程。
– 基于分布式消息传递的异步任务队列/作业队列。
– concurrent-futures 模块为调用异步执行提供了一个高层次的接口。
异步网络编程库
– (在Python 3.4 +版本以上的 Python标准库)异步I/O,时间循环,协同程序和任务。
– 基于事件驱动的网络引擎框架。
– 一个网络框架和异步网络库。
– Python事件驱动的并发框架。
– Python的基于绿色事件的I/O框架。
– 一个使用 的基于协程的Python网络库。
– 有WSGI支持的异步框架。
– 异步代码的奇妙的修饰语法。
– 基于分布式消息传递的异步任务队列/作业队列。
– 小型多线程任务队列。
– Mr. Queue – 使用redis & Gevent 的Python分布式工作任务队列。
– 基于Redis的轻量级任务队列管理器。
– 一个简单的,可无限扩展,基于Amazon SQS的队列。
– Gearman的Python API。
– 云端执行Python代码。
– 云端执行R,Python和matlab代码。
电子邮件解析库
– 电子邮件地址和Mime解析库。
– Mailgun库用于提取消息的报价和签名。
网址和网络地址操作
解析/修改网址和网络地址库。
– 一个小的Python库,使得操纵URL简单化。
– 一个简单的不可改变的URL以及一个干净的用于调试和操作的API。
– 用于打破统一资源定位器(URL)的字符串在组件(寻址方案,网络位置,路径等)之间的隔断,为了结合组件到一个URL字符串,并将“相对URL”转化为一个绝对URL,称之为“基本URL”。
– 从URL的注册域和子域中准确分离TLD,使用公共后缀列表。
– 用于显示和操纵网络地址的Python库。
网页内容提取
提取网页内容的库。
HTML页面的文本和元数据
– 用Python进行新闻提取、文章提取和内容策展。
– 将HTML转为Markdown格式文本。
– HTML内容/文章提取器。
– 人性化的网页内容检索工具
– 一个从网址中提取丰富内容的小库。
-一个自动汇总文本文件和HTML网页的模块
– 一个可扩展的图像爬虫。
– arc90 readability工具的快速Python接口。
– 从HTML网页中提取结构化数据的库。给出了一些Web页面和数据提取的示例,scrapely为所有类似的网页构建一个分析器。
– 一个从YouTube下载视频的小命令行程序。
– Python3的YouTube、优酷/ Niconico视频下载器。
– 下载和保存wikis的工具。
用于WebSocket的库。
– 开源的应用消息传递路由器(Python实现的用于Autobahn的WebSocket和WAMP)。
– 提供了WebSocket协议和WAMP协议的Python实现并且开源。
– Python 2和3以及PyPy的WebSocket客户端和服务器库。
– 在全球超过1500个的DNS服务器上检查你的DNS。
– c-ares的接口。c-ares是进行DNS请求和异步名称决议的C语言库。
计算机视觉
– 开源计算机视觉库。
– 用于照相机、图像处理、特征提取、格式转换的简介,可读性强的接口(基于OpenCV)。
– 快速计算机图像处理算法(完全使用 C++ 实现),完全基于 numpy 的数组作为它的数据类型。
代理服务器
– 一个快速隧道代理,可帮你穿透防火墙(支持TCP和UDP,TFO,多用户和平滑重启,目的IP黑名单)。
– tproxy是一个简单的TCP路由代理(第7层),基于Gevent,用Python进行配置。
其他Python工具列表
长夜漫漫,而你将去往何方……
# Python 资源大全中文版 我想很多程序员应该记得 GitHub 上有一个 Awesome - XXX 系列的资源整理。[awesome-python](https://link.jianshu.com?t=https%3A%2F%2Fgithub.com%2Fvin...
转自:https://weibo.com/ttarticle/p/show?id= 参考:https://github.com/jobbole/awesome-python-cn 环境管理 管理 Python 版本和环境的工具 p...
Python 资源大全中文版 我想很多程序员应该记得 GitHub 上有一个 Awesome - XXX 系列的资源整理。awesome-python 是 vinta 发起维护的 Python 资源列表,内容包括:Web框架、网络爬虫、网络内容提取、模板引擎、数据库、数据可...
Python常用库大全,看看有没有你需要的。 环境管理 管理 Python 版本和环境的工具 p – 非常简单的交互式 python 版本管理工具。pyenv – 简单的 Python 版本管理工具。Vex – 可以在虚拟环境中执行命令。virtualenv – 创建独立 ...
环境管理管理Python版本和环境的工具。p–非常简单的交互式python版本管理工具。pyenv–简单的Python版本管理工具。Vex–可以在虚拟环境中执行命令。virtualenv–创建独立Python环境的工具。virtualenvwrapper-virtualen...
正式开始瑜伽写作,第一篇想回顾一下我和瑜伽的缘分。 第一次接触瑜伽,是2004年的夏天。那个炎热的夏天,学校新建的豪华大厦开了一个健身馆,馆里开设瑜伽课程。我和一个舍友跑去体验了几次课程。记得收费应该不高,因为临近毕业的我们马上就要各散东西,是没有金钱也没有必要办年...
笃定要改变的第一个月 即便再累再忙 基本上都会抽空看书、做笔记 不断地迭代阅读方式、做笔记方法 一个月下来,共读了6本书 结合自己的看书习惯 在此推荐其中的4本 希望它们有机会成为你2月的书单 1.小狗钱钱 作者:[德] 博多·舍费尔 出版社:南海出版社 豆瓣评分:8.9分...
《辛德勒的名单》根据澳大利亚小说家托马斯·肯尼利所著的《辛德勒名单》改编而成。是1993年由史蒂文·斯皮尔伯格导演的一部电影。 影片再现了德国企业家奥斯卡·辛德勒与其夫人埃米莉·辛德勒在第二次世界大战期间倾家荡产保护了1200余名犹太人免遭法西斯杀害的真实历史事件。 本片包...
Google Daydream VR头显由谷歌开发,主要给用户在佩戴娱乐时,能产生一种身临其境的体验感。用户可以通过内置的小型控制器来切换虚拟对象。GameLink作为一家成人内容公司,宣布与谷歌开展VR内容上的合作。 GameLink是第一个提供虚拟现实的成人点播总站。 ...人生苦短,我用Python。这句话我仅仅只是说说而已,Python丰富的库,意味着Python写代码的效率比其他的编程语言高出好几倍。Python的应用领域也特别的广,比如:web开发、爬虫、自动化测试/运维、测试/运维开发、大数据、数据分析、人工智能、机器学习等等。如果你是想要爬取网上数据的话,那肯定就是用Python了呀,毕竟很强大。
如果你需要一个良好的学习交流环境,那么你可以考虑Python学习交流群:;
如果你需要一份系统的学习资料,那么你可以考虑Python学习交流群:。
学习Python的朋友都知道"廖雪峰",也都会在上面看教程,但是总是用网页看感觉特别麻烦,今天小编就用Python把"廖雪峰"的教程制作成PDF,这样离线也能可以看了!
今天小编就是用Python爬虫中最为常用的两个模块,也是爬虫的两大神器了
Requests和beautifulsoup
首先咱们需要做的就是先安装这两个模块
Pip install requests
Pip install beautifulsoup
既然要把网页的html转换成PDF那么也要需要pdfkit这个模块的,和上面一样的安装方法。
接下来就是爬虫的基本的流程了
先在浏览器里面利用开发者工具(F12)找到正文的div标签,然后用requests获取整个页面数据,在用beautifulsoup提取正文内容。
左侧的目录也可以用同样的方法找到。
将一些图片文件转换成成为一个PDF格式文档,这也就是制作一个纯图片内容的文档,只是PDF文件接触的比较少,不知道怎样把图片转成PDF,其实要把图片转换成PDF还是很简单的。
  一、用word或...
在网页中使用PDF文档: 用户可HTML
标记从 HTML 文档链接到PDF文档。当网络用户单击HTML 页上的该链接时,PDF 文档打开。该文档可充满整个浏览器窗口,或启动作为帮助应用程序的一个...
# coding = UTF-8
# 爬取大学nlp课程的教学pdf文档课件
http://ccl.pku.edu.cn/alcourse/nlp/
import urllib.request
平时我们在编辑Word文档时,为了减轻对大段文本伏案敲字的艰辛,常常要将扫描的或者翻拍的书本中的大量页面保存成图片,然后将这些文字图片转换成PDF文档以备课件制作调用。当然对于Word转存为PDF,现...
一般的文档格式转换都是将一些office文档格式与pdf文件互相转换,但有时候除了office文档,有时候也会需要将一些图片放到一起合成一个pdf文件,那么将图片转换成pdf是如何转换的呢?
作者简介:孟岩CSDN 副总裁,负责 CSDN 的内容、社区和区块链业务。编者按:本文是10月20日孟岩在中关村区块链产业联盟与CSDN、清华经管创业者加速器联合举办的“区块链系列沙龙”上所作演讲的文...
Yonsm 制作,必属精品。目录书签大纲什么的样样俱全,浏览查看方便。且均为中文文档,可以一目十行得看,不用像看“阴文”那样一行一行地啃。NASM 中文手册: http://Yonsm.zj.com...
怎样将图片格式的PDF文档变成word?
软件有很多
但是推荐“CAJViewer”软件,
用它打开PDF文件;
再用其自带的OCR文字识别功能识别图片中的文字;
复制粘贴到Word或发送到Word...
没有更多推荐了,&figure&&img src=&https://pic3.zhimg.com/v2-e18c43cd50caf00c8eaae4cc_b.jpg& data-rawwidth=&1728& data-rawheight=&1080& class=&origin_image zh-lightbox-thumb& width=&1728& data-original=&https://pic3.zhimg.com/v2-e18c43cd50caf00c8eaae4cc_r.jpg&&&/figure&&p&本文主要总结了用python做并行或并发计算的方法,介绍了ipyparallel、threading、multiprocessing这三个常用的package。&/p&&h2&&b&1.ipyparallel&/b&&/h2&&h2&1.1 安装&/h2&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&conda install ipyparallel
ipcluster nbextension enable
&/code&&/pre&&/div&&p&之后,你可以在Jupyter上通过IPython Clusters的tab启动一个集群,也可以用命令行实现:&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&ipcluster start -n
&/code&&/pre&&/div&&h2&1.2 基本概念&/h2&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&from ipyparallel import Client
&/code&&/pre&&/div&&p&client可以连接到不同的集群的engine上,这些engine可以在同一个机器上也可以在不同的机器上。&/p&&br&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&rc = Client()
#[0, 1, 2, 3]
&/code&&/pre&&/div&&p&一个视图(view)提供了访问不同engine的方法,任务可以通过视图提交到这些engine上,直接视图(direct view)可以允许用户精确的将不同任务发送给不同的engines,而load balanced view有点像multiprocessing里的pool对象。&/p&&h2&Direct View&/h2&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&dv = rc[:]
dv.map_sync(lambda x, y, z: x + y + z, range(10), range(10), range(10))
#[0, 3, 6, 9, 12, 15, 18, 21, 24, 27]
&/code&&/pre&&/div&&h2&Load Balanced View&/h2&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&lv = rc.load_balanced_view()
lv.map_sync(lambda x: sum(x), np.random.random((10, 100000)))
&/code&&/pre&&/div&&h2&使用Apply&/h2&&p&除了map方法,我们还可以使用apply进行任务的分配:&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&rc[1:3].apply_sync(lambda x, y: x**2 + y**2, 3, 4)
&/code&&/pre&&/div&&h2&同步任务和异步任务&/h2&&p&我们之前已经使用了map和apply方法,其中sync代表的是同步任务,我们也可以用map_async和apply_async实现异步任务:&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&res = dv.map_async(lambda x, y, z: x + y + z, range(10), range(10), range(10))
#&AsyncMapResult: &lambda&&
&/code&&/pre&&/div&&p&不过和同步任务不同的是,这时候res返回的是一个object,我们还要判断其是否完成,并使用get()函数,才能得到其中具体的数值:&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&res.done()
#[0, 3, 6, 9, 12, 15, 18, 21, 24, 27]
&/code&&/pre&&/div&&h2&&b&2. Threading&/b&&/h2&&p&Threading是Python里一个可以实现多线程的包,下面用一个例子介绍其简单用法:&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&import threading
#用于线程执行的函数
def counter(n):
for i in xrange(n):
for j in xrange(i):
if __name__ == '__main__':
for i in range(5):
th = threading.Thread(target=counter, args=(i,));
jobs.append(th)
th.start();
&/code&&/pre&&/div&&p&其中,counter是我们定义的函数,我们的目的就是同时运行多个counter函数(但是这些函数输入的参数不同)来实现速度的提升。&/p&&p&这段代码可以直接copy然后自己进行修改,Thread就是帮助我们实现多线程的函数,而target属性后面就是我们要并发的函数,而args则关乎输入的参数。&/p&&p&不过python多线程有个讨厌的限制:全局解释器锁(global interpreter lock)。这个锁的意思是任一时间只能有一个线程使用解释器,所以就变成了单CPU跑多个程序。这实际上叫“并发”,不是“并行”。这个锁造成的问题就是如果有一个计算密集型的线程占着cpu,其他的线程都得等着。如果你的多个线程中有这么一个线程,那就悲剧了,有时候多线程竟被生生被搞成串行运算。&/p&&h2&&b&3. multiprocessing&/b&&/h2&&p&相比于Threading,multiprocessing是实现多进程的,其使用方法和threading完全一样,只不过是将threading.Thread换成multiprocessing.Process。&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&import multiprocessing
#用于进程执行的函数
def worker(num):
print 'Worker:', num
if __name__ == '__main__':
for i in range(5):
p = multiprocessing.Process(target=worker, args=(i,))
jobs.append(p)
&/code&&/pre&&/div&&p&在multiprocessing还可以引入进程池(pool)的概念:&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&from multiprocessing import Pool
import os, time
def long_time_task(name):
print 'Run task %s (%s)...' % (name, os.getpid())
start = time.time()
time.sleep(3)
end = time.time()
print 'Task %s runs %0.2f seconds.' % (name, (end - start))
if __name__=='__main__':
print 'Parent process %s.' % os.getpid()
p = Pool()
for i in range(5):
p.apply_async(long_time_task, args=(i,))
print 'Waiting for all subprocesses done...'
print 'All subprocesses done.'
&/code&&/pre&&/div&&h2&&b&最后&/b&&/h2&&p&想要了解关于R、Python、数据科学以及机器学习更多内容。&/p&&p&请关注我的专栏:&a href=&https://zhuanlan.zhihu.com/rdatamining& class=&internal&&Data Science with R&Python&/a&, 以及关注我的知乎账号:&a href=&https://www.zhihu.com/people/wen-yi-yang-81& class=&internal&&文兄&/a&&/p&
本文主要总结了用python做并行或并发计算的方法,介绍了ipyparallel、threading、multiprocessing这三个常用的package。1.ipyparallel1.1 安装conda install ipyparallel
ipcluster nbextension enable之后,你可以在Jupyter上通过IPython Clusters的tab启…
&figure&&img src=&https://pic3.zhimg.com/v2-c63fbef4ed5fe_b.jpg& data-rawwidth=&1254& data-rawheight=&482& class=&origin_image zh-lightbox-thumb& width=&1254& data-original=&https://pic3.zhimg.com/v2-c63fbef4ed5fe_r.jpg&&&/figure&&p&从事爬虫虽然时间不长,但是经历的项目都具有特例性,从亿级数据采集到各种伪造隐藏技术,从极验验证码破解到淘宝百度等反爬虫破解,从分布式架构部署到多种ip跟换技术,从普通请求到js破解和自动化模拟,这些主流技术都有亲身经历。因此不才去尝试写这份技术指南。&br&&/p&&p&因在公司有需求培养新人从爬虫技术入手,因此特地制作本系列教程,学技术重在广而精,因此先综述爬虫技术的技术栈,之后对需要分析以及灵活的技术进行样例演示解说。&/p&&p&技能树总图:&/p&&p&红色为常用 ,&a href=&https://link.zhihu.com/?target=https%3A//www.urlteam.org/wp-content/uploads/E7%2588%25AC%25E8%2599%25AB%25E6%258A%%2583%25BD%25E6%25A0%%2580%25BB%25E8%25A7%%259B%25BE.graffle.zip& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&爬虫技能树-总览图.graffle&/a&(可以拿到链接)我是由mac中omnigraffle软件创建的&/p&&figure&&img src=&https://pic1.zhimg.com/v2-b4b4bff9cc794b62a8d6262_b.jpg& data-rawwidth=&3812& data-rawheight=&2671& class=&origin_image zh-lightbox-thumb& width=&3812& data-original=&https://pic1.zhimg.com/v2-b4b4bff9cc794b62a8d6262_r.jpg&&&/figure&&br&&br&&br&&p&总结而言,常用的一系列工具为:&/p&&p&&strong&分析工具:&/strong&&/p&&ul&&li&xpath测试chrome插件xpath helper&/li&&li&请求头伪造chrome插件 Modify Headers for Google Chrome&/li&&li&post和参数调节工具 postman&/li&&li&scrapy 的shell&/li&&li&开发者工具&/li&&/ul&&p&&strong&请求工具:&/strong&&/p&&ul&&li&requests 网络包&/li&&li&urllib2 网络包&/li&&/ul&&p&&strong&分布式工具:&/strong&&/p&&ul&&li&redis 基于内存的数据库&/li&&li&mysql
数据库&/li&&li&docker 部署工具,&/li&&/ul&&p&&strong&数据抽取工具&/strong&&/p&&ul&&li&re 正则表达式&/li&&li&lxml xpath抽取&/li&&/ul&&p&&strong&模拟浏览器&/strong&&/p&&ul&&li&phantomjs&/li&&li&selenium&/li&&li&ghost&/li&&/ul&&p&&strong&异步&/strong&&/p&&ul&&li&threading&/li&&li&&a href=&https://link.zhihu.com/?target=https%3A//twistedmatrix.com/trac/& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&Twisted&/a&&/li&&/ul&&p&&strong&ip更换技术&/strong&&/p&&ul&&li&代理,adsl,tor,vpn,加速器&/li&&/ul&&br&&p&因为ppt主要是列举,然后口头现场表达和演示,没有详细的说明,以后的分享文会专心于技术内容,而不是今天的技术栈概述。&/p&&p&附录:&/p&&p&ppt和思维汇总图下载
&a href=&https://link.zhihu.com/?target=https%3A//www.urlteam.org/wp-content/uploads/E9%E9%259B%%258A%%259C%25AF%25E5%E4%25BA%25AB%25E7%25AC%25AC%25E4%25B8%%259C%259F.zip& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&采集技术分享第一期&/a&&/p&&p&博客网址:&a href=&https://link.zhihu.com/?target=https%3A//www.urlteam.org/E6%%25E6%258D%25AE%25E9%E9%259B%%258A%%259C%25AF%25E6%258C%%258D%%25AC%25AC%25E4%25B8%%25AF%%258A%%259C%25AF%25E6%25A0%%2580%25BB%25E8%25A7%2588/& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&数据采集技术指南 第一篇 技术栈总览-附总图和演讲ppt&/a&&/p&&p&github代码干货:&a href=&https://link.zhihu.com/?target=https%3A//github.com/luyishisi/Anti-Anti-Spider& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&luyishisi/Anti-Anti-Spider&/a&&/p&
从事爬虫虽然时间不长,但是经历的项目都具有特例性,从亿级数据采集到各种伪造隐藏技术,从极验验证码破解到淘宝百度等反爬虫破解,从分布式架构部署到多种ip跟换技术,从普通请求到js破解和自动化模拟,这些主流技术都有亲身经历。因此不才去尝试写这份技…
&figure&&img src=&https://pic1.zhimg.com/v2-3cf229eedc59aad2fca5_b.jpg& data-rawwidth=&601& data-rawheight=&203& class=&origin_image zh-lightbox-thumb& width=&601& data-original=&https://pic1.zhimg.com/v2-3cf229eedc59aad2fca5_r.jpg&&&/figure&&p&本文将通过一些例子来讲述作为Python开发者有哪些常用的方式来实现异步编程,以及分享个人对异步编程的理解,如有错误,欢迎指正。&br&&/p&&p&先从一个例子说起。&/p&&p&小梁是一个忠实的电影好爱者,有一天,小梁看到豆瓣这个网站,发现了很多自己喜欢的内容,恰好小梁是个程序猿,于是心血来潮的他决定写个程序,把豆瓣Top250的电影列表给爬下来。小梁平时是个Python发烧友,做起这些事情来自然是得心应手,于是他欣喜地撸起袖子就是干!果不其然,不到十分钟,小梁就写好了第一个程序。&/p&&br&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&c1&&#-*- coding:utf-8 -*-&/span&
&span class=&kn&&import&/span& &span class=&nn&&urllib.request&/span&
&span class=&kn&&import&/span& &span class=&nn&&ssl&/span&
&span class=&kn&&from&/span& &span class=&nn&&lxml&/span& &span class=&kn&&import&/span& &span class=&n&&etree&/span&
&span class=&n&&url&/span& &span class=&o&&=&/span& &span class=&s1&&'https://movie.douban.com/top250'&/span&
&span class=&n&&context&/span& &span class=&o&&=&/span& &span class=&n&&ssl&/span&&span class=&o&&.&/span&&span class=&n&&SSLContext&/span&&span class=&p&&(&/span&&span class=&n&&ssl&/span&&span class=&o&&.&/span&&span class=&n&&PROTOCOL_TLSv1_1&/span&&span class=&p&&)&/span&
&span class=&k&&def&/span& &span class=&nf&&fetch_page&/span&&span class=&p&&(&/span&&span class=&n&&url&/span&&span class=&p&&):&/span&
&span class=&n&&response&/span& &span class=&o&&=&/span& &span class=&n&&urllib&/span&&span class=&o&&.&/span&&span class=&n&&request&/span&&span class=&o&&.&/span&&span class=&n&&urlopen&/span&&span class=&p&&(&/span&&span class=&n&&url&/span&&span class=&p&&,&/span& &span class=&n&&context&/span&&span class=&o&&=&/span&&span class=&n&&context&/span&&span class=&p&&)&/span&
&span class=&k&&return&/span& &span class=&n&&response&/span&
&span class=&k&&def&/span& &span class=&nf&&parse&/span&&span class=&p&&(&/span&&span class=&n&&url&/span&&span class=&p&&):&/span&
&span class=&n&&response&/span& &span class=&o&&=&/span& &span class=&n&&fetch_page&/span&&span class=&p&&(&/span&&span class=&n&&url&/span&&span class=&p&&)&/span&
&span class=&n&&page&/span& &span class=&o&&=&/span& &span class=&n&&response&/span&&span class=&o&&.&/span&&span class=&n&&read&/span&&span class=&p&&()&/span&
&span class=&n&&html&/span& &span class=&o&&=&/span& &span class=&n&&etree&/span&&span class=&o&&.&/span&&span class=&n&&HTML&/span&&span class=&p&&(&/span&&span class=&n&&page&/span&&span class=&p&&)&/span&
&span class=&n&&xpath_movie&/span& &span class=&o&&=&/span& &span class=&s1&&'//*[@id=&content&]/div/div[1]/ol/li'&/span&
&span class=&n&&xpath_title&/span& &span class=&o&&=&/span& &span class=&s1&&'.//span[@class=&title&]'&/span&
&span class=&n&&xpath_pages&/span& &span class=&o&&=&/span& &span class=&s1&&'//*[@id=&content&]/div/div[1]/div[2]/a'&/span&
&span class=&n&&pages&/span& &span class=&o&&=&/span& &span class=&n&&html&/span&&span class=&o&&.&/span&&span class=&n&&xpath&/span&&span class=&p&&(&/span&&span class=&n&&xpath_pages&/span&&span class=&p&&)&/span&
&span class=&n&&fetch_list&/span& &span class=&o&&=&/span& &span class=&p&&[]&/span&
&span class=&n&&result&/span& &span class=&o&&=&/span& &span class=&p&&[]&/span&
&span class=&k&&for&/span& &span class=&n&&element_movie&/span& &span class=&ow&&in&/span& &span class=&n&&html&/span&&span class=&o&&.&/span&&span class=&n&&xpath&/span&&span class=&p&&(&/span&&span class=&n&&xpath_movie&/span&&span class=&p&&):&/span&
&span class=&n&&result&/span&&span class=&o&&.&/span&&span class=&n&&append&/span&&span class=&p&&(&/span&&span class=&n&&element_movie&/span&&span class=&p&&)&/span&
&span class=&k&&for&/span& &span class=&n&&p&/span& &span class=&ow&&in&/span& &span class=&n&&pages&/span&&span class=&p&&:&/span&
&span class=&n&&fetch_list&/span&&span class=&o&&.&/span&&span class=&n&&append&/span&&span class=&p&&(&/span&&span class=&n&&url&/span& &span class=&o&&+&/span& &span class=&n&&p&/span&&span class=&o&&.&/span&&span class=&n&&get&/span&&span class=&p&&(&/span&&span class=&s1&&'href'&/span&&span class=&p&&))&/span&
&span class=&k&&for&/span& &span class=&n&&url&/span& &span class=&ow&&in&/span& &span class=&n&&fetch_list&/span&&span class=&p&&:&/span&
&span class=&n&&response&/span& &span class=&o&&=&/span& &span class=&n&&fetch_page&/span&&span class=&p&&(&/span&&span class=&n&&url&/span&&span class=&p&&)&/span&
&span class=&n&&page&/span& &span class=&o&&=&/span& &span class=&n&&response&/span&&span class=&o&&.&/span&&span class=&n&&read&/span&&span class=&p&&()&/span&
&span class=&n&&html&/span& &span class=&o&&=&/span& &span class=&n&&etree&/span&&span class=&o&&.&/span&&span class=&n&&HTML&/span&&span class=&p&&(&/span&&span class=&n&&page&/span&&span class=&p&&)&/span&
&span class=&k&&for&/span& &span class=&n&&element_movie&/span& &span class=&ow&&in&/span& &span class=&n&&html&/span&&span class=&o&&.&/span&&span class=&n&&xpath&/span&&span class=&p&&(&/span&&span class=&n&&xpath_movie&/span&&span class=&p&&):&/span&
&span class=&n&&result&/span&&span class=&o&&.&/span&&span class=&n&&append&/span&&span class=&p&&(&/span&&span class=&n&&element_movie&/span&&span class=&p&&)&/span&
&span class=&k&&for&/span& &span class=&n&&i&/span&&span class=&p&&,&/span& &span class=&n&&movie&/span& &span class=&ow&&in&/span& &span class=&nb&&enumerate&/span&&span class=&p&&(&/span&&span class=&n&&result&/span&&span class=&p&&,&/span& &span class=&mi&&1&/span&&span class=&p&&):&/span&
&span class=&n&&title&/span& &span class=&o&&=&/span& &span class=&n&&movie&/span&&span class=&o&&.&/span&&span class=&n&&find&/span&&span class=&p&&(&/span&&span class=&n&&xpath_title&/span&&span class=&p&&)&/span&&span class=&o&&.&/span&&span class=&n&&text&/span&
&span class=&k&&print&/span&&span class=&p&&(&/span&&span class=&n&&i&/span&&span class=&p&&,&/span& &span class=&n&&title&/span&&span class=&p&&)&/span&
&span class=&k&&def&/span& &span class=&nf&&main&/span&&span class=&p&&():&/span&
&span class=&n&&parse&/span&&span class=&p&&(&/span&&span class=&n&&url&/span&&span class=&p&&)&/span&
&span class=&k&&if&/span& &span class=&n&&__name__&/span& &span class=&o&&==&/span& &span class=&s1&&'__main__'&/span&&span class=&p&&:&/span&
&span class=&n&&main&/span&&span class=&p&&()&/span&
&/code&&/pre&&/div&&p&程序也不出意外地正常运行。&/p&&figure&&img src=&https://pic3.zhimg.com/v2-3bb705d22dda5e3166c96_b.jpg& data-rawwidth=&2048& data-rawheight=&1280& class=&origin_image zh-lightbox-thumb& width=&2048& data-original=&https://pic3.zhimg.com/v2-3bb705d22dda5e3166c96_r.jpg&&&/figure&&br&&p&但是,这个程序让人感觉比较慢,有多慢呢?小梁在主函数中加了下面一段代码。&br&&/p&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&k&&def&/span& &span class=&nf&&main&/span&&span class=&p&&():&/span&
&span class=&kn&&from&/span& &span class=&nn&&time&/span& &span class=&kn&&import&/span& &span class=&n&&time&/span&
&span class=&n&&start&/span& &span class=&o&&=&/span& &span class=&n&&time&/span&&span class=&p&&()&/span&
&span class=&k&&for&/span& &span class=&n&&i&/span& &span class=&ow&&in&/span& &span class=&nb&&range&/span&&span class=&p&&(&/span&&span class=&mi&&5&/span&&span class=&p&&):&/span&
&span class=&n&&parse&/span&&span class=&p&&(&/span&&span class=&n&&url&/span&&span class=&p&&)&/span&
&span class=&n&&end&/span& &span class=&o&&=&/span& &span class=&n&&time&/span&&span class=&p&&()&/span&
&span class=&k&&print&/span& &span class=&p&&(&/span&&span class=&s1&&'Cost {} seconds'&/span&&span class=&o&&.&/span&&span class=&n&&format&/span&&span class=&p&&((&/span&&span class=&n&&end&/span& &span class=&o&&-&/span& &span class=&n&&start&/span&&span class=&p&&)&/span& &span class=&o&&/&/span& &span class=&mi&&5&/span&&span class=&p&&))&/span&
&/code&&/pre&&/div&&p&发现总共耗时7.6秒!!&br&&/p&&div class=&highlight&&&pre&&code class=&language-bash&&&span&&/span&python movie.py
Cost 7.583 seconds
&/code&&/pre&&/div&&p&小梁不禁陷入了沉思...&br&&/p&&p&&figure&&img src=&https://pic2.zhimg.com/v2-e01da4b607e_b.jpg& data-rawwidth=&224& data-rawheight=&261& class=&content_image& width=&224&&&/figure&小梁突然想起了两天前小张同学给他安利的一个库,叫&strong&requests&/strong&,比那urllib,urllib2,urllib3,urllibn...不知高到哪里去了!小梁兴致勃勃地修改程序,用requests代替了标准库urllib。&br&&/p&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&kn&&import&/span& &span class=&nn&&requests&/span&
&span class=&kn&&from&/span& &span class=&nn&&lxml&/span& &span class=&kn&&import&/span& &span class=&n&&etree&/span&
&span class=&kn&&from&/span& &span class=&nn&&time&/span& &span class=&kn&&import&/span& &span class=&n&&time&/span&
&span class=&n&&url&/span& &span class=&o&&=&/span& &span class=&s1&&'https://movie.douban.com/top250'&/span&
&span class=&k&&def&/span& &span class=&nf&&fetch_page&/span&&span class=&p&&(&/span&&span class=&n&&url&/span&&span class=&p&&):&/span&
&span class=&n&&response&/span& &span class=&o&&=&/span& &span class=&n&&requests&/span&&span class=&o&&.&/span&&span class=&n&&get&/span&&span class=&p&&(&/span&&span class=&n&&url&/span&&span class=&p&&)&/span&
&span class=&k&&return&/span& &span class=&n&&response&/span&
&span class=&k&&def&/span& &span class=&nf&&parse&/span&&span class=&p&&(&/span&&span class=&n&&url&/span&&span class=&p&&):&/span&
&span class=&n&&response&/span& &span class=&o&&=&/span& &span class=&n&&fetch_page&/span&&span class=&p&&(&/span&&span class=&n&&url&/span&&span class=&p&&)&/span&
&span class=&n&&page&/span& &span class=&o&&=&/span& &span class=&n&&response&/span&&span class=&o&&.&/span&&span class=&n&&content&/span&
&span class=&n&&html&/span& &span class=&o&&=&/span& &span class=&n&&etree&/span&&span class=&o&&.&/span&&span class=&n&&HTML&/span&&span class=&p&&(&/span&&span class=&n&&page&/span&&span class=&p&&)&/span&
&span class=&n&&xpath_movie&/span& &span class=&o&&=&/span& &span class=&s1&&'//*[@id=&content&]/div/div[1]/ol/li'&/span&
&span class=&n&&xpath_title&/span& &span class=&o&&=&/span& &span class=&s1&&'.//span[@class=&title&]'&/span&
&span class=&n&&xpath_pages&/span& &span class=&o&&=&/span& &span class=&s1&&'//*[@id=&content&]/div/div[1]/div[2]/a'&/span&
&span class=&n&&pages&/span& &span class=&o&&=&/span& &span class=&n&&html&/span&&span class=&o&&.&/span&&span class=&n&&xpath&/span&&span class=&p&&(&/span&&span class=&n&&xpath_pages&/span&&span class=&p&&)&/span&
&span class=&n&&fetch_list&/span& &span class=&o&&=&/span& &span class=&p&&[]&/span&
&span class=&n&&result&/span& &span class=&o&&=&/span& &span class=&p&&[]&/span&
&span class=&k&&for&/span& &span class=&n&&element_movie&/span& &span class=&ow&&in&/span& &span class=&n&&html&/span&&span class=&o&&.&/span&&span class=&n&&xpath&/span&&span class=&p&&(&/span&&span class=&n&&xpath_movie&/span&&span class=&p&&):&/span&
&span class=&n&&result&/span&&span class=&o&&.&/span&&span class=&n&&append&/span&&span class=&p&&(&/span&&span class=&n&&element_movie&/span&&span class=&p&&)&/span&
&span class=&k&&for&/span& &span class=&n&&p&/span& &span class=&ow&&in&/span& &span class=&n&&pages&/span&&span class=&p&&:&/span&
&span class=&n&&fetch_list&/span&&span class=&o&&.&/span&&span class=&n&&append&/span&&span class=&p&&(&/span&&span class=&n&&url&/span& &span class=&o&&+&/span& &span class=&n&&p&/span&&span class=&o&&.&/span&&span class=&n&&get&/span&&span class=&p&&(&/span&&span class=&s1&&'href'&/span&&span class=&p&&))&/span&
&span class=&k&&for&/span& &span class=&n&&url&/span& &span class=&ow&&in&/span& &span class=&n&&fetch_list&/span&&span class=&p&&:&/span&
&span class=&n&&response&/span& &span class=&o&&=&/span& &span class=&n&&fetch_page&/span&&span class=&p&&(&/span&&span class=&n&&url&/span&&span class=&p&&)&/span&
&span class=&n&&page&/span& &span class=&o&&=&/span& &span class=&n&&response&/span&&span class=&o&&.&/span&&span class=&n&&content&/span&
&span class=&n&&html&/span& &span class=&o&&=&/span& &span class=&n&&etree&/span&&span class=&o&&.&/span&&span class=&n&&HTML&/span&&span class=&p&&(&/span&&span class=&n&&page&/span&&span class=&p&&)&/span&
&span class=&k&&for&/span& &span class=&n&&element_movie&/span& &span class=&ow&&in&/span& &span class=&n&&html&/span&&span class=&o&&.&/span&&span class=&n&&xpath&/span&&span class=&p&&(&/span&&span class=&n&&xpath_movie&/span&&span class=&p&&):&/span&
&span class=&n&&result&/span&&span class=&o&&.&/span&&span class=&n&&append&/span&&span class=&p&&(&/span&&span class=&n&&element_movie&/span&&span class=&p&&)&/span&
&span class=&k&&for&/span& &span class=&n&&i&/span&&span class=&p&&,&/span& &span class=&n&&movie&/span& &span class=&ow&&in&/span& &span class=&nb&&enumerate&/span&&span class=&p&&(&/span&&span class=&n&&result&/span&&span class=&p&&,&/span& &span class=&mi&&1&/span&&span class=&p&&):&/span&
&span class=&n&&title&/span& &span class=&o&&=&/span& &span class=&n&&movie&/span&&span class=&o&&.&/span&&span class=&n&&find&/span&&span class=&p&&(&/span&&span class=&n&&xpath_title&/span&&span class=&p&&)&/span&&span class=&o&&.&/span&&span class=&n&&text&/span&
&span class=&c1&&# print(i, title)&/span&
&/code&&/pre&&/div&&p&结果一测,6.5秒!虽然比用urllib快了1秒多,但是总体来说,他们基本还是处于同一水平线的,程序并没有快很多,这一点的差距或许是requests对请求做了优化导致的。&br&&/p&&div class=&highlight&&&pre&&code class=&language-bash&&&span&&/span&python movie_requests.py
Cost 6.677 seconds
&/code&&/pre&&/div&&p&小梁不禁暗想:是我的程序写的太挫了吗?会不会是lxml这个库解析的速度太慢了,用正则表达式会不会好一些?&/p&&p&于是小梁把lxml库换成了标准的re库。&/p&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&c1&&#-*- coding:utf-8 -*-&/span&
&span class=&kn&&import&/span& &span class=&nn&&requests&/span&
&span class=&kn&&from&/span& &span class=&nn&&time&/span& &span class=&kn&&import&/span& &span class=&n&&time&/span&
&span class=&kn&&import&/span& &span class=&nn&&re&/span&
&span class=&n&&url&/span& &span class=&o&&=&/span& &span class=&s1&&'https://movie.douban.com/top250'&/span&
&span class=&k&&def&/span& &span class=&nf&&fetch_page&/span&&span class=&p&&(&/span&&span class=&n&&url&/span&&span class=&p&&):&/span&
&span class=&n&&response&/span& &span class=&o&&=&/span& &span class=&n&&requests&/span&&span class=&o&&.&/span&&span class=&n&&get&/span&&span class=&p&&(&/span&&span class=&n&&url&/span&&span class=&p&&)&/span&
&span class=&k&&return&/span& &span class=&n&&response&/span&
&span class=&k&&def&/span& &span class=&nf&&parse&/span&&span class=&p&&(&/span&&span class=&n&&url&/span&&span class=&p&&):&/span&
&span class=&n&&response&/span& &span class=&o&&=&/span& &span class=&n&&fetch_page&/span&&span class=&p&&(&/span&&span class=&n&&url&/span&&span class=&p&&)&/span&
&span class=&n&&page&/span& &span class=&o&&=&/span& &span class=&n&&response&/span&&span class=&o&&.&/span&&span class=&n&&content&/span&
&span class=&n&&fetch_list&/span& &span class=&o&&=&/span& &span class=&nb&&set&/span&&span class=&p&&()&/span&
&span class=&n&&result&/span& &span class=&o&&=&/span& &span class=&p&&[]&/span&
&span class=&k&&for&/span& &span class=&n&&title&/span& &span class=&ow&&in&/span& &span class=&n&&re&/span&&span class=&o&&.&/span&&span class=&n&&findall&/span&&span class=&p&&(&/span&&span class=&n&&rb&/span&&span class=&s1&&'&a href=.*\s.*&span class=&title&&(.*)&/span&'&/span&&span class=&p&&,&/span& &span class=&n&&page&/span&&span class=&p&&):&/span&
&span class=&n&&result&/span&&span class=&o&&.&/span&&span class=&n&&append&/span&&span class=&p&&(&/span&&span class=&n&&title&/span&&span class=&p&&)&/span&
&span class=&k&&for&/span& &span class=&n&&postfix&/span& &span class=&ow&&in&/span& &span class=&n&&re&/span&&span class=&o&&.&/span&&span class=&n&&findall&/span&&span class=&p&&(&/span&&span class=&n&&rb&/span&&span class=&s1&&'&a href=&(\?start=.*?)&'&/span&&span class=&p&&,&/span& &span class=&n&&page&/span&&span class=&p&&):&/span&
&span class=&n&&fetch_list&/span&&span class=&o&&.&/span&&span class=&n&&add&/span&&span class=&p&&(&/span&&span class=&n&&url&/span& &span class=&o&&+&/span& &span class=&n&&postfix&/span&&span class=&o&&.&/span&&span class=&n&&decode&/span&&span class=&p&&())&/span&
&span class=&k&&for&/span& &span class=&n&&url&/span& &span class=&ow&&in&/span& &span class=&n&&fetch_list&/span&&span class=&p&&:&/span&
&span class=&n&&response&/span& &span class=&o&&=&/span& &span class=&n&&fetch_page&/span&&span class=&p&&(&/span&&span class=&n&&url&/span&&span class=&p&&)&/span&
&span class=&n&&page&/span& &span class=&o&&=&/span& &span class=&n&&response&/span&&span class=&o&&.&/span&&span class=&n&&content&/span&
&span class=&k&&for&/span& &span class=&n&&title&/span& &span class=&ow&&in&/span& &span class=&n&&re&/span&&span class=&o&&.&/span&&span class=&n&&findall&/span&&span class=&p&&(&/span&&span class=&n&&rb&/span&&span class=&s1&&'&a href=.*\s.*&span class=&title&&(.*)&/span&'&/span&&span class=&p&&,&/span& &span class=&n&&page&/span&&span class=&p&&):&/span&
&span class=&n&&result&/span&&span class=&o&&.&/span&&span class=&n&&append&/span&&span class=&p&&(&/span&&span class=&n&&title&/span&&span class=&p&&)&/span&
&span class=&k&&for&/span& &span class=&n&&i&/span&&span class=&p&&,&/span& &span class=&n&&title&/span& &span class=&ow&&in&/span& &span class=&nb&&enumerate&/span&&span class=&p&&(&/span&&span class=&n&&result&/span&&span class=&p&&,&/span& &span class=&mi&&1&/span&&span class=&p&&):&/span&
&span class=&n&&title&/span& &span class=&o&&=&/span& &span class=&n&&title&/span&&span class=&o&&.&/span&&span class=&n&&decode&/span&&span class=&p&&()&/span&
&span class=&c1&&# print(i, title)&/span&
&/code&&/pre&&/div&&p&再一跑,咦,又足足提升了将近一秒!&br&&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&python movie_regex.py
Cost 5.069 seconds
&/code&&/pre&&/div&&p&小梁心里暗爽,程序变得更短了,运行得也更快了,感觉离成功越来越近了,但小梁眉头一皱,很快地意识到了一个问题,这样写出来的程序虽然看起来更短了,但所做的都是在盲目地求&strong&快&/strong&,但完全没有&strong&扩展性&/strong&可言!虽然这样做可以满足普通的需求场景,但当程序逻辑变复杂时,依赖原生正则表达式的程序会更加难以维护!借助一些专门做这些事情的解析库,才能使程序变得清晰。其次,这种网络应用通常瓶颈都在IO层面,解决等待读写的问题比提高文本解析速度来的更有性价比!小梁想起了昨天上操作系统课时老师讲的多进程和多线程概念,正好用他们来解决实际问题。&br&&/p&&figure&&img src=&https://pic1.zhimg.com/v2-fc11bf4c2daa_b.jpg& data-rawwidth=&506& data-rawheight=&482& class=&origin_image zh-lightbox-thumb& width=&506& data-original=&https://pic1.zhimg.com/v2-fc11bf4c2daa_r.jpg&&&/figure&&br&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&c1&&#-*- coding:utf-8 -*-&/span&
&span class=&kn&&import&/span& &span class=&nn&&requests&/span&
&span class=&kn&&from&/span& &span class=&nn&&lxml&/span& &span class=&kn&&import&/span& &span class=&n&&etree&/span&
&span class=&kn&&from&/span& &span class=&nn&&time&/span& &span class=&kn&&import&/span& &span class=&n&&time&/span&
&span class=&kn&&from&/span& &span class=&nn&&threading&/span& &span class=&kn&&import&/span& &span class=&n&&Thread&/span&
&span class=&n&&url&/span& &span class=&o&&=&/span& &span class=&s1&&'https://movie.douban.com/top250'&/span&
&span class=&k&&def&/span& &span class=&nf&&fetch_page&/span&&span class=&p&&(&/span&&span class=&n&&url&/span&&span class=&p&&):&/span&
&span class=&n&&response&/span& &span class=&o&&=&/span& &span class=&n&&requests&/span&&span class=&o&&.&/span&&span class=&n&&get&/span&&span class=&p&&(&/span&&span class=&n&&url&/span&&span class=&p&&)&/span&
&span class=&k&&return&/span& &span class=&n&&response&/span&
&span class=&k&&def&/span& &span class=&nf&&parse&/span&&span class=&p&&(&/span&&span class=&n&&url&/span&&span class=&p&&):&/span&
&span class=&n&&response&/span& &span class=&o&&=&/span& &span class=&n&&fetch_page&/span&&span class=&p&&(&/span&&span class=&n&&url&/span&&span class=&p&&)&/span&
&span class=&n&&page&/span& &span class=&o&&=&/span& &span class=&n&&response&/span&&span class=&o&&.&/span&&span class=&n&&content&/span&
&span class=&n&&html&/span& &span class=&o&&=&/span& &span class=&n&&etree&/span&&span class=&o&&.&/span&&span class=&n&&HTML&/span&&span class=&p&&(&/span&&span class=&n&&page&/span&&span class=&p&&)&/span&
&span class=&n&&xpath_movie&/span& &span class=&o&&=&/span& &span class=&s1&&'//*[@id=&content&]/div/div[1]/ol/li'&/span&
&span class=&n&&xpath_title&/span& &span class=&o&&=&/span& &span class=&s1&&'.//span[@class=&title&]'&/span&
&span class=&n&&xpath_pages&/span& &span class=&o&&=&/span& &span class=&s1&&'//*[@id=&content&]/div/div[1]/div[2]/a'&/span&
&span class=&n&&pages&/span& &span class=&o&&=&/span& &span class=&n&&html&/span&&span class=&o&&.&/span&&span class=&n&&xpath&/span&&span class=&p&&(&/span&&span class=&n&&xpath_pages&/span&&span class=&p&&)&/span&
&span class=&n&&fetch_list&/span& &span class=&o&&=&/span& &span class=&p&&[]&/span&
&span class=&n&&result&/span& &span class=&o&&=&/span& &span class=&p&&[]&/span&
&span class=&k&&for&/span& &span class=&n&&element_movie&/span& &span class=&ow&&in&/span& &span class=&n&&html&/span&&span class=&o&&.&/span&&span class=&n&&xpath&/span&&span class=&p&&(&/span&&span class=&n&&xpath_movie&/span&&span class=&p&&):&/span&
&span class=&n&&result&/span&&span class=&o&&.&/span&&span class=&n&&append&/span&&span class=&p&&(&/span&&span class=&n&&element_movie&/span&&span class=&p&&)&/span&
&span class=&k&&for&/span& &span class=&n&&p&/span& &span class=&ow&&in&/span& &span class=&n&&pages&/span&&span class=&p&&:&/span&
&span class=&n&&fetch_list&/span&&span class=&o&&.&/span&&span class=&n&&append&/span&&span class=&p&&(&/span&&span class=&n&&url&/span& &span class=&o&&+&/span& &span class=&n&&p&/span&&span class=&o&&.&/span&&span class=&n&&get&/span&&span class=&p&&(&/span&&span class=&s1&&'href'&/span&&span class=&p&&))&/span&
&span class=&k&&def&/span& &span class=&nf&&fetch_content&/span&&span class=&p&&(&/span&&span class=&n&&url&/span&&span class=&p&&):&/span&
&span class=&n&&response&/span& &span class=&o&&=&/span& &span class=&n&&fetch_page&/span&&span class=&p&&(&/span&&span class=&n&&url&/span&&span class=&p&&)&/span&
&span class=&n&&page&/span& &span class=&o&&=&/span& &span class=&n&&response&/span&&span class=&o&&.&/span&&span class=&n&&content&/span&
&span class=&n&&html&/span& &span class=&o&&=&/span& &span class=&n&&etree&/span&&span class=&o&&.&/span&&span class=&n&&HTML&/span&&span class=&p&&(&/span&&span class=&n&&page&/span&&span class=&p&&)&/span&
&span class=&k&&for&/span& &span class=&n&&element_movie&/span& &span class=&ow&&in&/span& &span class=&n&&html&/span&&span class=&o&&.&/span&&span class=&n&&xpath&/span&&span class=&p&&(&/span&&span class=&n&&xpath_movie&/span&&span class=&p&&):&/span&
&span class=&n&&result&/span&&span class=&o&&.&/span&&span class=&n&&append&/span&&span class=&p&&(&/span&&span class=&n&&element_movie&/span&&span class=&p&&)&/span&
&span class=&n&&threads&/span& &span class=&o&&=&/span& &span class=&p&&[]&/span&
&span class=&k&&for&/span& &span class=&n&&url&/span& &span class=&ow&&in&/span& &span class=&n&&fetch_list&/span&&span class=&p&&:&/span&
&span class=&n&&t&/span& &span class=&o&&=&/span& &span class=&n&&Thread&/span&&span class=&p&&(&/span&&span class=&n&&target&/span&&span class=&o&&=&/span&&span class=&n&&fetch_content&/span&&span class=&p&&,&/span& &span class=&n&&args&/span&&span class=&o&&=&/span&&span class=&p&&[&/span&&span class=&n&&url&/span&&span class=&p&&])&/span&
&span class=&n&&t&/span&&span class=&o&&.&/span&&span class=&n&&start&/span&&span class=&p&&()&/span&
&span class=&n&&threads&/span&&span class=&o&&.&/span&&span class=&n&&append&/span&&span class=&p&&(&/span&&span class=&n&&t&/span&&span class=&p&&)&/span&
&span class=&k&&for&/span& &span class=&n&&t&/span& &span class=&ow&&in&/span& &span class=&n&&threads&/span&&span class=&p&&:&/span&
&span class=&n&&t&/span&&span class=&o&&.&/span&&span class=&n&&join&/span&&span class=&p&&()&/span&
&span class=&k&&for&/span& &span class=&n&&i&/span&&span class=&p&&,&/span& &span class=&n&&movie&/span& &span class=&ow&&in&/span& &span class=&nb&&enumerate&/span&&span class=&p&&(&/span&&span class=&n&&result&/span&&span class=&p&&,&/span& &span class=&mi&&1&/span&&span class=&p&&):&/span&
&span class=&n&&title&/span& &span class=&o&&=&/span& &span class=&n&&movie&/span&&span class=&o&&.&/span&&span class=&n&&find&/span&&span class=&p&&(&/span&&span class=&n&&xpath_title&/span&&span class=&p&&)&/span&&span class=&o&&.&/span&&span class=&n&&text&/span&
&span class=&c1&&# print(i, title)&/span&
&/code&&/pre&&/div&&p&效果果然立竿见影!多线程有效的解决了阻塞等待的问题,这个程序足足比之前的程序快了80%!只需要1.4秒就可完成电影列表的抓取。&br&&/p&&div class=&highlight&&&pre&&code class=&language-bash&&&span&&/span&python movie_multithread.py
Cost 1.506 seconds
&/code&&/pre&&/div&&p&但小梁还是觉得不够过瘾,既然Python的多线程也受制于GIL,为什么我不用多进程呢?于是话不多说又撸出了一个基于多进程的版本。用4个进程的进程池来并行处理网络数据。&br&&/p&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&c1&&#-*- coding:utf-8 -*-&/span&
&span class=&kn&&import&/span& &span class=&nn&&requests&/span&
&span class=&kn&&from&/span& &span class=&nn&&lxml&/span& &span class=&kn&&import&/span& &span class=&n&&etree&/span&
&span class=&kn&&from&/span& &span class=&nn&&time&/span& &span class=&kn&&import&/span& &span class=&n&&time&/span&
&span class=&kn&&from&/span& &span class=&nn&&concurrent.futures&/span& &span class=&kn&&import&/span& &span class=&n&&ProcessPoolExecutor&/span&
&span class=&n&&url&/span& &span class=&o&&=&/span& &span class=&s1&&'https://movie.douban.com/top250'&/span&
&span class=&k&&def&/span& &span class=&nf&&fetch_page&/span&&span class=&p&&(&/span&&span class=&n&&url&/span&&span class=&p&&):&/span&
&span class=&n&&response&/span& &span class=&o&&=&/span& &span class=&n&&requests&/span&&span class=&o&&.&/span&&span class=&n&&get&/span&&span class=&p&&(&/span&&span class=&n&&url&/span&&span class=&p&&)&/span&
&span class=&k&&return&/span& &span class=&n&&response&/span&
&span class=&k&&def&/span& &span class=&nf&&fetch_content&/span&&span class=&p&&(&/span&&span class=&n&&url&/span&&span class=&p&&):&/span&
&span class=&n&&response&/span& &span class=&o&&=&/span& &span class=&n&&fetch_page&/span&&span class=&p&&(&/span&&span class=&n&&url&/span&&span class=&p&&)&/span&
&span class=&n&&page&/span& &span class=&o&&=&/span& &span class=&n&&response&/span&&span class=&o&&.&/span&&span class=&n&&content&/span&
&span class=&k&&return&/span& &span class=&n&&page&/span&
&span class=&k&&def&/span& &span class=&nf&&parse&/span&&span class=&p&&(&/span&&span class=&n&&url&/span&&span class=&p&&):&/span&
&span class=&n&&page&/span& &span class=&o&&=&/span& &span class=&n&&fetch_content&/span&&span class=&p&&(&/span&&span class=&n&&url&/span&&span class=&p&&)&/span&
&span class=&n&&html&/span& &span class=&o&&=&/span& &span class=&n&&etree&/span&&span class=&o&&.&/span&&span class=&n&&HTML&/span&&span class=&p&&(&/span&&span class=&n&&page&/span&&span class=&p&&)&/span&
&span class=&n&&xpath_movie&/span& &span class=&o&&=&/span& &span class=&s1&&'//*[@id=&content&]/div/div[1]/ol/li'&/span&
&span class=&n&&xpath_title&/span& &span class=&o&&=&/span& &span class=&s1&&'.//span[@class=&title&]'&/span&
&span class=&n&&xpath_pages&/span& &span class=&o&&=&/span& &span class=&s1&&'//*[@id=&content&]/div/div[1]/div[2]/a'&/span&
&span class=&n&&pages&/span& &span class=&o&&=&/span& &span class=&n&&html&/span&&span class=&o&&.&/span&&span class=&n&&xpath&/span&&span class=&p&&(&/span&&span class=&n&&xpath_pages&/span&&span class=&p&&)&/span&
&span class=&n&&fetch_list&/span& &span class=&o&&=&/span& &span class=&p&&[]&/span&
&span class=&n&&result&/span& &span class=&o&&=&/span& &span class=&p&&[]&/span&
&span class=&k&&for&/span& &span class=&n&&element_movie&/span& &span class=&ow&&in&/span& &span class=&n&&html&/span&&span class=&o&&.&/span&&span class=&n&&xpath&/span&&span class=&p&&(&/span&&span class=&n&&xpath_movie&/span&&span class=&p&&):&/span&
&span class=&n&&result&/span&&span class=&o&&.&/span&&span class=&n&&append&/span&&span class=&p&&(&/span&&span class=&n&&element_movie&/span&&span class=&p&&)&/span&
&span class=&k&&for&/span& &span class=&n&&p&/span& &span class=&ow&&in&/span& &span class=&n&&pages&/span&&span class=&p&&:&/span&
&span class=&n&&fetch_list&/span&&span class=&o&&.&/span&&span class=&n&&append&/span&&span class=&p&&(&/span&&span class=&n&&url&/span& &span class=&o&&+&/span& &span class=&n&&p&/span&&span class=&o&&.&/span&&span class=&n&&get&/span&&span class=&p&&(&/span&&span class=&s1&&'href'&/span&&span class=&p&&))&/span&
&span class=&k&&with&/span& &span class=&n&&ProcessPoolExecutor&/span&&span class=&p&&(&/span&&span class=&n&&max_workers&/span&&span class=&o&&=&/span&&span class=&mi&&4&/span&&span class=&p&&)&/span& &span class=&k&&as&/span& &span class=&n&&executor&/span&&span class=&p&&:&/span&
&span class=&k&&for&/span& &span class=&n&&page&/span& &span class=&ow&&in&/span& &span class=&n&&executor&/span&&span class=&o&&.&/span&&span class=&n&&map&/span&&span class=&p&&(&/span&&span class=&n&&fetch_content&/span&&span class=&p&&,&/span& &span class=&n&&fetch_list&/span&&span class=&p&&):&/span&
&span class=&n&&html&/span& &span class=&o&&=&/span& &span class=&n&&etree&/span&&span class=&o&&.&/span&&span class=&n&&HTML&/span&&span class=&p&&(&/span&&span class=&n&&page&/span&&span class=&p&&)&/span&
&span class=&k&&for&/span& &span class=&n&&element_movie&/span& &span class=&ow&&in&/span& &span class=&n&&html&/span&&span class=&o&&.&/span&&span class=&n&&xpath&/span&&span class=&p&&(&/span&&span class=&n&&xpath_movie&/span&&span class=&p&&):&/span&
&span class=&n&&result&/span&&span class=&o&&.&/span&&span class=&n&&append&/span&&span class=&p&&(&/span&&span class=&n&&element_movie&/span&&span class=&p&&)&/span&
&span class=&k&&for&/span& &span class=&n&&i&/span&&span class=&p&&,&/span& &span class=&n&&movie&/span& &span class=&ow&&in&/span& &span class=&nb&&enumerate&/span&&span class=&p&&(&/span&&span class=&n&&result&/span&&span class=&p&&,&/span& &span class=&mi&&1&/span&&span class=&p&&):&/span&
&span class=&n&&title&/span& &span class=&o&&=&/span& &span class=&n&&movie&/span&&span class=&o&&.&/span&&span class=&n&&find&/span&&span class=&p&&(&/span&&span class=&n&&xpath_title&/span&&span class=&p&&)&/span&&span class=&o&&.&/span&&span class=&n&&text&/span&
&span class=&c1&&# print(i, title)&/span&
&/code&&/pre&&/div&&p&结果是2秒,甚至还不如多线程的版本。&br&&/p&&div class=&highlight&&&pre&&code class=&language-bash&&&span&&/span&python movie_multiprocess.py
Cost 2.037 seconds
&/code&&/pre&&/div&&em&(注:ThreadPoolExecutor和ProcessPoolExecutor是Python3.2之后引入的分别对线程池和进程池的一个封装,如果使用Python2.x,需要安装&b&futures&/b&这个库才能使用它们。)&/em&&p&小梁立马就傻眼了,这跟他的预期完全不符合啊。&br&&/p&&figure&&img src=&https://pic3.zhimg.com/v2-7b020d801e8005eabf3e81f859ad6b86_b.jpg& data-rawwidth=&220& data-rawheight=&150& class=&content_image& width=&220&&&/figure&&p&多进程带来的优点(cpu处理)并没有得到体现,&strong&反而创建和调度进程带来的开销要远超出它的正面效应&/strong&,拖了一把后腿。即便如此,多进程带来的效益相比于之前单进程单线程的模型要好得多。&br&&/p&&figure&&img src=&https://pic1.zhimg.com/v2-8e925f8b22f1f1aac5b8c297e2e09244_b.jpg& data-rawwidth=&300& data-rawheight=&187& class=&content_image& width=&300&&&/figure&&p&正当小梁在苦苦思索还有什么方法可以提高性能时,他无意中看到一篇文章,里面提到了协程相比于多进程和多线程的优点(&strong&&em&多进程和多线程除了创建的开销大之外还有一个难以根治的缺陷,就是处理进程之间或线程之间的协作问题,因为是依赖多进程和多线程的程序在不加锁的情况下通常是不可控的,而协程则可以完美地解决协作问题,由用户来决定协程之间的调度。&/em&&/strong&),小梁折腾起来也是不甘人后啊,他搜索了一些资料,思考如何用协程来加强自己的程序。&/p&&p&很快,小梁就发现了一个基于协程的网络库,叫做gevent,而且更爽的是,听说用了gevent的猴子补丁后,整个程序就会变成异步的了!&/p&&figure&&img src=&https://pic2.zhimg.com/v2-805b60d4680fac025dca75d_b.jpg& data-rawwidth=&256& data-rawheight=&256& class=&content_image& width=&256&&&/figure&&p&真的有那么神奇吗?小梁迫不及待地要看看这到底是什么黑科技!马上写出了基于gevent的栗子:&br&&/p&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&c1&&#-*- coding:utf-8 -*-&/span&
&span class=&kn&&import&/span& &span class=&nn&&requests&/span&
&span class=&kn&&from&/span& &span class=&nn&&lxml&/span& &span class=&kn&&import&/span& &span class=&n&&etree&/span&
&span class=&kn&&from&/span& &span class=&nn&&time&/span& &span class=&kn&&import&/span& &span class=&n&&time&/span&
&span class=&kn&&import&/span& &span class=&nn&&gevent&/span&
&span class=&kn&&from&/span& &span class=&nn&&gevent&/span& &span class=&kn&&import&/span& &span class=&n&&monkey&/span&
&span class=&n&&monkey&/span&&span class=&o&&.&/span&&span class=&n&&patch_all&/span&&span class=&p&&()&/span&
&span class=&n&&url&/span& &span class=&o&&=&/span& &span class=&s1&&'https://movie.douban.com/top250'&/span&
&span class=&k&&def&/span& &span class=&nf&&fetch_page&/span&&span class=&p&&(&/span&&span class=&n&&url&/span&&span class=&p&&):&/span&
&span class=&n&&response&/span& &span class=&o&&=&/span& &span class=&n&&requests&/span&&span class=&o&&.&/span&&span class=&n&&get&/span&&span class=&p&&(&/span&&span class=&n&&url&/span&&span class=&p&&)&/span&
&span class=&k&&return&/span& &span class=&n&&response&/span&
&span class=&k&&def&/span& &span class=&nf&&fetch_content&/span&&span class=&p&&(&/span&&span class=&n&&url&/span&&span class=&p&&):&/span&
&span class=&n&&response&/span& &span class=&o&&=&/span& &span class=&n&&fetch_page&/span&&span class=&p&&(&/span&&span class=&n&&url&/span&&span class=&p&&)&/span&
&span class=&n&&page&/span& &span class=&o&&=&/span& &span class=&n&&response&/span&&span class=&o&&.&/span&&span class=&n&&content&/span&
&span class=&k&&return&/span& &span class=&n&&page&/span&
&span class=&k&&def&/span& &span class=&nf&&parse&/span&&span class=&p&&(&/span&&span class=&n&&url&/span&&span class=&p&&):&/span&
&span class=&n&&page&/span& &span class=&o&&=&/span& &span class=&n&&fetch_content&/span&&span class=&p&&(&/span&&span class=&n&&url&/span&&span class=&p&&)&/span&
&span class=&n&&html&/span& &span class=&o&&=&/span& &span class=&n&&etree&/span&&span class=&o&&.&/span&&span class=&n&&HTML&/span&&span class=&p&&(&/span&&span class=&n&&page&/span&&span class=&p&&)&/span&
&span class=&n&&xpath_movie&/span& &span class=&o&&=&/span& &span class=&s1&&'//*[@id=&content&]/div/div[1]/ol/li'&/span&
&span class=&n&&xpath_title&/span& &span class=&o&&=&/span& &span class=&s1&&'.//span[@class=&title&]'&/span&
&span class=&n&&xpath_pages&/span& &span class=&o&&=&/span& &span class=&s1&&'//*[@id=&content&]/div/div[1]/div[2]/a'&/span&
&span class=&n&&pages&/span& &span class=&o&&=&/span& &span class=&n&&html&/span&&span class=&o&&.&/span&&span class=&n&&xpath&/span&&span class=&p&&(&/span&&span class=&n&&xpath_pages&/span&&span class=&p&&)&/span&
&span class=&n&&fetch_list&/span& &span class=&o&&=&/span& &span class=&p&&[]&/span&
&span class=&n&&result&/span& &span class=&o&&=&/span& &span class=&p&&[]&/span&
&span class=&k&&for&/span& &span class=&n&&element_movie&/span& &span class=&ow&&in&/span& &span class=&n&&html&/span&&span class=&o&&.&/span&&span class=&n&&xpath&/span&&span class=&p&&(&/span&&span class=&n&&xpath_movie&/span&&span class=&p&&):&/span&
&span class=&n&&result&/span&&span class=&o&&.&/span&&span class=&n&&append&/span&&span class=&p&&(&/span&&span class=&n&&element_movie&/span&&span class=&p&&)&/span&
&span class=&k&&for&/span& &span class=&n&&p&/span& &span class=&ow&&in&/span& &span class=&n&&pages&/span&&span class=&p&&:&/span&
&span class=&n&&fetch_list&/span&&span class=&o&&.&/span&&span class=&n&&append&/span&&span class=&p&&(&/span&&span class=&n&&url&/span& &span class=&o&&+&/span& &span class=&n&&p&/span&&span class=&o&&.&/span&&span class=&n&&get&/span&&span class=&p&&(&/span&&span class=&s1&&'href'&/span&&span class=&p&&))&/span&
&span class=&n&&jobs&/span& &span class=&o&&=&/span& &span class=&p&&[&/span&&span class=&n&&gevent&/span&&span class=&o&&.&/span&&span class=&n&&spawn&/span&&span class=&p&&(&/span&&span class=&n&&fetch_content&/span&&span class=&p&&,&/span& &span class=&n&&url&/span&&span class=&p&&)&/span& &span class=&k&&for&/span& &span class=&n&&url&/span& &span class=&ow&&in&/span& &span class=&n&&fetch_list&/span&&span class=&p&&]&/span&
&span class=&n&&gevent&/span&&span class=&o&&.&/span&&span class=&n&&joinall&/span&&span class=&p&&(&/span&&span class=&n&&jobs&/span&&span class=&p&&)&/span&
&span class=&p&&[&/span&&span class=&n&&job&/span&&span class=&o&&.&/span&&span class=&n&&value&/span& &span class=&k&&for&/span& &span class=&n&&job&/span& &span class=&ow&&in&/span& &span class=&n&&jobs&/span&&span class=&p&&]&/span&
&span class=&k&&for&/span& &span class=&n&&page&/span& &span class=&ow&&in&/span& &span class=&p&&[&/span&&span class=&n&&job&/span&&span class=&o&&.&/span&&span class=&n&&value&/span& &span class=&k&&for&/span& &span class=&n&&job&/span& &span class=&ow&&in&/span& &span class=&n&&jobs&/span&&span class=&p&&]:&/span&
&span class=&n&&html&/span& &span class=&o&&=&/span& &span class=&n&&etree&/span&&span class=&o&&.&/span&&span class=&n&&HTML&/span&&span class=&p&&(&/span&&span class=&n&&page&/span&&span class=&p&&)&/span&
&span class=&k&&for&/span& &span class=&n&&element_movie&/span& &span class=&ow&&in&/span& &span class=&n&&html&/span&&span class=&o&&.&/span&&span class=&n&&xpath&/span&&span class=&p&&(&/span&&span class=&n&&xpath_movie&/span&&span class=&p&&):&/span&
&span class=&n&&result&/span&&span class=&o&&.&/span&&span class=&n&&append&/span&&span class=&p&&(&/span&&span class=&n&&element_movie&/span&&span class=&p&&)&/span&
&span class=&k&&for&/span& &span class=&n&&i&/span&&span class=&p&&,&/span& &span class=&n&&movie&/span& &span class=&ow&&in&/span& &span class=&nb&&enumerate&/span&&span class=&p&&(&/span&&span class=&n&&result&/span&&span class=&p&&,&/span& &span class=&mi&&1&/span&&span class=&p&&):&/span&
&span class=&n&&title&/span& &span class=&o&&=&/span& &span class=&n&&movie&/span&&span class=&o&&.&/span&&span class=&n&&find&/span&&span class=&p&&(&/span&&span class=&n&&xpath_title&/span&&span class=&p&&)&/span&&span class=&o&&.&/span&&span class=&n&&text&/span&
&span class=&c1&&# print(i, title)&/span&
&/code&&/pre&&/div&只有1.2秒,果然很快!而且我们看整个程序,几乎看不到有异步处理的影子,&strong&gevent给予了我们一种以同步逻辑来书写异步程序的能力&/strong&,看monkey.patch_all()这段代码,它是整个程序实现异步的黑科技,当我们给程序打了猴子补丁后,Python程序在运行时会动态地将一些网络库(例如socket,thread)替换掉,变成异步的库。使得程序在进行网络操作的时候都变成异步的方式去工作,效率就自然提升很多了。&div class=&highlight&&&pre&&code class=&language-bash&&&span&&/span&python movie_gevent.py
Cost 1.1425 seconds
&/code&&/pre&&/div&&p&虽然程序变得很快了,但小梁整个人都是懵逼的啊,gevent的魔术给他带来了一定的困惑,而且他觉得gevent这玩意实在不好学,跟他心目中Pythonic的清晰优雅还是有距离的。Python社区也意识到Python需要一个独立的标准库来支持协程,于是就有了后来的asyncio。&/p&&p&小梁把同步的requests库改成了支持asyncio的aiohttp库,使用3.5的async/await语法(&em&3.5之前用@asyncio.coroutine和yield from代替&/em&)写出了协程版本的例子。&/p&&div class=&highlight&&&pre&&code class=&language-python&&&span&&/span&&span class=&c1&&#-*- coding:utf-8 -*-&/span&
&span class=&kn&&from&/span& &span class=&nn&&lxml&/span& &span class=&kn&&import&/span& &span class=&n&&etree&/span&
&span class=&kn&&from&/span& &span class=&nn&&time&/span& &span class=&kn&&import&/span& &span class=&n&&time&/span&
&span class=&kn&&import&/span& &span class=&nn&&asyncio&/span&
&span class=&kn&&import&/span& &span class=&nn&&aiohttp&/span&
&span class=&n&&url&/span& &span class=&o&&=&/span& &span class=&s1&&'https://movie.douban.com/top250'&/span&
&span class=&n&&async&/span& &span class=&k&&def&/span& &span class=&nf&&fetch_content&/span&&span class=&p&&(&/span&&span class=&n&&url&/span&&span class=&p&&):&/span&
&span class=&n&&async&/span& &span class=&k&&with&/span& &span class=&n&&aiohttp&/span&&span class=&o&&.&/span&&span class=&n&&ClientSession&/span&&span class=&p&&()&/span& &span class=&k&&as&/span& &span class=&n&&session&/span&&span class=&p&&:&/span&
&span class=&n&&async&/span& &span class=&k&&with&/span& &span class=&n&&session&/span&&span class=&o&&.&/span&&span class=&n&&get&/span&&span class=&p&&(&/span&&span class=&n&&url&/span&&span class=&p&&)&/span& &span class=&k&&as&/span& &span class=&n&&response&/span&&span class=&p&&:&/span&
&span class=&k&&return&/span& &span class=&n&&await&/span& &span class=&n&&response&/span&&span class=&o&&.&/span&&span class=&n&&text&/span&&span class=&p&&()&/span&
&span class=&n&&async&/span& &span class=&k&&def&/span& &span class=&nf&&parse&/span&&span class=&p&&(&/span&&span class=&n&&url&/span&&span class=&p&&):&/span&
&span class=&n&&page&/span& &span class=&o&&=&/span& &span class=&n&&await&/span& &span class=&n&&fetch_content&/span&&span class=&p&&(&/span&&span class=&n&&url&/span&&span class=&p&&)&/span&
&span class=&n&&html&/span& &span class=&o&&=&/span& &span class=&n&&etree&/span&&span class=&o&&.&/span&&span class=&n&&HTML&/span&&span class=&p&&(&/span&&span class=&n&&page&/span&&span class=&p&&)&/span&
&span class=&n&&xpath_movie&/span& &span class=&o&&=&/span& &span class=&s1&&'//*[@id=&content&]/div/div[1]/ol/li'&/span&
&span class=&n&&xpath_title&/span& &span class=&o&&=&/span& &span class=&s1&&'.//span[@class=&title&]'&/span&
&span class=&n&&xpath_pages&/span& &span class=&o&&=&/span& &span class=&s1&&'//*[@id=&content&]/div/div[1]/div[2]/a'&/span&
&span class=&n&&pages&/span& &span class=&o&&=&/span& &span class=&n&&html&/span&&span class=&o&&.&/span&&span class=&n&&xpath&/span&&span class=&p&&(&/span&&span class=&n&&xpath_pages&/span&&span class=&p&&)&/span&
&span class=&n&&fetch_list&/span& &span class=&o&&=&/span& &span class=&p&&[]&/span&
&span class=&n&&result&/span& &span class=&o&&=&/span& &span class=&p&&[]&/span&
&span class=&k&&for&/span& &span class=&n&&element_movie&/span& &span class=&ow&&in&/span& &span class=&n&&html&/span&&span class=&o&&.&/span&&span class=&n&&xpath&/span&&span class=&p&&(&/span&&span class=&n&&xpath_movie&/span&&span class=&p&&):&/span&
&span class=&n&&result&/span&&span class=&o&&.&/span&&span class=&n&&append&/span&&span class=&p&&(&/span&&span class=&n&&element_movie&/span&&span class=&p&&)&/span&
&span class=&k&&for&/span& &span class=&n&&p&/span& &span class=&ow&&in&/span& &span class=&n&&pages&/span&&span class=&p&&:&/span&
&span class=&n&&fetch_list&/span&&span class=&o&&.&/span&&span class=&n&&append&/span&&span class=&p&&(&/span&&span class=&n&&url&/span& &span class=&o&&+&/span& &span class=&n&&p&/span&&span class=&o&&.&/span&&span class=&n&&get&/span&&span class=&p&&(&/span&&span class=&s1&&'href'&/span&&span class=&p&&))&/span&
&span class=&n&&tasks&/span& &span class=&o&&=&/span& &span class=&p&&[&/span&&span class=&n&&fetch_content&/span&&span class=&p&&(&/span&&span class=&n&&url&/span&&span class=&p&&)&/span& &span class=&k&&for&/span& &span class=&n&&url&/span& &span class=&ow&&in&/span& &span class=&n&&fetch_list&/span&&span class=&p&&]&/span&
&span class=&n&&pages&/span& &span class=&o&&=&/span& &span class=&n&&await&/span& &span class=&n&&asyncio&/span&&span class=&o&&.&/span&&span class=&n&&gather&/span&&span class=&p&&(&/span&&span class=&o&&*&/span&&span class=&n&&tasks&/span&&span class=&p&&)&/span&
&span class=&k&&for&/span& &span class=&n&&page&/span& &span class=&ow&&in&/span& &span class=&n&&pages&/span&&span class=&p&&:&/span&
&span class=&n&&html&/span& &span class=&o&&=&/span& &span class=&n&&etree&/span&&span class=&o&&.&/span&&span class=&n&&HTML&/span&&span class=&p&&(&/span&&span class=&n&&page&/span&&span class=&p&&)&/span&
&span class=&k&&for&/span& &span class=&n&&element_movie&/span& &span class=&ow&&in&/span& &span class=&n&&html&/span&&span class=&o&&.&/span&&span class=&n&&xpath&/span&&span class=&p&&(&/span&&span class=&n&&xpath_movie&/span&&span class=&p&&):&/span&
&span class=&n&&result&/span&&span class=&o&&.&/span&&span class=&n&&append&/span&&span class=&p&&(&/span&&span class=&n&&element_movie&/span&&span class=&p&&)&/span&
&span class=&k&&for&/span& &span class=&n&&i&/span&&span class=&p&&,&/span& &span class=&n&&movie&/span& &span class=&ow&&in&/span& &span class=&nb&&enumerate&/span&&span class=&p&&(&/span&&span class=&n&&result&/span&&span class=&p&&,&/span& &span class=&mi&&1&/span&&span class=&p&&):&/span&
&span class=&n&&title&/span& &span class=&o&&=&/span& &span class=&n&&movie&/span&&span class=&o&&.&/span&&span class=&n&&find&/span&&span class=&p&&(&/span&&span class=&n&&xpath_title&/span&&span class=&p&&)&/span&&span class=&o&&.&/span&&span class=&n&&text&/span&
&span class=&c1&&# print(i, title)&/span&
&span class=&k&&def&/span& &span

我要回帖

更多关于 吃鸡不用加速器才能进 的文章

 

随机推荐