2017-8-27 python 协程加速

情景描述:

上周,由于产品嫌报告生成太慢,经过使用profile/gprof2dot研究后,发现主要时间耗费在接口网络请求上,于是我决定在项目中大量处理I/O网络请求的地方使用gevent,以缓解报告生成压力。

实现代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import gevent
from gevent import monkey, pool
monkey.patch_socket()
p = pool.Pool(300)

def requests_parse(self, tel_tuple):
"""
主要处理requests请求
"""
print('Size of pool', len(p))
……
……

def generator_label(self, tel_data):
"""
使用gevent实现协程处理I/O网络请求
"""
jobs = [p.spawn(self.requests_parse, tel) for tel in tel_data]
gevent.joinall(jobs)

tlist = [x.value for x in jobs]
if None in tlist:
message_list = [x.exception for x in jobs]
self.logger.error(tlist)
self.logger.error(message_list)
raise Exception(message_list)
return tlist

def update_data_step(self, tel):
"""
主要将请求回来处理好的数据写入数据库(mongo)
"""
res = self.db.update()
……
……

tel_data = [……] # 一大堆待请求参数list
tlist = self.generator_label(set(tel_data))
map(lambda x: self.update_data_step(x[0])(x[1],x[2],x[3],x[4],x[5],x[6]), tlist)

问题:

上线后发现,代码运行一段时间后,请求一上来,任务数直线上升,一直增加:
enter description here
但是,pool的数量是正常的:
enter description here
之后log里报错信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
  File "run.py", line 42, in update_data
File "report/generate/calls_sa_by_tel.py", line 396, in run
File "report/generate/calls_sa_by_tel.py", line 38, in base_call
File "venv/lib/python2.7/site-packages/pymongo/cursor.py", line 729, in count
File "venv/lib/python2.7/site-packages/pymongo/collection.py", line 1344, in _count
File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
File "venv/lib/python2.7/site-packages/pymongo/mongo_client.py", line 904, in _socket_for_reads
File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
File "venv/lib/python2.7/site-packages/pymongo/mongo_client.py", line 870, in _get_socket
File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
File "venv/lib/python2.7/site-packages/pymongo/server.py", line 168, in get_socket
File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
File "venv/lib/python2.7/site-packages/pymongo/pool.py", line 844, in get_socket
File "venv/lib/python2.7/site-packages/pymongo/pool.py", line 881, in _get_socket_no_auth
File "venv/lib/python2.7/site-packages/pymongo/pool.py", line 817, in connect
File "venv/lib/python2.7/site-packages/pymongo/pool.py", line 263, in _raise_connection_failure
AutoReconnect: xxx.xxx.xxx.117:27017: [Errno 24] Too many open files

解决办法:

经过分析,解决办法:monkey.patch_socket()换为monkey.patch_all(),或者,在使用完gevent后使用reload(socket)将socket初始化。
原因:应该是mongo写数据是阻塞的,请求快于写操作,导致写操作堆积越来越多,最终导致程序抛出Too many open files错误。
最终代码,如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import gevent
from gevent import monkey, pool
monkey.patch_all()

def generator_label(self, tel_data):
p = pool.Pool(10)
jobs = [p.spawn(self.requests_parse, tel) for tel in tel_data]
gevent.joinall(jobs)
# print [x.value for x in jobs]
tlist = [x.value for x in jobs]
if None in tlist:
message_list = [x.exception for x in jobs]
self.logger.error(tlist)
self.logger.error(message_list)
raise Exception(message_list)
return tlist


# 或者使用
p = pool.Pool(10)
jobs = [p.spawn(self.requests_parse, tel) for tel in tel_data]
gevent.joinall(jobs)
import socket
reload(socket)

另一个用例:

给传递两个参数,直接后面跟着就行,逗号分开;
返回如是多个情况的,value是一个以tuple为元素的list。

1
2
3
4
5
6
7
8
9
10
11
p = pool.Pool(10)
jobs = []
# date ('2017-07-01', '2017-07-14')
jobs = [p.spawn(self.crawl_a_call_log, date[0], date[1]) for date in dates]
gevent.joinall(jobs)
data_list = [x.value for x in jobs]
print data_list
if None in data_list:
message_list = [x.exception for x in jobs]
self.log('crawler', data_list, message_list)
raise Exception(message_list)

总结

协程(gevent)是把双刃剑,monkey.patch 是一个邪恶的东西;
提升效果不要太好,耗时足足降了60%,而且,请求越多,效果越明显。

支持小徐?賞一賞!