爬虫程序访问网站,速度很快,很容易突破网站设置的访问次数,此情况下就会被停止访问,或者IP被封。如果此时能有一些代理IP,切换不同的代理IP去访问网站,使网站以为是从不同的机器上访问的,那么代理IP背后的自己的IP就不受影响了。就算用了代理IP也不要频繁访问网站,因为要为网站考虑一下它的压力。
1.从http://www.xicidaili.com/nn/1里获取免费代理IP。打开网页,查看源代码,分析代码结构,找到你需要的数据,用正则把
用它找出来。正则表达式是
r'<td>(([1-9]\.|[1-9][0-9]\.|1[0-9]{2}\.|2[0-4][0-9]\.|25[0-5]\.){3}([1-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]))</td>\s+<td>(\d{2,5})</td>'。
2.把代理IP保存文件,留着用。但代理IP变化很快,有可能一会功夫就不能用了。所以在需要的时候抓取一下就行了。可以保存在文件里,也可以保存在数据库里。
3.检查代理IP有效性。这个操作可以放在每次抓取页面前,如果不能用就切换其他代理IP,同时把这个不能用的代理IP移除。
代码如下:分两个文件,一个获取代理IP,一个检查有效性(另外有多进程检查
<https://blog.csdn.net/uvyoaa/article/details/81069033>)。
# -*- coding: utf-8 -*- ''' 从www.xicidaili.com获取代理IP,并保存文件 ''' import
urllib.request as req import time import re import random text_html =
r'd:/tmp/xici_html.txt' text_ips = r'd:/tmp/xici_ips.txt' class Getxi(): def
__init__(self,page): self.page = page self.url =
r'http://www.xicidaili.com/nn/{}' def request_method(self,p): curr_time =
time.time() sec = int(curr_time) micsec = int(round(curr_time*1000))
print(sec,' == ',micsec) headers = { 'Cache-Control':'max-age=0', 'Connection':
'Keep-Alive', 'Accept':
'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.8', 'Accept-Enconding':'gzip, deflate, sdch',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/55.0.2883.87 Safari/537.36', 'Host':'www.xicidili.com',
'Referer':'http://www.xicidili.com/', 'Pragma':'no-cache',
'Upgrade-Insecure-Requests':1, } url_com = self.url.format(p) reqs =
req.Request(url_com,headers=headers) return reqs def get_html(self,p): reqss =
self.request_method(p) conn = req.urlopen(reqss) html =
conn.read().decode('utf-8') return html def save_html(self,ip_html): with
open(text_html,'a') as f: f.write(ip_html) f.close() def save_ips(self,ips):
with open(text_ips,'a') as f: f.write(ips) f.close() def
parse_html(self,ip_html): pattern =
re.compile(r'<td>(([1-9]\.|[1-9][0-9]\.|1[0-9]{2}\.|2[0-4][0-9]\.|25[0-5]\.){3}([1-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]))</td>\s+<td>(\d{2,5})</td>',re.S)
tds = pattern.findall(ip_html) str1 = '' for td in tds: str1 +=
'{}:{}\n'.format(td[0].strip(),td[3].strip()) #print(str1) self.save_ips(str1)
def crawler(self): for i in range(self.page): html = self.get_html(i+1)
self.save_html(html) self.parse_html(html) time.sleep(random.randint(5,15)) def
xixi(): page = 2 xi = Getxi(page) xi.crawler() if __name__ == '__main__': xixi()
检查有效性:访问的网页是http://2018.ip138.com/ic.asp
# -*- coding: utf-8 -*- ''' 验证代理IP的有效性 ''' from urllib import request import
urllib import time import random import socket import http ips_ok_file =
r'd:/tmp/xici_1_ok.txt' # 验证后,存入有效的IP ips_file = r'd:/tmp/xici_ips.txt' # IP列表
url = 'http://2018.ip138.com/ic.asp' # 检测访问ip User_Agent = 'Mozilla/5.0
(Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/55.0.2883.87 Safari/537.36' ok_ips = '' class CheckProxyIp(): def
__init__(self): pass def read_ips_file(self): with
open(ips_file,'r',encoding='utf-8') as f: ips = f.readlines() f.close() for ip
in ips: i = ip.strip() self.check_ips(i) time.sleep(random.randint(1,5)) def
check_ips(self,ip): global ok_ips proxy = {'http':ip,'https':ip} print(proxy)
proxy_handler = request.ProxyHandler(proxy) opener =
request.build_opener(proxy_handler) opener.addheaders =
[('User-Agent',User_Agent)] request.install_opener(opener) try: response =
request.urlopen(url,timeout=3) # 使用安装好的opener if(response.getcode() == 200):
html = response.read().decode('gbk') print(len(html)) ok_ips += ip+'\n' else:
print('no') except UnicodeDecodeError as e: print(e) except
urllib.error.HTTPError as e: print(e) except urllib.error.URLError as e:
print(e) except socket.timeout as e: print(e) except
http.client.RemoteDisconnected as e: print(e) except ConnectionResetError as e:
print(e) def save_ok_ip(self): global ok_ips print('save ....') print(ok_ips)
with open(ips_ok_file,'w') as f: f.write(ok_ips) f.close() def check(): chcip =
CheckProxyIp() chcip.read_ips_file() chcip.save_ok_ip() if __name__ ==
'__main__': check()
热门工具 换一换