今天早上无聊,去笔趣阁扒了点小说存Mongodb里存着,想着哪天做一个小说网站有点用,无奈网太差,爬了一个小时就爬了几百章,爬完全网的小说,不知道要到猴年马月去了。再说说scrapy这个爬虫框架,真是不用不知道,一用吓一跳,这个实在太好用了,比自己用request,Beautifulsoup这些模块来爬,实在要简单不知多少倍。废话不多说,现在开始上代码。
首先用virtualEnv创建虚拟环境并pip安装Scrapy的步骤我就不多废话了,建好project后在项目目录下会有如下几个文件
我们先点开items.py这个文件开始定义字段,这些字段用来保存数据,方便我们后续的操作
。其中name是小说名字,author是作者,content是小说内容。
import scrapy class ClawerItem(scrapy.Item): # define the fields for your item
here like: name = scrapy.Field() author = scrapy.Field() content =
scrapy.Field()
定义好字段后,我们就在spiders文件夹中编写自己的爬虫:rules可以理解成给定一个规则,让爬虫自己去爬这些网页,其中正则表达式则代表前缀满足http://www/biquge.com.tw/的任何网页,也就是爬整个笔趣阁,
callback则是调用parse_item的方法。在parse_item里面,xpath只需在网页中打开开发者工具然后找到需要的地方,右键copy就可以了。至于不知道Xpath是什么的同学如果有兴趣可以到runoob去看看。
from scrapy.selector import Selector from scrapy.linkextractors import
LinkExtractor from scrapy.spiders import CrawlSpider, Rule from clawer.items
import ClawerItem class MovieSpider(CrawlSpider): name = 'novel'
allowed_domains = ['www.biquge.com.tw'] start_urls =
['http://www.biquge.com.tw/'] rules = (
Rule(LinkExtractor(allow=(r'http://www.biquge.com.tw/[a-z]+/$'))),
Rule(LinkExtractor(allow=(r'http://www.biquge.com.tw/\d+_\d+/$')), ),
Rule(LinkExtractor(allow=(r'http://www.biquge.com.tw/\d+_\d+/\d+.html$')),
callback='parse_item'), ) def parse_item(self, response): sel =
Selector(response) item = ClawerItem() item['name'] =
sel.xpath('//div[@class="bookname"]/div/a[3]/text()').extract_first()
item['author'] =
sel.xpath('//*[@id="newscontent"]/div[1]/ul/li[1]/span[3]/text()').extract()
item['total'] =
sel.xpath('//*[@id="wrapper"]/div[4]/div/div[1]/a[2]/text()').extract()
contents = response.xpath('//*[@id="content"]/text()') s = '' for content in
contents: if len(content.re('\S+')) > 0: s += content.re('\S+')[0]
item['content'] = s return item
接着开始在pipelines.py中完成对数据进行持久化的操作。
import pymongo from scrapy.exceptions import DropItem from scrapy.conf import
settings from scrapy import log class ClawerPipeline(object): def
__init__(self): connection = pymongo.MongoClient(settings['MONGODB_SERVER'],
settings['MONGODB_PORT']) db = connection[settings['MONGODB_DB']]
self.collection = db[settings['MONGODB_COLLECTION']] def process_item(self,
item, spider): #Remove invalid data valid = True for data in item: if not data:
valid = False raise DropItem("Missing %s of blogpost from %s" %(data,
item['url'])) if valid: #Insert data into database new_moive=[{ "name":
item['name'], "author":item['author'], "content":item['content'], }]
self.collection.insert(new_moive) log.msg("Item wrote to MongoDB database
%s/%s" % (settings['MONGODB_DB'], settings['MONGODB_COLLECTION']),
level=log.DEBUG, spider=spider) return item
最后修改一下setting里面的配置:
BOT_NAME = 'clawer' SPIDER_MODULES = ['clawer.spiders'] NEWSPIDER_MODULE =
'clawer.spiders' # Crawl responsibly by identifying yourself (and your website)
on the user-agent USER_AGENT = 'Mozilla/5.0 (Windows NT 6.2; WOW64)
AppleWebKit/537.36 (KHTML, like Gecko) \ Chrome/27.0.1453.94 Safari/537.36' #
Obey robots.txt rules ROBOTSTXT_OBEY = False # Configure maximum concurrent
requests performed by Scrapy (default: 16) CONCURRENT_REQUESTS = 32 # Configure
a delay for requests for the same website (default: 0) # See
https://doc.scrapy.org/en/latest/topics/settings.html#download-delay # See also
autothrottle settings and docs DOWNLOAD_DELAY = 3 # The download delay setting
will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default)
COOKIES_ENABLED = True MONGODB_SERVER = '47.106.144.34' MONGODB_PORT = 27017
MONGODB_DB = 'xuanhuan' MONGODB_COLLECTION = 'novel' # Disable Telnet Console
(enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default
request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept':
'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', #
'Accept-Language': 'en', #} # Enable or disable spider middlewares # See
https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = { # 'clawer.middlewares.ClawerSpiderMiddleware': 543, #}
# Enable or disable downloader middlewares # See
https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = { # 'clawer.middlewares.ClawerDownloaderMiddleware':
543, #} # Enable or disable extensions # See
https://doc.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { #
'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines #
See https://doc.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES =
{ 'clawer.pipelines.ClawerPipeline': 300, } LOG_LEVEL = 'DEBUG' # Enable and
configure the AutoThrottle extension (disabled by default) # See
https://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED
= True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum
download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to # each
remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing
throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False #
Enable and configure HTTP caching (disabled by default) # See
https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
HTTPCACHE_ENABLED = True HTTPCACHE_EXPIRATION_SECS = 0 HTTPCACHE_DIR =
'httpcache' HTTPCACHE_IGNORE_HTTP_CODES = [] HTTPCACHE_STORAGE =
'scrapy.extensions.httpcache.FilesystemCacheStorage'
现在就可以在命令行输入scrapy crawl novel 就可以开始爬小说存入mongodb了。
热门工具 换一换