昨天晚上看了一些关于保存文件的相关资料,早早的睡了,白天根据网上查找的资料,自己再捡起来.弄了一上午就爬取出来了,开心!!!好吧,让我们开始
老规矩,先上图。大家也赶快行动起来
分类文件:
文件内coding.py
1.matlib.py
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from Matlib.items import MatlibItem
import pdb
class MatlibSpider(scrapy.Spider):
name = 'matlib'
allowed_domains = ['matplotlib.org']
start_urls = ['https://matplotlib.org/examples/index.html']
def parse(self, response):
#le = LinkExtractor(restrict\_css='div.toctree-wrapper.compound li.toctree-l2')
le = LinkExtractor(restrict\_css='div.toctree-wrapper.compound li.toctree-l1', deny='/index.html$')
for link in le.extract\_links(response):
yield scrapy.Request(link.url,callback=self.parse\_url)
def parse\_url(self,response):
sel = response.css('a.reference.external::attr(href)').extract()\[0\]
url = response.urljoin(sel)
mpl = MatlibItem()
#mpl\['files\_url'\] = \[url\]
#pdb.set\_trace()
mpl\['files\_url'\] = url.encode('utf-8')
#return mpl
yield mpl
2.items.py
import scrapy
class MatlibItem(scrapy.Item):
files\_url = scrapy.Field()
files = scrapy.Field()
3.pipelines.py# -*- coding: utf-8 -*-
_# Define your item pipelines here
import scrapy
from scrapy.pipelines.files import FilesPipeline
import urlparse
from os.path import basename,dirname,join
import pdb
class MyFilePipeline(FilesPipeline):
def get\_media\_requests(self,item,info): #调用这个方法,爬虫才能保存文件!!
#for file in item\['files\_url'\]:
#yield scrapy.Request(file.encode('utf-8'))
yield scrapy.Request(item\['files\_url'\])
'''
def file\_path(self, request, response=None, info=None): #重写文件名,和路径
split\_url = str(request.url).split('/')
kind\_name = split\_url\[-2\]
file\_name = split\_url\[-1\]
return '%s/%s' % (kind\_name, file\_name)
'''
def file\_path(self,request,response=None,info=None):
path=urlparse.urlparse(request.url).path
return join(basename(dirname(path)),basename(path))_
4.settings.py(其他和我之前发表的一样,大家可以去查找下)
ITEM_PIPELINES = {
#'Matlib.pipelines.MatlibPipeline': 200,
'Matlib.pipelines.MyFilePipeline': 2,
#'scrapy.pipelines.files.FilesPipeline': 1,
}
遇到的问题:
1.matlib.py
1.url = response.urljoin(sel) :url需要解码才能在pipelines.py/scrapy.Request(item[''files_url]) 中运行下载
2.pipelines.py:
1.未重写文件名,会自动保存为checksum名,
'checksum': '715610c4375a1d749bc26b39cf7e7199',
'path': 'animation/bayes_update.py',
'url': 'https://matplotlib.org/examples/animation/bayes_update.py'}],
2.def get_media_requests(self,item,info): 调用这个函数,文件才能下载!!!看其他人重写路径的时候,没有调用这个函数,误导了我很久
如果有小伙伴,遇到了其他问题,欢迎留言,大家一起进步
手机扫一扫
移动阅读更方便
你可能感兴趣的文章