scrapy – 湫落

本文最后更新于1134 天前，其中的信息可能已经过时，如有错误请发送邮件到2192492965@qq.com

项目创建

创建项目
```
scrapy startproject 项目名
```
创建spider文件
```
cd 项目
scrapy genspider 项目名 爬取域
```
执行以上命令后将在spiders目录下创建项目

创建spider的文件

import scrapy

class NewsSpider(scrapy.Spider):
   name = 'news'  # 爬虫名称
   allowed_domains = ['www.cnr.cn']  # 允许的域名
   start_urls = ['http://www.cnr.cn/']  # 起始url列表，该列表中存放的url会被scrapy主动调用

   def parse(self, response):#数据解析
       pass

执行工程
```
scrapy crawl 项目名称
```

设置配置

在爬虫主目录下找到settings.py文件

设置robots协议为false，设置日志输出(LOG_LEVEL)为"ERROR"

设置user-agent

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36 Edg/107.0.1418.26'

数据解析

scrapy 内置xpath，可直接调用response.xpath()方法对数据进行解析

    def parse(self, response):
        text = response.xpath('//div[@class="article-main"]/div//p/text()').extract()# 想要从Select类中取得文本，必须用extract()方法
        print(''.join(text))

数据的持久化存储

scrapy持久化存储的两种方式：

基于终端命令

只可以将parser方法的返回值存储到本地文件中
基于管道

基于终端命令

基于终端指令的持久化存储只能够是以下类型：

('json', 'jsonlines', 'jl', 'csv', 'xml', 'marshal', 'pickle'）

scrapy crawl 爬虫名 -o 保存的文件路径

　基于管道

流程:

数据解析
在item类中定义相关属性
将解析的数据封装到Item类型的对象中
将item类型的对象提交给管道进行持久化存储操作
在管道类的process_item中要将其接收到的item对象中存储的数据进行持久化存储操作
在setting.py中开启管道

在item类中定义相关属性

在items.py文件中

import scrapy

class TutorialItem(scrapy.Item):
    # define the fields for your item here like:
    text = scrapy.Field()

将解析的数据封装到Item类型的对象中和把item类型的对象提交给管道进行持久化存储操作

#首先在爬虫文件中导入Item类，此处示例为tutorial项目
from tutorial.items import TutorialItem

# 然后在parser方法中实例化一个TutorialItem,并将数据以字典赋值的方式传给管道
 def parse(self, response):
        title = response.xpath('//div[@class="article-header"]/h1/text()').extract()
        text = response.xpath('//div[@class="article-main"]/div//p/text()').extract()  # 想要从Select类中取得文本，必须用extract()方法
        item=TutorialItem()
        item['text']=title+'\n'+text
        yield item

process_item存储

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

# useful for handling different item types with a single interface
from itemadapter import ItemAdapter

class TutorialPipeline:
    fp = None
    filename = 'a.txt'#文件名
    #打开文件

    def open_spider(self, spider):
        self.fp = open('./'+self.filename, 'w', encoding='utf-8')

    #存储操作
    def process_item(self, item, spider):
        return item

    #关闭文件
    def close_spider(self,spider):
        self.fp.close()

开启管道

在settings.py文件中

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'tutorial.pipelines.TutorialPipeline': 300,
}

将此行取消注释

项目创建

设置配置

数据解析

数据的持久化存储

基于终端命令

基于管道

在item类中定义相关属性

将解析的数据封装到Item类型的对象中和把item类型的对象提交给管道进行持久化存储操作

process_item存储

开启管道

发送评论 编辑评论

推荐文章

　基于管道

发送评论编辑评论