How can I use different pipelines for different spiders in a single Scrapy project - 瀚海星空

2012-11-28

Hi vitsin, You can’t override settings like this in your spiders like your code does:

class FirstSpider(CrawlSpider):
    settings.overrides['ITEM_PIPELINES'] = ...

And you can’t customize the item pipelines per spider.

What you could do is check the spider in the process_item() of your pipeline, and ignore certain ones. For example:

def process_item(self, item, spider):
    if spider.name not in ['myspider1', 'myspider2', 'myspider3']:
        return item

Hope this helps, Pablo. Not for the moment. But there are some nice alternatives for achieving that functionality. For example, you can choose a spider attribute to define which pipelines will be enabled for each spider, and then check that attribute in your pipelines.

Here’s how your spiders would look:

class SomeSpider(CrawlSpider):
    pipelines = ['first']

class AnotherSpider(CrawlSpider):
    pipelines = ['first', 'second']

And your pipelines:

class FirstPipeline(object):
    def process_item(self, item, spider):
        if 'first' not in getattr(spider, 'pipelines', []):
            return item

        # ... pipeline code here ...


class SecondPipeline(object):
   def process_item(self, item, spider):
        if 'second' not in getattr(spider, 'pipelines', []):
            return item

        # ... pipeline code here ...

Btw, this code can be easily made more performant by using sets instead of lines for the pipelines attribute, and by caching the pipelines per spider.

Pablo.

On Thu, Nov 25, 2010 at 07:14:12AM -0800, vitsin wrote:

hi, are you planning may be to add support for custom pipeline per spider? 10x, –vs ————– I can think of at least four approaches:

Use a different scrapy project per set of spiders+pipelines (might be appropriate if your spiders are different enough warrant being in different projects) On the scrapy tool command line, change the pipeline setting with scrapy settings in between each invocation of your spider Isolate your spiders into their own scrapy tool commands, and define the default_settings[‘ITEM_PIPELINES’] on your command class to the pipeline list you want for that command. See line 6 of this example. In the pipeline classes themselves, have process_item() check what spider it’s running against, and do nothing if it should be ignored for that spider. See the example using resources per spider to get you started. (This seems like an ugly solution because it tightly couples spiders and item pipelines. You probably shouldn’t use this one.)

from: http://stackoverflow.com/questions/8372703/how-can-i-use-different-pipelines-for-different-spiders-in-a-single-scrapy-proje

如非注明转载, 均为原创. 本站遵循知识共享CC协议,转载请注明来源

FEATURED TAGS

css vc6 http automake linux make makefile voip 乱码 awk flash vista vi vim javascript pietty putty ssh posix subversion svn windows 删除编译多线程 wxwidgets ie ubuntu 开源 c python bash 备份性能 scp 汉字 log ruby 中文 bug msn nginx php shell wordpress mqueue android eclipse java mac ios html5 js mysql protobuf apache hadoop install iocp twisted centos mapreduce hbase thrift tutorial hive erlang lucene hdfs sqoop utf8 filter 草原 yarn ganglia 恢复 scrapy django fsimage flume tail flume-ng mining scala go kafka gradle cassandra baas spring postgres maven mybatis mongodb https nodejs 镜像心理学机器学习 Keras theano anaconda docker spark akka-http json 群论区块链加密抽象代数离散对数同余欧拉函数扩展欧几里德算法 ES6 node-inspect debug win10 vscode 挖矿

FEATURED TAGS

FRIENDS