scrapy 试用 - 瀚海星空

2012-10-31

周海汉

http://abloz.com 上一篇文章讲了《scrapy 安装》，解决了openssl编译不通过的问题。本篇对scrapy进行试用。

[zhouhh@Hadoop48 python]$ scrapy startproject test

[zhouhh@Hadoop48 test]$ find .
.
./scrapy.cfg
./test
./test/items.py
./test/spiders
./test/spiders/__init__.py
./test/spiders/test_spider.py
./test/__init__.py
./test/settings.py
./test/pipelines.py






[zhouhh@Hadoop48 test]$ cat test/spiders/test_spider.py
from scrapy.spider import BaseSpider

class TestSpider(BaseSpider):
    name = "hadoop48"
    allowed_domains = ["hadoop48"]
    start_urls = [
        "http://hadoop48/index.php"
    ]

    def parse(self, response):
        filename = response.url.split("/")[-2]
        open(filename, 'wb').write(response.body)

[zhouhh@Hadoop48 test]$ cat test/items.py

from scrapy.item import Item, Field

class TestItem(Item):
    # define the fields for your item here like:
    title = Field()
    link = Field()
    desc = Field()






[zhouhh@Hadoop48 test]$ scrapy crawl test
Traceback (most recent call last):
File "/usr/local/bin/scrapy", line 4, in <module>
execute()
File "/usr/local/lib/python2.7/site-packages/scrapy/cmdline.py", line 96, in execute
settings = get_project_settings()
File "/usr/local/lib/python2.7/site-packages/scrapy/utils/project.py", line 56, in get_project_settings
settings_module = __import__(settings_module_path, {}, {}, [''])
ImportError: No module named settings

test和系统的名字有冲突，改特殊一点，用test1

[zhouhh@Hadoop48 python]$ scrapy startproject test1

将相应的源码复制到test1

[zhouhh@Hadoop48 test1]$ scrapy crawl test1

KeyError: ‘Spider not found: test1’

原来test1_spider.py的name，要设为和crawl一致。

将class Test1Spider 的name由hadoop48改为test1

[zhouhh@Hadoop48 test1]$ scrapy crawl test1 2012-10-31 13:49:24+0800 [scrapy] INFO: Scrapy 0.16.1 started (bot: test1)

…

2012-10-31 13:49:24+0800 [test1] INFO: Spider closed (finished)

此时scrapy项目根目录下多了hadoop48文件，正是我要抓取的首页。

[zhouhh@Hadoop48 test1]$ ls hadoop48 scrapy.cfg test1 [zhouhh@Hadoop48 test1]$ cat hadoop48

list tables demo of zhouhh 获取全部表名
167094287 10.28 json详单
100004458 10.29表格

参考： 官网：http://www.scrapy.org 教学：http://doc.scrapy.org/en/latest/intro/tutorial. 简介：http://doc.scrapy.org/en/latest/intro/overview.html

如非注明转载, 均为原创. 本站遵循知识共享CC协议,转载请注明来源

FEATURED TAGS

css vc6 http automake linux make makefile voip 乱码 awk flash vista vi vim javascript pietty putty ssh posix subversion svn windows 删除编译多线程 wxwidgets ie ubuntu 开源 c python bash 备份性能 scp 汉字 log ruby 中文 bug msn nginx php shell wordpress mqueue android eclipse java mac ios html5 js mysql protobuf apache hadoop install iocp twisted centos mapreduce hbase thrift tutorial hive erlang lucene hdfs sqoop utf8 filter 草原 yarn ganglia 恢复 scrapy django fsimage flume tail flume-ng mining scala go kafka gradle cassandra baas spring postgres maven mybatis mongodb https nodejs 镜像心理学机器学习 Keras theano anaconda docker spark akka-http json 群论区块链加密抽象代数离散对数同余欧拉函数扩展欧几里德算法 ES6 node-inspect debug win10 vscode 挖矿

FEATURED TAGS

FRIENDS