scrapy 试用

abloz 2012-10-31
2012-10-31

周海汉

http://abloz.com 上一篇文章讲了《scrapy 安装》,解决了openssl编译不通过的问题。本篇对scrapy进行试用。

[zhouhh@Hadoop48 python]$ scrapy startproject test

[zhouhh@Hadoop48 test]$ find .
.
./scrapy.cfg
./test
./test/items.py
./test/spiders
./test/spiders/__init__.py
./test/spiders/test_spider.py
./test/__init__.py
./test/settings.py
./test/pipelines.py






[zhouhh@Hadoop48 test]$ cat test/spiders/test_spider.py
from scrapy.spider import BaseSpider

class TestSpider(BaseSpider):
    name = "hadoop48"
    allowed_domains = ["hadoop48"]
    start_urls = [
        "http://hadoop48/index.php"
    ]

    def parse(self, response):
        filename = response.url.split("/")[-2]
        open(filename, 'wb').write(response.body)

[zhouhh@Hadoop48 test]$ cat test/items.py

from scrapy.item import Item, Field

class TestItem(Item):
    # define the fields for your item here like:
    title = Field()
    link = Field()
    desc = Field()






[zhouhh@Hadoop48 test]$ scrapy crawl test
Traceback (most recent call last):
File "/usr/local/bin/scrapy", line 4, in <module>
execute()
File "/usr/local/lib/python2.7/site-packages/scrapy/cmdline.py", line 96, in execute
settings = get_project_settings()
File "/usr/local/lib/python2.7/site-packages/scrapy/utils/project.py", line 56, in get_project_settings
settings_module = __import__(settings_module_path, {}, {}, [''])
ImportError: No module named settings

test和系统的名字有冲突,改特殊一点,用test1

[zhouhh@Hadoop48 python]$ scrapy startproject test1

将相应的源码复制到test1

[zhouhh@Hadoop48 test1]$ scrapy crawl test1

KeyError: ‘Spider not found: test1’

原来test1_spider.py的name,要设为和crawl一致。

将class Test1Spider 的name由hadoop48改为test1

[zhouhh@Hadoop48 test1]$ scrapy crawl test1 2012-10-31 13:49:24+0800 [scrapy] INFO: Scrapy 0.16.1 started (bot: test1)

2012-10-31 13:49:24+0800 [test1] INFO: Spider closed (finished)

此时scrapy项目根目录下多了hadoop48文件,正是我要抓取的首页。

[zhouhh@Hadoop48 test1]$ ls hadoop48 scrapy.cfg test1 [zhouhh@Hadoop48 test1]$ cat hadoop48

list tables demo of zhouhh 获取全部表名
167094287 10.28 json详单
100004458 10.29表格

参考: 官网:http://www.scrapy.org 教学:http://doc.scrapy.org/en/latest/intro/tutorial. 简介:http://doc.scrapy.org/en/latest/intro/overview.html


如非注明转载, 均为原创. 本站遵循知识共享CC协议,转载请注明来源