周海汉
http://abloz.com 上一篇文章讲了《scrapy 安装》,解决了openssl编译不通过的问题。本篇对scrapy进行试用。
[zhouhh@Hadoop48 python]$ scrapy startproject test
[zhouhh@Hadoop48 test]$ find .
.
./scrapy.cfg
./test
./test/items.py
./test/spiders
./test/spiders/__init__.py
./test/spiders/test_spider.py
./test/__init__.py
./test/settings.py
./test/pipelines.py
[zhouhh@Hadoop48 test]$ cat test/spiders/test_spider.py
from scrapy.spider import BaseSpider
class TestSpider(BaseSpider):
name = "hadoop48"
allowed_domains = ["hadoop48"]
start_urls = [
"http://hadoop48/index.php"
]
def parse(self, response):
filename = response.url.split("/")[-2]
open(filename, 'wb').write(response.body)
[zhouhh@Hadoop48 test]$ cat test/items.py
from scrapy.item import Item, Field
class TestItem(Item):
# define the fields for your item here like:
title = Field()
link = Field()
desc = Field()
[zhouhh@Hadoop48 test]$ scrapy crawl test
Traceback (most recent call last):
File "/usr/local/bin/scrapy", line 4, in <module>
execute()
File "/usr/local/lib/python2.7/site-packages/scrapy/cmdline.py", line 96, in execute
settings = get_project_settings()
File "/usr/local/lib/python2.7/site-packages/scrapy/utils/project.py", line 56, in get_project_settings
settings_module = __import__(settings_module_path, {}, {}, [''])
ImportError: No module named settings
test和系统的名字有冲突,改特殊一点,用test1
[zhouhh@Hadoop48 python]$ scrapy startproject test1
将相应的源码复制到test1
[zhouhh@Hadoop48 test1]$ scrapy crawl test1
KeyError: ‘Spider not found: test1’
原来test1_spider.py的name,要设为和crawl一致。
将class Test1Spider 的name由hadoop48改为test1
[zhouhh@Hadoop48 test1]$ scrapy crawl test1 2012-10-31 13:49:24+0800 [scrapy] INFO: Scrapy 0.16.1 started (bot: test1)
…
2012-10-31 13:49:24+0800 [test1] INFO: Spider closed (finished)
此时scrapy项目根目录下多了hadoop48文件,正是我要抓取的首页。
[zhouhh@Hadoop48 test1]$ ls hadoop48 scrapy.cfg test1 [zhouhh@Hadoop48 test1]$ cat hadoop48
167094287 10.28 json详单
100004458 10.29表格
参考: 官网:http://www.scrapy.org 教学:http://doc.scrapy.org/en/latest/intro/tutorial. 简介:http://doc.scrapy.org/en/latest/intro/overview.html
如非注明转载, 均为原创. 本站遵循知识共享CC协议,转载请注明来源