xpath - Scrapy can't get data when following links -
i have asked question scrapy can't data. have new problem when using spider. i've pay attention xpath, seems there same error in program.
here spider's code:
from scrapy.contrib.spiders import crawlspider, rule scrapy.selector import selector scrapy import item, field scrapy.contrib.spiders import crawlspider, rule scrapy.contrib.linkextractors.sgml import sgmllinkextractor db_connection import db_con class uniparc(item): database = field() identifier = field() version = field() organism = field() first_seen = field() last_seen = field() active = field() source = field() class uniparcspider(crawlspider): name = "uniparc" allowed_domains = ["uniprot.org"] start_urls = ["http://www.uniprot.org/uniparc/?query=rna&offset=25&sort=score&columns=id%2corganisms%2ckb%2cfirst-seen%2clast-seen%2clength"] rules = ( rule(sgmllinkextractor(allow=(), restrict_xpaths=('//*[@id="results"]/tr/td[2]/a',)), callback="parse_items", follow = true), ) def parse_items(self, response): hxs = selector(response) sites = hxs.xpath('//*[@id="results"]/tr') db = db_con() collection = db.getcollection(self.term) site in sites: item = uniparc() item["database"] = map(unicode.strip, site.xpath("td[1]/text()").extract()) item["identifier"] = map(unicode.strip, site.xpath("td[2]/a/text()").extract()) item["version"] = map(unicode.strip, site.xpath("td[3]/text()").extract()) item["organism"] = map(unicode.strip, site.xpath("td[4]/a/text()").extract()) item["first_seen"] = map(unicode.strip, site.xpath("td[5]/text()").extract()) item["last_seen"] = map(unicode.strip, site.xpath("td[6]/text()").extract()) item["active"] = map(unicode.strip, site.xpath("td[7]/text()").extract()) item['source'] = self.name collection.update({"identifier": item['identifier']}, dict(item), upsert=true) yield item
i used rules
extract link want follow , data it. seems no urls have been got start_url.
here log:
2016-05-28 22:28:54 [scrapy] info: enabled item pipelines: 2016-05-28 22:28:54 [scrapy] info: spider opened 2016-05-28 22:28:54 [scrapy] info: crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-05-28 22:28:54 [scrapy] debug: telnet console listening on 127.0.0.1:6023 2016-05-28 22:28:55 [scrapy] debug: crawled (200) <get http://www.uniprot.org/uniparc/?query=rna&offset=25&sort=score&columns=id%2corganisms%2ckb%2cfirst-seen%2clast-seen%2clength> (referer: none) 2016-05-28 22:28:55 [scrapy] info: closing spider (finished) 2016-05-28 22:28:55 [scrapy] info: dumping scrapy stats: {'downloader/request_bytes': 314, 'downloader/request_count': 1, 'downloader/request_method_count/get': 1, 'downloader/response_bytes': 12263, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2016, 5, 28, 14, 28, 55, 638618), 'log_count/debug': 2, 'log_count/info': 7, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2016, 5, 28, 14, 28, 54, 645490)}
so can tell what's wrong code? there wrong xpath? i've check many times.
in order fix following links step, fix xpath expression, replace:
//*[@id="results"]/tr/td[2]/a
with:
//*[@id="results"]//tr/td[2]/a
and, side note, should not inserting extracted items database directly in spider. that, scrapy offers pipelines. in case of mongodb, check out scrapy-mongodb
. see:
Comments
Post a Comment