xpath - Scrapy can't get data when following links -

- January 15, 2014

i have asked question scrapy can't data. have new problem when using spider. i've pay attention xpath, seems there same error in program.

here spider's code:

from scrapy.contrib.spiders import crawlspider, rule scrapy.selector import selector scrapy import item, field scrapy.contrib.spiders import crawlspider, rule scrapy.contrib.linkextractors.sgml import sgmllinkextractor db_connection import db_con  class uniparc(item):     database = field()     identifier = field()     version = field()     organism = field()     first_seen = field()     last_seen = field()     active = field()     source = field()  class uniparcspider(crawlspider):     name = "uniparc"     allowed_domains = ["uniprot.org"]     start_urls = ["http://www.uniprot.org/uniparc/?query=rna&offset=25&sort=score&columns=id%2corganisms%2ckb%2cfirst-seen%2clast-seen%2clength"]      rules = (         rule(sgmllinkextractor(allow=(), restrict_xpaths=('//*[@id="results"]/tr/td[2]/a',)), callback="parse_items", follow = true),     )      def parse_items(self, response):         hxs = selector(response)         sites = hxs.xpath('//*[@id="results"]/tr')         db = db_con()         collection = db.getcollection(self.term)         site in sites:             item = uniparc()             item["database"] = map(unicode.strip, site.xpath("td[1]/text()").extract())             item["identifier"] = map(unicode.strip, site.xpath("td[2]/a/text()").extract())             item["version"] = map(unicode.strip, site.xpath("td[3]/text()").extract())             item["organism"] = map(unicode.strip, site.xpath("td[4]/a/text()").extract())             item["first_seen"] = map(unicode.strip, site.xpath("td[5]/text()").extract())             item["last_seen"] = map(unicode.strip, site.xpath("td[6]/text()").extract())             item["active"] = map(unicode.strip, site.xpath("td[7]/text()").extract())             item['source'] = self.name             collection.update({"identifier": item['identifier']}, dict(item), upsert=true)             yield item

i used rules extract link want follow , data it. seems no urls have been got start_url.

here log:

2016-05-28 22:28:54 [scrapy] info: enabled item pipelines:  2016-05-28 22:28:54 [scrapy] info: spider opened 2016-05-28 22:28:54 [scrapy] info: crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-05-28 22:28:54 [scrapy] debug: telnet console listening on 127.0.0.1:6023 2016-05-28 22:28:55 [scrapy] debug: crawled (200) <get http://www.uniprot.org/uniparc/?query=rna&offset=25&sort=score&columns=id%2corganisms%2ckb%2cfirst-seen%2clast-seen%2clength> (referer: none) 2016-05-28 22:28:55 [scrapy] info: closing spider (finished) 2016-05-28 22:28:55 [scrapy] info: dumping scrapy stats: {'downloader/request_bytes': 314,  'downloader/request_count': 1,  'downloader/request_method_count/get': 1,  'downloader/response_bytes': 12263,  'downloader/response_count': 1,  'downloader/response_status_count/200': 1,  'finish_reason': 'finished',  'finish_time': datetime.datetime(2016, 5, 28, 14, 28, 55, 638618),  'log_count/debug': 2,  'log_count/info': 7,  'response_received_count': 1,  'scheduler/dequeued': 1,  'scheduler/dequeued/memory': 1,  'scheduler/enqueued': 1,  'scheduler/enqueued/memory': 1,  'start_time': datetime.datetime(2016, 5, 28, 14, 28, 54, 645490)}

so can tell what's wrong code? there wrong xpath? i've check many times.

in order fix following links step, fix xpath expression, replace:

//*[@id="results"]/tr/td[2]/a

with:

//*[@id="results"]//tr/td[2]/a

and, side note, should not inserting extracted items database directly in spider. that, scrapy offers pipelines. in case of mongodb, check out scrapy-mongodb. see:

web scraping scrapy , mongodb

Search This Blog

Today's Best Video

xpath - Scrapy can't get data when following links -

Comments

Post a Comment

Popular posts from this blog

ios - RestKit 0.20 — CoreData: error: Failed to call designated initializer on NSManagedObject class (again) -

laravel - PDOException in Connector.php line 55: SQLSTATE[HY000] [1045] Access denied for user 'root'@'localhost' (using password: YES) -

java - Digest auth with Spring Security using javaconfig -