python - Pandas: Append rows to DataFrame already running through pandas.DataFrame.apply -


brief: using selenium webdriver , pandas python 2.7 make web scraper goes sequence of urls , scrapes urls on page. if finds urls there, want them added running sequence. how can using pandas.dataframe.apply?


code:

import pandas pd selenium import webdriver import re  df = pd.read_csv(spreadsheet.csv, delimiter=",")  def crawl(use):     url = use["url"]     driver.get(url)     scraped_urls = re.findall(r"(www.+)", element.text)     something_else = "foobar"      #ideally scraped_urls list have unpacked here     return pd.series([scraped_urls, something_else])  df[["url", "something else"]] = df["url"].apply(crawl)  df.to_csv("result.csv", delimiter=",") 

the above scraper uses column "url" in spreadsheet.csv navigate each new url. scrapes strings on page matches regex www.+ find urls, , puts results in list scraped_urls.

it gets string something_else = "foobar".

when has processed cells in "url" writes new file result.csv.


my problem:

i have had difficulties finding way add scraped urls in list scraped_urls column "url" – inserted below "active" url (retrieved use["url"]).

if column in source spreadsheet looks this:

["url"] "www.yahoo.com" "www.altavista.com" "www.geocities.com" 

and on www.yahoo.com, scraper finds these strings via regex:

"www.angelfire.com" "www.gamespy.com" 

i want add these rows column "url" below www.yahoo.com, become next keyword scraper search:

["url"] "www.yahoo.com"         #this 1 done "www.angelfire.com"     #go here  "www.gamespy.com"       #then here "www.altavista.com"     #then here "www.geocities.com"     #... 

is possible? can on-the-fly append dataframe df being run through apply()?

i don't think there way use apply way envision. , if there way,

  • it require keeping track of how many items have been iterated on know insert new items df['url']

  • inserting middle of df['url'] require copying data current dataframe new, larger dataframe. copying whole dataframe (potentially) once every row make code unnecessarily slow.

instead, simpler, better way use stack. stack can implemented simple list. can push df['url'] onto stack, pop url off stack , process it. when new scraped urls found, can pushed onto stack , next items popped off:

import pandas pd  def crawl(url_stack):     url_stack = list(url_stack)     result = []     while url_stack:         url = url_stack.pop()         driver.get(url)         scraped_urls = ...         url_stack.extend(scraped_urls)          something_else = "foobar"         result.append([url, something_else])     return pd.dataframe(result, columns=["url", "something else"])  df = pd.read_csv(spreadsheet.csv, delimiter=",") df = crawl(df['url'][::-1]) df.to_csv("result.csv", delimiter=",") 

Comments

Popular posts from this blog

ios - RestKit 0.20 — CoreData: error: Failed to call designated initializer on NSManagedObject class (again) -

java - Digest auth with Spring Security using javaconfig -

laravel - PDOException in Connector.php line 55: SQLSTATE[HY000] [1045] Access denied for user 'root'@'localhost' (using password: YES) -