python - Pandas: Append rows to DataFrame already running through pandas.DataFrame.apply -
brief: using selenium webdriver
, pandas
python 2.7 make web scraper goes sequence of urls , scrapes urls on page. if finds urls there, want them added running sequence. how can using pandas.dataframe.apply
?
code:
import pandas pd selenium import webdriver import re df = pd.read_csv(spreadsheet.csv, delimiter=",") def crawl(use): url = use["url"] driver.get(url) scraped_urls = re.findall(r"(www.+)", element.text) something_else = "foobar" #ideally scraped_urls list have unpacked here return pd.series([scraped_urls, something_else]) df[["url", "something else"]] = df["url"].apply(crawl) df.to_csv("result.csv", delimiter=",")
the above scraper uses column "url"
in spreadsheet.csv
navigate each new url
. scrapes strings on page matches regex www.+
find urls, , puts results in list scraped_urls
.
it gets string something_else = "foobar"
.
when has processed cells in "url"
writes new file result.csv
.
my problem:
i have had difficulties finding way add scraped urls in list scraped_urls
column "url"
– inserted below "active" url (retrieved use["url"]
).
if column in source spreadsheet looks this:
["url"] "www.yahoo.com" "www.altavista.com" "www.geocities.com"
and on www.yahoo.com, scraper finds these strings via regex:
"www.angelfire.com" "www.gamespy.com"
i want add these rows column "url"
below www.yahoo.com
, become next keyword scraper search:
["url"] "www.yahoo.com" #this 1 done "www.angelfire.com" #go here "www.gamespy.com" #then here "www.altavista.com" #then here "www.geocities.com" #...
is possible? can on-the-fly append dataframe df
being run through apply()
?
i don't think there way use apply
way envision. , if there way,
it require keeping track of how many items have been iterated on know insert new items
df['url']
inserting middle of
df['url']
require copying data current dataframe new, larger dataframe. copying whole dataframe (potentially) once every row make code unnecessarily slow.
instead, simpler, better way use stack. stack can implemented simple list. can push df['url']
onto stack, pop url off stack , process it. when new scraped urls found, can pushed onto stack , next items popped off:
import pandas pd def crawl(url_stack): url_stack = list(url_stack) result = [] while url_stack: url = url_stack.pop() driver.get(url) scraped_urls = ... url_stack.extend(scraped_urls) something_else = "foobar" result.append([url, something_else]) return pd.dataframe(result, columns=["url", "something else"]) df = pd.read_csv(spreadsheet.csv, delimiter=",") df = crawl(df['url'][::-1]) df.to_csv("result.csv", delimiter=",")
Comments
Post a Comment