java - How to check the size() or isEmpty() for ConcurrentLinkedQueue -
i trying prototype simple structure web crawler in java. until prototype trying below:
- initialize queue list of starting urls
- take out url queue , submit new thread
- do work , add url set of visited urls
for queue of starting urls, using concurrentlinkedqueue
synchronizing. spawn new threads using executorservice
.
but while creating new thread, application needs check if concurrentlinkedqueue
empty or not. tried using:
.size()
.isempty()
but both seem not returning true state of concurrentlinkedqueue
.
the problem in below block:
while (!crawler.geturl_horizon().isempty()) { workers.submitnewworkerthread(crawler); }
and because of this, executorservice creates threads in limit, if input 2 urls.
is there problem way multi-threading being implemented here? if not, better way check state of concurrentlinkedqueue?
starting class application:
public class crawlerapp { private static crawler crawler; public static void main(string[] args) { crawler = = new crawler(); initializeapp(); startcrawling(); } private static void startcrawling() { crawler.seturl_visited(new hashset<url>()); workermanager workers = workermanager.getinstance(); while (!crawler.geturl_horizon().isempty()) { workers.submitnewworkerthread(crawler); } try { workers.getexecutor().shutdown(); workers.getexecutor().awaittermination(10, timeunit.minutes); } catch (interruptedexception e) { e.printstacktrace(); } } private static void initializeapp() { properties config = new properties(); try { config.load(crawlerapp.class.getclassloader().getresourceasstream("url-horizon.properties")); string[] horizon = config.getproperty("urls").split(","); concurrentlinkedqueue<url> url_horizon = new concurrentlinkedqueue<>(); (string link : horizon) { url url = new url(); url.seturl(link); url_horizon.add(url); } crawler.seturl_horizon(url_horizon); } catch (ioexception e) { e.printstacktrace(); } } }
crawler.java
maintains queue of urls , set of visited urls.
public class crawler implements runnable { private concurrentlinkedqueue<url> url_horizon; public void seturl_horizon(concurrentlinkedqueue<url> url_horizon) { this.url_horizon = url_horizon; } public concurrentlinkedqueue<url> geturl_horizon() { return url_horizon; } private set<url> url_visited; public void seturl_visited(set<url> url_visited) { this.url_visited = url_visited; } public set<url> geturl_visited() { return collections.synchronizedset(url_visited); } @override public void run() { url url = nexturlfromhorizon(); scrap(url); addurltovisited(url); } private url nexturlfromhorizon() { if (!geturl_horizon().isempty()) { url url = url_horizon.poll(); if (geturl_visited().contains(url)) { return nexturlfromhorizon(); } system.out.println("horizon url:" + url.geturl()); return url; } return null; } private void scrap(url url) { new scrapper().scrap(url); } private void addurltovisited(url url) { system.out.println("adding visited set:" + url.geturl()); geturl_visited().add(url); } }
url.java
class private string url
, overriden hashcode()
, equals()
.
also, scrapper.scrap()
has dummy implementation until now:
public void scrap(url url){ system.out.println("done scrapping:"+url.geturl()); }
workermanager
create threads:
public class workermanager { private static final integer worker_limit = 10; private final executorservice executor = executors.newfixedthreadpool(worker_limit); public executorservice getexecutor() { return executor; } private static volatile workermanager instance = null; private workermanager() { } public static workermanager getinstance() { if (instance == null) { synchronized (workermanager.class) { if (instance == null) { instance = new workermanager(); } } } return instance; } public future submitnewworkerthread(runnable run) { return executor.submit(run); } }
problem
the reason why end creating more threads there urls in queue because possible (and in fact likely) none of threads of executor start until go through while
loop lot of times.
whenever working threads should keep in mind threads scheduled independently , run @ own pace except when explicitly synchronize them. in case, threads can start @ time after submit()
call, though seems you'd each 1 start , go past nexturlfromhorizon
before next iteration in while
loop.
solution
consider dequeuing url queue before submitting runnable
executor. suggest defining crawlertask
submitted executor once, rather crawler
submitted repeatedly. in such design wouldn't need thread-safe container urls to-be-scraped.
class crawlertask extends runnable { url url; crawlertask(url url) { this.url = url; } @override public void run() { scrape(url); // add url visited? } } class crawler { executorservice executor; queue urlhorizon; //... private static void startcrawling() { while (!urlhorizon.isempty()) { executor.submit(new crawlertask(urlhorizon.poll()); } // ... } }
Comments
Post a Comment