java - How to check the size() or isEmpty() for ConcurrentLinkedQueue -

- May 15, 2014

i trying prototype simple structure web crawler in java. until prototype trying below:

initialize queue list of starting urls
take out url queue , submit new thread
do work , add url set of visited urls

for queue of starting urls, using concurrentlinkedqueue synchronizing. spawn new threads using executorservice.

but while creating new thread, application needs check if concurrentlinkedqueue empty or not. tried using:

.size()
.isempty()

but both seem not returning true state of concurrentlinkedqueue.

the problem in below block:

while (!crawler.geturl_horizon().isempty()) {                 workers.submitnewworkerthread(crawler);             }

and because of this, executorservice creates threads in limit, if input 2 urls.

is there problem way multi-threading being implemented here? if not, better way check state of concurrentlinkedqueue?

starting class application:

public class crawlerapp {      private static crawler crawler;      public static void main(string[] args) {         crawler = = new crawler();         initializeapp();         startcrawling();      }      private static void startcrawling() {         crawler.seturl_visited(new hashset<url>());         workermanager workers = workermanager.getinstance();         while (!crawler.geturl_horizon().isempty()) {             workers.submitnewworkerthread(crawler);         }         try {             workers.getexecutor().shutdown();             workers.getexecutor().awaittermination(10, timeunit.minutes);         } catch (interruptedexception e) {             e.printstacktrace();         }     }      private static void initializeapp() {          properties config = new properties();         try {             config.load(crawlerapp.class.getclassloader().getresourceasstream("url-horizon.properties"));             string[] horizon = config.getproperty("urls").split(",");             concurrentlinkedqueue<url> url_horizon = new concurrentlinkedqueue<>();             (string link : horizon) {                 url url = new url();                 url.seturl(link);                 url_horizon.add(url);             }             crawler.seturl_horizon(url_horizon);         } catch (ioexception e) {             e.printstacktrace();         }      }  }

crawler.java maintains queue of urls , set of visited urls.

public class crawler implements runnable {     private concurrentlinkedqueue<url> url_horizon;      public void seturl_horizon(concurrentlinkedqueue<url> url_horizon) {         this.url_horizon = url_horizon;     }      public concurrentlinkedqueue<url> geturl_horizon() {         return url_horizon;     }      private set<url> url_visited;      public void seturl_visited(set<url> url_visited) {         this.url_visited = url_visited;     }      public set<url> geturl_visited() {         return collections.synchronizedset(url_visited);     }      @override     public void run() {         url url = nexturlfromhorizon();         scrap(url);         addurltovisited(url);      }      private url nexturlfromhorizon() {         if (!geturl_horizon().isempty()) {             url url = url_horizon.poll();             if (geturl_visited().contains(url)) {                 return nexturlfromhorizon();             }             system.out.println("horizon url:" + url.geturl());             return url;          }         return null;      }      private void scrap(url url) {         new scrapper().scrap(url);     }      private void addurltovisited(url url) {         system.out.println("adding visited set:" + url.geturl());         geturl_visited().add(url);     }  }

url.java class private string url , overriden hashcode() , equals().

also, scrapper.scrap() has dummy implementation until now:

public void scrap(url url){         system.out.println("done scrapping:"+url.geturl());     }

workermanager create threads:

public class workermanager {     private static final integer worker_limit = 10;     private final executorservice executor = executors.newfixedthreadpool(worker_limit);      public executorservice getexecutor() {         return executor;     }      private static volatile workermanager instance = null;      private workermanager() {     }      public static workermanager getinstance() {         if (instance == null) {             synchronized (workermanager.class) {                 if (instance == null) {                     instance = new workermanager();                 }             }         }          return instance;     }      public future submitnewworkerthread(runnable run) {         return executor.submit(run);     }  }

problem

the reason why end creating more threads there urls in queue because possible (and in fact likely) none of threads of executor start until go through while loop lot of times.

whenever working threads should keep in mind threads scheduled independently , run @ own pace except when explicitly synchronize them. in case, threads can start @ time after submit() call, though seems you'd each 1 start , go past nexturlfromhorizon before next iteration in while loop.

solution

consider dequeuing url queue before submitting runnable executor. suggest defining crawlertask submitted executor once, rather crawler submitted repeatedly. in such design wouldn't need thread-safe container urls to-be-scraped.

class crawlertask extends runnable {    url url;     crawlertask(url url) { this.url = url; }     @override    public void run() {      scrape(url);      // add url visited?    } }  class crawler {   executorservice executor;   queue urlhorizon;    //...    private static void startcrawling() {     while (!urlhorizon.isempty()) {       executor.submit(new crawlertask(urlhorizon.poll());     }     // ...   } }

Search This Blog

Today's Best Video