Logo Goletty

Exploiting Location-aware Mechanism for Distributed Web Crawling over DHTs
Journal Title Journal of Computers
Journal Abbreviation jcp
Publisher Group Academy Publisher
Website http://ojs.academypublisher.com
PDF (1,195 kb)
   
Title Exploiting Location-aware Mechanism for Distributed Web Crawling over DHTs
Authors Fang, Binxing; Zhang, Hongli; Zhang, Weizhe; Xu, Xiao
Abstract Inspired by the concept of internet computing, DHT-based distributed Web crawling model is proposed to solve the bottlenecks of the traditional Web crawling systems. Based on this system model, we propose optimizations to reduce the download time of the Web crawling tasks in order to increase the efficiency of the system. The improvement on the download time is achieved by shortening the crawler-crawlee network distance. By utilizing the mapping mechanism of Content Addressable Network (CAN) over Network Coordinate System (NC), the issue can be mapped onto a problem of minimizing the distances between peers and resources on the DHT overlay. This paper focuses on reducing such distances, seeking to provide an improved location-aware infrastructure for distributed Web crawling. A new DHT-based distributed Web crawling model is proposed first. Then, under this model, a new method based on CAN’s splitting schemes is proposed which shows a significant decrease in crawler-crawlee distance against existing schemes. In addition, the issue of load balancing is also solved by combining the new method with old ones.
Publisher ACADEMY PUBLISHER
Date 2010-11-01
Source Journal of Computers Vol 5, No 11 (2010)
Rights Copyright © ACADEMY PUBLISHER - All Rights Reserved.To request permission, please check out URL: http://www.academypublisher.com/copyrightpermission.html.

 

See other article in the same Issue


Goletty © 2024