Directory Spider and data collector

Cancelled Posted Jun 28, 2005 Paid on delivery
Cancelled Paid on delivery

I need a very small script which does the following procedure:

SPIDER-PART

- go to <http://www.g**[url removed, login to view]> (G**gle Directory)

- spider **all** pages from the directory

- collect each listed URL and save it into a database

for every read out URL also save the following data

1. Anchor Text

2. Category it is listed in

3. width of the green bar, which is listed beside the URL (width of the green image "[url removed, login to view]")

Once again, the script should do this for *all* Directory-Entries. Since the script should do the whole spidering and collecting of data within one or max. two days - it has to be very, very fast.

ADMIN-PART

I'll need some password-protected html-sites where i can do the following:

- start/stop the script

- have a "live monitor" which shows me, which pages the script is crawling on at the moment and which pages have already been crwaled (can be done in java, flash or whatever dynamical language which lets you put out the live status) and eventually, if there are any errors.

- a page where i can sort all data from the database after the following criterias:

1. pixel-width of the green bar

2. Category (alphabetically ordered)

- a search-page, where I can search pages which fit into a certain category and/or which have a certain pixel-width of the green bar.

After accepting and paying for your work I may use, edit and resell the script free of further charge.

As far as I know C++ should be the fastest language to put this into practice. I'll need an absoltuely fast spider. All PHP spiders I know are not able to spider the Google Directory within a day.

## Deliverables

1) Complete and fully-functional working program(s) in executable form as well as complete source code of all work done.

2) Deliverables must be in ready-to-run condition, as follows (depending on the nature of the deliverables):

a) For web sites or other server-side deliverables intended to only ever exist in one place in the Buyer's environment--Deliverables must be installed by the Seller in ready-to-run condition in the Buyer's environment.

b) For all others including desktop software or software the buyer intends to distribute: A software installation package that will install the software in ready-to-run condition on the platform(s) specified in this bid request.

3) All deliverables will be considered "work made for hire" under U.S. Copyright law. Buyer will receive exclusive and complete copyrights to all work purchased. (No GPL, GNU, 3rd party components, etc. unless all copyright ramifications are explained AND AGREED TO by the buyer on the site per the coder's Seller Legal Agreement).

## Platform

It will run on my dedicated Linux-Server.

Administration of the script must be possible through a web interface, though.

C Programming JavaScript JSP Perl PHP XML

Project ID: #3784004

About the project

Remote project Active Jun 28, 2005