Find Jobs
Hire Freelancers

build web spider

$300-1500 USD

Closed
Posted almost 17 years ago

$300-1500 USD

Paid on delivery
Hi Looking to have a web spider built, the spider must adhere to the following guidelines. 1) Completely obey [login to view URL] files and meta tags in web pages. 2) Only request [login to view URL] file once when indexing a website, i.e. if spidering [login to view URL], only request [login to view URL] file once for all pages in that website, store [login to view URL] file in table: at_Robot_Txt 2 a) Check to see if my Spidername is not blocked by website, if not continue to index pages 2 b) Insert [login to view URL] into database table: at_Robot_Txt, and use information from that to determine which pages can and cannot index Columns: • URL_Robot_Idx INT Primary Key • BaseURL VarChar(100) • RobotTxt VarChar(7500) If no [login to view URL] file found enter “No text file found in Site” 3) Allow to enter own user-agent name i.e. "Spidername" 4) Read from a list of banned words and permitted words. 5) If it finds any banned works ignore page 6) If it find any permitted words index page 7) If it finds neither of the above ignore page 8) Must be able to index 60,000 + pages a day. 9) Must run on any windows platform from Windows 2000 professional, XP or Server 10) User interface must be easy to use and I should be able to see how spider is progressing, similar to visual web spider. 11) Take list of URLs from SQL 2005 at_URLsToIndex. Columns: • URLID INT Primary Key • URL VarChar(300) 12) When indexing page insert data into the following table at_SpideredWebsites Columns: • PageURL VarChar(300) • BaseURL VarChar(100) • PageTitle VarChar(200) maximum 20 words • PageParagraph VarChar(6000) • PageSize VarChar(6) in KB • PageLastUpdated VarChar(10) Format: 23 May 07 • ServerIpAddr VarChar(50) • PageLevel INT i.e [login to view URL] = 100, [login to view URL] = 75, [login to view URL] = 50 [login to view URL] = 25 [login to view URL] = 0 • PageSpidered SmallDateTime Format: 23 May 07 13) Only index URLs that begin http:// 14) Remove all html tags before inserting into database 15) Ignore URL that are invalid i.e [login to view URL]://[login to view URL] etc 16) For body text all text except text in dropdownlists I cannot state how strongly the spider must obey [login to view URL] files and only request file once when in session no matter how many threads are running, if spider is stopped and restarted later, only request [login to view URL] file once and update [login to view URL] table in database. If you cannot achieve the above, please do not apply for this project as you will be just wasting my time and yours. The budget for this project is $500, but get this right and i'll use you for the crawler that needs building George
Project ID: 150624

About the project

14 proposals
Remote project
Active 17 yrs ago

Looking to make some money?

Benefits of bidding on Freelancer

Set your budget and timeframe
Get paid for your work
Outline your proposal
It's free to sign up and bid on jobs

About the client

Flag of UNITED KINGDOM
Washington, United Kingdom
0.0
0
Member since Feb 22, 2006

Client Verification

Thanks! We’ve emailed you a link to claim your free credit.
Something went wrong while sending your email. Please try again.
Registered Users Total Jobs Posted
Freelancer ® is a registered Trademark of Freelancer Technology Pty Limited (ACN 142 189 759)
Copyright © 2024 Freelancer Technology Pty Limited (ACN 142 189 759)
Loading preview
Permission granted for Geolocation.
Your login session has expired and you have been logged out. Please log in again.