MUST follow the coding instructions laid out below (no deviations or substitutions).
I have attached sample data and details for the 8 sites to scrape. The scraper definition is also attached so you can see proper formatting for JSON.
note: I will have many more of these for developers that perform a good job in a timely and cost-effective manner.
Thanks,
Scott
Scraping Specs
- Written in Ruby, NO TABS (2 spaces instead).
- Run from the command line taking two arguments - the first should be an integer for the scrape ID, the second should be the URL for the VENUE where the scrape starts:
./[login to view URL] <ID:integer> <URL:string>
./[login to view URL] 111 [login to view URL]
- Must use Curl for GET-ing URLs
GEM: curb
- Must only use standard Ruby regex for parsing, OR hpricot OR nokogiri as an alternative
GEM: hpricot
GEM: nokogiri
- Must output JSON as a finished product, sample data included below
GEM: json
- Must *NOT* use any other GEMS outside of these three: curb, hpricot, nokogiri, json
- The script should return only 1 of 2 things formatted in JSON. Either an ERROR, or the actual data if everything works.
- If there is any kind of error, it needs to output json as defined with a specific error code and message, or at least the standard error code and message:
{"scrape": {
"id": <SCRAPE_ID_FROM_INITIAL_ARGUMENT_1>,
"url": "<URL_FROM_INITIAL_ARGUMENT_2>",
"success": <BOOLEAN: true/false>,
"error": {
"code": <VALID_ERROR_CODE>,
"description": "<TEXT_WITH_WHATEVER_ERROR_MESSAGE_YOU_WANT>"
}
}
VALID ERROR CODES ARE:
10: (Generic error of any kind)
20: (URL GET error - any error involving GET-ing a URL)
30: (PARSE error - any error involving parsing the data)
SAMPLE ERROR RETURN:
{"scrape": {
"id": 111,
"url": "http://foo.com/calendar",
"success": false,
"error": {
"code": 10,
"description": "Problem doing something in the foo function."
}
}
- If it succeeds, it needs to output json as defined with at least the REQUIRED following data in proper format:
{"scrape": {
"id": <SCRAPE_ID_FROM_INITIAL_ARGUMENT_1>,
"url": "<URL_FROM_INITIAL_ARGUMENT_2>",
"success": <BOOLEAN: true/false>,
"events": [
{
"title": "<STRING: Name of the event REQUIRED>",
"start_date": "<DATE: date of the event, or date the event starts (MM/DD/YYYY) REQUIRED>",
"start_time": "<DATETIME: date/time the event starts in *24 HOUR LOCAL TIME* (MM/DD/YYYY HH:MM) OPTIONAL>",
"end_date": "<DATE: date the event ends (MM/DD/YYYY) OPTIONAL>",
"end_time": "<DATETIME: date/time the event ends in *24 HOUR LOCAL TIME* (MM/DD/YYYY HH:MM) OPTIONAL>",
"repeating": <INTEGER: 0 if the event happens once, 1 if the event repeats weekly REQUIRED>,
"repeats_on": "<STRING: *full* name of the day of week the event repeats on (Thursday, Friday, etc.) OPTIONAL>",
"repeats_until": "<DATE: date the event repeats until (MM/DD/YYYY) OPTIONAL>",
"image_url": "<STRING: url for an image associated with this event OPTIONAL>",
"ticket_url": "<STRING: url to buy tickets for this event OPTIONAL>",
"ticket_prices": "<STRING: descriptional text about the ticket price OPTIONAL>",
"description": "<STRING: any freeform descriptive text about the event OPTIONAL>",
"bands": [
{ "name": "<STRING: band name>" },
{ "name": "<STRING: band name>" }
]
}
]
}
SAMPLE DATA:
{"scrape": {
"id": 111,
"url": "http://foo.com/calendar",
"success": true,
"events": [
{
"title": "2$ off Lone Star!",
"start_date": "01/01/2010",
"repeating": 1,
"repeats_on": "Tuesday",
"repeats_until": "01/01/2011",
"image_url": "http://pictures.com/of/lone_star.jpg",
},
{
"title": "Rock Your Mom's House",
"start_date": "01/10/2010",
"start_time": "01/10/2010 19:00",
"end_time": "01/10/2010 22:00",
"repeating": 0,
"image_url": "http://yourmoms.com/house.gif",
"ticket_url": "http://buytix.to/yourmoms",
"ticket_prices": "$8.00 all ages",
"description": "These people really know how to stick it to you.",
"bands": [
{ "name": "Buttcheeck Falcons" },
{ "name": "Foo Fighters" }
]
}
]
}
NOTES:
- All TIMES / DATETIMES should be in the LOCAL TIME of whatever VENUE is being scraped. Usually this will just be the time that you're scraping, but BE SURE.
- ALWAYS return a valid error code if anything goes wrong. Even if it's just the generic error message.