I require a web scraper (written in Python using Scrapy) that will have multiple spiders (to scrape multiple web sites) for scraping news websites and retrieving articles, filtering these articles using keyword matching (with Scrapy pipelines) and storing relevant articles in a postgreSQL database.
Spiders
- The specific web sites that I would like scraped will be provided at project commencement.
- The spiders should scrape the news website's RSS feeds (where possible).
- The spiders should store the following information for each article:
* title
* author
* date
* publication name
* article URL
* article text (including all HTML formatting)
* keywords (either from the article itself or from HTML meta tags)
- The spiders should be as generic as possible, extending some base spider class to allow for further extension
Pipelines
- A pipeline should filter the article by matching the article's keywords or article text with a list of "interesting" keywords
- A second pipeline should write all "interesting" articles to a postgreSQL database.