The group project is your opportunity to further explore an area of data mining you're interested in and also to gain practical experience in the domain. For your proposal, you must select team members and a topic. A custom rubric will be created for grading your project based on your topic choice, group size, and project difficulty.
You will have complete freedom in what domain you choose, what technologies you use (including, for instance, programming languages, databases, etc), and what type of analysis you seek to perform, with the only restriction that the instructor will require full access to these tools in order to grade your work (thus, don't choose proprietary software unless the instructor has access to a license for it).
Another concern is that your choices must carry a significant enough difficulty for you to be able to earn an A on your assignment. There must also be enough work to share between group members, so if you have a larger group, there should be multiple significantly difficult steps in your proposal for the knowledge discovery process. Every project should have some significant difficulty added to at least one step in the process. As a simple rule of thumb, for every added group member, there should be at least one additional step with significant difficulty.
Please submit your proposal early. If it is accepted, you will receive full credit. If it is not, you will have as many opportunities to resubmit for full credit as you need up until the due date. After the due date, you will receive partial credit once your submission is accepted.
What to Submit
Include the following in your submission:
data source
data cleaning plan
data integration plan
data selection plan
data mining plan
pattern evaluation plan
For each of these steps, also indicate what tools will be used, whether any significant programming will be needed, and if so what language will be used. You may also assign tasks to group members in your proposal, but this is not necessary at this step (it will be in your final report). Every member of the group needs to submit a proposal, but you can copy it. This way I will know that everyone is in board.
Domain Choice
Your proposal should include a choice of domain. This, essentially, is identifying the type of data you will be using in your project, and to some extent what type of analysis you'll be able to do. Here are some examples:
social network analysis
financial data analysis
product recommendation
sports predictions
There are many more possible domains -- feel free to discuss any ideas you have with me in office hours or on the project discussion board on blackboard.
Data Source
Determine where, specifically, you will retrieve the data you will use for your project. There are two ways you can add difficulty to your assignment in this step: choose a less typical type of data (text mining, linked data, etc) or create your data set (ex. crawling the web, manual collection from surveys, etc). Finding an existing but seldom used data set will also add some credit for difficulty.
Data Cleaning Plan
If your data requires significant cleaning before it is used, this may be a more significant step. Indicate and explain how much work you expect for this step.
Data Integration Plan
If you are using multiple data sources, or if you intend to move your data set in to a database or data warehouse, explain your plan here. Otherwise, indicate this is not applicable.
Data Selection Plan
At the very least, indicate how you will select attributes from your data set for performing further analysis. Indicate if you intend to use data mining tools to determine what the subset of attributes should be, or if you intend to use a more complex technique to transform the data, such as principal component analysis.
Data Mining Plan
Determine what type of analysis you will perform (classification, clustering, outlier analysis, regression, mining association rules, etc). Give some detail on what you will be looking for, how you plan to do this analysis (i.e. using WEKA, writing a classification program in Python, etc), and what your expectations are for results. Again, use the discussion forum to discuss any tools you think might be interesting for other students.
Pattern Evaluation Plan
If you plan to perform an analysis on the results from the data mining step (for instance, cross validation, visualization, etc), indicate such here.