Final project
Deadlines
| Deliverable | Date | Time |
|---|---|---|
| Group creation | Thursday 10/30 | Beginning of class |
| Proposal for data collection | Wednesday 11/05 | 11:59pm |
| Revised proposal for data collection (if applicable) | Sunday 11/09 | 11:59pm |
| Collect your data | Anytime between proposal approval and Sunday 5/04 | |
| Excel/GoogleSheets of your data | Monday 12/01 | Beginning of class |
| Peer review draft | Wednesday 12/03 | 12:00pm |
| Peer review feedback* | Thursday 12/04 | Beginning of class |
| Rough draft of Methods | Sunday 12/07 | 11:59pm |
| Presentation | Thursday 12/11 | 7:00-10:00pm (slides due one hour before) |
| Report (rendered version and .qmd) and Finalized data set + dictionary | Sunday 12/14 (tentative) | 11:59pm |
| Reflection* | Sunday 12/14 (tentative) | 11:59pm |
*Reflection should be submitted individually, and peer review feedback should be prepared individually. All other deliverables are one-per-group.
Description
As a final project, you will work (optionally) in groups to collect and analyze data from fellow Middlebury students! The final project will give you a chance to demonstrate and celebrate your mastery over the the statistics and data science concepts you have learned this entire semester! You and your group will come up with a list of research questions that you will subsequently answer using appropriate statistical methods.
In particular, your group will:
- come up with a list of inferential research questions you are interested in answering
- determine your target population, create a list of variables you’d like to observe, and devise an appropriate sampling strategy
- have a finalized data set that anyone can access and use, accompanied by a “data dictionary”
- present your project to the class
- submit a well-written, professional report
- submit a reflection on the project, which will include a rubric for how much each team member contributed
You should have fun and be creative with this project!! Your research questions can be serious or funny (so long as they are PG-13 and below!)
Week 12 of the semester will be mostly devoted to project work time.
Requirements
The following looks like a lot, but is meant to provide you structure for strong final project that you are proud of!
Group creation
Each group will ideally contain three people.
By the due date listed in the table above, you should either:
- form a group of three on your own and then e-mail Prof. Tang your decision, cc-ing everyone in the group
- form a group of two and then e-mail Prof. Tang your decision, cc-ing the other person in the group.
- e-mail Prof. Tang if you would like to work alone AND/OR if you wouldn’t mind being paired up with another person
The larger the group, the more research questions you are required to answer.
Proposal for data collection
By the due date listed in the table above, your group should submit to Canvas a document detailing the following:
- Your target population(s) of interest
- A list of research questions your group is interested in exploring/answering. You should have at least one question per group member, but more is better!
- Note: we have not learned about regression yet. But we will soon learn linear regression, which answers questions about how changes in variable \(x\) might lead to changes in continuous variable \(y\)
- A list of variables you will observe/collect from each observational unit that you may be useful and/or required for answering your research question(s).
- You should plan to collect at least four numerical variables (at least two of which are continuous), and at least two categorical variables
- A detailed description of how you plan to collect your data that touches on each of the following:
- How many people will you aim to include in your study?
- How will you collect the data? (e.g. have participants fill out a paper form or Googleform, you will record the data yourself)
- IMPORTANT: you will either conduct an observational study or an experiment.
- If you conduct an observational study, you will have to implement some sort of random sampling at some point in your study. In your proposal, detail how exactly will you plan on obtaining your sample using random sampling. Remember, this could be SRS, cluster, systematic, or stratified.
- If you conduct an experiment, you will have to do a randomized experiment, but you do not need to do random sampling to recruit people to your study (though it would be great if you do!). In your proposal, clearly define what your response variable(s) is/are, as well as the different treatment(s). Also clearly define how you will implement the random assignment.
- A rough description of the inferential methods you might use to answer each question you listed in (2).
- This might be along the lines of “hypothesis test for a single mean” or “confidence interval for a proportion” or “linear regression” along with a one-two sentence justification for the method is appropriate.
- It is possible that we haven’t yet learned a method to answer some of your questions. If so, please indicate this on your proposal along with a brief justification for why you think what we have currently learned doesn’t apply.
Once you have submitted your proposal, Prof. Tang will give you feedback. Based on the feedback, your group will either have to submit a revised proposal, or you will be given the okay to go forth and collect your data.
Data set
Once you have collected your data, you should compile it all into a single Excel or GoogleSheets. Please give each variable an informative one-word name (use underscores to make longer names). This should be submitted to Canvas by the beginning of Week 12. If you link a GoogleSheets, please ensure that you give me access!
Report
Each group will write a final report created in an .qmd file. You can download the template here:
Your report will be submitted both as a PDF and as a .qmd file. The code in your report should be as reproducible as possible (e.g. not hard coding specific values, setting a seed when appropriate). All tables should be beautiful!
Additionally, only the output/execution from the code relevant for the report should be visible to the reader, not the code itself. This means that as the viewer, I should not see any code or extraneous output. The former is achieved by setting the chunk header option echo = F. That being said, if there is some code that you believe is useful/important for the reader to see, you are welcome to make that code visible by setting echo = T.
The report does not have a minimum length, but should contain each of the following sections. The report should also have an informative, academically appropriate title along with each group member’s name.
An example of reports from previous years: note that they are not perfect, but they might give you an idea of what I’m looking for!
Introduction
- The introduction should introduce your research questions and the population(s) of interest. Your report should have at least one research question per team member. In the introduction, you should briefly motivate the research questions (i.e. explain why they were of interest to your group)!
- It could be helpful to identify/label your research questions as “Question 1”, “Question 2”, etc. for easier reference moving forward.
Data collection
This section should describe your data and the sampling process: how the data were collected and a summary description of the variables you collected. In this section, you should clearly explain how you incorporated random sampling. You should also identify your original target/desired sample size, along with the number of samples you ended up collecting.
Methods
For each research question, you should:
Identify the specific variables used to address the research question. Also describe if you had to create/mutate new variables or filter out/subset your analysis to focus on certain observations for this research question. For each analysis, please tell us the sample size you end up working with for that question.
Include at least one visualization or table of summary statistics that will be useful/informative for answering the research question. Briefly interpret the visualization/summary table.
- It is okay if if one visualization is complex enough to address multiple research questions!
Describe and justify the statistical method(s) that you will use to answer your research questions. Part of the “justification” component includes conditions being met, if applicable. For linear regression, you should only verify the “L” and the “I” at this point.
Be specific about different groups/populations, direction of differences, explanatory vs response variable, etc.
If you are conducting a hypothesis test, don’t forget to define your hypotheses and set the significance level. If you are creating a confidence interval, be sure to specify the level of confidence.
Results
- Using appropriate code, conduct the statistical methods you described in the Methods section for each research question. Then interpret your results in context (i.e. use your interval/p-value/model to actually answer the question).
- If you do linear regression, you should check the “N” and “E” conditions here.
- Do not use R functions that will automatically run the tests for you (with the exception of
lm()). Using functions liket.test()will not yield credit for this section. Demonstrate to Prof. Tang that you understand the concepts in this course!
- The goal is not to do an exhaustive data analysis (i.e., do not calculate every statistic and procedure you have learned for every variable), but rather let me know that you are proficient at asking meaningful questions and answering them with results of data analysis, that you are proficient in using R, and that you are proficient at interpreting and presenting the results.
Discussion
This section is a conclusion and discussion. This will require:
- A summary of what you have learned about your research questions along with statistical arguments supporting your conclusions.
- A critique your own methods and provide suggestions for improving your analysis. Issues pertaining to the reliability and validity of your data/data collection and appropriateness of the statistical analysis should also be discussed here.
- A paragraph on what your group would do differently if you were able to start over with the project or what you would do next if you were going to continue work on the project should also be included.
Presentation
Each group will present their work during the exam period. Presentation duration lengths (before questions) are:
Groups of one: 4-5 minutes
Groups of two: 5-7 minutes
Groups of three: 7-9 minutes
Part of your presentation grade will be sticking close to the presentation duration. Your presentation will be followed by a few minutes for questions from the audience.
Your presentation should be accompanied by slides that summarize and showcase your project. Introduce your research questions and data/data collection, showcase visualizations, and provide some conclusions. These slides should serve as a brief visual accompaniment and will be graded for content and quality. I recommend roughly one slide per minute of your presentation.
The slides are due to Canvas one hour before our final exam/presentation period.
- You do not need to discuss all of your research questions if you don’t want to! You are not expected to have finished all your analyses by the time you present, but you should have some results (whether preliminary or finalized).
- It’s most important that your audience understands the research questions you present, how you obtained your data, and how you did/plan to answer the research questions.
- Pictures and summary tables are always preferred over sentences!
- A general recommendation is to have one slide per minute of the presentation.
- All presenters should speak roughly an equal amount of time.
- Make sure the font is large enough. Also, less is more!
Part of the presentation component is participation. You should be an engaged audience member. You will be asked to give feedback to a few groups and you are expected to ask questions.
Peer review
By Wednesday at 12:00pm, please submit to Canvas your team’s rough draft of the Introduction and Data Collection sections as a single PDF or Word Document. You will then be individually assigned another team’s draft for peer review. In class the following day, be prepared to discuss and give a physical copy of your comments that address some criteria (provided soon). This will count towards the Participation component of the course.
Misc. submission items
Finalized data set + dictionary
You will be asked to submit your finalized, clean dataset. This may take the form of a .csv file or a link to a GoogleSheet. Please give each variable an informative one-word name. Alongside the submission, please include a separate file for the data dictionary: a bulleted list or table containing each variable name, a brief description of the variable, and the variable type. Include units if appropriate. If the variable is categorical, please provide all the levels as they appear in the cleaned data.
The purpose of the data dictionary is to enable future use of your data.
Reflection
You will be asked to write a brief reflection about the final project. There will be a few questions/prompts pertaining to your specific contribution, learning, and experience, along with an opportunity to comment on the group dynamics and working methods. The prompts/questions will be provided on the associated Canvas assignment.
Grading
General grading structure
Grading of the project will take into account the following:
- Content - What is the quality of research and/or policy question and relevancy of data to those questions?
- Correctness - Are statistical procedures carried out and explained correctly?
- Writing and Presentation - What is the quality of the statistical presentation, writing, and explanations?
- Creativity and Critical Thought - Is the project carefully thought out? Are the limitations carefully considered? Does it appear that time and effort went into the planning and implementation of the project?
- Contribution to group – Did you contribute meaningfully to all aspects of the project?
A general breakdown of scoring for the final report and presentation is as follows:
- 90%-100%: Outstanding effort. Student understands how to apply all statistical concepts, can put the results into a cogent argument, can identify weaknesses in the argument, and can clearly communicate the results to others.
- 80%-89%: Good effort. Student understands most of the concepts, puts together an adequate argument, identifies some weaknesses of their argument, and communicates most results clearly to others.
- 70%-79%: Passing effort. Student has misunderstanding of concepts in several areas, has some trouble putting results together in a cogent argument, and communication of results is sometimes unclear.
- 60%-69%: Struggling effort. Student is making some effort, but has misunderstanding of many concepts and is unable to put together a cogent argument. Communication of results is unclear.
- Below 60%: Student is not making a sufficient effort.
| Component | Section | Points |
|---|---|---|
| Project proposal | 15 points | |
| Initial data set submission | 1 point | |
| Rough draft | Methods** | 5 points |
| Presentation | Presentation style: well-practiced, equal distribution of speaking time, within time limit | 8 points |
| Slides: content, quality | 5 points | |
| Participation | 2 points | |
| Final report | Introduction | 5 points |
| Data collection description | 5 points | |
| Methods | 25 points | |
| Results | 20 points | |
| Discussion | 10 points | |
| Clean code + reproducibility | 3 points | |
| Submission of both PDF and .qmd report | 1 point | |
| Reflection | 3 points | |
| Misc. items | Finalized data set with accompanying dictionary | 2 point |
| Contribution to group | 10 points | |
| Total | 120 points |
\(^{**}\) Even if you include more sections, Prof. Tang will only provide feedback on the Methods section!