Goals

The goals of this milestone are as follows:

Discuss topics you’re interested in investigating and find data sets on those topics.
Identify 3 data sets you’re interested in potentially using for the project.
Get these datasets into Python.
Write up reasons and justifications for why you want to work with these datasets.
Review your team contract.

Important

You must use one of the data sets in the proposal for the final project, unless instructed otherwise when given feedback.

Finding a dataset

Criteria for datasets

The data sets should meet the following criteria:

At least 500 observations
At least 8 columns
At least 6 of the columns must be useful and unique explanatory variables.
- Identifier variables such as “name”, “social security number”, etc. are not useful explanatory variables.
- If you have multiple columns with the same information (e.g. “state abbreviation” and “state name”), then they are not unique explanatory variables.
You may not use data that has previously been used in any course materials, or any derivation of data that has been used in course materials.
You can curate one of your datasets via web scraping.

Please ask a member of the teaching team if you’re unsure whether your data set meets the criteria.

If you set your hearts on a dataset that has fewer observations or variables than what’s suggested here, that might still be ok; use these numbers as guidance for a successful proposal, not as minimum requirements.

Resources for datasets

You can find data wherever you like, but here are some recommendations to get you started. You shouldn’t feel constrained to datasets that are already in a tidy format, you can start with data that needs cleaning and tidying, scrape data off the web, or collect your own data.

Components

For each data set, include the following:

Introduction and data

For each data set:

Identify the source of the data.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
Write a brief description of the observations.
Address ethical concerns about the data, if any.

Research question

Your research question should contain at least three variables, and should be a mix of categorical and quantitative variables. When writing a research question, please think about the following:

What is your target population?
Is the question original?
Can the question be answered?

For each data set, include the following:

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
Statement on why this question is important.
A description of the research topic along with a concise statement of your hypotheses on this topic.
Identify the types of variables in your research question. Categorical? Quantitative?

Glimpse of data

For each data set:

Place the file containing your data in the data folder of the project repo.
Use the .head() and .info() functions to provide a glimpse of the data set.

Grading

Total	10 pts
Introduction and data	3
Research question	3
Glimpse of data	3
Workflow and formatting	1

Each component will be graded as follows:

Meets expectations (full credit): All required elements are completed and are accurate. The narrative is written clearly, all tables and visualizations are nicely formatted, and the work would be presentable in a professional setting.
Close to expectations (half credit): There are some elements missing and/or inaccurate. There are some issues with formatting.
Does not meet expectations (no credit): Major elements missing. Work is not neatly formatted and would not be presentable in a professional setting.

It is critical to check feedback on your project proposal. Even if you earn full credit, it may not mean that your proposal is perfect.