Go over lecture materials and application exercises
Review labs and feedback you’ve received so far
Do the exercises at the end of readings from both books
Do the exam review (to be posted on Friday)
Frequently asked question
Is there a limit to a DataFrame size?
No, a DataFrame can be any number of rows or columns. However, when you print it, it will only print the first few rows and the columns that fit across the screen.
If you want to see more rows and columns, you can:
Open it in the data viewer with df.head(n)
Explicitly print more rows with, e.g., print(df.head(25))
county state percbelowpoverty percollege
0 ADAMS IL 13.151443 19.631392
1 ALEXANDER IL 32.244278 11.243308
2 BOND IL 12.068844 17.033819
3 BOONE IL 7.209019 17.278954
4 BROWN IL 13.520249 14.475999
.. ... ... ... ...
432 WAUKESHA WI 3.121060 35.396784
433 WAUPACA WI 8.488697 16.549869
434 WAUSHARA WI 13.786985 15.064584
435 WINNEBAGO WI 8.804031 24.995504
436 WOOD WI 8.525831 21.666382
[437 rows x 4 columns]
Is there a relationship between
- number of DS courses taken
- motivation for taking course
- …
and performance in this course?”
Each of these would require joining class performance data with an outside data source so we can have all relevant information (columns) in a single data frame.
Setup
For the next few slides…
x = pd.DataFrame({'id': [1, 2, 3],'value_x': ['x1', 'x2', 'x3']})print(x)
id value_x
0 1 x1
1 2 x2
2 3 x3
y = pd.DataFrame({'id': [1, 2, 4],'value_y': ['y1', 'y2', 'y4']})print(y)
id value_y
0 1 y1
1 2 y2
2 4 y4
Left join
left_merged = pd.merge(x, y, on='id', how='left')print(left_merged)
id value_x value_y
0 1 x1 y1
1 2 x2 y2
2 3 x3 NaN
Right join
right_merged = pd.merge(x, y, on='id', how='right')print(right_merged)
id value_x value_y
0 1 x1 y1
1 2 x2 y2
2 4 NaN y4
Outer (full) join
outer_merged = pd.merge(x, y, on='id', how='outer')print(outer_merged)
id value_x value_y
0 1 x1 y1
1 2 x2 y2
2 3 x3 NaN
3 4 NaN y4
Inner join
inner_merged = pd.merge(x, y, on='id', how='inner')print(inner_merged)
Data sets can’t be labeled as wide or long, but they can be made wider or longer for a certain analysis that requires a certain format.
When pivoting longer, variable names that turn into values are characters by default. If you need them to be in another format, you need to explicitly make that transformation, which you can do within the melt() function.