Data wrangling

Lecture 7

Dr. Greg Chism

University of Arizona
INFO 511 - Fall 2024

Study tips for the exam

  • Go over lecture materials and application exercises
  • Review labs and feedback you’ve received so far
  • Do the exercises at the end of readings from both books
  • Do the exam review (to be posted on Friday)

Frequently asked question

Is there a limit to a DataFrame size?

No, a DataFrame can be any number of rows or columns. However, when you print it, it will only print the first few rows and the columns that fit across the screen.

If you want to see more rows and columns, you can:

  • Open it in the data viewer with df.head(n)

  • Explicitly print more rows with, e.g., print(df.head(25))

  • Explicitly select or rearrange columns

Options for a DataFrame

import pandas as pd

df = pd.read_csv('data/midwest.csv')
print(df)
      PID     county state   area  poptotal   popdensity  popwhite  popblack  \
0     561      ADAMS    IL  0.052     66090  1270.961540     63917      1702   
1     562  ALEXANDER    IL  0.014     10626   759.000000      7054      3496   
2     563       BOND    IL  0.022     14991   681.409091     14477       429   
3     564      BOONE    IL  0.017     30806  1812.117650     29344       127   
4     565      BROWN    IL  0.018      5836   324.222222      5264       547   
..    ...        ...   ...    ...       ...          ...       ...       ...   
432  3048   WAUKESHA    WI  0.034    304715  8962.205880    298313      1096   
433  3049    WAUPACA    WI  0.045     46104  1024.533330     45695        22   
434  3050   WAUSHARA    WI  0.037     19385   523.918919     19094        29   
435  3051  WINNEBAGO    WI  0.035    140320  4009.142860    136822       697   
436  3052       WOOD    WI  0.048     73605  1533.437500     72157        90   

     popamerindian  popasian  ...  percollege  percprof  poppovertyknown  \
0               98       249  ...   19.631392  4.355859            63628   
1               19        48  ...   11.243308  2.870315            10529   
2               35        16  ...   17.033819  4.488572            14235   
3               46       150  ...   17.278954  4.197800            30337   
4               14         5  ...   14.475999  3.367680             4815   
..             ...       ...  ...         ...       ...              ...   
432            672      2699  ...   35.396784  7.667090           299802   
433            125        92  ...   16.549869  3.138596            44412   
434             70        43  ...   15.064584  2.620907            19163   
435            685      1728  ...   24.995504  5.659847           133950   
436            481       722  ...   21.666382  4.583725            72685   

     percpovertyknown  percbelowpoverty  percchildbelowpovert  \
0           96.274777         13.151443             18.011717   
1           99.087145         32.244278             45.826514   
2           94.956974         12.068844             14.036061   
3           98.477569          7.209019             11.179536   
4           82.505140         13.520249             13.022889   
..                ...               ...                   ...   
432         98.387674          3.121060              3.785820   
433         96.330036          8.488697             10.071411   
434         98.854785         13.786985             20.050708   
435         95.460376          8.804031             10.592031   
436         98.750085          8.525831             11.162997   

     percadultpoverty  percelderlypoverty  inmetro  category  
0           11.009776           12.443812        0       AAR  
1           27.385647           25.228976        0       LHR  
2           10.852090           12.697410        0       AAR  
3            5.536013            6.217047        1       ALU  
4           11.143211           19.200000        0       AAR  
..                ...                 ...      ...       ...  
432          2.590061            4.085479        1       HLU  
433          6.953799           10.338641        0       AAR  
434         11.695784           11.804558        0       AAR  
435          8.660587            6.661094        1       HAU  
436          7.375656            7.882918        0       AAR  

[437 rows x 28 columns]
print(df.head(13))
    PID     county state   area  poptotal   popdensity  popwhite  popblack  \
0   561      ADAMS    IL  0.052     66090  1270.961540     63917      1702   
1   562  ALEXANDER    IL  0.014     10626   759.000000      7054      3496   
2   563       BOND    IL  0.022     14991   681.409091     14477       429   
3   564      BOONE    IL  0.017     30806  1812.117650     29344       127   
4   565      BROWN    IL  0.018      5836   324.222222      5264       547   
5   566     BUREAU    IL  0.050     35688   713.760000     35157        50   
6   567    CALHOUN    IL  0.017      5322   313.058824      5298         1   
7   568    CARROLL    IL  0.027     16805   622.407407     16519       111   
8   569       CASS    IL  0.024     13437   559.875000     13384        16   
9   570  CHAMPAIGN    IL  0.058    173025  2983.189660    146506     16559   
10  571  CHRISTIAN    IL  0.042     34418   819.476190     34176        82   
11  572      CLARK    IL  0.030     15921   530.700000     15842        10   
12  573       CLAY    IL  0.028     14460   516.428571     14403         4   

    popamerindian  popasian  ...  percollege   percprof  poppovertyknown  \
0              98       249  ...   19.631392   4.355859            63628   
1              19        48  ...   11.243308   2.870315            10529   
2              35        16  ...   17.033819   4.488572            14235   
3              46       150  ...   17.278954   4.197800            30337   
4              14         5  ...   14.475999   3.367680             4815   
5              65       195  ...   18.904624   3.275891            35107   
6               8        15  ...   11.917388   3.209601             5241   
7              30        61  ...   16.197121   3.055727            16455   
8               8        23  ...   14.107649   3.206799            13081   
9             331      8033  ...   41.295808  17.757448           154934   
10             51        89  ...   13.567226   3.089998            33788   
11             26        36  ...   15.110863   2.776225            15615   
12             17        29  ...   13.683010   2.788432            14248   

    percpovertyknown  percbelowpoverty  percchildbelowpovert  \
0          96.274777         13.151443             18.011717   
1          99.087145         32.244278             45.826514   
2          94.956974         12.068844             14.036061   
3          98.477569          7.209019             11.179536   
4          82.505140         13.520249             13.022889   
5          98.372002         10.399635             14.158819   
6          98.478016         15.149781             13.787761   
7          97.917287         11.710726             17.225462   
8          97.350599         13.875086             17.994784   
9          89.544286         15.572437             14.132234   
10         98.169562         11.708299             16.320612   
11         98.078010         12.007685             15.321547   
12         98.533887         16.774284             20.582578   

    percadultpoverty  percelderlypoverty  inmetro  category  
0          11.009776           12.443812        0       AAR  
1          27.385647           25.228976        0       LHR  
2          10.852090           12.697410        0       AAR  
3           5.536013            6.217047        1       ALU  
4          11.143211           19.200000        0       AAR  
5           8.179287           11.008586        0       AAR  
6          12.932331           21.085271        0       LAR  
7          10.027037            9.525052        0       AAR  
8          11.914343           13.660180        0       AAR  
9          17.562728            8.105017        1       HAU  
10          9.569700           11.490641        0       AAR  
11         10.131775           12.595420        0       AAR  
12         14.464114           17.670078        0       LAR  

[13 rows x 28 columns]
selected_columns = df[['county', 'state', 'percbelowpoverty', 'percollege']]
print(selected_columns)
        county state  percbelowpoverty  percollege
0        ADAMS    IL         13.151443   19.631392
1    ALEXANDER    IL         32.244278   11.243308
2         BOND    IL         12.068844   17.033819
3        BOONE    IL          7.209019   17.278954
4        BROWN    IL         13.520249   14.475999
..         ...   ...               ...         ...
432   WAUKESHA    WI          3.121060   35.396784
433    WAUPACA    WI          8.488697   16.549869
434   WAUSHARA    WI         13.786985   15.064584
435  WINNEBAGO    WI          8.804031   24.995504
436       WOOD    WI          8.525831   21.666382

[437 rows x 4 columns]
relocated_columns = df[['county', 'state', 'percbelowpoverty', 'percollege', *df.columns.difference(['county', 'state', 'percbelowpoverty', 'percollege'])]]
print(relocated_columns)
        county state  percbelowpoverty  percollege   PID   area category  \
0        ADAMS    IL         13.151443   19.631392   561  0.052      AAR   
1    ALEXANDER    IL         32.244278   11.243308   562  0.014      LHR   
2         BOND    IL         12.068844   17.033819   563  0.022      AAR   
3        BOONE    IL          7.209019   17.278954   564  0.017      ALU   
4        BROWN    IL         13.520249   14.475999   565  0.018      AAR   
..         ...   ...               ...         ...   ...    ...      ...   
432   WAUKESHA    WI          3.121060   35.396784  3048  0.034      HLU   
433    WAUPACA    WI          8.488697   16.549869  3049  0.045      AAR   
434   WAUSHARA    WI         13.786985   15.064584  3050  0.037      AAR   
435  WINNEBAGO    WI          8.804031   24.995504  3051  0.035      HAU   
436       WOOD    WI          8.525831   21.666382  3052  0.048      AAR   

     inmetro  percadultpoverty  percamerindan  ...  percwhite  popadults  \
0          0         11.009776       0.148283  ...  96.712059      43298   
1          0         27.385647       0.178807  ...  66.384340       6724   
2          0         10.852090       0.233473  ...  96.571276       9669   
3          1          5.536013       0.149322  ...  95.254171      19272   
4          0         11.143211       0.239890  ...  90.198766       3979   
..       ...               ...            ...  ...        ...        ...   
432        1          2.590061       0.220534  ...  97.899020     195837   
433        0          6.953799       0.271126  ...  99.112875      30109   
434        0         11.695784       0.361104  ...  98.498839      13316   
435        1          8.660587       0.488170  ...  97.507127      88960   
436        0          7.375656       0.653488  ...  98.032742      46796   

     popamerindian  popasian  popblack   popdensity  popother  \
0               98       249      1702  1270.961540       124   
1               19        48      3496   759.000000         9   
2               35        16       429   681.409091        34   
3               46       150       127  1812.117650      1139   
4               14         5       547   324.222222         6   
..             ...       ...       ...          ...       ...   
432            672      2699      1096  8962.205880      1935   
433            125        92        22  1024.533330       170   
434             70        43        29   523.918919       149   
435            685      1728       697  4009.142860       388   
436            481       722        90  1533.437500       155   

     poppovertyknown  poptotal  popwhite  
0              63628     66090     63917  
1              10529     10626      7054  
2              14235     14991     14477  
3              30337     30806     29344  
4               4815      5836      5264  
..               ...       ...       ...  
432           299802    304715    298313  
433            44412     46104     45695  
434            19163     19385     19094  
435           133950    140320    136822  
436            72685     73605     72157  

[437 rows x 28 columns]

Data wrangling

Data wrangling
AKA Munging 🤮

Joining datasets

Why join?

Suppose we want to answer questions like:

Is there a relationship between
- number of DS courses taken
- motivation for taking course
- …
and performance in this course?”

Each of these would require joining class performance data with an outside data source so we can have all relevant information (columns) in a single data frame.

Setup

For the next few slides…

x = pd.DataFrame({
    'id': [1, 2, 3],
    'value_x': ['x1', 'x2', 'x3']
})
print(x)
   id value_x
0   1      x1
1   2      x2
2   3      x3
y = pd.DataFrame({
    'id': [1, 2, 4],
    'value_y': ['y1', 'y2', 'y4']
})

print(y)
   id value_y
0   1      y1
1   2      y2
2   4      y4

Left join

left_merged = pd.merge(x, y, on='id', how='left')
print(left_merged)
   id value_x value_y
0   1      x1      y1
1   2      x2      y2
2   3      x3     NaN

Right join

right_merged = pd.merge(x, y, on='id', how='right')
print(right_merged)
   id value_x value_y
0   1      x1      y1
1   2      x2      y2
2   4     NaN      y4

Outer (full) join

outer_merged = pd.merge(x, y, on='id', how='outer')
print(outer_merged)
   id value_x value_y
0   1      x1      y1
1   2      x2      y2
2   3      x3     NaN
3   4     NaN      y4

Inner join

inner_merged = pd.merge(x, y, on='id', how='inner')
print(inner_merged)
   id value_x value_y
0   1      x1      y1
1   2      x2      y2

Semi-join

semi_merged = x[x['id'].isin(y['id'])]
print(semi_merged)
   id value_x
0   1      x1
1   2      x2

Anti-join

anti_merged = x[~x['id'].isin(y['id'])]
print(anti_merged)
   id value_x
2   3      x3

🤮

Pivoting (.melt())

  • Data sets can’t be labeled as wide or long, but they can be made wider or longer for a certain analysis that requires a certain format.

  • When pivoting longer, variable names that turn into values are characters by default. If you need them to be in another format, you need to explicitly make that transformation, which you can do within the melt() function.

ae-05-majors-wrangling