It is generally considered that pandas is one of the most popular python libraries for data science. The first and most important thing is understanding the syntax of the package.

Pandas package has a number of aggregating functions that reduce the dimension of the initial dataset. It goes with a set of SQL-like aggregation functions you can apply when grouping data during feature engineering step. Here’s a quick example of how to group on multiple columns and summarise data by applying multiple aggregation functions using Pandas.

Create a dataset

import pandas as pd

data = {
    "State": ["Alabama", "Alabama", 
              "Arizona", "Arizona", 
              "California", "California", 
              "Colorado", "Colorado", 
              "Florida", "Florida"],
    "City": ["Montgomery", "Birmingham", 
             "Phoenix", "Tucson", 
             "Los Angeles", "Sacramento", 
             "Denver", "Colorado Springs",
             "Tallahassee", "Miami"],
    "Population": [198218, 209880, 1660272, 545975, 3990456, 
                   508529, 716492, 472688, 193551, 470914],
    'Real-Estate Tax': [0.42, 0.42, 0.69, 0.69, 0.76, 
                        0.76, 0.53, 0.53, 0.93, 0.93]}
df = pd.DataFrame(data)


State City Population Real-Estate Tax
0 Alabama Montgomery 198218 0.42
1 Alabama Birmingham 209880 0.42
2 Arizona Phoenix 1660272 0.69
3 Arizona Tucson 545975 0.69
4 California Los Angeles 3990456 0.76
5 California Sacramento 508529 0.76
6 Colorado Denver 716492 0.53
7 Colorado Colorado Springs 472688 0.53
8 Florida Tallahassee 193551 0.93
9 Florida Miami 470914 0.93

Grouping by specific columns with aggregation functions

To group in pandas use the .groupby() method.

The following code will group by 'State' and 'Real-Estate Tax'. To apply aggregation functions, simply add key:value pairs as dictionary to .agg() method.

I’d recommend setting specific prefix over resulting columns to avoid possible duplicates and make your code more coherent.

Don't forget to reset index  – multi-index notation isn't sklearn friendly.

grouped_df = df.groupby(['State', 'Real-Estate Tax']).agg({'Population': ['mean', 'min', 'max']})
grouped_df = grouped_df.add_prefix('population_')
grouped_df = grouped_df.reset_index()

The full list of aggregation functions

mean(): Compute mean of groups
sum(): Compute sum of group values
size(): Compute group sizes
count(): Compute count of group
std(): Standard deviation of groups
var(): Compute variance of groups
sem(): Standard error of the mean of groups
describe(): Generates descriptive statistics
first(): Compute first of group values
last(): Compute last of group values
nth() : Take nth value, or a subset if n is a list
min(): Compute min of group values
max(): Compute max of group values

In Conclusion

In light of the above use pandas group by method, apply aggregation functions as many as possible. You can easily drop highly correlated columns afterwards.

If you are interested in another example of group by, check this guide on custom aggregation functions for pandas.