Juiced Group in Pandas

It is generally considered that pandas is one of the most popular python libraries for data science. The first and most important thing is understanding the syntax of the package.

Pandas package has a number of aggregating functions that reduce the dimension of the initial dataset. It goes with a set of SQL-like aggregation functions you can apply when grouping data during feature engineering step. Here’s a quick example of how to group on multiple columns and summarise data by applying multiple aggregation functions using Pandas.

Create a dataset

import pandas as pd

data = {
    "State": ["Alabama", "Alabama", 
              "Arizona", "Arizona", 
              "California", "California", 
              "Colorado", "Colorado", 
              "Florida", "Florida"],
              
    "City": ["Montgomery", "Birmingham", 
             "Phoenix", "Tucson", 
             "Los Angeles", "Sacramento", 
             "Denver", "Colorado Springs",
             "Tallahassee", "Miami"],
             
    "Population": [198218, 209880, 1660272, 545975, 3990456, 
                   508529, 716492, 472688, 193551, 470914],
                   
    'Real-Estate Tax': [0.42, 0.42, 0.69, 0.69, 0.76, 
                        0.76, 0.53, 0.53, 0.93, 0.93]}
    
df = pd.DataFrame(data)
print(df)

Output:

	State	City	Population	Real-Estate Tax
0	Alabama	Montgomery	198218	0.42
1	Alabama	Birmingham	209880	0.42
2	Arizona	Phoenix	1660272	0.69
3	Arizona	Tucson	545975	0.69
4	California	Los Angeles	3990456	0.76
5	California	Sacramento	508529	0.76
6	Colorado	Denver	716492	0.53
7	Colorado	Colorado Springs	472688	0.53
8	Florida	Tallahassee	193551	0.93
9	Florida	Miami	470914	0.93

Grouping by specific columns with aggregation functions

To group in pandas use the .groupby() method.

The following code will group by 'State' and 'Real-Estate Tax'. To apply aggregation functions, simply add key:value pairs as dictionary to .agg() method.

I’d recommend setting specific prefix over resulting columns to avoid possible duplicates and make your code more coherent.

Don't forget to reset index – multi-index notation isn't sklearn friendly.

grouped_df = df.groupby(['State', 'Real-Estate Tax']).agg({'Population': ['mean', 'min', 'max']})
grouped_df = grouped_df.add_prefix('population_')
grouped_df = grouped_df.reset_index()
print(grouped_df)

The full list of aggregation functions

mean(): Compute mean of groups
sum(): Compute sum of group values
size(): Compute group sizes
count(): Compute count of group
std(): Standard deviation of groups
var(): Compute variance of groups
sem(): Standard error of the mean of groups
describe(): Generates descriptive statistics
first(): Compute first of group values
last(): Compute last of group values
nth() : Take nth value, or a subset if n is a list
min(): Compute min of group values
max(): Compute max of group values

In Conclusion

In light of the above use pandas group by method, apply aggregation functions as many as possible. You can easily drop highly correlated columns afterwards.

If you are interested in another example of group by, check this guide on custom aggregation functions for pandas.

Juiced Group in Pandas

Vasily Korf

Vasily Korf

Create a dataset

Grouping by specific columns with aggregation functions

The full list of aggregation functions

In Conclusion

Create videos with Python and become a vlogger

Find pizza with AI help

Detect underdog stocks to buy during pandemic

Hosting your blog for peanuts

Highlighting code in Ghost