It is generally considered that pandas is one of the most popular python libraries for data science. The first and most important thing is understanding the syntax of the package.
Pandas package has a number of aggregating functions that reduce the dimension of the initial dataset. It goes with a set of SQL-like aggregation functions you can apply when grouping data during feature engineering step. Here’s a quick example of how to group on multiple columns and summarise data by applying multiple aggregation functions using Pandas.
Create a dataset
import pandas as pd
data = {
"State": ["Alabama", "Alabama",
"Arizona", "Arizona",
"California", "California",
"Colorado", "Colorado",
"Florida", "Florida"],
"City": ["Montgomery", "Birmingham",
"Phoenix", "Tucson",
"Los Angeles", "Sacramento",
"Denver", "Colorado Springs",
"Tallahassee", "Miami"],
"Population": [198218, 209880, 1660272, 545975, 3990456,
508529, 716492, 472688, 193551, 470914],
'Real-Estate Tax': [0.42, 0.42, 0.69, 0.69, 0.76,
0.76, 0.53, 0.53, 0.93, 0.93]}
df = pd.DataFrame(data)
print(df)
Output:
State | City | Population | Real-Estate Tax | |
---|---|---|---|---|
0 | Alabama | Montgomery | 198218 | 0.42 |
1 | Alabama | Birmingham | 209880 | 0.42 |
2 | Arizona | Phoenix | 1660272 | 0.69 |
3 | Arizona | Tucson | 545975 | 0.69 |
4 | California | Los Angeles | 3990456 | 0.76 |
5 | California | Sacramento | 508529 | 0.76 |
6 | Colorado | Denver | 716492 | 0.53 |
7 | Colorado | Colorado Springs | 472688 | 0.53 |
8 | Florida | Tallahassee | 193551 | 0.93 |
9 | Florida | Miami | 470914 | 0.93 |
Grouping by specific columns with aggregation functions
To group in pandas use the .groupby() method.
The following code will group by 'State' and 'Real-Estate Tax'. To apply aggregation functions, simply add key:value pairs as dictionary to .agg() method.
I’d recommend setting specific prefix over resulting columns to avoid possible duplicates and make your code more coherent.
Don't forget to reset index – multi-index notation isn't sklearn friendly.
grouped_df = df.groupby(['State', 'Real-Estate Tax']).agg({'Population': ['mean', 'min', 'max']})
grouped_df = grouped_df.add_prefix('population_')
grouped_df = grouped_df.reset_index()
print(grouped_df)
The full list of aggregation functions
mean(): Compute mean of groups
sum(): Compute sum of group values
size(): Compute group sizes
count(): Compute count of group
std(): Standard deviation of groups
var(): Compute variance of groups
sem(): Standard error of the mean of groups
describe(): Generates descriptive statistics
first(): Compute first of group values
last(): Compute last of group values
nth() : Take nth value, or a subset if n is a list
min(): Compute min of group values
max(): Compute max of group values
In Conclusion
In light of the above use pandas group by method, apply aggregation functions as many as possible. You can easily drop highly correlated columns afterwards.
If you are interested in another example of group by, check this guide on custom aggregation functions for pandas.