The many ways to rename columns in polars

Dataframes that have column names with spaces in them can be annoying to work with. I like to remove any spaces or unusual characters from column names. This makes it easier to perform operations on the dataframe. The main advantage you get with columns that don’t have spaces is that you can access them using the dot method. Let me show you what I mean.

import polars as pl
pl.Config(tbl_rows=6) #this displays only 6 rows of dataframe
from pathlib import Path

df = pl.read_parquet(f"{Path('../../../')}/datasets/regional_sales.parquet")
df

shape: (20, 5)

Customer ID	Customer Name	Region Name	Segment	Sales In Dollars
i64	str	str	str	i32
740150	"Barton LLC"	"US"	"Gold"	215000
714466	"Trantow-Barrows"	"EMEA"	"Silver"	430000
218895	"Kulas Inc"	"EMEA"	"Platinum"	410000
…	…	…	…	…
257198	"Cronin, Oberbrunner and Spence…	"US"	"Platinum"	425000
604255	"Halvorson, Crona and Champlin"	"US"	null	430000
163416	"Purdy-Kunde"	"APAC"	"Silver"	410000

The only column without a space in the dataframe above is Segment. I can access this column using the dot method as shown below.

df.select(pl.col.Segment)

shape: (20, 1)

Segment
str
"Gold"
"Silver"
"Platinum"
…
"Platinum"
null
"Silver"

However, I cannot use the dot method to access any of the columns with spaces, for instance, Customer ID. To access it I need to enclose the column name in square quotation marks as shown below.

df.select(pl.col('Customer ID'))

shape: (20, 1)

Customer ID
i64
740150
714466
218895
…
257198
604255
163416

Removing spaces in column names is the first data-cleaning step I take when I’m doing that analysis. I’ll show you three ways to clean up the column names in our dataframe.

Method 1

This is the manual way because it involves copying every column name with a space and giving it a name without a space. I don’t advise you to use this method if you have a dataset with many columns.

(df
 .rename({'Customer ID':'Customer_ID',
          'Customer Name':'Customer_Name',
          'Region Name':'Region_Name',
          'Sales In Dollars':'Sales_In_Dollars'})
)

shape: (20, 5)

Customer_ID	Customer_Name	Region_Name	Segment	Sales_In_Dollars
i64	str	str	str	i32
740150	"Barton LLC"	"US"	"Gold"	215000
714466	"Trantow-Barrows"	"EMEA"	"Silver"	430000
218895	"Kulas Inc"	"EMEA"	"Platinum"	410000
…	…	…	…	…
257198	"Cronin, Oberbrunner and Spence…	"US"	"Platinum"	425000
604255	"Halvorson, Crona and Champlin"	"US"	null	430000
163416	"Purdy-Kunde"	"APAC"	"Silver"	410000

Method 2

This may be considered the idiomatic polars way, although it uses many aspects of the library. As such, it may be difficult to remember.

(df
 .select(pl.all().name.map(lambda col: col.replace(' ', '_')))
 )

shape: (20, 5)

Customer_ID	Customer_Name	Region_Name	Segment	Sales_In_Dollars
i64	str	str	str	i32
740150	"Barton LLC"	"US"	"Gold"	215000
714466	"Trantow-Barrows"	"EMEA"	"Silver"	430000
218895	"Kulas Inc"	"EMEA"	"Platinum"	410000
…	…	…	…	…
257198	"Cronin, Oberbrunner and Spence…	"US"	"Platinum"	425000
604255	"Halvorson, Crona and Champlin"	"US"	null	430000
163416	"Purdy-Kunde"	"APAC"	"Silver"	410000

Method 3

This is my personal favorite. It’s short and easy to remember.

(df
 .rename(lambda col: col.replace(' ', '_'))
 )

shape: (20, 5)

Customer_ID	Customer_Name	Region_Name	Segment	Sales_In_Dollars
i64	str	str	str	i32
740150	"Barton LLC"	"US"	"Gold"	215000
714466	"Trantow-Barrows"	"EMEA"	"Silver"	430000
218895	"Kulas Inc"	"EMEA"	"Platinum"	410000
…	…	…	…	…
257198	"Cronin, Oberbrunner and Spence…	"US"	"Platinum"	425000
604255	"Halvorson, Crona and Champlin"	"US"	null	430000
163416	"Purdy-Kunde"	"APAC"	"Silver"	410000

Check out Polars course to learn more about this new and fast data processing library.