Python lists are handy for performing data science tasks, especially for data cleaning and manipulation. In this article, I will briefly introduce Python list comprehensions and guide you through some practical use cases.
In simple terms, Python list comprehensions allow you to create lists by performing various operations on an existing list. Based on my experience, they also help reduce the number of lines in your code.
Anatomy of a List Comprehension
Concept
Let’s say you have a list of integers:
nums = [1,2,3,4,5]
Suppose you want to multiply each number in this list by 2 and create a new list. If we were to use the conventional method, we’d have to use a for loop and do something like this;
num_list = [1, 2, 3, 4, 5]
# initializing an empty list to store the multiplied values
multiplied_nums = []
# iterating through the initial list
for num in num_list:
# append the multiplied values into the new list
multiplied_nums.append(num * 2)
print(multiplied_nums)
But with the help of a list comprehension, this could be done in just 1 line of code
multiplied_nums = [x * 2 for x in num_list]
Now let’s look into a data analytics use case for list comprehensions.
Data Wrangling Use Cases
1. Renaming a selected set of columns from a large dataset
Imagine you are working with a large dataset containing hundreds of columns. Suppose you want to collect all column names that start with “time” into a list called time_cols. Here’s how you can accomplish this using a Python list comprehension:
time_cols = [col for col in df.columns if col.startswith('time')]
With the above example, we can derive a more comprehensive anatomy for a list comprehension:
Now, if we want to rename the columns in the list time_cols by prefixing them with “T_”, we can easily do it like this:
df.rename(columns={col: 'T_' + col for col in df.columns if col.startswith('time')}, inplace=True)
2. Filter a dataframe based on a criteria
Assume that we have a dataset relate to sales.
# Create the DataFrame
sales_data = [
{"item": "A", "amount": 120},
{"item": "B", "amount": 90},
{"item": "C", "amount": 150},
{"item": "D", "amount": 80},
]
df = pd.DataFrame(sales_data)
To filter the sales data to include only those with an amount greater than 100, a list comprehension can be used.
# Filter sales where amount is greater than 100
filtered_df = df[df["amount"] > 100]
3. Performing summary calculations on a Column
In the same sales dataset, suppose there are missing values in the “amount” column and we want to calculate the mean of this column ignoring the missing values. It could be done easily using a list comprehension.
# Calculate the mean amount, ignoring None values
amounts = [sale["amount"] for sale in sales_data if sale["amount"] is not None]
mean_amount = np.mean(amounts)