10 Vital Python Concepts for Data Science

Let's talk about Python concepts used for data science. This is a valuable and growing field in 2024, and there are many things you'll need to know if you want to use this programming language to evaluate data.

Below, I'll share 10 Python concepts I wish I knew earlier in my data science career. I included detailed explanations for each, including code examples. This will help introduce and reinforce Python concepts that you'll use again and again.

1. Boolean Indexing & Multi-Indexing

When it comes to data science and Python, Pandas is the name of the game! And one of the things that sets Pandas apart is its powerful indexing capabilities.

Sure, basic slicing is intuitive for Pandas users, but there’s much more you can do with advanced indexing methods, like boolean indexing and multi-indexing.

What is boolean indexing, though? Well, this is an elegant way to filter data based on criteria.

So rather than explicitly specifying index or column values, you pass a condition, and Pandas returns rows and columns that meet it.

Cool, but what is multi-indexing? Sometimes known as hierarchical indexing, this is especially useful for working with higher-dimensional data.

This lets you work with data in a tabular format (which is 2D by nature) while preserving the dataset’s multi-dimensional nature.

I bet you’re already itching to add these ideas to your Python projects!

The real benefit of these methods is the flexibility they bring to data extraction and manipulation. After all, this is one of the major activities of data science!

'''
Hackr.io: 10 Python Concepts I Wish I Knew Earlier
Advanced Indexing & Slicing: General Syntax
'''
# Boolean Indexing
df[boolean_condition]

# Multi-Indexing (setting)
df.set_index(['level_1', 'level_2'])

Let’s dive into an example to see these concepts in action.

Consider a dataset of students with individual scores in multiple subjects. Now, let’s say you want to extract the records of students who scored more than 90 in Mathematics.

Importantly, you want a hierarchical view based on Class, then Student names.

No problem, just use boolean indexing to find the students, then multi-indexing to set the indexing hierarchy, as shown below.

What I really like about this approach is that it not only streamlines data extraction, but it also helps me to organize data in a structured and intuitive manner. Win-win!

Once you get the hang of advanced indexing, you'll find data extraction and manipulation much quicker and more efficient.

'''
Hackr.io: 10 Python Concepts I Wish I Knew Earlier
Hackr.io: Advanced Indexing & Slicing - Example
'''
import pandas as pd

# Sample dataset
data = {
  'Class': ['10th', '10th', '10th', '11th', '11th'],
  'Student': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
  'Mathematics': [85, 93, 87, 90, 95],
  'Physics': [91, 88, 79, 94, 88]
}

df = pd.DataFrame(data)

# Boolean Indexing: Extract records where Mathematics score > 90
high_scorers = df[df['Mathematics'] > 90]

# Multi-Indexing: Setting a hierarchical index on Class and then Student
df_multi_index = df.set_index(['Class', 'Student'])

2. Regular Expressions

Ask any data scientist; they’ll probably all have a tale about challenges with messy or unstructured data.

This is where the magical power of those cryptic-looking regular expressions comes into play!

Regex is an invaluable tool for text processing, as we can use it to find, extract, and even replace patterns in strings.

And yes, I know that learning regular expressions can seem daunting at first, given the cryptic-looking patterns that they use.

But trust me, when you understand the basic building blocks and rules, it becomes an extremely powerful tool in your toolkit. It’s almost like you’ve learned to read The Matrix!

That said, it always helps to have a regex cheat sheet handy if you can’t quite remember how to formulate an expression.

When it comes to Python, the re module provides the interface you need to harness regular expressions.

You can match and manipulate string data in diverse and complex ways by defining specific patterns.

'''
Hackr.io: 10 Python Concepts I Wish I Knew Earlier
Regular Expressions: General Syntax
'''
import re

# Basic match
re.match(pattern, string)

# Search throughout a string
re.search(pattern, string)

# Find all matches
re.findall(pattern, string)

# Replace patterns
re.sub(pattern, replacement, string)

As a practical example, consider a scenario where you need to extract email addresses from text. Regular expressions to the rescue!

These provide a straightforward approach to capturing these patterns, as shown below.

'''
Hackr.io: 10 Python Concepts I Wish I Knew Earlier
Regular Expressions Example
'''
import re

text = "Contact Alice at alice@example.com and Bob at bob@example.org for more details."
email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,7}\b'
emails = re.findall(email_pattern, text)

3. String Methods

Whether you're working with text data, filenames, or data cleaning tasks, String processing is ubiquitous in data science.

In fact, if you’ve taken a Python course, you probably found yourself working with Strings a lot!

Thankfully, Python strings come with a host of built-in methods that make these tasks significantly simpler.

So whether you want to change case, check prefixes/suffixes, split, join, and more, there’s a built-in method that does just that. Awesome!

Generally speaking, String methods are straightforward, but their real power shines when you learn how and when to combine them effectively.

And, because Python's string methods are part of the string object, you can easily chain them together, resulting in concise and readable code. Pythonic indeed!

'''
Hackr.io: 10 Python Concepts I Wish I Knew Earlier
String Methods: Commonly Used Methods
'''
# Change case
string.upper()
string.lower()
string.capitalize()

# Check conditions
string.startswith(prefix)
string.endswith(suffix)

# Splitting and joining
string.split(delimiter)
delimiter.join(list_of_strings)

Let’s dive into an example to show the efficacy of these methods, focusing on a common use case when we need to process user input to ensure it's in a standard format.

So, imagine that you want to capture the names of people, ensuring they start with a capital letter, regardless of how the user enters them.

Let’s use String methods to take care of it!

You’ll see that we’ve combined the lower() and capitalize() methods within a list comprehension to process the list of names quickly and Pythonically.

Of course, this is a simple example, but you get the picture!

'''
Hackr.io: 10 Python Concepts I Wish I Knew Earlier
String Methods Example
'''
# User input
raw_names = ["ALICE", "bOB", "Charlie", "DANIEL"]

# Process names to have the first letter capitalized
processed_names = [name.lower().capitalize() for name in raw_names]

4. Lambda Functions

Python lambda functions are one of those techniques that you need to have in your toolkit when it comes to data science!

The TL-DR is that they provide a quick and concise way to declare small functions on the fly. Yep, no need for the def keyword or a function name here!

And, when you pair these with functions like map() and filter(), lambda functions really shine for data science. Pick up any good Python book, and you’ll see this in action!

If you’re not quite sure why, no problem! Let’s take a quick detour.

WIth map() you can apply a function to all items in an input sequence (like a list or tuple).

The filter() function also operates on sequences, but it constructs an iterator from the input sequence elements that return True for a given function.

The TL-DR: it filters elements based on a function that returns True or False.

Put both of those tidbits in your back pocket as you never know when they might come in handy for a Python interview!

That said, the best way to show the power of lambda functions with map() and filter () is with a practical example.

So, let’s look at a simple scenario where we want to double the numbers in a list before filtering out those that are not divisible by 3.

Sure, we could do this with list comprehensions or traditional for-loops, but combining lambda functions with map() and filter() offers a neat and Pythonic alternative.

I think you’ll agree that the beauty of this approach lies in its brevity.

It is worth noting that while lambda functions are powerful, they're really best for short and simple operations.

For complex operations, stick to traditional functions.

'''
Hackr.io: 10 Python Concepts I Wish I Knew Earlier
Lambda with map() and filter() Example
'''
# Original list of numbers
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

# Double each number using map() and lambda
doubled_numbers = list(map(lambda x: x*2, numbers))

# Filter numbers not divisible by 3 using filter() and lambda
filtered_numbers = list(filter(lambda x: x % 3 == 0, doubled_numbers))

5. Pandas Method Chaining

If you’re using Python for data science, you’re using Pandas! Take any data science course, and it will include Pandas!

And without a doubt, one of the best things about Pandas is the huge range of methods to process data.

When it comes to using Pandas methods, two common styles include method chaining and employing intermediate Dataframes.

Each approach has pros and cons, and understanding them can be crucial for code readability and efficiency.

But what is method chaining? Simple really, it’s just when we call multiple methods sequentially in a single line or statement.

This eliminates the need for temporary variables, which is always nice!

This net result can be concise code, but you need to make sure your code doesn’t compromise readability by overusing chained method calls.

By all means, feel free to continue using intermediate Dataframes, as they can be helpful for storing the results of each step into separate variables, not to mention debugging.

But when possible, it can be cleaner to chain Pandas methods. Let’s take a look at a practical example by firing up our Python IDE.

Suppose we want to read a CSV file, rename a column, and then compute the mean of that column. We have two ways to do this: with chained methods and intermediate dataframes.

As you can see, both approaches achieve the same outcome, but I think the chained method approach feels more Pythonic when it doesn’t sacrifice readability.

'''
Hackr.io: 10 Python Concepts I Wish I Knew Earlier
Pandas Method Chaining Example
'''
import pandas as pd

# Using Method Chaining
mean_value = (pd.read_csv('data.csv')
            .rename(columns={'column2': 'new_column'})
            .new_column.mean())

# Using Intermediate DataFrames
df = pd.read_csv('data.csv')
renamed_df = df.rename(columns={'column2': 'new_column'})
mean_value = renamed_df.new_column.mean()

6. Pandas Missing Data Functions

Handling missing data is an essential skill for data scientists, and thankfully, the Pandas library offers simple but powerful tools to manage missing data effectively.

The two most commonly used functions for handling missing data are fillna() dropna().

I have a feeling that you can work out what they both do, but let’s explore the basic syntax and functionalities of these two methods, starting with fillna().

The TL-DR here is that it’s used to fill NA/NaN values with a specified method or value. If you’re not sure what I mean by NaN, this is just shorthand for Not a Number!

'''
Hackr.io: 10 Python Concepts I Wish I Knew Earlier
fillna(): General Syntax
'''
df.fillna(value=None, method=None, axis=None, inplace=False)

Now, let’s consider a simple use case when we have a dataset with missing values. Our goal is to replace all NaNs with the mean value of the column.

Pandas makes this really easy, as you can see below!

'''
Hackr.io: 10 Python Concepts I Wish I Knew Earlier
fillna() Example
'''
import pandas as pd

data = {'A': [1, 2, pd.NA, 4], 'B': [5, pd.NA, 7, 8]}
df = pd.DataFrame(data)
df['A'].fillna(df['A'].mean(), inplace=True)
print(df)

Now, let’s take a look at dropna(), which is used to remove missing values. Depending on how you use this function, you can drop entire rows or columns.

'''
Hackr.io: 10 Python Concepts I Wish I Knew Earlier
dropna(): General Syntax
'''
df.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

Let’s look at a simple example where we want to drop any row in our dataset that contains at least one NaN value.

'''
Hackr.io: 10 Python Concepts I Wish I Knew Earlier
dropna() Example
'''
data = {'A': [1, 2, pd.NA, 4], 'B': [5, pd.NA, 7, 8]}
df = pd.DataFrame(data)
df.dropna(inplace=True)
print(df)

Overall, when it comes to working with real-world data, missing values are generally a given.

And unless you know how to handle these, you might encounter errors or even produce unreliable analyses.

By understanding how to manage and handle these missing values efficiently, we can ensure our analysis remains robust and insightful. Win!

7. Pandas Data Visualization

Sure, data scientists need to spend a lot of time (A LOT!) manipulating data, but the ability to produce data visualizations is perhaps just as important, if not more so!

After all, data science is about storytelling, and what better way to do that than with pictures?

Yes, you might need to produce beautiful plots to share with stakeholders and customers, but it’s also super helpful to create quick visualizations to better understand your data.

From experience, there have been a ton of occasions when I spotted an underlying trend, pattern, or characteristic of a dataset that I would not have been able to see without a plot.

Once again, Pandas comes to the rescue here, as it makes it super easy to visualize data with the integrated plot() function.

Don’t worry, this uses Matplotlib under the hood, so you’re in safe hands!

Let's delve into the basic mechanics of this function.

The most important thing to remember is that plot() is highly versatile (just see the docs to get a feel for how much you can do with it!).

By default, it generates a line plot, but you can easily change the type, along with a host of other formatting features.

In fact, if you’ve spent any time working with Matplotlib, you’ll know just how much you can control, tweak, and customize plots.

Let’s take a look at a concrete example where we have a dataset with monthly sales figures. Our goal is to plot a bar graph to visualize monthly trends.

As you can see, it doesn’t get much easier than calling the plot() function and passing in some basic parameters to tweak the output.

'''
Hackr.io: 10 Python Concepts I Wish I Knew Earlier
Pandas plot() Example
'''
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Sample data: Monthly sales figures
data = {'Month': ['Jan', 'Feb', 'Mar', 'Apr'], 'Sales': [200, 220, 250, 275]}
df = pd.DataFrame(data)

# Bar plot using Pandas plot()
df.plot(x='Month', y='Sales', kind='bar', title='Monthly Sales Data', grid=True, legend=False)
plt.show()

8. Numpy Broadcasting

When it comes to data science with Python, Pandas and NumPy are the two pillars that have helped propel Python’s popularity.

When the time comes to work with arrays in NumPy, we can often find ourselves needing to perform operations between arrays of different shapes. No bueno!

On the surface, this seems problematic, and you might have even found yourself implementing manual reshaping and looping with various Python operators.

But there is a simpler way! By using NumPy’s broadcasting feature, these operations become incredibly streamlined.

But what is broadcasting? Great question!

This is a powerful NumPy concept that allows you to perform arithmetic operations on arrays of different shapes without explicit looping or reshaping. I know, what a dream!

In simple terms, you can think of this as NumPy's method of implicitly handling element-wise binary operations with input arrays of different shapes. That’s a mouthful!

But to understand broadcasting, it's important to grasp the rules that NumPy uses to decide if two arrays are compatible for broadcasting.

Rule 1: If the two arrays have different shapes, the array with fewer dimensions is padded with 1s on its left side.

For example: Shape of A: (5, 4), Shape of B: (4,) = Broadcasted shape of B: (1, 4)

Rule 2: If the two arrays differ in all dimensions, whichever array has a shape of 1 is stretched to match the other array.

For example: Shape of A: (5, 4), Shape of B: (1, 4) = Broadcasted shape of both A and B: (5, 4)

Rule 3: If any dimension sizes disagree and neither is equal to 1, an error is raised.

For example: Shape of A: (5, 4), Shape of B: (6, 4) = This will raise an error.

So, as you can see, if two arrays are compatible, they can be broadcasted.

Let's look at a classic example to grasp this idea.

Imagine you have an array of data, and you want to normalize it by subtracting the mean and then dividing by the standard deviation. Simple stuff, right?

Well, for starters, you need to remember that the mean and standard deviation are scalar values while the data is a 3x3 array.

But, thanks to broadcasting, NumPy allows us to subtract a scalar from an array and divide an array by a scalar. This is the magic of broadcasting!

'''
Hackr.io: 10 Python Concepts I Wish I Knew Earlier
NumPy Broadcasting Example
'''
import numpy as np

# Data array
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Compute mean and standard deviation
mean = np.mean(data)
std_dev = np.std(data)

# Normalize the data
normalized_data = (data - mean) / std_dev

9. Pandas groupby()

You’ve probably spotted a heavy Pandas theme in this article, but trust me, it really is the backbone of data science with Python!

That said, one of the most powerful tools you can use with Pandas is the groupby() method.

This allows you to split data into groups based on criteria and then apply a function to each group, such as aggregation, transformation, or filtering.

If you’ve spent any time working with SQL commands, this Python concept should be somewhat familiar to you as it’s inspired by the SQL grouping syntax and the split-apply-combine strategy.

Just remember the clue is in the name here! You're grouping data by some criterion, and then you're able to apply various operations to each group.

Let’s take a look at the basic approach.

Split: Divide data into groups.
Apply: Perform an operation on each group, such as aggregation (sum or average), transformation (filling NAs), or filtration (discarding data based on group properties).
Combine: Put the results back together into a new data structure.

As always, the best way to understand this Python concept is to look at an example.

So, suppose you have a dataset of sales in a store and want to find out the total sales for each product. Seems reasonable enough!

As you can see, we call the groupby() method on the dataframe column containing Products.

We then use the dot notation to access the Sales column, and we apply the sum() method to get the total sales per product.

The resultant series contains products as indices with their respective total sales as values.

The more I’ve used the groupby() method, the more I’ve come to appreciate how powerful it is for producing concise representations of aggregated data with minimal code.

'''
Hackr.io: 10 Python Concepts I Wish I Knew Earlier
Pandas groupby() Example
'''
import pandas as pd

# Sample data
data = {
  'Product': ['A', 'B', 'A', 'C', 'B', 'C', 'A'],
  'Sales': [100, 150, 200, 50, 300, 25, 75]
}

df = pd.DataFrame(data)

# Group by product and sum up sales
total_sales_per_product = df.groupby('Product').Sales.sum()

10. Vectorization Vs. Iteration

Anyone who's worked with large datasets in Python will have stumbled upon the dilemma of performance, especially when you need to traverse the data. Yep, we’re on the subject of Big-O!

Well, allow me to introduce you to something special called vectorization!

But what is that, I hear you ask?

No problem. Vectorization leverages low-level optimizations to allow operations to be applied on whole arrays rather than individual elements.

Libraries like NumPy in Python have perfected this.

But why does this matter, and how does it differ from traditional iteration?

Well, you probably know that iteration involves going through elements one by one.

And sure, this is super intuitive for us programmers, but it can be much slower and thus more computationally expensive with bigger datasets.

To make this clearer, let’s look at the general syntax for the two approaches.

'''
Hackr.io: 10 Python Concepts I Wish I Knew Earlier
Vectorization vs Iteration: Syntax
'''
import numpy as np

result = []
for item in data:
  result.append(some_function(item))

result = np.some_function(data)

And yes, I do love how concise the NumPy code is, but the real gains are hidden away from us.

The whole point of using vectorization is to boost time performance, so let’s look at a simple example to illustrate this.

To start with, we’ve populated a list with 100,000 elements before creating two simple functions.

The iterative function uses a list comprehension to iterate over each item in the list and compute its square.

The vectorized function converts the list to a NumPy array to take advantage of NumPy's vectorized operations to compute the square of each number in the array all at once.

We’ve then used the timeit module to run these functions ten times and compute the average run time.

If you run this example on your own machine, the actual time in seconds will vary, but you should see that the vectorized operation is significantly faster!

On my machine, the average time over 10 runs is nearly 7.5x faster for the vectorized function than it is with the iterative function.

And remember, this gain becomes even more pronounced as your data grows in size.

So, when you're working with huge datasets or doing extensive computations, vectorization can save not only valuable coding time but also computational time.

'''
Hackr.io: 10 Python Concepts I Wish I Knew Earlier
Vectorization vs Iteration Example
'''
import numpy as np
import timeit

# Sample data
data = list(range(1, 100001))

# Timing function for Iteration
def iterative_approach():
  return [item**2 for item in data]

# Timing function for Vectorization
def vectorized_approach():
  data_np = np.array(data)
  return data_np**2

# Using Iteration
iterative_time = timeit.timeit(iterative_approach, number=10) / 10  

print(f"Iterative Approach Time: {iterative_time:.5f} seconds")
# Output on my machine: Iterative Approach Time: 0.03872 seconds

# Using Vectorization
vectorized_time = timeit.timeit(vectorized_approach, number=10) / 10  

print(f"Vectorized Approach Time: {vectorized_time:.5f} seconds")
# Output on my machine: Vectorized Approach Time: 0.00514 seconds

Conclusion

As we move nearer to the end of 2024, Python is still a top 3 language with huge demand in data science.

And with the Bureau of Labor and Statistics reporting an average salary of over $115K for data scientists, learning essential Python concepts to land a job can be highly rewarding.

Even if you’re new to the data science job market, learning these Python concepts can help you succeed and stand out from the crowd.

And there you have it, the 10 Python concepts I wish I knew earlier for data science, including explanations and code examples for the Python concepts.

Whether you’re new to data science and looking to land your first job or fresh off a Python course and looking to learn data science, mastering these 10 Python concepts can help you stand out from the crowd!

Frequently Asked Questions

1. What Python Concepts Are Required For Data Science?

In general, you should be familiar with Python essentials like data structures, control structures, functions, exception handling, and key Python libraries like NumPy, Pandas and Matplotlib. I’d also recommend checking out the various concepts we’ve outlined above.

2. How Long Does It Take To Learn Python Data Science?

This depends on your current skill and education level. If you’re a beginner, learning data manipulation may take 1-3 months. You can then aim for an intermediate level by adding statistics and machine learning skills over 3-6 months. Advanced proficiency, including skills like deep learning, will likely require 12+ months.

People are also reading: