Python RegEx (regular expression) is a powerful tool for pattern matching and text manipulation.
The re
module provides built-in support for regex, allowing you to search, extract, and modify text efficiently. Regex patterns can match digits, special characters, and even unicode text. Since regex has a special meaning in pattern matching, understanding how to use escape sequences is essential.
Importing the re Module
Before using regex in Python, you need to import the re
module:
import re # The regex module in Python
Why Use Regular Expressions?
- Pattern Matching: Find specific patterns in strings, such as email addresses or phone numbers.
- Text Validation: Ensure that strings conform to expected formats (e.g., validating user input).
- Data Extraction: Extract parts of a string based on patterns.
- String Replacement: Modify text efficiently using search-and-replace operations.
A regex pattern is a sequence of characters that defines a search pattern. The use of word boundaries ensures precise matching, and square brackets help define character sets. Parentheses are used to create groups within a pattern, which can be accessed separately.
How to Get a RegEx Match in Python?
The primary functions provided by the re
module to get a match include:
re.search(pattern, string) # Searches for the first occurrence of the pattern
re.match(pattern, string) # Checks if the pattern matches at the start of the string
re.findall(pattern, string) # Returns a list of all occurrences of the pattern
re.fullmatch(pattern, string) # Ensures the entire string matches the pattern
What is the match Function in Python RegEx?
The re.match()
function checks if the beginning of a string matches a pattern. It is useful when you need to confirm that the search pattern occurs at the start of the string.
A tuple containing the span of the match is returned when a successful match is found. None is returned if no match is found.
Let's take a look at the basic syntax to check for a match at the start of a string:
text = "Python is powerful!"
match = re.match(r'Python', text)
if match:
print("Match found!", "Span:", match.span())
else:
print("No match.")
Output:
Match found! Span: (0, 6)
Understanding Span in Regex
The span in regex refers to the range of indices in the input string where a match occurs. The span()
method of a match object returns a tuple containing the start and end indices of the match.
For example:
text = "Find the number 42 in this text."
match = re.search(r'\d+', text)
if match:
print("Matched number:", match.group(), "Span:", match.span())
Output:
Matched number: 42 Span: (14, 16)
How to Check if a String Matches a RegEx Pattern in Python?
To check if a string matches a regex pattern, you can use re.match()
, re.search()
, or re.fullmatch()
.
At times, a regex pattern may be case-insensitive, which can be achieved using flags.
pattern = r'^[a-z]+$' # Only lowercase letters allowed
text = "python"
if re.fullmatch(pattern, text):
print("Valid format")
else:
print("Invalid format")
Output:
Valid format
How to Match a Pattern in RegEx?
The re.search()
function looks for the first occurrence of a pattern in a string and returns a match object if found.
String literals are often used in regex patterns to ensure exact matches. A backslash is used to escape special characters when needed. A raw string can be used to prevent issues with escape sequences in regex patterns.
text = "The price is $25.99"
match = re.search(r'\d+\.\d+', text)
if match:
print("Found price:", match.group(), "Span:", match.span())
Output:
Found price: 25.99 Span: (13, 18)
Special Characters and Character Classes
Regex supports special characters like .
(wildcard), ^
(beginning of the string), $
(end of the string), and predefined character classes:
\d
– Matches any digit (equivalent to[0-9]
).\w
– Matches any word character (letters, digits, underscore_
).\s
– Matches whitespace characters.\S
– Matches any non-whitespace character.[]
– Square brackets define a set of characters to match.
Using regex to detect an alphanumeric character can be useful when validating user input.
Case-Insensitive Matching and Flags
Regex allows case-insensitive matching using flags like re.IGNORECASE
. This ensures a match regardless of the case of the sequence of characters in the input string.
pattern = r'hello'
text = "HELLO world"
match = re.search(pattern, text, flags=re.IGNORECASE)
print("Found:", match.group())
Output:
Found: HELLO
Substitution Using re.sub()
The re.sub()
function allows for substitution of matched patterns with a replacement string. It is useful for data cleaning and formatting.
text = "Replace newline character with a space\n"
new_text = re.sub(r'\n', ' ', text)
print(new_text)
Output:
Replace newline character with a space
Common Regex Patterns
Here are some common regex patterns and their usage that are definitely worth adding to your Python regex cheat sheet:
-
\d+
– Matches one or more digits. -
\w+
– Matches one or more word characters. -
^abc
– Matchesabc
at the start of a string. -
abc$
– Matchesabc
at the end of a string. -
a{2,4}
– Matchesa
repeated 2 to 4 times. -
[^abc]
– Matches any character excepta
,b
, orc
. -
(abc|def)
– Matches eitherabc
ordef
.
Key Takeaways
-
Regular expressions are powerful for searching, extracting, and modifying text in Python projects.
-
Use
re.search()
to find the first occurrence of a pattern in a string. -
re.match()
checks if a pattern occurs at the start of a string. -
re.findall()
returns a list of all occurrences of a pattern. -
Use
re.sub()
for efficient text substitution. -
Flags like
re.IGNORECASE
help with case-insensitive matching. -
The
span()
method returns the start and end indices of a match. -
Raw strings (
r''
) help avoid escape sequence conflicts in regex patterns.
Practice Exercise
To reinforce your understanding of regex in Python, try solving the following problem in your Python editor:
Write a Python script that extracts all valid email addresses from a given text and replaces them with [EMAIL REDACTED]
. The script should handle various email formats and domain extensions.
import re
text = "Contact us at support@example.com or sales@my-company.org for inquiries."
pattern = r'[\w.-]+@[\w.-]+\.[a-zA-Z]{2,}'
redacted_text = re.sub(pattern, '[EMAIL REDACTED]', text)
print("Redacted text:", redacted_text)
Expected Output:
Redacted text: Contact us at [EMAIL REDACTED] or [EMAIL REDACTED] for inquiries.
Wrapping Up
Regular expressions are an essential tool for working with text data. Whether validating user input, searching for patterns, or extracting structured data, mastering regex will significantly enhance your Python programming skills. Understanding how to use word boundaries, escape sequences, raw strings, and character classes will help you match complex patterns effectively. Happy coding!