Regular Expressions — Grouping and Backreferences (42/100 Days of Python)
In this tutorial, we will be exploring the concepts of grouping and backreferences in regular expressions, including how to search and modify strings using these features. Regular expressions, also known as “regex” or “regexp”, are a powerful tool for pattern matching and string manipulation. With the help of grouping and backreferences, you can make your regular expressions even more versatile and efficient.
Grouping in Regular Expressions
Grouping in regular expressions is a way to group together multiple characters or symbols and treat them as a single unit. Grouping is achieved using parentheses ( )
:
import re
text = 'John Doe: john.doe@example.com, Jane Doe: jane.doe@example.com'
# Find: word boundary (\b) + word (\w+) + dot (\.) + word (\w+)
# + at (@) + word (\w+) + word boundary (\b)
emails = re.findall(r'\b\w+\.\w+@\w+\.\w+\b', text)
print(emails)
# ['john.doe@example.com', 'jane.doe@example.com']
In this example, the regular expression r'\b\w+\.\w+@\w+\.\w+\b'
is used to match all the email addresses in the text
variable. The \b
symbols are word boundaries, which indicate the start and end of a word. The \w+
symbols match any word characters, and the \.
symbol matches the dot. By grouping the characters together using parentheses, we can match the username, domain, and top-level domain as separate units:
import re
text = 'John Doe: john.doe@example.com, Jane Doe: jane.doe@example.com'
# boundary + word in one group (\b\w+) + dot (\.) + word in another group (\w+)
# @ + another group for (\w+) + dot (\.) + group for extension (\w+) + \b
email_pattern = r'(\b\w+)\.(\w+)@(\w+)\.(\w+)\b'
emails = re.findall(email_pattern, text)
print(emails)
# [('john', 'doe', 'example', 'com'), ('jane', 'doe', 'example', 'com')]
In this example, the regular expression r'(\b\w+)\.(\w+)@(\w+)\.(\w+)\b'
is used to match all the email addresses in the text
variable. The four groups in the match are the username, domain, top-level domain, and full email address. This makes it easier to extract and manipulate specific parts of the email addresses.
Backreferences in Regular Expressions
Backreferences in regular expressions allow you to reuse a matched pattern in the same expression. In other words, you can reference a captured group in your pattern to match an exact copy of the text it captured. This can be particularly useful when you want to match a pattern multiple times within a string or when you want to modify a string based on a pattern match. Backreferences are indicated using the backslash \
followed by a number that corresponds to the group number or using the backslash and \g<num>
where num
is the group number:
import re
text = 'John has a cat named Mittens'
animal_pattern = r'(\w+) has a (\w+) named (\w+)'
match = re.search(animal_pattern, text)
if match:
animal_type = match.group(2) # cat
animal_name = match.group(3) # Mittens
new_text = re.sub(animal_pattern, f'{animal_type.capitalize()} named {animal_name.capitalize()} belongs to \g<1>', text)
print(new_text)
# Cat named Mittens belongs to John
In this example, the animal_pattern
defines a pattern that matches the text "John has a cat named Mittens". The pattern uses parentheses to group the matching elements and to capture the matched groups. The captured groups are then referenced in the re.sub()
function using the \g<1>
, \g<2>
, and \g<3>
syntax to insert their respective values into the replacement text. The result is a modified string that replaces the original pattern match with a string that includes the backreferenced values.
Modifying Strings with Regular Expressions
Regular expressions can also be used to modify strings. The re.sub()
function is a useful tool for replacing matches in a string with a specified value. It’s especially useful in combination with backreferences:
import re
# Example 1
text = 'John Doe: john.doe@example.com, Jane Doe: jane.doe@example.com'
email_pattern = r'(\b\w+)\.(\w+)@(\w+)\.(\w+)\b'
new_text = re.sub(email_pattern, r'\1_\2@\3.\4', text)
print(new_text)
# John Doe: john_doe@example.com, Jane Doe: jane_doe@example.com
# Example 2
text = 'John was born in 1980 and Jane was born in 1985'
birth_pattern = r'(\w+) was born in (\d+)'
new_text = re.sub(birth_pattern, r'\2: \1', text)
print(new_text)
# 1980: John and 1985: Jane
# Example 3
text = 'John: 123, Jane: 456, Bob: 789'
numbers_pattern = r'(\w+): (\d+)'
new_text = re.sub(numbers_pattern, r'\2: \1', text)
print(new_text)
# 123: John, 456: Jane, 789: Bob
In these examples, re.sub()
is used to replace all the matches of a specified pattern in the text
variable with a value that contains backreferences to the captured groups in the match. The resulting modified strings are then printed.
By using grouping and backreferences in regular expressions, you can perform complex string manipulations with ease. The possibilities are endless, and with a little creativity and experimentation, you can achieve great results.
What’s next?
- If you found this story valuable, please consider clapping multiple times (this really helps a lot!)
- Hands-on Practice: Free Python Course
- Full series: 100 Days of Python
- Previous topic: Regular Expressions in Python
- Next topic: Python Classes and Objects