Regular Expressions — Grouping and Backreferences (42/100 Days of Python)

Martin Mirakyan
4 min readFeb 12, 2023

--

Day 42 of the “100 Days of Python” blog post series covering grouping and backreferences in regexp

In this tutorial, we will be exploring the concepts of grouping and backreferences in regular expressions, including how to search and modify strings using these features. Regular expressions, also known as “regex” or “regexp”, are a powerful tool for pattern matching and string manipulation. With the help of grouping and backreferences, you can make your regular expressions even more versatile and efficient.

Grouping in Regular Expressions

Grouping in regular expressions is a way to group together multiple characters or symbols and treat them as a single unit. Grouping is achieved using parentheses ( ):

import re

text = 'John Doe: john.doe@example.com, Jane Doe: jane.doe@example.com'

# Find: word boundary (\b) + word (\w+) + dot (\.) + word (\w+)
# + at (@) + word (\w+) + word boundary (\b)
emails = re.findall(r'\b\w+\.\w+@\w+\.\w+\b', text)
print(emails)
# ['john.doe@example.com', 'jane.doe@example.com']

In this example, the regular expression r'\b\w+\.\w+@\w+\.\w+\b' is used to match all the email addresses in the text variable. The \b symbols are word boundaries, which indicate the start and end of a word. The \w+ symbols match any word characters, and the \. symbol matches the dot. By grouping the characters together using parentheses, we can match the username, domain, and top-level domain as separate units:

import re

text = 'John Doe: john.doe@example.com, Jane Doe: jane.doe@example.com'

# boundary + word in one group (\b\w+) + dot (\.) + word in another group (\w+)
# @ + another group for (\w+) + dot (\.) + group for extension (\w+) + \b
email_pattern = r'(\b\w+)\.(\w+)@(\w+)\.(\w+)\b'
emails = re.findall(email_pattern, text)
print(emails)
# [('john', 'doe', 'example', 'com'), ('jane', 'doe', 'example', 'com')]

In this example, the regular expression r'(\b\w+)\.(\w+)@(\w+)\.(\w+)\b' is used to match all the email addresses in the text variable. The four groups in the match are the username, domain, top-level domain, and full email address. This makes it easier to extract and manipulate specific parts of the email addresses.

Backreferences in Regular Expressions

Backreferences in regular expressions allow you to reuse a matched pattern in the same expression. In other words, you can reference a captured group in your pattern to match an exact copy of the text it captured. This can be particularly useful when you want to match a pattern multiple times within a string or when you want to modify a string based on a pattern match. Backreferences are indicated using the backslash \ followed by a number that corresponds to the group number or using the backslash and \g<num> where num is the group number:

import re

text = 'John has a cat named Mittens'
animal_pattern = r'(\w+) has a (\w+) named (\w+)'
match = re.search(animal_pattern, text)

if match:
animal_type = match.group(2) # cat
animal_name = match.group(3) # Mittens
new_text = re.sub(animal_pattern, f'{animal_type.capitalize()} named {animal_name.capitalize()} belongs to \g<1>', text)
print(new_text)

# Cat named Mittens belongs to John

In this example, the animal_pattern defines a pattern that matches the text "John has a cat named Mittens". The pattern uses parentheses to group the matching elements and to capture the matched groups. The captured groups are then referenced in the re.sub() function using the \g<1>, \g<2>, and \g<3> syntax to insert their respective values into the replacement text. The result is a modified string that replaces the original pattern match with a string that includes the backreferenced values.

Modifying Strings with Regular Expressions

Regular expressions can also be used to modify strings. The re.sub() function is a useful tool for replacing matches in a string with a specified value. It’s especially useful in combination with backreferences:

import re

# Example 1
text = 'John Doe: john.doe@example.com, Jane Doe: jane.doe@example.com'
email_pattern = r'(\b\w+)\.(\w+)@(\w+)\.(\w+)\b'
new_text = re.sub(email_pattern, r'\1_\2@\3.\4', text)
print(new_text)
# John Doe: john_doe@example.com, Jane Doe: jane_doe@example.com

# Example 2
text = 'John was born in 1980 and Jane was born in 1985'
birth_pattern = r'(\w+) was born in (\d+)'
new_text = re.sub(birth_pattern, r'\2: \1', text)
print(new_text)
# 1980: John and 1985: Jane

# Example 3
text = 'John: 123, Jane: 456, Bob: 789'
numbers_pattern = r'(\w+): (\d+)'
new_text = re.sub(numbers_pattern, r'\2: \1', text)
print(new_text)
# 123: John, 456: Jane, 789: Bob

In these examples, re.sub() is used to replace all the matches of a specified pattern in the text variable with a value that contains backreferences to the captured groups in the match. The resulting modified strings are then printed.

By using grouping and backreferences in regular expressions, you can perform complex string manipulations with ease. The possibilities are endless, and with a little creativity and experimentation, you can achieve great results.

What’s next?

--

--

Martin Mirakyan
Martin Mirakyan

Written by Martin Mirakyan

Software Engineer | Machine Learning | Founder of Profound Academy (https://profound.academy)

No responses yet