Regular Expressions in Python — Deep Dive (41/100 Days of Python)

Martin Mirakyan
4 min readFeb 11, 2023

--

Day 41 of the “100 Days of Python” blog post series covering regular expressions

Regular expressions (regex) are a powerful tool for data processing and analysis. They allow you to search, match, and extract patterns in text, making them a valuable addition to any Python developer’s toolkit. In this comprehensive guide, we’ll cover the basics of regular expressions in Python, including metacharacters, special sequences, and quantifiers, along with real-world examples to help you understand how to apply them in practice.

What are Regular Expressions in Python?

Regular expressions are a sequence of characters that define a search pattern. They are often used to perform operations on strings, such as searching for specific patterns, replacing substrings, and validating data. Python provides a module called re that provides functions for working with regular expressions.

Using Metacharacters in Python

Metacharacters are special characters that have a special meaning in regular expressions. They are used to match specific patterns in text. Some of the most commonly used metacharacters in Python include:

  • .: Matches any character except a newline
  • ^: Matches the start of a string
  • $: Matches the end of a string
  • []: Matches any character within the square brackets
  • [^ ]: Matches any character not within the square brackets
  • \w: Matches any word character (alphanumeric)
  • \d: Matches any decimal digit
  • \s: Matches any whitespace character
  • \b: Matches a word boundary
  • ( ): Matches the expression within the parentheses
  • |: Matches either the expression before or after the | symbol

Here’s an example of how metacharacters can be used in real-world applications:

text = 'Hello 123 World 456 Hello World'

# match any character except a newline
if re.search('. World', text):
print('Match found!')
else:
print('No match found.')

# match the start of a string
if re.search('^Hello', text):
print('Match found!')
else:
print('No match found.')

# match the end of a string
if re.search('Hello World$', text):
print('Match found!')
else:
print('No match found.')

# match any character within the square brackets
if re.search('[0123456789]', text):
print('Match found!')
else:
print('No match found.')

# match any character not within the square brackets
if re.search('[^0123456789]', text):
print('Match found!')
else:
print('No match found.')

# match any word character (alphanumeric)
if re.search('\w+', text):
print('Match found!')
else:
print('No match found.')

# match any decimal digit
if re.search('\d+', text):
print('Match found!')
else:
print('No match found.')

# match any whitespace character
if re.search('\s+', text):
print('Match found!')
else:
print('No match found.')

# match a word boundary
if re.search(r'\bHello\b', text):
print('Match found!')
else:
print('No match found.')


# match either the expression before or after the | symbol
if re.search('Hello World|Hello 123', text):
print('Match found!')
else:
print('No match found.')

As you can see, the metacharacters in Python provide a powerful way to search, match, and extract patterns in text. In all of the examples above the program should perform a search and find a match. So, the output for each of the snippets should be Match found!.

Special Sequences in Python

Special sequences in Python are sequences of characters that have a special meaning in regular expressions. Some of the most commonly used special sequences include:

  • \A: — Matches the start of the string
  • \b: Matches a word boundary
  • \B: Matches a non-word boundary
  • \d: Matches any decimal digit
  • \D: Matches any non-digit character
  • \s: Matches any whitespace character
  • \S: Matches any non-whitespace character
  • \w: Matches any word character (alphanumeric)
  • \W: Matches any non-word character

Here’s an example of how special sequences can be used:

text = 'Hello 123 World 456 Hello World'

# match the start of the string
if re.search(r'\AHello', text):
print('Match found!')
else:
print('No match found.')

# match a word boundary
if re.search(r'\bHello\b', text):
print('Match found!')
else:
print('No match found.')

# match a non-word boundary
if re.search(r'\BHello\B', text):
print('Match found!')
else:
print('No match found.')

# match any decimal digit
if re.search(r'\d+', text):
print('Match found!')
else:
print('No match found.')

# match any non-digit character
if re.search(r'\D+', text):
print('Match found!')
else:
print('No match found.')

# match any whitespace character
if re.search(r'\s+', text):
print('Match found!')
else:
print('No match found.')

# match any non-whitespace character
if re.search(r'\S+', text):
print('Match found!')
else:
print('No match found.')

# match any word character (alphanumeric)
if re.search(r'\w+', text):
print('Match found!')
else:
print('No match found.')

# match any non-word character
if re.search(r'\W+', text):
print('Match found!')
else:
print('No match found.')

Quantifiers in Python

Quantifiers in Python allow you to specify the number of times a character or pattern should be matched. Some of the most commonly used quantifiers include:

*: Matches zero or more occurrences of the preceding character or pattern:

text = 'Hello 123 World 456 Hello World'

# match zero or more occurrences of the preceding character
if re.search(r'Hello *World', text):
print('Match found!')
else:
print('No match found.')

+: Matches one or more occurrences of the preceding character or pattern:

text = 'Hello 123 World 456 Hello World'

# match one or more occurrences of the preceding character
if re.search(r'Hello +World', text):
print('Match found!')
else:
print('No match found.')

?: Matches zero or one occurrence of the preceding character or pattern:

text = 'Hello 123 World 456 Hello World'

# match zero or one occurrence of the preceding character
if re.search(r'Hello? World', text):
print('Match found!')
else:
print('No match found.')

{m,n}: Matches from m to n occurrences of the preceding character or pattern:

text = 'Hello 123 World 456 Hello World'

# match from m to n occurrences of the preceding character
if re.search(r'Hello{1,3} World', text):
print('Match found!')
else:
print('No match found.')

What’s next?

--

--

Martin Mirakyan
Martin Mirakyan

Written by Martin Mirakyan

Software Engineer | Machine Learning | Founder of Profound Academy (https://profound.academy)

No responses yet