Python regexp get group

Содержание

Python Regex Capturing Groups
Table of contents
What is Group in Regex?
Example to Capture Multiple Groups
Access Each Group Result Separately
Regex Capture Group Multiple Times
Extract Range of Groups Matches

Python Regex Capturing Groups

In this article, will learn how to capture regex groups in Python. By capturing groups we can match several distinct patterns inside the same target string.

What is Group in Regex?

A group is a part of a regex pattern enclosed in parentheses () metacharacter. We create a group by placing the regex pattern inside the set of parentheses ( and ) . For example, the regular expression (cat) creates a single group containing the letters ‘c’, ‘a’, and ‘t’.

For example, in a real-world case, you want to capture emails and phone numbers, So you should write two groups, the first will search email, and the second will search phone numbers.

Also, capturing groups are a way to treat multiple characters as a single unit. They are created by placing the characters to be grouped inside a set of parentheses ( , ) .

For example, In the expression, ((\w)(\s\d)) , there are three such groups

We can specify as many groups as we wish. Each sub-pattern inside a pair of parentheses will be captured as a group. Capturing groups are numbered by counting their opening parentheses from left to right.

Capturing groups are a handy feature of regular expression matching that allows us to query the Match object to find out the part of the string that matched against a particular part of the regular expression.

Anything you have in parentheses () will be a capture group. using the group(group_number) method of the regex Match object we can extract the matching value of each group.

We will see how to capture single as well as multiple groups.

Example to Capture Multiple Groups

Let’s assume you have the following string:

target_string = "The price of PINEAPPLE ice cream is 20"

And, you wanted to match the following two regex groups inside a string

To extract the uppercase word and number from the target string we must first write two regular expression patterns.

Pattern to match the uppercase word (PINEAPPLE)
Pattern to match the number (20).

The first group pattern to search for an uppercase word: [A-Z]+

[A-Z] is the character class. It means match any letter from the capital A to capital Z in uppercase exclusively.
Then the + metacharacter indicates 1 or more occurrence of an uppercase letter

Second group pattern to search for the price: \d+

The \d means match any digit from 0 to 9 in a target string
Then the + metacharacter indicates number can contain a minimum of 1 or maximum any number of digits.

Extract matched group values

In the end, we can use the groups() and group() method of match object to get the matched values.

Now Let’s move to the example.

import re target_string = "The price of PINEAPPLE ice cream is 20" # two groups enclosed in separate ( and ) bracket result = re.search(r"(\b[A-Z]+\b).+(\b\d+)", target_string) # Extract matching values of all groups print(result.groups()) # Output ('PINEAPPLE', '20') # Extract match value of group 1 print(result.group(1)) # Output 'PINEAPPLE' # Extract match value of group 2 print(result.group(2)) # Output 20

Let’s understand the above example

First of all, I used a raw string to specify the regular expression pattern. As you may already know, the backslash has a special meaning in some cases because it may indicate an escape character or escape sequence to avoid that we must use raw string.

Now let’s take a closer look at the regular expression syntax to define and isolate the two patterns we are looking to match. We need two things.

First, we need to enclose each of the two patterns inside a pair of parentheses. So (\b[A-Z]+\b) is the first group, and (\b\d+) is the second group in between parentheses. Therefore each pair of parentheses is a group.

The parentheses are not part of the pattern. It indicates a group.
The \b indicates a word boundary.

Secondly, we need to consider the larger context in which these groups reside. This means that we also care about the location of each of these groups inside the entire target string and that’s why we need to provide context or borders for each group.

Next, I have added .+ at the start of each group. the dot represents any character except a new line and the plus sign means that the preceding pattern is repeating one or more times. This syntax means that before the group, we have a bunch of characters that we can ignore, only take uppercase words followed by the word boundary (whitespace). it will match to PINEAPPLE.

I have also added .+ at the start of the second pattern, it means before the second group, we have a bunch of characters that we can ignore, only take numbers followed by a boundary. it will match to 20.

Next, we passed both the patterns to the re.search() method to find the match.

The group s () method

At last, using the groups() method of a Match object, we can extract all the group matches at once. It provides all matches in the tuple format.

Access Each Group Result Separately

We can use the group() method to extract each group result separately by specifying a group index in between parentheses. Capturing groups are numbered by counting their opening parentheses from left to right. In our case, we used two groups.

Please note that unlike string indexing, which always starts at 0, group numbering always starts at 1.

The group with the number 0 is always the target string. If you call The group() method with no arguments at all or with 0 as an argument you will get the entire target string.

To get access to the text matched by each regex group, pass the group’s number to the group(group_number) method.

So the first group will be a group of 1. The second group will be a group of 2 and so on.

# Extract first group print(result.group(1)) # Extract second group print(result.group(2)) # Target string print(result.group(0))

So this is the simple way to access each of the groups as long as the patterns were matched.

Regex Capture Group Multiple Times

In earlier examples, we used the search method. It will return only the first match for each group. But what if a string contains the multiple occurrences of a regex group and you want to extract all matches.

In this section, we will learn how to capture all matches to a regex group. To capture all matches to a regex group we need to use the finditer() method.

The finditer() method finds all matches and returns an iterator yielding match objects matching the regex pattern. Next, we can iterate each Match object and extract its value.

Note: Don’t use the findall() method because it returns a list, the group() method cannot be applied. If you try to apply it to the findall method, you will get AttributeError: ‘list’ object has no attribute ‘groups.’

So always use finditer if you wanted to capture all matches to the group.

import re target_string = "The price of ice-creams PINEAPPLE 20 MANGO 30 CHOCOLATE 40" # two groups enclosed in separate ( and ) bracket # group 1: find all uppercase letter # group 2: find all numbers # you can compile a pattern or directly pass to the finditer() method pattern = re.compile(r"(\b[A-Z]+\b).(\b\d+\b)") # find all matches to groups for match in pattern.finditer(target_string): # extract words print(match.group(1)) # extract numbers print(match.group(2))

PINEAPPLE 20 MANGO 30 CHOCOLATE 40

Extract Range of Groups Matches

One more thing that you can do with the group() method is to have the matches returned as a tuple by specifying the associated group numbers in between the group() method’s parentheses. This is useful when we want to extract the range of groups.

For example, get the first 5 group matches only by executing the group(1, 5 ).

import re target_string = "The price of PINEAPPLE ice cream is 20" # two pattern enclosed in separate ( and ) bracket result = re.search(r".+(\b[A-Z]+\b).+(\b\d+)", target_string) print(result.group(1, 2)) # Output ('PINEAPPLE', '20')

Источник