Redundant character escape in regexp python

Escaping metacharacters

This chapter will show how to match metacharacters literally. Examples will be discussed for both manually as well as programmatically constructed patterns. You’ll also learn about escape sequences supported by the re module.

Escaping with backslash

You have seen a few metacharacters and escape sequences that help to compose a RE. To match the metacharacters literally, i.e. to remove their special meaning, prefix those characters with a \ (backslash) character. To indicate a literal \ character, use \\ . This assumes you are using raw strings and not normal strings.

>> bool(re.search(r'b^2', 'a^2 + b^2 - C*3')) False # escaping will work >>> bool(re.search(r'b\^2', 'a^2 + b^2 - C*3')) True # match ( or ) literally >>> re.sub(r'\(|\)', '', '(a*b) + c') 'a*b + c' # note that the input string is also a raw string here >>> re.sub(r'\\', '/', r'\learn\by\example') '/learn/by/example' 

As emphasized earlier, regular expressions is just another tool to process text. Some examples and exercises presented in this book can be solved using normal string methods as well. It is a good practice to reason out whether regular expressions is needed for a given problem.

re.escape()

Okay, what if you have a string variable that must be used to construct a RE — how to escape all the metacharacters? Relax, the re.escape() function has got you covered. No need to manually take care of all the metacharacters or worry about changes in future versions.

>> print(re.escape(expr)) \(a\^b\) # replace only at the end of string >>> eqn = 'f*(a^b) - 3*(a^b)' >>> re.sub(re.escape(expr) + r'\Z', 'c', eqn) 'f*(a^b) - 3*c' 

Recall that in the Alternation section, join was used to dynamically construct RE pattern from an iterable of strings. However, that didn’t handle metacharacters. Here are some examples on how you can use re.escape() so that the resulting pattern will match the strings from the input iterable literally.

>> terms = ['a_42', '(a^b)', '2|3'] # using 're.escape' and 'join' to construct the pattern >>> pat1 = re.compile('|'.join(re.escape(s) for s in terms)) # using only 'join' to construct the pattern >>> pat2 = re.compile('|'.join(terms)) >>> print(pat1.pattern) a_42|\(a\^b\)|2\|3 >>> print(pat2.pattern) a_42|(a^b)|2|3 >>> s = 'ba_423 (a^b)c 2|3 a^b' >>> pat1.sub('X', s) 'bX3 Xc X a^b' >>> pat2.sub('X', s) 'bXX (a^b)c X|X a^b' 

Escape sequences

Certain characters like tab and newline can be expressed using escape sequences as \t and \n respectively. These are similar to how they are treated in normal string literals. However, \b is for word boundaries as seen earlier, whereas it stands for the backspace character in normal string literals.

Читайте также:  Css default property value

The full list is mentioned at the end of docs.python: Regular Expression Syntax section as \a \b \f \n \N \r \t \u \U \v \x \\ . Do read the documentation for details as well as how it differs for byte data.

If an escape sequence is not defined, you’ll get an error.

You can also represent a character using hexadecimal escape of the format \xNN where NN are exactly two hexadecimal characters. If you represent a metacharacter using escapes, it will be treated literally instead of its metacharacter feature.

>> re.sub(r'\x20', '', 'h e l l o') 'hello' # \x7c is '|' character >>> re.sub(r'2\x7c3', '5', '12|30') '150' >>> re.sub(r'2|3', '5', '12|30') '15|50' 

See ASCII code table for a handy cheatsheet with all the ASCII characters and their hexadecimal representations.

Octal escapes will be discussed in the Backreference section. The Codepoints and Unicode escapes section will discuss escapes for unicode characters using \u and \U .

Cheatsheet and Summary

This short chapter discussed how to match metacharacters literally. re.escape() helps if you are using input strings sourced from elsewhere to build the final RE. You also saw how to use escape sequences to represent characters and how they differ from normal string literals.

Exercises

a) Transform the given input strings to the expected output using the same logic on both strings.

>> str2 = '(qty+4)/2-(9-2)*5+pq/4' ##### add your solution here for str1 '35+qty/3-(9-2)*7' ##### add your solution here for str2 '(qty+4)/2-35+pq/4' 

b) Replace (4)\| with 2 only at the start or end of the given input strings.

>> s2 = r'(4)\|42 - (4)\|3' >>> s3 = 'two - (4)\\|\n' >>> pat = re.compile() ##### add your solution here >>> pat.sub('2', s1) '2.3/(4)\\|6 foo 5.3-2' >>> pat.sub('2', s2) '242 - (4)\\|3' >>> pat.sub('2', s3) 'two - (4)\\|\n' 

c) Replace any matching element from the list items with X for given the input strings. Match the elements from items literally. Assume no two elements of items will result in any matching conflict.

>> pat = re.compile() ##### add your solution here >>> pat.sub('X', '0a.bcd') '0Xcd' >>> pat.sub('X', 'EAMPLE') 'EXAMPLE' >>> pat.sub('X', r'43+n2 ax\y\ze') '4X2 aXe' 

d) Replace the backspace character \b with a single space character for the given input string.

>> ip '123\x08456' >>> print(ip) 12456 >>> re.sub() ##### add your solution here '123 456' 

e) Replace all occurrences of \e with e .

>> re.sub() ##### add your solution here 'there are common aspects among the alternations' 

f) Replace any matching item from the list eqns with X for given the string ip . Match the items from eqns literally.

>> eqns = ['(a^b)', '(a/b)', '(a^b)+2'] ##### add your solution here >>> pat.sub('X', ip) '3-X*X-X+3' 

Источник

Читайте также:  Google fonts html codes

Escape regex special characters in a Python string

re.escape(string)

Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.

As of Python 3.7 re.escape() was changed to escape only characters which are meaningful to regex operations.

I’m surprised no one has mentioned using regular expressions via re.sub() :

import re print re.sub(r'([\"])', r'\\\1', 'it\'s "this"') # it's \"this\" print re.sub(r"([\'])", r'\\\1', 'it\'s "this"') # it\'s "this" print re.sub(r'([\" \'])', r'\\\1', 'it\'s "this"') # it\'s\ \"this\" 
  • In the search pattern, include \ as well as the character(s) you’re looking for. You’re going to be using \ to escape your characters, so you need to escape that as well.
  • Put parentheses around the search pattern, e.g. ([\»]) , so that the substitution pattern can use the found character when it adds \ in front of it. (That’s what \1 does: uses the value of the first parenthesized group.)
  • The r in front of r'([\»])’ means it’s a raw string. Raw strings use different rules for escaping backslashes. To write ([\»]) as a plain string, you’d need to double all the backslashes and write ‘([\\»])’ . Raw strings are friendlier when you’re writing regular expressions.
  • In the substitution pattern, you need to escape \ to distinguish it from a backslash that precedes a substitution group, e.g. \1 , hence r’\\\1′ . To write that as a plain string, you’d need ‘\\\\\\1’ — and nobody wants that.

Use repr()[1:-1]. In this case, the double quotes don’t need to be escaped. The [-1:1] slice is to remove the single quote from the beginning and the end.

>>> x = raw_input() I'm "stuck" :\ >>> print x I'm "stuck" :\ >>> print repr(x)[1:-1] I\'m "stuck" :\\ 

Or maybe you just want to escape a phrase to paste into your program? If so, do this:

>>> raw_input() I'm "stuck" :\ 'I\'m "stuck" :\\' 

Источник

Оцените статью