Java and regex example

Regular Expressions and the Java Programming Language

Applications frequently require text processing for features like word searches, email validation, or XML document integrity. This often involves pattern matching. Languages like Perl, sed, or awk improves pattern matching with the use of regular expressions, strings of characters that define patterns used to search for matching text. To pattern match using the Java programming language required the use of the StringTokenizer class with many charAt substring methods to read through the characters or tokens to process the text. This often lead to complex or messy code.

The Java 2 Platform, Standard Edition (J2SE), version 1.4, contains a new package called java.util.regex , enabling the use of regular expressions. Now functionality includes the use of meta characters, which gives regular expressions versatility.

This article provides an overview of the use of regular expressions, and details how to use regular expressions with the java.util.regex package, using the following common scenarios as examples:

  • Simple word replacement
  • Email validation
  • Removal of control characters from a file
  • File searching

To compile the code in these examples and to use regular expressions in your applications, you’ll need to install J2SE version 1.4. [Editor’s note: The latest version of Java SE is available here.]

Regular Expressions Constructs

A regular expression is a pattern of characters that describes a set of strings. You can use the java.util.regex package to find, display, or modify some or all of the occurrences of a pattern in an input sequence.

The simplest form of a regular expression is a literal string, such as «Java» or «programming.» Regular expression matching also allows you to test whether a string fits into a specific syntactic form, such as an email address.

To develop regular expressions, ordinary and special characters are used:

Any other character appearing in a regular expression is ordinary, unless a \ precedes it.

Special characters serve a special purpose. For instance, the . matches anything except a new line. A regular expression like s.n matches any three-character string that begins with s and ends with n , including sun and son .

There are many special characters used in regular expressions to find words at the beginning of lines, words that ignore case or are case-specific, and special characters that give a range, such as a-e , meaning any letter from a to e .

Regular expression usage using this new package is Perl-like, so if you are familiar with using regular expressions in Perl, you can use the same expression syntax in the Java programming language. If you’re not familiar with regular expressions here are a few to get you started:

Construct Matches
Characters
x The character x
\\ The backslash character
\0 n The character with octal value 0 n (0 n
\0 nn The character with octal value 0 nn (0 n
\0 mnn The character with octal value 0 mnn (0 m <= 3, 0 n
\x hh The character with hexadecimal value 0x hh
\u hhhh The character with hexadecimal value 0x hhhh
\t The tab character ( ‘\u0009’ )
\n The newline (line feed) character ( ‘\u000A’ )
\r The carriage-return character ( ‘\u000D’ )
\f The form-feed character ( ‘\u000C’ )
\a The alert (bell) character ( ‘\u0007’ )
\e The escape character ( ‘\u001B’ )
\c x The control character corresponding to x
Character Classes
[abc] a , b , or c (simple class)
[^abc] Any character except a , b , or c (negation)
[a-zA-Z] a through z or A through Z , inclusive (range)
[a-z-[bc]] a through z , except for b and c : [ad-z] (subtraction)
[a-z-[m-p]] a through z , except for m through p : [a-lq-z]
[a-z-[^def]] d , e , or f
Predefined Character Classes
. Any character (may or may not match line terminators)
\d A digit: 8
\D A non-digit: [^0-9]
\s A whitespace character: [ \t\n\x0B\f\r]
\S A non-whitespace character: [^\s]
\w A word character: [a-zA-Z_0-9]
\W A non-word character: [^\w]

Check the documentation about the Pattern class for more specific details and examples.

Classes and Methods

The following classes match character sequences against patterns specified by regular expressions.

Pattern Class

An instance of the Pattern class represents a regular expression that is specified in string form in a syntax similar to that used by Perl.

A regular expression, specified as a string, must first be compiled into an instance of the Pattern class. The resulting pattern is used to create a Matcher object that matches arbitrary character sequences against the regular expression. Many matchers can share the same pattern because it is stateless.

The compile method compiles the given regular expression into a pattern, then the matcher method creates a matcher that will match the given input against this pattern. The pattern method returns the regular expression from which this pattern was compiled.

The split method is a convenience method that splits the given input sequence around matches of this pattern. The following example demonstrates:

 /* * Uses split to break up a string of input separated by * commas and/or whitespace. */ import java.util.regex.*; public class Splitter < public static void main(String[] args) throws Exception < // Create a pattern to match breaks Pattern p = Pattern.compile("[,\\s]+"); // Split input with the pattern String[] result = p.split("one,two, three four , five"); for (int i=0; i

Matcher Class

Instances of the Matcher class are used to match character sequences against a given string sequence pattern. Input is provided to matchers using the CharSequence interface to support matching against characters from a wide variety of input sources.

A matcher is created from a pattern by invoking the pattern's matcher method. Once created, a matcher can be used to perform three different kinds of match operations:

  • The matches method attempts to match the entire input sequence against the pattern.
  • The lookingAt method attempts to match the input sequence, starting at the beginning, against the pattern.
  • The find method scans the input sequence looking for the next sequence that matches the pattern.

Each of these methods returns a boolean indicating success or failure. More information about a successful match can be obtained by querying the state of the matcher.

This class also defines methods for replacing matched sequences by new strings whose contents can, if desired, be computed from the match result.

The appendReplacement method appends everything up to the next match and the replacement for that match. The appendTail appends the strings at the end, after the last match.

For instance, in the string blahcatblahcatblah , the first appendReplacement appends blahdog . The second appendReplacement appends blahdog , and the appendTail appends blah , resulting in: blahdogblahdogblah .

CharSequence Interface

The CharSequence interface provides uniform, read-only access to many different types of character sequences. You supply the data to be searched from different sources. String, StringBuffer and CharBuffer implement CharSequence, so they are easy sources of data to search through. If you don't care for one of the available sources, you can write your own input source by implementing the CharSequence interface.

Example Regex Scenarios

The following code samples demonstrate the use of the java.util.regex package for various common scenarios:

Simple Word Replacement

 /* * This code writes "One dog, two dogs in the yard." * to the standard-output stream: */ import java.util.regex.*; public class Replacement < public static void main(String[] args) throws Exception < // Create a pattern to match cat Pattern p = Pattern.compile("cat"); // Create a matcher with an input string Matcher m = p.matcher("one cat," + " two cats in the yard"); StringBuffer sb = new StringBuffer(); boolean result = m.find(); // Loop through and create a new String // with the replacements while(result) < m.appendReplacement(sb, "dog"); result = m.find(); >// Add the last segment of input to // the new String m.appendTail(sb); System.out.println(sb.toString()); > > 

Email Validation

The following code is a sample of some characters you can check are in an email address, or should not be in an email address. It is not a complete email validation program that checks for all possible email scenarios, but can be added to as needed.

 /* * Checks for invalid characters * in email addresses */ public class EmailValidation < public static void main(String[] args) throws Exception < String input = "@sun.com"; //Checks for email addresses starting with //inappropriate symbols like dots or @ signs. Pattern p = Pattern.compile("^\\.|^\\@"); Matcher m = p.matcher(input); if (m.find()) System.err.println("Email addresses don't start" + " with dots or @ signs."); //Checks for email addresses that start with //www. and prints a message if it does. p = Pattern.compile("^www\\."); m = p.matcher(input); if (m.find()) < System.out.println("Email addresses don't start" + " with \"www.\", only web pages do."); >p = Pattern.compile("[^A-Za-z0-9\\.\\@_\\-~#]+"); m = p.matcher(input); StringBuffer sb = new StringBuffer(); boolean result = m.find(); boolean deletedIllegalChars = false; while(result) < deletedIllegalChars = true; m.appendReplacement(sb, ""); result = m.find(); >// Add the last segment of input to the new String m.appendTail(sb); input = sb.toString(); if (deletedIllegalChars) < System.out.println("It contained incorrect characters" + " , such as spaces or commas."); >> > 

Removing Control Characters from a File

 /* This class removes control characters from a named * file. */ import java.util.regex.*; import java.io.*; public class Control < public static void main(String[] args) throws Exception < //Create a file object with the file name //in the argument: File fin = new File("fileName1"); File fout = new File("fileName2"); //Open and input and output stream FileInputStream fis = new FileInputStream(fin); FileOutputStream fos = new FileOutputStream(fout); BufferedReader in = new BufferedReader( new InputStreamReader(fis)); BufferedWriter out = new BufferedWriter( new OutputStreamWriter(fos)); // The pattern matches control characters Pattern p = Pattern.compile(""); Matcher m = p.matcher(""); String aLine = null; while((aLine = in.readLine()) != null) < m.reset(aLine); //Replaces control characters with an empty //string. String result = m.replaceAll(""); out.write(result); out.newLine(); >in.close(); out.close(); > > 

File Searching

 /* * Prints out the comments found in a .java file. */ import java.util.regex.*; import java.io.*; import java.nio.*; import java.nio.charset.*; import java.nio.channels.*; public class CharBufferExample < public static void main(String[] args) throws Exception < // Create a pattern to match comments Pattern p = Pattern.compile("//.*$", Pattern.MULTILINE); // Get a Channel for the source file File f = new File("Replacement.java"); FileInputStream fis = new FileInputStream(f); FileChannel fc = fis.getChannel(); // Get a CharBuffer from the source file ByteBuffer bb = fc.map(FileChannel.MAP_RO, 0, (int)fc.size()); Charset cs = Charset.forName("8859_1"); CharsetDecoder cd = cs.newDecoder(); CharBuffer cb = cd.decode(bb); // Run some matches Matcher m = p.matcher(cb); while (m.find()) System.out.println("Found comment: "+m.group()); >> 

Conclusion

Pattern matching in the Java programming language is now as flexible as in many other programming languages. Regular expressions can be put to use in applications to ensure data is formatted correctly before being entered into a database, or sent to some other part of an application, and they can be used for a wide variety of administrative tasks. In short, you can use regular expressions anywhere in your Java programming that calls for pattern matching.

For More Information

About the Authors

Dana Nourie is a JDC technical writer. She enjoys exploring the Java platform, especially creating interactive web applications using servlets and JavaServer Pages technologies, such as the JDC Quizzes and Learning Paths and Step-by-Step pages. She is also a scuba diver and is looking for the Pacific Cold Water Seahorse.

Mike McCloskey is a Sun engineer, working in Core Libraries for J2SE. He has made contributions in java.lang, java.util, java.io and java.math , as well as the new packages java.util.regex and java.nio. He enjoys playing racquetball and writing science fiction.

Источник

Lesson: Regular Expressions

This lesson explains how to use the java.util.regex API for pattern matching with regular expressions. Although the syntax accepted by this package is similar to the Perl programming language, knowledge of Perl is not a prerequisite. This lesson starts with the basics, and gradually builds to cover more advanced techniques.

Introduction Provides a general overview of regular expressions. It also introduces the core classes that comprise this API. Test Harness Defines a simple application for testing pattern matching with regular expressions. String Literals Introduces basic pattern matching, metacharacters, and quoting. Character Classes Describes simple character classes, negation, ranges, unions, intersections, and subtraction. Predefined Character Classes Describes the basic predefined character classes for whitespace, word, and digit characters. Quantifiers Explains greedy, reluctant, and possessive quantifiers for matching a specified expression x number of times. Capturing Groups Explains how to treat multiple characters as a single unit. Boundary Matchers Describes line, word, and input boundaries. Methods of the Pattern Class Examines other useful methods of the Pattern class, and explores advanced features such as compiling with flags and using embedded flag expressions. Methods of the Matcher Class Describes the commonly-used methods of the Matcher class. Methods of the PatternSyntaxException Class Describes how to examine a PatternSyntaxException . Additional Resources To read more about regular expressions, consult this section for additional resources.

Источник

Читайте также:  Login Demo
Оцените статью