Some concepts

metacharacters

<([{\^-=$!|]})?*+.> these characters have special meaning which can affect the way a pattern is matached.

character classes

a set of characters enclosed within square brackets.

Construct Description
[abc] a, b, or c (simple class)
[^abc] Any character except a, b, or c (negation)
[a-zA-Z] a through z, or A through Z, inclusive (range)
[a-d[m-p]] a through d, or m through p: [a-dm-p] (union)
[a-z&&[def]] d, e, or f (intersection)
[a-z&&[^bc]] a through z, except for b and c: [ad-z] (subtraction)
[a-z&&[^m-p]] a through z, and not m through p: [a-lq-z] (subtraction)

predefined character classes

convenient shorthands for commonly used regular expressions

Construct Description
. Any character (may or may not match line terminators)
\d A digit: [0-9]
\D A non-digit: [^0-9]
\s A whitespace character: [ \t\n\x0B\f\r]
\S A non-whitespace character: [^\s]
\w A word character: [a-zA-Z_0-9]
\W A non-word character: [^\w]

bounary matchers

Boundary Construct Description
^ The beginning of a line
\b A word boundary
\B A non-word boundary
\A The beginning of the input
\G The end of the previous match
\Z The end of the input but for the final terminator, if any
\z The end of the input

qualifier

the number of occurrences to match against

qualifier meaning
X? once or not at all
X* zero or more times
X+ one or more times
X{n} exactly n times
X{n,} at least n times
X{n,m} at least n but not more than m times

Qualifier can be applied on character, character class, capturing group.

capturing groups

treat multiple characters as a single unit

Capturing groups are numbered by counting their opening parentheses from left to right.

There is also a special group, group 0, which always represents the entire expression. This group is not included in the total reported by groupCount.


s(\d\w),f(dfd(\Sgd)er)3=pr
 --1---  ------2-----
             --3--

Group 1: (\d\w) Group 2: (dfd(\Sgd)er) Group 3: (\Sgd)

backreferences

In your regular expression, you can reference the previous capturing group by a backslash () followed by a digit indicating the number of the group to be recalled.

(\d\d)\1(abc)dfd\2-ellre

Here, \1 is refered as the matched content of (\d\d) and \2 is refered as the matched content of (abc). Note, backreference doesn’t mean the replacement of the sub pattern of the group.

Pattern

Pattern pattern = Pattern.compile("(\\d\\d).(abc)";

Matcher

Matcher matcher = pattern.matcher("string-to-match")

find() vs. matches() vs. lookingAt()

  • matches() matches the full string starting from 0.
  • lookingAt() like matches() starting from 0, but not requre to match the full string.
  • find() tries to find the next occurrence within the substring that matches the regex. That means, the result of calling find() multiple times might not be the same.
Pattern pattern = Pattern.compile("\\d\\d");
Matcher matcher = pattern.matcher("2345");
matcher.matches(); //false
matcher.lookingAt(); // true
matcher.find(); // true

If you care about the full string matching, use matches(), otherwise use lookingAt(). find() is better to use to match multiple occurrences and call multiple times.

query state

Once matched, we can always query the state by method group(), start(), and end().

Because find() can call multiple times to get all the matched substrings, we can get more detailed group state by group(i), start(i), and end(i).

while(matcher.find())
    out.prinf("Found the text \"%s\" from %d to %d", 
              matcher.group(), 
              matcher.start(), 
              matcher.end()
    );
    
    // further group details
    // although group 0 is excluded from groupCount(), we can safely get it
    // group() is equivlent with group(0) here
    for(int i=0; i <= matcher.groupCount(); ++i) {
        out.printf("Group %d: %s from %d to %d",
                   matcher.group(i),
                   matcher.start(i),
                   matcher.end(i)
        );
    }
}

replacement

replaceAll(..) and replaceFirst(..) is very useful to replace the whole matches. And in the replacement, you can use $0, $1, $2, .. to represent respective group matches.

But the powerful way is using appendReplacement(..) and applendTail().

The following example, I need to replace the domain part of all emails in multiple lines.

String emails = "[email protected]\n" +
                "[email protected]";
String EMAIL_PATTERN = "^([a-zA-Z0-9._]+)@((\\w+)\\.(\\w+))$";
StringBuffer sbf = new StringBuffer();
Matcher m = Pattern.compile(EMAIL_PATTERN, Pattern.MULTILINE).matcher(emails);
while(m.find()) {
    String domain = m.group(3);
    if("googlemail".equals(domain)) {
        m.appendReplacement(sbf, "$1@gmail.$4");
    }
}
m.appendTail(sbf);

System.out.printf("%s", sbf);

Examples

String EMAIL_PATTERN = "([a-zA-Z0-9._]+)@((\\w+\\.)+\\w+)";
Matcher m = Pattern.compile(EMAIL_PATTERN).matcher("[email protected]");
while(m.find()) {
    System.out.printf("Found \"%s\" (%d - %d) %n", m.group(), m.start(), m.end());
    for(int i=0; i <= m.groupCount(); ++i) {
        System.out.printf("Group %d: %s (%d - %d) %n", i, m.group(i), m.start(i), m.end(i));
    }
}

Found “[email protected]” (0 - 20) Group 0: [email protected] (0 - 20) Group 1: richd.yang (0 - 10) Group 2: gmail.com (11 - 20) Group 3: gmail. (11 - 17)

Now we can get the Email domain from group 2.

String IPADDRESS_PATTERN =
        "^([0-9]|[1-9]\\d|2[0-4]\\d|25[0-5])\\." +
         "([0-9]|[1-9]\\d|2[0-4]\\d|25[0-5])\\." +
         "([0-9]|[1-9]\\d|2[0-4]\\d|25[0-5])\\." +
         "([0-9]|[1-9]\\d|2[0-4]\\d|25[0-5])$";
Matcher m = Pattern.compile(IPADDRESS_PATTERN).matcher("10.233.29.5");
while(m.find()) {
    System.out.printf("Found \"%s\" (%d - %d) %n", m.group(), m.start(), m.end());
    for(int i=0; i <= m.groupCount(); ++i) {
        System.out.printf("Group %d: %s (%d - %d) %n", i, m.group(i), m.start(i), m.end(i));
    }
}

Found “10.233.29.5” (0 - 11) Group 0: 10.233.29.5 (0 - 11) Group 1: 10 (0 - 2) Group 2: 233 (3 - 6) Group 3: 29 (7 - 9) Group 4: 5 (10 - 11)

reference: how to validate ip address

Further references