Regular expressions in bash

Using regular expressions(REs) is just like using REs in other common languages (C#, java, php, python...). REs in bash has two syntaxes, basic and extended, which are defined in IEEE POSIX standard. The basic syntax is the default mode in bash.

Commonly used metacharacters used in bash are

  • [ ]

    bracket expression. Used to match a selection of characters.

  • .

    dot sign matches any single character.

  • *, +, ?

    matches a preceding item zero or more times, one or more times, zero or one time, resp.

  • ^, $

    matches the empty string in the start, or the end of a line.

  • {m},{n,},{m,n}

    matches a preceding item m, n+, m~n times.

  • |

    This choice operator means matching the expression before or after it.

Note that the precedence of operators matters here:

parentheses > precedence > concatenation > alternation

In basic mode, to use metacharacters ( ),?,+, { }, |, we have to escape them using \; this may be annoying when we have to use a lot of escaping \; using more \ also reduces the readability in long REs. In such cases, we may prefer the extended syntax, in which metacharacters don't have to be escaped.

An example of RE in basic and extended syntax:

  #basic syntax
  grep  '^[0-9]\{3\}\-[0-9]\{3\}\-[0-9]\{4\}\|([0-9]\{3\}) [0-9]\{3\}\-[0-9]\{4\}$' file.txt  

  #extended syntax
  grep  -E '^(\([0-9]{3}\) [0-9]{3}\-[0-9]{4}|[0-9]{3}\-[0-9]{3}\-[0-9]{4})$' file.txt 

In this example, the single-quoted expressions used with grep match each line that stores a telephone number (in the form of 123-456-7890, or (123) 456-7890 ) in file.txt.

Here, the option -E enables the extended syntax for grep. We can see the in the extended syntax, we only escape characters, e.g. (, ), -, to get their literal meaning. Note also for [,], *,^,$ no escaping backslashes are needed (in basic and extended syntaxes).

Abilities of bash REs are limited compared with PERL REs. As PERL REs has been a de factor stanadard for REs, a lot of modern languages have adopted a PER-style syntax for its ease of reading and expressive power.


Common questions:

  1. How to match a space character?

    ' ', '\s','[[:space:]]'. [[:space:]] is a character class in POSIX.

    Other commonly used characters classes are [[:upper:]], [[:lower:]], [[:alnum :]] (for alphanumeric characters), [[:alpha :]] (for alphabetic characters).

References:

http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_04_01.htm http://en.wikipedia.org/wiki/Regular_expression#Standards