This site will work and look better in a browser that supports
web standards, but it should be accessible for any browser or Internet device.
If you're seeing this message, you may have to scroll to the bottom of the page to
see the navigation links.
Regular Expressions
Whazzit this regular expressions thang?
Regular Expressions are the tool used at the command line in Unix, in many programs --like BBEdit for the Mac-- and in various programming languages --such as Java, PERL and JavaScript-- to search for simply phrases or complex pattersn.
The idea of regular expressions trace back to American mathematician Stephen Kleene. He developed regular expressions as a way of describing "the algebra of regular sets." Via early work with search algorithms, it found it's way into early UNIX command line tools used to manipulate text.
Keeping it simple
Regular expressions can be thought of templates. That template allows the computer to find everything that matches the template. The beauty of regular expressions is that they can be as simple and almost as complex as you'd like. If you wanted to find every instance of regular in a file called foobar, you'd search this way:
> egrep "regular" foobar
Change...
Say you wanted to cheange every instance of regular to constipated --it's been a bad month you see-- in the mysterious foobar file, you could use sed to do this. Let's put the results in a new file, though, called barf:
> sed 's/regular/constipated/g' foobar > barf
As with most regular expression situations, the s at the beginning indicates substitution, while the g indicates global: do it as many times as possible on each line.
Getting a Little More Complex...
Regular expressions can include all sorts of little characters --metacharcters-- that do special things, like signifying any character, any number or some position on a line of text.
Anything | . | Matches any one character |
Anything in | [...] | Matches any character listed between the brackets |
Anything but | [^...] | Matches any character except those listed between [^ and ] |
Anything between | [.-.] | Matches any character in the range. For |
Now, let's say you want to find every instance of EXPORT followed by a number (export1, export2, etc.,) in our mysterious foobar file, you could do it this way --again, we're dumping it into barf:
> egrep "EXPORT." foobar > barf
This trick will find EXPORT1, EXPORT2 and even EXPORT.
Well, then we find we've got lots of things we don't want. Like Exporta, Exporto. Whoops! A lot of spanish language stuff here, when all we want is Export followed by a number. Then, we can do this:
> egrep "EXPORT[0-9]" foobar > barf
Or we might decide we want all the ones not ending in A or O:
> egrep "EXPORT[^AO]" foobar > barf
Or, mixing it all up, let's say we want to find everything between B and N and 0 and 9:
> egrep "EXPORT[B-Nb-n0-9]" foobar > barf
We don't include space because we'd end up matching every space. Not good. Also, note that we've included seperate listing for upper and lower case!
But I want a Baker's Dozen
Any 0+ Times | ? | Matches any character zero or one times |
0-1 | * | Matches the preceding element zero or more times |
1+ | + | Matches the preceding element one or more times |
X Times | {#} | Matches the preceding element # times |
X to Y times | {min, max} | Matches the preceding element from min to max times |
Say, we've got a bad keyboard, and the r key sticks a little. So, we replace the key, but we still run across problems, like Exporrt. Let's see we want to make sure we find all those Export's that end in A to Z. We'd do this:
> egrep "expo[r]+[A-Za-z]" foobar > barf
As I type this, I'm in the process of mauling large batches of text files with names like archive_1 and so on. And I've got to bash those against another set of text files. Let's say I want to find every archive up to 99. I'll need to look for files beginning with archive and ending with one or two digits. So, I could:
> egrep "archive[0-9]{1,2}" foobar> barf
But only at the End!
Line start | ^ | Matches at the start of the line |
Line end | $ | Matches at the end of the line |
Word start | \< | Matches at the beginning of a word |
Word end | \> | Matches at the end of a word |
Word Edge | \b | Matches at the beginning or the end of a word |
Word Middle | \B | Matches any charater not at the beginning or end of a word |
Odds and Ends
Special Characters: To search for special characters --astericks, slashes, periods, etc-- you use what's called the escape character. Usually, this is the backslash.
Alternation, or this or that: To search for one thing or another, put the pipe symbol ( "|" ) between the two choices. So, to search for foo or bar in a file called bubba.html, you'd do something like:
> egrep "foo|bar" bubba.html
This gets really useful in searching for alternate spellings, like grey and grey:
> egrep "gr(a|e)y" Important.files
Backreferences: back references allow you to store things you've matched and reuse them later; particularly in replaces
s/\foo([0-9]\.html)/ archive\1 >>fraggled_files/
This will take every instance of foo followed by a number followed by .html and change it to archive and then the number and then .html. So foo1.html would become archive1.html. Unlike most search tools, this allows us to change according to a template. With a simpler tool, we'd have to change foo1.html to archive1.html, then foo2.html to archive2.html and so on. Ugh.
Metacharacters and Gew-Gaws
In each section, the most common metacharacters are set off from less common ones by a extra linebreak.
Simple Characters
- . Matches any character except newline
- \d Matches a digit; same as [0-9]
- \D Matches a non-digit, same as [^0-9]
- \w Matches an alphanumeric (word) character [a-zA-Z0-9_]
- \W Matches a non-word character [^a-zA-Z0-9_]
Odd Characters
- \s Matches a whitespace char (space, tab, newline...)
- \S Matches a non-whitespace character
- \n Matches newline
- \r Matches a carriage return
- \t Matches a tab
- \f Matches a formfeed
- \b Matches a backspace (inside [] only)
- \0 Matches a null character
- \000 Also matches a null character
- \nnn Matches an ASCII character of that octal value
- \xnn Matches an ASCII character of that hexadecimal value
- \CX Matches an ASCII control character
- \metachar Matches the character itself (to override normally meaning of special characters, e.g. ^ matches beginning of line, but \^ actually looks for the '^' symbol)
Multiple Characters
- abc Matches all a, b, and c in order
- [a-z0-9] Matches any single character of set.
- [^a-z0-9] Matches any single character not in set (in this context ^ means 'not').
- x? Matches 0 or 1 x's, where x is any of above
- x* Matches 0 or more x's
- x+ Matches 1 or more x's
- x{m,n} Matches at least m x's but no more than n
Or
- fee|fie|foe Matches any of fee, fie, or foe
Boundaries
- \< At the start of a word
- \> At the end of a word
- \b Either end of a word boundary (outside [] only)
- \B Inside a word
- ^ At the beginning of a line or string
- $ At the end of a line or string
Remembering
- (abc) Remembers the match for later backreferences.
- \1 Use whatever was in the first backreference
- \2 User what's in the second backreference
- \3 and so on...
Resources
Regular Expressions for Poets: an good piece trying to find a way to introduce regular expressions to the non-technical.
Steve Ramsay's Guide to Regular Expressions
Mastering Regular Expressions: Powerful Techniques for Perl and Other Tools (Nutshell Handbook) by Jeffrey E. Friedl (Editor), Andy Oram (Editor), published by O'Reilly & Associates