2009-07-27

Regular Expressions in Vim

Regular expressions are a fantastic thing to keep in your arsenal. To get a rough idea of what regular expressions are for, check out this xkcd comic on regular expressions if you haven't already seen it. Basically, regular expressions allow you to find and act on patterns in a file or group of files. I'm going to be looking at this from how you can use regular expressions with Vim, but regular expressions are certainly available in a much larger capacity throughout linux.

When learning a new programming language or programming concept, I find it useful to be able to see examples of code that other people have written. Unfortunately, this attack is probably not quite as helpful for regular expressions. Reading regular expressions can be quite tricky and may require a fair amount of time and energy to decipher a regular expression if you don't have any notes handy to explain them. Just a hint from my personal experience, you should probably leave a comments in any program or script you write to explain any regular expressions contained within. If you don't leave comments, you may find yourself looking back at the regular expressions in your code and feeling like you're reading brainfuck. The best way to learn regular expressions, in my experience, is to dive right in and start writing them yourself.

That all said, let's get started here with where the bulk of the work is done with regular expressions: metacharacters. Metacharacters, or escaped characters, are special characters that represent something else in a regular expression. That maybe doesn't make a whole lot of sense as you read it, but, hopefully, it will be clearer with the following list of metacharacters:
  • . - Represents any character except the "new line" character

  • \n - Represents the "new line" character

  • \s - Represents any whitespace character (e.g. space, tab, etc.)

  • \S - Represents any non-whitespace character

  • \d - Represents any numerical digit (0-9)

  • \D - Represents any non-numerical character

  • \x - Represents any hex digit (0-f), case insensitive

  • \X - Represents any non-hex digit

  • \o - Represents any octal digit (0-7)

  • \O - Represents any non-octal digit

  • \h - Represents any head of word character (a-z,A-Z,_)

  • \H - Represents any non-head of word character

  • \p - Represents any printable character

  • \P - Represents any non-digit printable character

  • \w - Represents any word character

  • \W - Represents any non-word character

  • \a - Represents any alphabetic character

  • \A - Represents any non-alphabetic character

  • \l - Represents any lowercase character

  • \L - Represents any non-lowercase character

  • \u - Represents any uppercase character

  • \U - Represents any non-uppercase character

I know, that's quite a list, but once you start writing some regular expressions, it's not too bad to refer to a list and before you know it, you'll find you won't even have to reference any list. Now that we've covered the metacharacters, the next place we want to look is how to denote the number of times something is to be repeated. The answer is with another class of escaped characters, called quantifiers. Quantifiers can be either greedy or non-greedy. Greedy quantifiers try to match as many times as possible. Non-greedy quantifiers try to match as few times as possible. This should be clear once I run down the list of quantifiers and use them in a couple examples:
  • * - Matches 0 or more of the preceding characters, as many as possible (greedy)

  • \{-} - Matches 0 or more of the preceding characters, as few as possible (non-greedy)

  • \+ - Matches 1 or more of the preceding characters, as many as possible (greedy)

  • \= - Matches 0 or 1 of the preceding characters

  • \{n} - Matches the preceding characters exactly n times

  • \{n,m} - Matches the preceding characters at least n times and at most m times, as many as possible (greedy)

  • \{-n,m} - Matches the preceding characters at least n times and at most m times, as few as possible (non-greedy)

  • \{n,} - Matches the preceding characters at least n times, as many as possible (greedy)

  • \{-n,} - Matches the preceding characters at least n times, as few as possible (non-greedy)

  • \{,m} - Matches the preceding characters at most m times, as many as possible (greedy)

  • \{-,m} - Matches the preceding characters as most m times, as few as possible (non-greedy)

We have enough to go through some common examples.

Find dates in YYYY-MM-DD format (two equivalent expressions):

/\d\d\d\d-\d\d-\d\d
\d\{4}-\d\{2}-\d\{2}

There is another point that I should make, the / character is a special character in Vim, so it has to be escaped. For example, if you want to find dates in mm/dd/yyyy format:

/\d\{1,2}\/\d\{1,2}\/\d\{4}

Before we can do much with finding and repacing text within Vim, there is one more thing we should go over. A lot of times when finding and replacing text, we will want to keep part of the patern that we find and have it in a different (or even keep it in the same) place. This will probably be easier if I just do this with an example. The following will convert all dates from MM/DD/YYYY format to YYYY-MM-DD format:

:%s/\(\d\{2}\)\/\(\d\{2}\)\/\(\d\{4}\)/\3-\1-\2/g

There are a couple things to explain in the above example. The \( and \) characters are not actually searched for, but are used to delimit the paterns that you want to keep for the text to replace with. The \n characters are mapped to the paterns that are surrounded by \( and \) characters in order. In other words, the first thing in the "find" portion of the command surrounded by \( and \) characters maps to \1.

What if you want to turn a list with each value on separate lines into a comma separated list? Here's how:

:%s/\n/,/

What about that problem from the xkcd comic linked above? To find text formatted as an address I will assume the following about the address:
  1. The address is of the form that you would use to mail a letter in the United States

  2. The name is two words (i.e. first and last name only)

  3. The house number is no more than 4 digits

  4. The street name ends with a common ending: st., ave., blvd., etc. and none longer than four characters.
  5. The zip code is in extended form: #####-####

Before the example, I should explain that since the . character is a metacharacter, if you want to search for a period character specifically, we need to escape it like this: \.. Also, the ^ character is used by Vim to denote the beginning of a line and the $ character is used to denote the end of a line. Here's my example of a search command to find text formatted like an address:

/^\w*\s\w*\n\d\{,4}\P*\w\{2,4}\.\n\P*,\s\u\{2}\s\d\{5\}-\d\{4}$

I know that's kind of long, but let me break this example down one part at a time:
  • ^ - Makes sure that it starts finding the address at the beginning of a line

  • \w* - This is to find the first name

  • \s - This is to find the space between the first and last name

  • \w* - This is to find the last name

  • \n - This is to make sure that there is nothing else on this line

  • \d\{,4} - This looks for a house number of at most 4 digits

  • \P* - This looks for the text of the street name, and takes into account street names with hyphens or multiple words

  • \w\{2,4}\. - This looks for the street ending (e.g. st. or blvd) ending with a period character

  • \n - This makes sure that there is nothing else on this line

  • \P*, - This looks for the city name ending with a comma character, and takes into account cities with hyphens or multiple words

  • \s - Makes sure there is a space between the city and state abbreviation

  • \u\{2} - This looks for the two digit state abbreviation

  • \s - This makes sure there is a space between the state abbreviation and the zip code

  • \d\{5} - This looks for the first five digits of the extended zip code

  • - - This makes sure there is a hyphen separating the two sections of the zip code

  • \d\{4} - This looks for the last four digits of the extended zip code

  • $ - This makes sure that there is nothing else on this line

I know that this may all seem rather daunting, especially if you've never really worked with regular expressions, but the best advice I can give is what I said before: start trying to write your own regular expressions. Sure, you'll make mistakes at first; heck, I still make mistakes when I write regular expressions. But, at least you can learn from your mistakes.

Well, that pretty much does it for tonight's post. Have fun writing regular expressions. See you next time.

No comments:

Post a Comment