Searching strings is a common task in programming. At one time or another, we've programmatically looked for a substring within another string. One plausible example is searching for "gmail" within an email address. But what if we want to validate an email address? A candidate for a valid email address must follow a certain strict pattern: a local part, an @ symbol, and then a domain. As valid email addresses vary widely-- from dictatortots12@yahoo.com to team.edward4evar@hotmail.com-- it's clear simple substring searches will not suffice. But a specific pattern laying out which characters are acceptable, in what particular order, and how many, can be used to validate complex, widely varying strings like email addresses.
Enter regular expressions, or regex for short. Regular expression is the language for defining such patterns. It's used for matching patterns and extracting patterns from strings.
The following is a regular expression for a valid email address: \b[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,4}\b
. If that looks like intimidating, cryptic gibberish to you, you're not alone. More than a few people find regular expressions esoteric. But like any other language, it may come in short, simple patterns, or long, intricate, and wildly complex ones. Someone new to English starts learning with simple patterns, not the prose of Nabakov. So too we'll begin with conceptually the simplest regular expression pattern: literal characters.
In Ruby (as well as other programming languages), two forward slashes denote a regular expression pattern. The =~
operator in Ruby tests a pattern against a string. The example below tests the string "tidepools" for the literal character 'd':
You might have guessed that the 2 return value is the position in the string where the pattern was found. Now let's see a non-matching pattern:
nil
is returned if the string does not match the regular expression pattern. Since a number has a boolean value of true
and the Nil object in Ruby has a boolean value of false
, we can branch control flow based on whether a regular expression matches, like so:
The string "tangerine176@aol.com" was found to match the pattern /aol/
-- meaning an 'a' directly followed by an 'o', directly followed by an 'l'.
Let's increase the flexibility of our patterns, which will begin to show the power of regular expressions. The /[bB]at/
pattern will match any string with "bat" or "Bat" in it. The bracket notation gives your regular expression choices: the pattern has the option to match either 'b' or 'B' characters.
To give our pattern even greater flexibility, we can use a range of possibilities: /[b-f]at/
. That pattern will match any string with the substrings "bat", "cat", "dat", or "eat" because the first character can now be the range of characters between 'b' to 'f'. (Yes, you can use a range of numbers as well!)
We can take flexibility further, but accepting any single character with the .
pattern. The pattern /.at/
will match any single character, followed directly by an 'a', followed directly by an 't'. Matching strings could be: "Rat", "5at", or "=at".
We can also control the "greediness" of our patterns using quantifiers:
- * (zero or more times)
- + (one or more times)
- {n} (exactly n times)
- {n,} (at least n times)
- {n,m} (at least n times, but no more than m times)
/[bB]+at/
will match one or more b's or B's, followed by an 'a' and 'b'. The following are strings that would match such a pattern: "bat", "Bat", "bbat", "bBat", "bbbat", "BbBat", "BBBBBBBBBBBBBBbat", and so on.
As you can see, regular expressions can cast an ever wider net, ensaring more strings of our choosing. Hopefully, this post has given you a glimpse of the power of regular expressions. What is shown here is just the tip of the regular expression iceberg however. See the References below to further investigate, or google for resources. Being such a useful and powerful tool, the topic of regular expressions has numerous tutorials, articles, and associated tools on the internet. Good luck and happy learning!