INTRODUCTION The regex (regular expressions) are very useful for programmers. Using this device you can describe every string that presents to its inside a certain regularity. We don't want to talk about formal languagesor formal grammars, we are going to bring you some examples that show how it works. Think about having a web page with a form with the following fields:
- - Name
- - Surname
- - Phone number
All the following fields owe a specific regularity and there are specific expressions that defines them. These are the expression:
- Name: It's made by one word only and by alphabetical letter. According to us this is not a compulsory camp.
- Surname: It's made by one or more words that can be made only by alphabetical letter
- Email: It's made by 3 part: the first one is made by alphanumeric issues, underscore (_) and period (.), there's the second one made by alphanumeric issues and dash, followed by a period , which is always followed by 2 to 4 alphabetical letters. This one is compulsory.
- Phone number: made by 2 part. It's divided by a dash.
- name: [a-zA-Z]*
- surname: [a-zA-Z' ]+
- email: [a-zA-Z0-9_\.]+@[a-zA-Z0-9-]+\.[a-zA-Z]{0,4}
- phone number: [0-9]+\-[0-9]+
- - -
CLASSES The operator sign [] it's made by two square brackets. In this metacharacter can be insert several constant character. Trough this metacharacter it's possibile to characterize a single occurancy of one of the present characters to its inside, if it's insert like normal characters or if it's insert using constants: the characters set defined trough this operator takes theclass name. For example the class [a] represents the single occurancy of the a character and allows to verify if it is inside a string and in that case executing some operation on it. Otherwise the class [abcd] represents the single occurency in one of the four characters presents inside it and permit to verify if they are present in the strings and execute operations on them.- - -
RANGE OPERATOR - it's an operator that permits to identify a range, for example:A part from this classic range, it's always possibile to personalize them, for example the a-fcontains all the lower case letters from a to f, it's very useful to characterize hexadecimal numbers. The class [a-fA-F0-9] individualizes all the figures and the letters from a to f ( lower case and upper case) all the characters that are inside an hexadecimal figure.
- a-z for all the lower case letters
- A-Z for all the upper case letters
- 0-9 for all the numbers.
- - -
CLASS REPETITION Now we are going to describe the class repetition operators. The first one we are going to analyse it's the star *. It's the one that can verify how many time a class is repeated inside a string and to select all the consecutive occurency. For example, the following regular expression [a-z]* selects in a string all the consecutive occurency of alphabetical letters, how it's shown here: I have got 7 telephone number, but this is my cell-phone: 0004578907 This operator considers an empty set as positive solution and it's used to verify the exactness of the NAME field, it could also be empty, but if it's not it must be made by one word only. The regular expression refereed to it is: [a-zA-Z]* That expression contains all the alphabetic letter a-zand A-Z. Very similar to the star it's the plus + operator that works in the same way, but it verify if a class it's repeated inside a string one or more times. We use it for the SURNAME fields, that can contain one or more words separated by spaces. This is the regular expression refereed to it:[a-zA-Z]+ Another operator it's made by 2 {} braces, in their inside it can be a number {3} or a numerical range {12,58}. The first one individualizes all the repetitions of 3 characters that verify the class. The second one individualizes from 12 to 58 repetitions of characters that verify the class. For example [0-9]{3,4}\-[0-9]{7} individualizes all the telephone number in an area code made by 3 or 4 figures and a suffix of 7 figures.- - -
BACKSLASH In the last example we also talked about another operator the backslash \. We put this sign before a character if it is an operator and it makes not considering it as character, if we put it before a letter it is a constant. The dash it's used to indicate a range and therefore if we want to use as a character we have to write it down this way: \- Now you can understand the regex that we used to verify the email:And the one for the telephone number:
- [a-zA-z0-9_\.]+@[a-zA-Z0-9-]+\.[a-zA-Z]{0,4}
- [0-9]+\[0-9]+
- - -
REPETITION OPERATOR'S SPECIFIC CHARACTERISTIC One of the characteristics of the repetition operators is selecting everything is related to the expressions. This characteristic could be counterproductive sometimes. If we want to eliminate from a html page all the tags, we can use the following regular expression:This kind of regex selects a consecutive series of characters inside a string. The first one is <followed by some different consecutive characters followed by a >. Therefore the regular expression described before in the following string will be such this:
- <.+>
Inside a line we take everything that is between the firs part of the character < and the last part of the character >. If this operation doesn't satisfy our demand we need to use one of the following method:
The first one makes the repetition operator less strong and it makes it stops in the first part of the closing character. The second individuates inside a strings a series of characters that start with <followed by any characters different from < and > followed by an >. The regex that we have just described will appear in the former string like this:
- 1. <.+?>
- 2. <[ ^<>]+>
- - -
CLASS DENING Let's focus on a different problem. Let's suppose having a story and we need to individuate all the sentences present inside it. If inside the story the period is used only at the end of the sentences, we have to deny a class in order to individuate a sentence in a easier way.The ^ sign if it's put immediately after the first bracket of a class, it denies the class. Therefore in our case it's individuated the consecutive repetition of all that characters that are not the period. Basically a sentences it is individuated.
- [^\.]+
- - -
THE PERIOD The period it's a constant, and if it is inserted in a regex it's equivalent to a class that has all the characters but the "new line". This is just an example to better understand the function of the period:The former regex individuates all the 4 characters sequences that starts with c followed by any characters and then followed by a an s. It creates different combinations such as:
- c.s.
- case
- casa
- cosa
- cose
- c%s9
- c£sl
- - -
ALTERNANCY OPERATOR Another very useful operator is the pipe | which has the same function of th OR. For example, the regex george|stuart individuates inside a string the word george or the word stuart:
- Both george and stuart are two famous seo, but george has a forum, stuarthas a web agency.
- - -
ANCHORS Another problem can be faced if you need to modify one or more elements inside a CSV (comma-separeted value) database, a textual database in which fields are separated by commas and which records are divided by a new line. The following database is an example that represents the daily gain of an adsense made by three friends.If one day one of the friends was banned from adsense, his data would not be useful anymore and could be necessary to remove them. In the former example there are very few data therefore it is very easy to do a manual change. If there were thousands data the regex would be the fastest solution. If the data of the banned friend is the ones in the third column, the fastest solution to remove them would be to eliminate all the occurency in the following regex:
- 12€, 50€, 70€
- 30€, 46€, 68€
- 15€, 52€, 73€
- 16€, 30€, 85€
The $ character doesn't identify any characters, but a position, the end of a line. Therefore the former regex finds all the consecutive characters series that start with a comma followed by some numbers, followed by the €, followed by the ending of a line. It's always possible to identify the beginning of a line with the ^ character. This one has to be used very carefully because you can use it to deny a class itself. Therefore you always have to remember to use it outside a class. Also the $ operator must be used this way, if it is used inside a class you can refer to it as a character.
- ,[0-9]*€$
- - -
GROUPS We can consider a characters series as a single group, we can operate on it using some of the operators that build the regex. We could find out inside a text a code we don't know its lenght, which is composed by 5 numbers followed by a letter, followed by 5 numbers followed by a letter etc etc...until it terminates with a new line. There is only a solution to find this code, we need to use a group. In this example the group it's made by a class which has numbers only repeated five times, followed by an only letters class. This group has to be repeated at least once and must end with a new line. It could be written down as:The regex creates this effect:
- ([0-9]{5}[a-zA-Z])+$
- My secret code is 12345T45345R12343F34567j
- Phil's secret code is 34526g54638j92725K63723H72829D12345I
- 12345T45345R12343F34567j is not phil's code.
- - -
BACKREFERENCES We could need to modify the positions of different part of text inside a string. For example, let's suppose having a database csv made by 5 columns and 10000 rows with an error: the second column is in the fourth column position. Changing the position manually it will take hours and hours, but with regex we can solve that problem in less than 5 seconds. One of the group property is to memorize in a variable the selected text trough them, in order to use it in a substitution phase. For example, we need to create 5 groups that selects the fields present in a inside a rows of our csv. We have to admit that the database it's structured as it follows:We can use the following regex to select to select each of the single fields inside a row:
- 1,45,589, phil, bob
- 2,56,79,mary,bob
- 3,57,89,phil,frank
- ..,..,..,..,..
With the former regex each fields will be memorized in a variable, the first will have the first one, the second will have the second one and etc etc. We need only to substitute the text selected with a new structure (1,4,3,2,5) to get the result we desire. But there's a little problem because the way to retrieve the variables is very different. Htaccess, dreamweaver, PERL retrieve the variables using the $ character. Example: $1 to retrieve the first one, $2 to retrieve the second one. Furthermore $0 retrieves the match of the whole regex. In the former example we would have replaced with this row:
- ([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),$
EditPad pro, PowerGREP retrieve the variables using the \ character. Example: \1 retrieves the first one, \2 retrieve the second one. Furthermore \0 retrieves the match of the whole regex. In the former example we would have replaced the regex with the following expression:
- $1,$4,$3,$2,$5
.NET, Javascript, PHP, etc.. each of them retrieves the variables in different ways and we advice you to read their guides. WARNING: if you use the repetition to repeat whole groups, the variables will be refereed each to a single selected group and not to the whole group repeated another time. If you use the regex ([0-9]{5}[a-zA-Z])+$ to select the code on this text My secret code is 12345T45345R12343F34567j the 1 variable will correspond with the selected part of the text and not to the whole code. This kind of things happens because the repetition is outside the backreference, therefore to solve this problem the solution is to change the group we have to repeat without backreference ( in order not to save it) and set up the backreference on the whole repetition:
- \1,\4,\3,\2,\5
In the former regex you can notice this particular structure (?: ?) in which are inserted two classes. This kind of structure is a group without backreferences; we can apply a repetition and memorize it. Now the variable 1 the following code (bold): My secret code is12345T45345R12343F34567j
- ((?:[0-9]{5}[a-zA-Z]?)+)$
- - -
QUESTION MARK In the groups the question mark can be used to avoid the match memorization. We have already seen that question mark could be used to restrict the repetitions. Now we will see that exit a lot of different functions for this simple character. The first function makes a group optional, as you can see in the following example:In the former regex the group (owen) is made optional and therefore it will be possible to select both the simple occurency of the word michael and the occurency of the word couple michael owen. The second function is being an anchor. As we have already seen there are a lot of operators such as ^ and $ that could be keepers, they individuate inside a string a position. The question mark can be also used in the groups as a keeper, to individuate it as a position inside the text. Example:
- michael (owen)?
The former regex selects the word michael in a text only if it is followed by the group (owen)that will not be selected. Examples:
- michael(?=owen)
You can also use the question mark to individuate the absence of a position. For example the following function selects the word michael only if it's not followed by the group ( owen):
- michael owen
- michael
- today owen scored a goal
- yesterday michael owen didn't scored
Example:
- michael(?!owen)
The two properties that we have just described works only when the anchor follows the text (or group or class) that has to be selected. If the anchor it's placed before the selected text, we have to use two structure, the first one to verify the presence of the anchor, the second to verify the absence:
- michael owen
- michael
- today owen scored a goal
- yesterday michael owen didn't scored
Basically the character < is inserted after the question mark.
- (?<=owen) michael
- (?
