Substitution and Translation


As well as identifying regular expressions Perl can make substitutions based on those matches. The way to do this is to use the s function which is designed to mimic the way substitution is done in the vi text editor. Once again the match operator is used, and once again if it is omitted then the substitution is assumed to take place with the $_ variable.To replace an occurrence of london by London in the string $sentence we use the expression
$sentence =~ s/london/London/
and to do the same thing with the $_ variable just

s/london/London/
Notice that the two regular expressions (london and London) are surrounded by a total of three slashes. The result of this expression is the number of substitutions made, so it is either 0 (false) or 1 (true) in this case.

Options

This example only replaces the first occurrence of the string, and it may be that there will be more than one such string we want to replace. To make a global substitution the last slash is followed by a gas follows:

s/london/London/g
which of course works on the$_ variable. Again the expression returns the number of substitutions made, which is 0 (false) or something greater than 0 (true).
If we want to also replace occurrences of lOndon, lonDON, LoNDoN and so on then we could use
s/[Ll][Oo][Nn][Dd][Oo][Nn]/London/g
but an easier way is to use the i option (for "ignore case"). The expression

s/london/London/gi
will make a global substitution ignoring case. The i option is also used in the basic /.../ regular expression match.

Remembering Patterns

It's often useful to remember patterns that have been matched so that they can be used again. It just so happens that anything matched in parentheses gets remembered in the variables $1,...,$9. These strings can also be used in the same regular expression (or substitution) by using the special RE codes \1,...,\9. For example

$_ = "Lord Whopper of Fibbing";s/([A-Z])/:\1:/g;print "$_\n";
will replace each upper case letter by that letter surrounded by colons. It will print :Lrd :W:hopper of :F:ibbing. The variables $1,...,$9 are read-only variables; you cannot alter them yourself.
As another example, the test
if (/(\b.+\b) \1/){ print "Found $1 repeated\n";}
will identify any words repeated. Each \b represents a word boundary and the .+ matches any non-empty string, so \b.+\b matches anything between two word boundaries. This is then remembered by the parentheses and stored as \1 for regular expressions and as $1 for the rest of the program.
The following swaps the first and last characters of a line in the $_ variable:
s/^(.)(.*)(.)$/\3\2\1/
The ^ and $ match the beginning and end of the line. The \1 code stores the first character; the \2 code stores everything else up the last character which is stored in the \3 code. Then that whole line is replaced with \1 and \3 swapped round.
After a match, you can use the special read-only variables $` and $& and $' to find what was matched before, during and after the seach. So after
$_ = "Lord Whopper of Fibbing";/pp/;
all of the following are true. (Remember that eq is the string-equality test.)

$` eq "Lord Wo";$& eq "pp";$' eq "er of Fibbing";
Finally on the subject of remembering patterns it's worth knowing that inside of the slashes of a match or a substitution variables are interpolated. So
$search = "the";s/$search/xxx/g;
will replace every occurrence of the with xxx. If you want to replace every occurence of there then you cannot do s/$searchre/xxx/ because this will be interpolated as the variable $searchre. Instead you should put the variable name in curly braces so that the code becomes

$search = "the";s/${search}re/xxx/;
Translation

The tr function allows character-by-character translation. The following expression replaces each a with e, each b with d, and each c with f in the variable $sentence. The expression returns the number of substitutions made.

$sentence =~ tr/abc/edf/
Most of the special RE codes do not apply in the tr function. For example, the statement here counts the number of asterisks in the $sentence variable and stores that in the $count variable.
$count = ($sentence =~ tr/*/*/);
However, the dash is still used to mean "between". This statement converts $_ to upper case.

tr/a-z/A-Z/;
Exercise

Your current program should count lines of a file which contain a certain string. Modify it so that it counts lines with double letters (or any other double character). Modify it again so that these double letters appear also in parentheses. For example your program would produce a line like this among others:

023 Amp, James Wa(tt), Bob Transformer, etc. These pion(ee)rs conducted many
Try to get it so that all pairs of letters are in parentheses, not just the first pair on each line.
For a slightly more interesting program you might like to try the following. Suppose your program is called countlines. Then you would call it with
./countlines
However, if you call it with several arguments, as in

./countlines first second etc
then those arguments are stored in the array @ARGV. In the above example we have $ARGV[0] is first and $ARGV[1] is second and $ARGV[2] is etc. Modify your program so that it accepts one argument and counts only those lines with that string. It should also put occurrences of this string in paretheses. So

./countlines the
will output something like this line among others:

019 But (the) greatest Electrical Pioneer of (the)m all was Thomas Edison, whoSplit


A very useful function in Perl is split, which splits up a string and places it into an array. The function uses a regular expression and as usual works on the $_ variable unless otherwise specified.
The split function is used like this:

$info = "Caine:Michael:Actor:14, Leafy Drive";@personal = split(/:/, $info);which has the same overall effect as
@personal = ("Caine", "Michael", "Actor", "14, Leafy Drive");If we have the information stored in the $_ variable then we can just use this instead
@personal = split(/:/);If the fields are divided by any number of colons then we can use the RE codes to get round this. The code

$_ = "Capes:Geoff::Shot putter:::Big Avenue";@personal = split(/:+/);is the same as
@personal = ("Capes", "Geoff", "Shot putter", "Big Avenue");But this:
$_ = "Capes:Geoff::Shot putter:::Big Avenue";@personal = split(/:/);would be like
@personal = ("Capes", "Geoff", "", "Shot putter", "", "", "Big Avenue");A word can be split into characters, a sentence split into words and a paragraph split into sentences:

@chars = split(//, $word);@words = split(/ /, $sentence);@sentences = split(/\./, $paragraph);In the first case the null string is matched between each character, and that is why the @chars array is an array of characters - ie an array of strings of length 1.
Exercise

A useful tool in natural language processing is concordance. This allows a specific string to be displayed in its immediate context whereever it appears in a text. For example, a concordance program identifying the target string the might produce some of the following output. Notice how the occurrences of the target string line up vertically.
discovered (this is the truth) that when het kinds of metal to the leg of a frog, an errent developed and the frog's leg kicked, longer attached to the frog, which was deanormous advances in the field of amphibianch it hop back into the pond -- almost. Buond -- almost. But the greatest Electricalectrical Pioneer of them all was Thomas EdiThis exercise is to write such a program. Here are some tips:



  • Read the entire file into array (this obviously isn't useful in general because the file may be extremely large, but we won't worry about that here). Each item in the array will be a line of the file.
  • When the chop function is used on an array it chops off the last character of every item in the array.
  • Recall that you can join the whole array together with a statement like $text = "@lines";
  • Use the target string as delimiter for splitting the text. (Ie, use the target string in place of the colon in our previous examples.) You should then have an array of all the strings between the target strings.
  • For each array element in turn, print it out, print the target string, and then print the next array element.
  • Recall that the last element of an array @food has index $#food.

As it stands this would be a pretty good program, but the target strings won't line up vertically. To tidy up the strings you'll need the substr function. Here are three examples of its use.
substr("Once upon a time", 3, 4); # returns "e up"substr("Once upon a time", 7); # returns "on a time"substr("Once upon a time", -6, 5); # returns "a tim"The first example returns a substring of length 4 starting at position 3. Remember that the first character of a string has index 0. The second example shows that missing out the length gives the substring right to the end of the string The third example shows that you can also index from the end using a negative index. It returns the substring that starts at the 6th character from the end and has length 5.If you use a negative index that extends beyond the beginning of the string then Perl will return nothing or give a warning. To avoid this happening you can pad out the string by using the x operator mentioned earlier. The expression (" "x30) produces 30 spaces, for example.


› See More: Perl Tutorial: A Basic Program III