In the last post I discussed a script that given a URL to a tarball with a name in a given form, translates the URL to a directory and a filename as follows:
http://ftp.drupal.org/files/projects/nodewords-6.x-1.11.tar.gz is parsed into nodewords-6.x-1.11.tar.gz(the filename) and nodewords(a directory to be).
The script parsed the URL like this:
MODULE_DIR=$(for f in $(echo $URL|tr '/' ' ');do true;done;echo $f | cut -d '-' -f1)
TAR_BALL=$(for f in $(echo $URL|tr '/' ' ');do true;done;echo $f)
The first line, MODULE_DIR=$(for f in $(echo $URL|tr '/' ' ');do true;done;echo $f | cut -d '-' -f1) simply translates all “/” to ” ” (spaces), then loops over each word saving every one of them in the loop variable f and finally cuts the last word at each “-” and returns the first word:
1 http://ftp.drupal.org/files/projects/nodewords-6.x-1.11.tar.gz -> http: ftp.drupal.org files projects nodewords-6.x-1.11.tar.gz
2 Loop through each word landing on nodewords-6.x-1.11.tar.gz
3 Cut nodewords-6.x-1.11.tar.gz on "-" -> nodewords 6.x 1.11.tar.gz (but return only the first, nodewords )
The second line, TAR_BALL=$(for f in $(echo $URL|tr '/' ' ');do true;done;echo $f) does the same thing but is only interested in the word after the last “/” which is nodewords 6.x 1.11.tar.gz.
Now this may seem like a lengthy way of splitting and parsing out a substring from a piece of text. So is there any other ways of achieving the same thing without strange loops and stuff?
Luckily, bash has a few ways of manipulating strings built into it. I’ll show a few here.
1. Using the expr(1) command.
expr is a neat little command for evaluating – you guessed it – expressions. It has a match argument which allows you to send in a regepx and parse out substrings in a detailed way.
Consider:
URL="http://ftp.drupal.org/files/projects/nodewords-6.x-1.11.tar.gz"
TARBALL=`expr match "$URL" '.*/\(.*$\)'`
What would we get and how?
The regexp given to expr match is '.*/\(.*$\)'. This means basically discard everything to a “/” that is followed by the substring that ends the line. Which is to say: Return the substring that follows the last “/”.
The escaped parentheses tells expr match that we are interested in a substring, and whatever regexp comes before the first parenthesis tells us where the substring should follow. A simpler example using expr match for substrings will follow.
2. Bash built-in string manipulation tricks
Now consider:
URL="http://ftp.drupal.org/files/projects/nodewords-6.x-1.11.tar.gz"
TARBALL=`expr match "$URL" '.*/\(.*$\)'`
(same as before)
echo ${TARBALL%-*gz}
While we in the last example relied on the command expr, we’re now interested in bash’s built-in feature for stripping the [shortest] match of a substring from the end of a string: ${string%substring} (where string is a variable).
So, echo ${TARBALL%-*gz} (where TARBALL contains nodewords-6.x-1.11.tar.gz ) would give us what we want for directory name, nodewords. This is how it works: Look insided the string nodewords-6.x-1.11.tar.gz and drop everything starting with “-” through and including “gz”.
Of course, you could also use the flavor that strips the longest match from the end, which looks like this: ${string%%substring} (two “%” rather than one).
Then you could say this in order to get the “nodewords” part from the TARBALL: echo ${TARBALL%%-*z} since the longest match from “-” to “z” in nodewords-6.x-1.11.tar.gz is from the first occurrence of “-” to the end of the string.
You could of course also use the counterpart of ${string%%substring} which is ${string##substring} and means “strip the longest match of a substring from the start of a string”. That would allow for the following statement for extracting the tarball from the URL: echo ${URL##http*/} (drop the longext match from and including http through “/” from the front of the URL).
So now we could shorten the script part for extracting “nodewords” and “nodewords-6.x-1.11.tar.gz” from the URL significantly:
TAR_BALL=${URL##http*/}
MODULE_DIR=${TAR_BALL%%-*z}
Some simpler examples
Using expr match for substrings
Consider the string FILE="Flower-30x25.jpg". Let’s say we want to extract the 30×25 part and we know the string will look like Name-NNxNN.jpg. The we would want to find the substring between “-” and the “.” . This is what it looks like:
SIZE=`expr match "$FILE" '.*-\(.*\)\.'`
So, here, the regexp says ignore everything up to and including “-”, give me the substring of everything up to the dot. In a regexp, the “.” has a special meaning of “any character” and the “*” means “zero or many”. If you want to reference an actual dot, you need to escape it: “\.”.
If you didn’t know the size substring would follow a “-”, you’d have to tell expr exactly what it looks like. That could look like this:
SIZE=`expr match "$FILE" '.*\([0-9][0-9]x[0-9][0-9]\)'`
Here you are saying: Skip everything but match the substring that consists of two digits, an “x”, and then two digits again.
The parentheses are escaped with backslash because bash has special meanings of parenthesis which we want to suppress so that they are sent verbatim to the regexp.
Feel free to comment or to leave suggestions!