Those who are familiar with Unix commandline tools like grep, sed and cut will know about the enormous power they provide. They make it a breeze to mangle, transform and retrieve information in and from text files. Unfortunately, they're mostly dependant on row and column based information. That is, they expect each line in a file to contain one row and each column to be seperated with a certain character (usually a space or a tab). Take, for instance, some lines from a simple Apache logfile
22.214.171.124 - - [26/Jun/2005:09:57:55 +0200] "GET /st.. 192.168.1.7 - - [26/Jun/2005:10:03:56 +0200] "GET / HTTP.. 126.96.36.199 - - [26/Jun/2005:10:14:27 +0200] "GET /imag.. 192.168.1.7 - - [26/Jun/2005:10:21:36 +0200] "GET / HTTP.. 188.8.131.52 - - [26/Jun/2005:10:23:53 +0200] "GET /ima..
If I wanted to list every unique IP in that logfile, I'd simply issue the following command at the shell:
[todsah@jib]~$ cut -d" " -f1 access.log | sort | uniq 192.168.1.7 184.108.40.206 220.127.116.11 18.104.22.168
'Cut' strips away every column except for the first. 'Sort' sorts list of IP's so that all duplicate will appear under eachother. 'Uniq' then removed all the duplicate IP's, and I'm left with a list of all unique IP's in the log. Writing this small 'script' took about 15 seconds. Now, that's a pretty strong method for statistical analysis.
Unfortunately, XML took that power completely away. It doesn't work on a row/column basis, it's syntax is loose (for example, you can spread a single element with attributes over multiple lines) and you can nest elements inside of other elements.
There is hope, however. A toolset called XMLStarlet offers a powerful XML commandline tool which can do Xpath selects, transformations and more.
Take the following example XML file:
<?xml version='1.0' encoding='UTF-8'?> <dataq port="50000" daemon="false" verbose="true"> <access> <host>127.0.0.1</host> </access> <access> <username>john</username> <password>johnspw</password> </access> <access> <host>192.168.1.5</host> <username>pete</username> <password>petespw</password> </access> <queue name='backup' /> <queue name='mp3' type='fifo' size='1' overflow='pop' /> <queue name='restricted' type='fifo' size='5' overflow='deny'> <access sense="deny"> <username>john</username> </access> </queue> </dataq>
Suppose I'd want to get all the usernames in this XML file. Using the traditional Unix commandline utilities, I'd have to do this:
[todsah@jib]~$ grep "<username>" dataq.xml | cut -d'>' -f2 | cut -d'<' -f1 john pete john
As you can see, this works. But what if we changed the last queue element to be completely on one line?:
<queue name='restricted' type='fifo' size='5' overflow='deny'><access sense="deny"><username>john</username></access></queue>
It's the exact same, valid, XML and should yield the same results, but it does this instead:
[todsah@jib]~$ grep "<username>" dataq.xml | cut -d'>' -f2 | cut -d'<' -f1 john pete
The problem is that you can't assume anything to be the same from one XML file to the next. It's simply not part of the XML specifications.
Using the XMLStarlet commandline tool, we can work around these problems. For instance, selecting all usernames from the XML file works like this:
[todsah@jib]~$ xmlstarlet sel -t -m "//username" -v 'node()' -n dataq.xml john pete john
This commandline basically says to use Select mode (sel) with a commandline template (-t) to match all <username> tags (-m "//username") and to show the Value (-v) of each match and to append each value with a newline (-n).
XMLStarlet also allows you to completely transform (using XSLT) XML files, translate, validate, format and edit XML files. You can, for instance, use XMLStarlet to delete or insert certain parts of an XML file that match an XPath expression. It can also convert XML to the PYX format, which can then be more easily used with traditional Unix commandline tools.