Last week I mentioned that I was going to take a look at some interesting Unix utilities that you can also run on Windows by way of GnuWin (or some similar GNU on Windows library).
The first item that I want to look at is the “join” program. Join can perform operations very similar to a SQL join, except that it works with text files. Let’s say I have two text files, one named people, and one named colors. The contents of the people file is a unique ID followed by a name, and the colors file is everyone’s favorite color – where the columns are the person’s ID and then their favorite color.
Here’s the people file:
1 Jason 2 Cameron 3 Celvin 4 Michael 5 Opal 6 Amy
Here’s the colors file:
1 Orange 2 Pink 3 Purple 4 Red 5 Blue 6 Hot Pink
As a very simple case, we can run join with just the names of these two files:
join people colors
And we’ll get this output:
1 Jason Orange 2 Cameron Pink 3 Celvin Purple 4 Michael Red 5 Opal Blue 6 Amy Hot Pink
In this case, the join program just assumed that the the first columns of each file were the same thing, and join accordingly. Again by default, the output was all of the columns from each file.
Let’s change things up and use a different people file, this time named people-small and containing fewer entries, like this:
1 Jason 4 Michael 5 Opal
Now if we run join with this file but the same colors file, like this:
join people-small colors
This will be our output:
1 Jason Orange 4 Michael Red 5 Opal Blue
If you come from the SQL world at all, you will notice that by default, join performed an inner join with respect to the data – that is, the three rows in the “left” data store (people-small) were paired up with the matching rows from the colors file, but the colors that didn’t match anything were just discarded. Similarly, if we run the full people file with a smaller colors file (named colors-small), we’ll see something similar. Consider this colors-small file:
2 Pink 3 Purple 6 Hot Pink
And running this command:
join people color-small
Then we’ll get this:
2 Cameron Pink 3 Celvin Purple 6 Amy Hot Pink
Again, an inner join was performed, in that only matched up rows (from one side or another) were paired up and put into the output.
Note that this behavior (assuming the first column is the ID to match, also, performing an inner join) is just the default behavior for join. As with most Unix tools, join has a plethora of command-line switches that define its behavior. You can choose different join methods, fields, and more.
The Hyperion Connection
I wanted to mention this tool specifically because I think for many of us working with ostensibly relational data (in text files), it can come in handy every now and then. I actually had a perfect use case for this the other day were I just wanted to quickly load some data but I didn’t want to mess with ODI to load it up or deal with importing it to a database. I had to sort the records in the input file first, but after that the join program worked like a charm and generated a file that I could easily use with a load rule to load up to a cube.