Cool tool for generating data

I came across a really useful tool the other day for generating sample relational data. It’s a web service called genereatedata.com (inventively enough). It lets you define the columns, data types, how to generate the data (such as random names, states, numbers, numbers in a range, and more), then generate the data. You can get data in a delimited form or even directly as INSERT statements for your database of choice. It looks like it generates 100 rows of your data for you for free, but that you need to pay for a premium service that is able to generate more rows, if you need it. Fortunately in my case I was good with 100 rows.

Of course, if you need to generate data for your cubes, you might want to use a tool designed specifically for doing so, such as the Cubedata tool I built some time ago (version 2 is in the works!).

GNU CoreUtils for Windows

Did you know that many Unix core programs are available as Windows executables? Because the code for these utilities is open source, there are a couple of different organizations that have taken the liberty of compiling them for the Windows platform and making them available. One such project is GnuWin32. If you want to, say, use common Unix text processing programs on Windows as part of your Hyperion automation, you definitely can. You would just install the programs on your server, probably add their folder to your system PATH, and then be able to call them like any other program.

Why would you want to do this? There are some really great programs available on Unix platforms that might be useful for some one-off processing you do as part of your automation. In the coming days I’m going to post a few examples of some interesting text processing examples you might do with these utilities that are inspired by Essbase/Hyperion. Of course, for those of you already running Linux/AIX, you should already be set (and of course, as a smug Mac user myself, I am good to go at least with my development machine…).

Another such project that otherwise makes Unix tools available on Windows is the Cygwin project, which you might want to check out too if you need to get this functionality on a Windows machine.

Code This, Not That: Python date string formatting

Continuing on in the Code This, Not That series, I’m going to point out some convenient functions for formatting dates. Since Jython runs in the JVM, we get easy access to all of the libraries that the rich Java programming language provides.

Working with FDMEE and custom scripting, it’s likely that you’re going to need to do some work with dates. This will often involve the Java “Date” object. You might find that you need to do something with the date – get the year, get the name of the month, create a string based on the day of the month, and so on. First off, there are a few things you definitely don’t want to do.

One, you don’t want to convert the date object to a string and then start doing string manipulation on them to get the data you want. If you are converting some VBScript code you might be tempted to do something like Year = RIGHT(someDate, 4). This can work but it’s a bit of a hack. What you want to do is work with the date object directly and get some other tried and true code to do the heavy lifting for you.

There are other ways to achieve this, but a pretty simple and robust one is the SimpleDateFormat class. The SimpleDateFormat class allows us to specify a formatting code, then supply it a Date object, and it’ll generate a string for us.

For example, the formatting code for the abbreviation for the month of a date object is “MMM”. Therefore, we can create a SimpleDateFormat object with this code, pass it a date, and the resulting string is Mar or Apr or whatever the current month is:

import java.text.SimpleDateFormat as Sdf
import java.lang.System as System
import java.sql.Date as Date

date = Date(System.currentTimeMillis())

monthShortName = Sdf('MMM')
print monthShortName.format(date)
# prints "Mar" (or whatever the current month is)

There are plenty of other codes that take care of formatting the minutes, hours, seconds, year, and more for us. We can even combine codes in a single string, like the following:

dateFolder = Sdf('yyyyMMdd')
print dateFolder.format(date)

In this case, the format is the four digit year, followed immediately by the two digit month (including a leading zero if it’s less than 10), and the two digit day. You can refer to Oracle’s documentation for more codes over here. You can also put in other characters that might be ignored in the formatting string, such as hyphens between codes, colons, spaces, and more.

Also keep in mind that you create the formatting object and use it as many times as you want to format dates into strings. In other words, if for some reason you need to format many dates into strings using the same format, you just have to create one SimpleDateFormat object and then just use it over and over again.

As I mentioned, there are plenty of ways to go about formatting dates in Java (Jython) but this is a particularly flexible and useful approach. With just a few lines of code you can generate all sorts of formats that you might need. Consider this an FDMEE/Jython pattern – and anti-patterns to this might include the following code situations you can cleanup and make more readable:

  • Treating dates like strings and extracting data via substrings
  • Prefixing “20” on a two-digit year to get a four digit year
  • Keeping a table of month names in an array and using the month number to get the proper month name
  • Keeping a table of month names and then taking the first three characters to get the “short” month name

Lastly, note that the fdmContext["PERIODKEY"] object (which is where you are likely to be doing some date processing) comes in as a java Date object (in other words, if you are treating it as a string object, you are actually relying on the Jython interpreter silently casting it to a string for you, when you could just be working with it directly). For example, to get the numeric four digit year from the PERIODKEY object, you could do this:

# Note the explicit conversion to an integer
year = int(Sdf('yyyy').format(fdmContext["PERIODKEY"]))

Happy coding!

Code This, Not That: Python string interpolation/concatentation

That title is a bit of a mouthful, ain’t it?

I often see VB code in FDM jobs that is creating SQL statements – concatenating things together, adding in parameters from variables, lots of quotes, single quotes, plus signs, underscores, and more all over the place. It’s hideous.

For example, check some of this garbage out:

strSQL = "INSERT INTO " & TEMP & " (ENTITY,ACCOUNT,COSTCENTER,AMOUNT) " & _
" SELECT Entity.label, ACCOUNT.label, Entity.label+'_CC'", & Period & _
" from " & DATABASE & " as FACT"

For the record, I’m not saying if you see this or write this that you’re a bad developer. What I will say, however, is that you and/or your predecessors were hobbled by a clunky programming language and now you can atone for your syntactic sins with Python (Jython).

Python offers some nice facilities for handling strings. If you are looking for a straight translation of the normal VBScript string concatenation, you could do this:

account = "170100"
select = "SELECT Amount FROM Transactions WHERE Account = " + account

Python uses the plus operator (+) instead of the ampersand (&) to concatenate strings. This is straightforward and works well for simple situations. Of course, things get a little harder to read when we need to surround our value with single quotes (since it’s a SQL query after all):

select = "SELECT Amount FROM Transactions WHERE Account = '" + account + "'"

This isn’t terribly hard to read but we’re on a slippery slope. One thing we can try is simple “printf” style string interpolation:

select = "SELECT Amount FROM Transactions WHERE Account = '%s'" % account

In this case we have a %s placeholder meaning that we will place a string in it. Then we are using the modulo (%) operator to place the value of account into the string. This probably looks funny if you’re not familiar. What’s going on with this syntax?

Remember how you used the ampersand in VB to concatenate strings but in Python you use a normal plus? Well, the fact that we can use the plus in Python is due to operator overloading. This is programming term that essentially means we can actually program in our own behavior for math operations. You’re really used to this with integers (think 2 + 3 = 5), but maybe not so used to it for strings. It’s a clever trick and a nice facility that let’s us work with strings (or indeed, other types of objects) without having to resort to a lot of boilerplate code like newStr = StringUtilities.concatenate(query1, query2) or something similarly awkward.

Along those lines, whereas the modulo operator in math means “give me the remainder when dividing these two numbers” (such as 5 % 3 = 2), in Python this operator has been overloaded to provide convenient operations with strings (after all, how on earth would you ever perform modulo arithmetic on strings?). So when we write string1 % string2 (which is what we had above with account), Python calls the appropriate handler for the string object, which in this case is a fancy string replacement functionality.

This string replacement facility, as it turns out, is pretty handy – and this leads me to the final example of this post.

Python supports objects known as dictionaries. For those from the Java world, this is essentially the same as a Map (or perhaps more like a Map). VBScript has these objects too but they aren’t part of the language itself – they have a clunky syntax. In Python we can make a dictionary simply like this:

params = {
    "ACCOUNT" : "0170200",
    "PERIOD"  : 2
}

Next, Python also lets us create multi-line strings (and we don’t even need the line continuation operator from VB where the lines end with & _). We tell Python we have a multi-line string simply by using three quotes:

select = """
    SELECT 
        Amount
    FROM
        Transactions
    WHERE
        Account = '%(ACCOUNT)s'
        AND Period = %(PERIOD)d
""" 

Note that this string has been parameterized with ACCOUNT and PERIOD placeholders. The syntax is %(KEYNAME)s for strings and %(KEYNAME)d for integers (there are other codes as well but these are the most common).

Now we have a multi-line string with placeholders that we can replace with values from a dictionary. We could print this out easily:

print select % params

And we get this code:

SELECT 
    Amount
FROM
    Transactions
WHERE
    Account = '0170200'
    AND Period = 2   

This works exceptionally well for crafting SQL queries (also, yes, I know about placeholders with ? that are specific to SQL query parameters, let’s save that for another post). The wins here are that you get to format the query how you want, you don’t have quotes and ampersands and plus symbols all over the place, it’s easy to read, easy to maintain, and even faster to write.

Remember, code for readability. Also keep in mind that if you are converting VB to Python/Jython, you are not supposed to slavishly translate line by line – you are translating the solution as whole.

Code This, Not That: From VBScript to Jython

I recently completed a fairly complex VBScript to Jython import integration script conversion. For those that are unfamiliar, Jython is the preferred scripting language of FDMEE (the successor to FDM). Jython offers numerous advantages over VBScript. Chief among those is that the language is more expressive and offers a lot of syntactic sugar, particularly compared to VBScript.

People that know me know that I have a passion for programming. I love writing code in many languages – Java is by far my strongest language, but I also am quite fond of Objective-C (with Swift on my list of things to play with), Python, and a few others. That said I don’t just like to write good, working code – I challenge myself to write clean, elegant, high-performance, easily maintainable, easy to understand, deceptively simple, and fluid code that reads like prose.

One of the greatest compliments I ever got from someone was on an Essbase automation system I had designed. The compliment was, “You made it look simple. And I know it’s not simple – it’s just that good.” I was beaming for days.

But what does this all have with FDMEE and Jython?

Think about this: code is read many more times than it is written. Therefore, it behooves us to write it with as much expressiveness and conciseness as we can, to add useful comments, to be idiomatic with respect to the language we are writing in. Along these lines, in my recent conversion project I was reminded of all the opportunities for writing better code – more expressive code – that Jython (Python) affords us over VBScript.  So to point out some of those examples, over the coming weeks I thought I would write about come particular code examples between VBScript and Jython that I think are worth pointing out, along with some general code recommendations I might have.

So let’s kick things off.

Checking if a value is one of a list of values

It’s common in an import integration script to check if some value, say, the current member from the Accounts dimension, is a particular account. The most straightforward way to handle this condition is with the following code:

if account == '0170100' or account == '0170200' or account == '0170300':
    print "Matched account!"

Thinking back to the notion of reading code more times than we read it, the English language interpretation of this code might be “If the account is 0170100, or if the account is 0170200, or if the account is 0170300, then do this thing.” You might call this the “brute force” approach – to just literally test each condition we might handle.

Can we come up with something a little more… refined? Python offers a convenient syntax that can make this code a little more readable. In Python we can use tuples. A tuple is like a variable that is a bag of other values or variables. Think of it as a simple collection. Using the “in” operator, we can test for membership in a tuple. The following code works the exact same way as the above code:

if account in ('0170100', '0170200', '0170300'):
    print "Matched account using tuple!"

Now the English interpretation of the code might be “Is the account one of these accounts?” The code is shorter, simpler, and reads a little more easily. While this example is somewhat trivial, it’s a good example of leveraging this great tool we now have to use. Let’s take it one step further. Let’s declare a variable with these values ahead of time:

gross_profit_accounts = ('0170100', '0170200', '0170300')

And then we’ll check if the account is one of these:

if account in gross_profit_accounts:
print "Matched account against account list!"

Again, this code works the exact same way, but we’ve introduced a variable name. Think of the variable as a bit of built-in documentation. We easily could have done this too:

# This is for example purposes, the previous example is the best one (in my opinion)
# Check if the account is one of the gross profit accounts
if account in ('0170100', '0170200', '0170300'):
    print "Matched account using tuple!"

But now without even using a comment, the variable name serves as an expressive explanation of what the code is doing. Again, code is read more than it is written, so I think this is a better solution than using the variable name “gp_accounts” or “gpa” or something that requires a decoder ring.

Think about the English translation of this code now: “If the account is a gross profit account, do this thing.” Also, what if we wanted to test the opposite case (not being a gross profit account)? We could do this:

if account not in gross_profit_accounts:
    print "Account not a gross profit account!"

In English: “If the account is not a gross profit account, do this thing.” It’s like the code documents itself – and that is the best kind of code. Think of code as prose!

Advanced Drillbridge tokens: Protecting against unexpected members

Here’s another quick tip for setting up Drillbridge tokens. I recently created a Drillbridge report that needs to map a member from the Cost Centers dimension over to multiple columns in a relational database table. For example, a cost center may have the format 111-222-333-44, and the code 222 corresponds to a column in a table, 333 corresponds to a column in a table, and 44 corresponds to a column in a table (these are all in the same table, by the way).

Because the incoming member on the POV comes across as a Java string, we can perform normal Java string operations on it. In this case we can easily extract the “222” by doing the following (assuming the cost center dimension is named “Cost Centers”:

#Cost_Centers.substring(4, 7)

Recall that in Java, string character offsets start at 0, and that the substring we extract does not include the character at the ending offset (this is all spelled out nicely in the Java String documentation). So effectively, in the above function we are saying “give me the substring of the #Cost_Centers variable from characters 4 through 7 (but not including 7).

A problem with this approach is that members with fewer characters could potentially cause a StringIndexOutOfBounds Java exception. For example, a level one member or parent to our cost center might be “Other”. Therefore when the substring method is called on “Other”, we are asking Java to give us characters that do not exist, and an exception is thrown.

I thought for awhile on what the best way to handle this is. There are a lot of ways I could go in terms of the code, but I decided that the best approach is one that won’t really affect the Drillbridge code much at all right now.

The decision to use a full-fledged expression language as the basis of Drillbridge token expressions has turned out to be quite fortuitous, and one of the things it affords us inside of Drillbridge expressions is the ability to write complex scripts that can handle this situation for us. One such way to guard against member names that are too short is to use the ternary operator.

Many programming languages support the ternary operator. It’s a compact way of representing an “if-else” construct. For example, consider this simple code:

if (stopped) {
    return "Red";
} else {
    return "Green";
}

In this case, the “stopped” variable is a boolean variable that is always one of either true or false. If the variable is true, our function returns the text “Red” and if the variable is false, our function returns the value “Green” (assume that the if block is contained within a function that returns a String value.

This construct is so prevalent in computer programming that many languages support a way to compactly write this, using the ternary operator:

String colorText = stopped ? "Red" : "Green";

In the above example, the variable stopped gets evaluated and if true, the whole expression will evaluate to “Red” and if false, it’s “Green”. Then the String variable named colorText will get the value.

We can use this exact same construct inside of a Drillbridge token expression, thanks to the powerful expression language being used. Now consider this enhanced query example:

SELECT * FROM TRANSACTIONS
WHERE
    SEGMENT2 = '{{
        "name":"Cost Centers", "expression" : "#Cost_Centers.length() >= 13 ? #Cost_Centers.substring(4, 7) : '~~ Cost Center name not long enough to parse ~~'", "sampleValue":"111-222-333-44"
    }}'

Now we are checking the value of an expression first – checking the length of the incoming cost center. If it is at least 13 characters, then the return value of of the expression is safe to be figured out using the substring method. If the member fails the length test, then we return an arbitrary string, which in this case we have determined cannot match anything at all in the database, and therefore return nothing.

This little trick can be used in situations where we need to parse out sub-strings from member names but might have some members that don’t fit the format we need. Also, thanks to Drillbridge’s inventive handling processing of multiple members with the drill-to-bottom feature, we can also use this to effectively filter out members from a list of members, such that we keep the ones we want and discard any others that we don’t need (for example, it’s not uncommon to have “non-conforming” member names for input members).

Drillbridge White Paper now available

I am excited to say that the official Drillbridge white paper is now generally available. It is titled “Drillbridge: Easy Hyperion Drill-through with No Redevelopment”. In my first official white paper, I have hoped to concisely capture the essence of this useful software tool that is now amazingly  approaching 20 production deployments.

In other news, the next version of Drillbridge is still in progress, slowly but surely. The focus continues to be on polish, performance, and incremental feature additions. I have a few small but exciting tidbits that I will hopefully have a chance to post later this week.

ODI journal table performance tweak with an index

The other day I provided a crash course on simple journalization in ODI with an Oracle database. I wanted to set the stage for this performance tweak that I put in awhile ago on a few tables.

I mentioned that ODI stores journalized rows in a special table it creates that tracks the subscriber, processing status, and date of each row. The possible performance issue has to do with what ODI does during the processing of rows out of the table. If you have two different systems (subscribers) for a given table, when an interface runs, it is going to process data for just one subscriber at a time (usually). During processing ODI needs to mark these rows that it’s working with by changing the JRN_CONSUMED column to 1 for each matching subscriber.

In other words, if our interface processing data out of the CUSTOMERS table (by way of the journal table J$CUSTOMERS) is for subscriber ‘MARKETING’, then at some point in the ODI interface there will be a step like this:

UPDATE J$CUSTOMERS SET JRN_CONSUMED = '1' WHERE JRN_SUBSCRIBER = 'MARKETING'

If there are a lot of rows in the table, and particularly if there are a lot of rows in this table AND there are multiple subscribers, then the WHERE clause on this UPDATE statement can be a bottleneck. This is because the JRN_SUBSCRIBER field is not indexed, meaning that Oracle has to do a full table scan to find every row with a matching value. On a few interfaces I had at a client, this step was taking an inordinate amount of time. So I added an index on the JRN_SUBSCRIBER column (and eventually added one for the combination of JRN_SUBSCRIBER and JRN_CONSUMED), and performance increased dramatically, as Oracle was now able to avoid a full table scan.

The first time I made this change was over a year ago and I haven’t seen any adverse affects or otherwise had any problems, and it dropped processing time for this one step from over 30 minutes to well under a minute.

If you have large, multi-subscriber journal tables being worked on by performance-critical interfaces, you might want to open up Operator and see how long they are taking and if a carefully placed index could benefit them.

 

ODI journalization crash course

I do seem to love me some ODI journalization. It has been a really great feature that I have implemented countless times to great effect. For those not familiar, journalization in ODI is where you tell ODI that you want to record what rows get updated and inserted in a particular table, such that when you load data out of that table, you can choose to work with just those updated rows.

This is useful for many obvious reasons: only processing data that’s new or changed is usually faster than processing a whole table. When you are dealing with millions and billions of rows, it’s a necessity. Journalization is also nice for instances where it replaces processing data based on time.

For example, I have reworked many systems where the data to be processed was sysdate - 1 or some other time period, and the automation runs daily. This can work. But what if something goes wrong and you don’t catch it for a day? Well, you go in and tweak things and run a one-off job to fix it. Or you have other coping mechanisms. What if data from a month ago is updated? Many times just processing anything that changed is the effective strategy.

It’s how journalization works under the hood, though, that is the focus of this post (for performance reasons). When you journalize a table (at least for the simple Oracle implementation), ODI installs a trigger on that table so that updates and inserts are recorded somewhere.

In this case, that somewhere is your work schema. If you were to journalize a table called CUSTOMERS and the primary key of customers was CUSTOMER_ID, then you would end up with a journal table called J$CUSTOMERS that has the following columns:

  • JRN_SUBSCRIBER
  • JRN_CONSUMED
  • JRN_FLAG
  • JRN_DATE
  • CUSTOMER_ID

The structure will always be the four JRN_ columns and the primary key (which can be multiple columns but is often just one) of the table being recorded. This simple table records everything you need to know about the new and updated rows.

Here’s a quick crash course in these columns:

JRN_SUBSCRIBER: You supply a text name indicating what the consumer of the data will be. For example, you may have two systems that independently want to process new and updated customers, so you might have two subscribers: MARKETING and CRM, for example. Rather than creating two separate journal tables, we just have a column in here to differentiate the columns.

JRN_CONSUMED: This defaults to 0 and will be 0 until processing data out of the journal occurs. This column provides a really elegant solution to making sure you don’t lose data when processing journalized data. When an ODI interface runs to process data out of a journal, it sets the JRN_CONSUMED column for rows to process to 1. This does a couple of things. One, any rows that are added to the journal after a job starts do not get lost or otherwise suffer from some timing issue. They will just get picked up on the next processing job. Two, if the ODI interface fails for some reason, before it completes successfully, and you need to re-run the job, you won’t have lost the journalized rows: they are only dropped when the job successfully completes (it’s one of the last cleanup steps in the job).

JRN_FLAG: Indicates whether the row was inserted or updated (I or U), although in practice you will notice that the value is always ‘I’, indicating an inserted row (even if it was actually just an update). This is because later on in the job ODI will mark rows that are actually updates with a U and then handle the processing accordingly

JRN_DATE: The date (time) that the row was recorded into the journal.

With all of this infrastructure in place (plus a couple of views that ODI puts in for you), journalization is ready to go. You design your interface as normal, and if you want to process just the new/updated rows, all you have to do is check a single checkbox on the Mapping tab (Use Journalized Data). ODI handles the rest for you.

This is a really killer feature of ODI and can cleanup countless home-brew solutions that are fragile, complex, and code-intensive.

That being said, the original purpose of this post was to actually talk about a performance issue with ODI journals, but it was necessary to provide some context with an overview of journalization first. With that out of the way, check back tomorrow for a tweak you can make to your journal tables that might speed things up if you are processing large amounts of data!

ODI Oracle Merge Incremental Update burns sequence values

I use Oracle Data Integrator for quite a bit of ETL work. Most of it tends to be with Oracle databases, but also a fair bit with MySQL, SQL Server, and others. A common integration is to update data in a table based on a primary key, and insert new data if existing data doesn’t exist.

It’s also common to insert a sequence value into a row that you populate in a table somewhere. Sequences in Oracle are this mechanism where you basically ask them “Hey, give me the number in something we’re counting!” and then plug that value in. They are commonly used to generate unique values for rows that get inserted somewhere.

Sequences aren’t random numbers – they start off at 1 (or whatever you configure them to) and then count up, and up, and up. Like up to the number 999999999999999999999999999 (that’s huge!). In fact, there’s so many possible sequence values that if you happen to use a few by mistake, it’s not that big of deal. Just use a new sequence value! It’s not uncommon to see gaps in a table or rows that have a sequence value – this just happens due to the way data is processed and how the Oracle server will return blocks of sequences to be used in a given operation, then discard any that don’t get used (this might seem wasteful, but it’s not).

In any case, I was alerted to something interesting in a client’s ETL job awhile ago: sequence values in a table being updated by ODI were starting to seem abnormally large. In other words, this table generally has about 100,000 rows in it, and they are used to the sequence values being less than a hundred million or so. All of a sudden the sequence values were trending much higher.

It turns out that the job in question uses the MERGE Oracle IKM to update rows. This is a strategy where existing rows (identified by a particular primary key) are updated based on the primary key, and new rows are inserted. MERGE is kind of an elegant way of writing this without having to have separate UPDATE and INSERT step. In fact, MERGE is thought of as an “upsert” due to being a combination of the two.

Under the hood, it seems that Oracle rewrites MERGE queries into something resembling two separate steps. And the way that Oracle rewrites the query is such that it burns or otherwise needs to use a sequence value for every row in the table, on every run. So if there are 100,000 rows in the table, about 100,000 sequences are getting used on each run, even if only one row is being updated. Couple this with the fact that the job runs hourly and almost always processes data, and you have a recipe for sequence values to go up pretty fast.

Of course, we could literally burn one trillion sequence values on every run of the job, and schedule the job to run a trillion times a day, and still not have to worry about running out of sequence values. But it does seem kind of wasteful.

For this reason, you might want to use one of the other Oracle Incremental Update IKMs, such as the traditional INSERT/UPDATE.

Late note: I wrote all this up and it seems that the IKM was modified to strip out a clause from the MERGE statement that would prevent rows that already existed in the target from even being touched, thus reducing the amount of rows in the MERGE data set, and thus lowering the rate at which sequence values would get burned. So I think the stock MERGE IKM is still fine to use with sequences in the target. It’s still true that MERGE will use sequence values on non-updated rows, it’s just that ODI tries to be smart and reduce the update recordset to not include those in the first place.