Last time I gave sort of a high-level look at pounding the rigid financial page definitions from the corporate overlords upstream folks, and generating the outlines based on this information. So now I have some nice fresh outlines, but I don’t have any processes for data just yet. I have a couple of issues I need to address:
- History is going to be restated each period so I need to go out and fetch all new data each period
- I need to load all of the data as fast as I can
- I want the data load process that runs each period to be as robust as possible.
Well, let’s start with the basics. The data I want is upstream in a DB2 server. One way to go would be to talk to the ETL group and see if they can drop off files for me each period. This isn’t really a bad option because they already deliver some pretty similar information in a similar fashion. But I don’t really feel like doing all the paperwork and cutting through the red tape right now. Since I already have a working ODBC connection, I’m just going to use this. I know, I know, ODBC is gross. I can’t really use JDBC at the moment because this particular DB2 server is z/OS and you apparently need a special .jar file to get JDBC connections to work (some kind of licensing issue). My emails to the DB2 folks about this have apparently been routed to /dev/null, and I can’t just go to the IBM download site and grab it. So that’s out for now. So let’s go with ODBC.
I determined earlier that the raw data for each period is just south of 200 megabytes. That adds up to a few gigs when you take in to account all the periods I want to look at. During the night, when network activity is relatively quiet, I can pull this down in about 20 minutes for a single period. For some reason I think this should be faster (a lot faster), but the speed seems constant regardless if I use JDBC or ODBC, so at the moment it seems like more of a network thing which I can’t change (I’m pretty sure that there is fairly extensive packet shaping going on… it’d be nice if I could have this port loosened up a bit). Maybe that’s just as fast as the DB (or my user ID on the DB) is allowed to go at. In any case, it’s tolerable, because I basically have the entire period (or one week, at the least) to pull down the history. The server has a tendency to kill my connections though, so I can’t get it all at once. I wouldn’t really prefer that option anyway, since I’m just throwing this into a text file, so I’ve decided that I would like one period per text file.
I wrote a small Python script that will read a configuration file, take some command-line parameters, and go grab all the data I want and put it into the folder I want with a particular naming convention (one period of data per file). It’s pretty nice. Source is available if anyone wants it (it’s a few pages long — nothing fancy). I used pyodbc to connect against the already setup DB2 ODBC connection. This same Python script is also flexible enough so that I can use it to pull down a single period of data — this will come in handy for scripting the period load process.
Each row of data consists of a division, location number, year, period, account, and amount. Recall that there are four different cubes in this instance that are modeled after different financial pages. The thing with this approach is that an account that is on one page may also be on another page. And based on just the account number, I don’t have a way to determine which pages the account belong to. So loading this data straight to the cubes with a load rule is out. And yes, I know that you can load data with EIS, but in this case I prefer to script the process in MaxL. Another thing about the databases is that they split out different departments in them, so I will need a way to associate the account to a department. This means that I will be loading the data files to my SQL Server database and massaging it a little bit. The massaging will be fairly minimal, because I have setup a view for each database that will do a LEFT JOIN of the ‘raw’ data to a table that associates accounts to departments to pages. I could do the JOIN in the load rule itself but that seems a bit messy, and since there are only four databases, this seems sensible.
Keeping in mind that my goal here is to be able to load all four cubes as fast as I possibly can, my process that runs at the close of each period looks something like this:
- Pull down current period of data
- Load data to SQL server
- Process Cube 1
- Process Cube 2
- Process Cube 3
- Process Cube 4
I’m using the word Process here rather than Load because these as ASO cubes, part of the process will be to design the aggregations on them that will make queries nice and snappy. The process that needs to run sometime during closing week and can be done ahead of time looks something like this:
- Pull down all of the historical data
- Load all of this data to SQL server
- Pull down the latest cross-reference/locations data
- Update outlines in staging cubes
- Load current data to staging cubes for validation (so we don’t get caught by any surprises if the cross-reference changed in some kind of wacky way).
What are the staging cubes? They are duplicates of the four databases but suffixed with an underscore. My original strategy with regard to these cubes was to load them, agg them, and then copy them on top of the other cubes so that the period load process would not result in any downtime for the ‘real’ cubes, but along the way I have discovered that I can load these babies up fast and in a scripted fashion, and that I am better off just loading direct. Also, the MaxL process ‘create or replace…’ seems to really shake the server to its knees and doesn’t have the best performance. Right now I keep the “_” cubes around for testing purposes. This hasn’t resulted in a duplicate of scripting functionality though, I just have a text file with a list of app/db names that I pass to a batch file, so I can choose which cubes to work with just by feeding a different configuration file.
The load process for each cube, then, simply clears the data, loads it (directly from SQL Server), and aggs it. I have tweaked the essbase.cfg to try and use 16 threads (the servers are quad-proc Xeons) and the speed seems much improved even from specifying 8, and definitely faster than none at all. The connection between the SQL server and the Essbase server is all gigabit with a theoretical 125 MB per second of throughput, although in actuality the throughput is much less than this. Altogether, just under a billion or so facts are loaded up and calculated (aggregrated) in about 40 minutes.
It’s not necessary right now, but a few areas that I could shave time off of would be to pre-stage the historical data that is in SQL to a text file on the Essbase server itself, then load this to a buffer, then load the new data. I could also use a different load rule that doesn’t need to prefix the various members with our naming convention (Lo. for a location, etc), so that the load rule has to do less ‘thinking,’ but again, this hasn’t been necessary. If it were, I’d be very curious what kind of performance we could squeak out by attaching three more drives to the storage array and setting up a partition on them so that it was separate than the drives with the Essbase data, then the disk wouldn’t be needing to read and write from the disks at the same time.