BI and Custom Development

What You Should Expect from Your Data Warehouse

Posted by admin on October 01, 2013  /   Posted in BI and Custom Development, Business, Data Best Practices

When it comes to the benefits of having a Data Warehouse (DW from here on), most of our clients are already cognizant of its critical role as part of a business system.  What many still don’t realize is that there are good DWs and not so good ones.  Of course everyone is aware of their existence (good vs not so useful), but not everyone has the time thinking of the ways to distinguish between those two.

A Good Data Warehouse Goes (Far) Beyond a Database

The number one question that I get when I talked to anyone about the services we provide to our customers is the difference between a Database and a Data Warehouse.  A lot of people associate a data warehouse with storing data (just like a physical warehouse)  — and that’s about it.

Truth is, a good DW serves the business in at least five areas in addition to hosting business data:

  1. Automatically apply business rules
  2. Automatically generate reports (both on-demand and periodical)
  3. Automatically notify parties of data events of interest
  4. Allows data owners to look into their data in multiple dimensions
  5. Allows for data error detection, investigation, and corrections

When you are building a DW, the above functions must be present in the planning, implementations, and testing phases.  Otherwise, you’d be left with basically just a database.

A Good Data Warehouse Grows into Maturity

In the next blog entry, I’ll be discussing in specifics, the different levels of DW maturity. A lot of misconception out there regarding how to assess the reliability and usefulness of a DW. Most people expect a DW to become mature at the end of the initial implementation project. This is not realistic given the typical complexity of meta-data, data, and business rules surrounding it. The correct way to proceed with a DW-building project is to realize that there are certain phases that a DW goes through which brings it along the path into maturity.

A Good Data Warehouse Is Redundant

An experienced BI consultant would build one or more copies of the DW that can be switched at any given time.  Not only will this provide a good development and testing environment for analytic outputs (reports, dashboards, ETLs), but it would also

A Good Data Warehouse Highlights Data Problems

Due to its proximity with people who did the accounting, and forecasting in the organization, a good DW is usually the first place where these people notice data inconsistencies.   A good way to take advantage of this characteristics, is to put in place monitoring ETLs that runs periodically, executing queries that is designed to check the consistencies of data.

A Good Data Warehouse Reduces the Load on the Transactional System

A lot of the systems that maintains the transactional data input from users are inundated with processes that shouldn’t belong there in the first place.  These processes typically does checking, applying business rules, and monitor the validity of the data, which takes away resources from what Transactional Systems should be mainly about: Collecting data and preserving its integrity.

Now the follow up question is: How do we know which part of the Transactional System should be offload to the DW?

A good boundary rule-of-thumb is whenever a process can be executed against any data in batch-mode. This is an indication that it should be taken off the transaction system and moved to the DW. And by the way, it is not a taboo for a DW to pump data back into the Transactional System when a valid use-case can be made.

In Summary
There are many other ways to assess a DW to see if it’s “good” or “not good enough,” but all of them revolves around the relationship between the DW and the existing Transaction Systems and the other flip side of the coin, the decision maker team. A good DW should always bring clarity and bridge the communication barrier caused by incongruent view of data held by different parts of the business. Hopefully this can get you started in the correct line of thinking as you embark on your DW projects.

ZK + TimelineJS + JQuery = Visualizing Business Data Along The Time Dimension : Part Two

Posted by admin on August 13, 2013  /   Posted in BI and Custom Development, Business, Web Development, ZK

by Cancan Gunadi

In our previous blog, Visualizing Business Data Along The Time Dimension, we have shown one of the ways to visualize business data along the time dimension in a meaningful and engaging way.

In this blog, we’ll show you the technical detail behind the implementation. We used TimelineJS, an interactive web-based timeline tool, and integrated it into the system that we’re building for our client. The system itself is built using ZK Framework.

First, in case you haven’t already, you can visit the actual Live Demo of this timeline on our server here (log on using the provided credential).

We assume some familiarities with the ZK Framework and TimelineJS libraries. We recommend you to visit their respective documentations.


The timeline area is contained inside a ZK iframe tag inside one of the ZK .zul file as such:

<window id="mainWin" apply="${mainCtrl}">
    <iframe id="iframe" src="jsp/timeline.jsp>

The iframe loads a JSP file which contains the TimelineJS script which is shown below. Some of the details of the JSP and zul file have been removed for the sake of simplicity.

// normal JSP import stuffs... here...

   <!-- jQuery -->
   <script type="text/javascript" src="../timelinelib/jquery-min.js"></script>
   <!-- BEGIN TimelineJS -->
   <script type="text/javascript" src="../timelinejs/storyjs-embed.js"></script>

   <script type="text/javascript">
       $(document).ready(function() {
       <% if (docs != null && docs.size() > 0) { %>
           var dataObject = 
                   "headline":"Progress Info",
                   "text":"<%= user.getFullName() %>",
                   <% if (startDate != null) { %>
                   <% } %>
                   "date": [  
                   <%for (Document doc : docs) {%>
                           "startDate":"<%= doc.getDate() %>",
                              "<a class='timeline-doc-link' 
                                  doc_id='<%= doc.getId() %>' 
                                  href='#'><%= doc.getTitle() %></a>",
                           "text":"<%= doc.getText() %>",
                   <% } %>            


               type:       'timeline',       
               source:     dataObject,
               embed_id:   'timelineContainer',
               font:       'Arvo-PTSans',
               start_at_end: true,
           // div with id "timelineContainer" is the div that wraps 
           // around the entire timeline area, defined in the <body>
           // section.
           $('div#timelineContainer').click(function(event) {
               var clickedElem = 
                   document.elementFromPoint(event.pageX, event.pageY);
               if (clickedElem.nodeName == 'A' 
                       && clickedElem.className == 'timeline-doc-link') {
                   // parent is [object Window] according to firebug
                      new parent.zk.Event(
               return event;
       <% } else { // if no item to be shown ... %>

               '<span style="font-size:75%;color:grey;">
               No progress information submitted yet.</span>');

       <% } %>    

    <!-- END TimelineJS -->

    <div id="timelineContainer"></div>

Line 12-53 of the JSP is the TimelineJS script which is dynamically generated using the JSP scriplet for-loop (line 24) to insert the available items (or “documents” in this case) into the timeline.

Making the Timeline More Useful

To make the timeline more useful, hyperlinks were added to the timeline, which can be clicked to open new ZK Tabs containing the detail of the timeline item. To achieve this, the javascript code in the timeline script needs to be able to send an event to the enclosing ZK container so the ZK application can respond and open a new ZK Tab.

Line 28-31 shows the <a> element which is tagged with a CSS class “timeline-doc-link” and a custom attribute “doc-id” which contains the id of the document.

On line 58-73, JQuery code is used to add a click handler to the “div” element with id “timelineContainer”, which is a wrapper around the timeline (The div is defined in the of the HTML).

Line 59-62 retrieves the element which received the click and checks if it’s really the anchor element with the special css class.

Line 65-69 is the real key to enable the interaction between the timeline and the rest of the application, it takes advantage of a very useful ZK feature for sending Event to parent element. Using the “parent.zAu.send” method, we sent an event (called “onDocRequestedFromIframe”) to the parent ZK component “mainWin” (refer to the ZUL snippet shown earlier). From there, the ZK main controller (Java code) that is attached to the main Window component can respond by opening a new ZK Tab to show the requested document.

The result is nicely integrated timeline which can interact with the rest of the ZK application via ZK Events as shown in the following figure. Eventhough the timeline script resides within an iframe, this approach allows the timeline script to invoke Tabs, Menus, and other component within the containing ZK application.

Figure 1 - Seamless Integration of TimelineJS into ZK Application

Figure 1 – Seamless Integration of TimelineJS into ZK Application

In a Nutshell

We have shown one of the ways to visualize business data along the time dimension in a meaningful and engaging way, and how to implement it using the ZK Framework, TimelineJS and a little bit of JQuery and Javascript codes. We have shown how to enable seamless interactions between the TimelineJS script and the containing ZK Application using ZK’s Events.

We hope this post can be beneficial for anyone who may be looking to implement similar functionality.

Please also check out our main blog page where you can find useful information regarding Data Management, Business Intelligence, and more (click here).

Visualizing Business Data Along The Time Dimension

Posted by admin on June 18, 2013  /   Posted in BI and Custom Development, Business, Web Development, ZK

by Cancan Gunadi

From time to time businesses are faced with the need to visualize their business data along the time dimension. It is very valuable for the businesses to have a way to visualize information such as milestones or key events in a meaningful way. This visualization answers questions such as “When was the last Sales Milestone achieved?”, “How do I see the progress of XYZ?”, or “What was the last step completed by a training class participant, and when?”

One way we have found to be both effective and engaging is to use a “timeline”. Timeline can be described as a sequence of related events arranged in chronological order and displayed along a line (usually drawn left to right or top to bottom). Data can be aggregrated though a nightly ETL job and key events be made available to be plotted on the timeline by the same job.


In our implementation, we used TimelineJS, an interactive web-based timeline tool, and integrated it into the system that we’re building for our client. The system itself is built using ZK Framework. I will write more about the technical detail of how this is done in my next blog.

Here’s an example of how the timeline visualization looks like. Please visit the actual live demo on our server here. (log on using the provided credential).


In this mock up example, we show the sales milestones on the timeline. This timeline is interactive, user can slide the timeline bar (bottom half of the timeline widget) with mouse to the left of right, or even swipe with his/her finger on a touch-screen. Clicking on each “flag/event” on the timeline brings up a related bar chart above the timeline bar. If the user needs more detail, he can click on the hyperlink next to the bar chart to open a new tab with all the detail regarding this milestones (See my next blog post for interesting technical detail about the interaction between the TimelineJS, the containing HTML inline frame, and the ZK Framework). User can also navigate to the previous milestone by using the arrow button to the left of the bar chart.

In a Nutshell

We have shown one of the ways to visualize business data along the time dimension in a meaningful and engaging way. Using this timeline, businesses can quickly see when a key event occured, time-wise, in relation to other/previous key events.

Stay tuned for my next blog post, where I will describe the technical detail on how to implement this timeline visualization using the TimelineJS, and ZK Framework.

Please also check out our main blog page where you find useful information regarding Data Management, Business Intelligence, and more (click here).

Seven Tips to Successfully Start Your BI System

Posted by admin on July 31, 2012  /   Posted in BI and Custom Development, Business, Data Best Practices

Given a mountain of data that comes from different sources in different flavors, it could be a daunting prospect to get them all in line.  Here are seven simple steps that could help you wade through the first phase of implementing a successful BI process that will not only solve the immediate problems, but also grow with your company.

  1. Trace your steps backwards, begin with the reports that you have to have in order to run your business.  Do not start with the data itself.  A lot of BI implementations make this mistake, which will soon overwhelms anyone involved and contributes to a slow-start.
  2. Identify two types of reports: Those that are very important for strategic decision making (annual bank loan reports, ISO certifications, etc.) and those that are important for tactical everyday decisions (daily status report, daily manufacturing job schedule, warehouse inventory, customer returns, etc.)
  3. Split your attention into the two types of report you identify above.  Pick one report from each type, and work on both at the same time.  By working on two different sets of data types simultaneously, you will gain more insight on how to transform the existing data to satisfy both types of reports.  Failing to do this, will cause you to backtrack a lot, and that means wasted effort.
  4. Use of your Time Dimension table. This is the lookup table consisting of different facts about one measure of time (usually a day).  In a lot of BI projects, people neglected to use this very useful table, opting to create complex SQL queries, which ends up being error prone, difficult to maintain, and a big time waster.
  5. Be mindful of your database’s idiosyncrasies.  Know what NULL values mean in a field, especially the ones that will be part of a calculation. Know what it means to say BETWEEN two dates.  And no, Non-SQL databases are not immune to these quirks.  They are just as susceptible to unfamiliar and untested usages.
  6. Prepare two identical databases that contain your BI tables. Setup two instances of your ETL building each database separately. This way you can switch back and forth when you are testing or having to re-run the ETL in the middle of a busy day.
  7. You will not catch every bug, and there is no perfect the BI system. Test, and re-check your tests, that’s why you have two identical databases.  BI is an ongoing process, therefore it is important to come up with a good system and an implementer who understands this.

By following these simple steps and precaution, you will minimize a lot of backtracking in the middle of the project.  And more importantly, by starting the project the right way, you will avoid burn out and frustration that will have more adverse impact on the quality of the data at the end of the process.

Enterprise Resource Planning (ERP) and Why You Should Care

Posted by admin on July 17, 2012  /   Posted in BI and Custom Development, Business

No matter the size of your organization, if its revenue is growing, sooner or later you will benefit from a more consolidated approach to managing your business data.

An ERP system is designed to provide a consistent user interface to an integrated database backend where data from different functional departments within your organization will be processed and stored.

A typical ERP system is configured to handle typical business functions (usually handled by personnel in each department) such as: Accounting, Orders and Invoices management, Customer and Vendor Relations, Manufacturing, Shipping and Logistics, Warehouse and Inventory management, etc.

An exceptional ERP system, however, is also marked by the ease-of-use and just as important: Easy to customize and modify. Not all businesses operates the same way, in fact, none of them really are.

So to summarize some obvious benefits:

  • Lower software license costs
  • Less time consumed in import/export data formats between applications
  • Reduce redundant data entry
  • Increase in speed and clarity in data flow and personnel workflow
  • Increase in data accessibility among different functional departments
  • System-wide upgrades and enhancements
  • Simplified new personnel training

To illustrate the above benefits, here is an excellent presentation by OpenERP.

In summary, for a healthy organizations (especially not so healthy ones), a move to adapt an ERP system can lead to what every organization needs: Cost reduction and increased productivity.

There is but one thing between these benefits and your company, the effort that it took to choose and implement an ERP system.

Business Owners, Are You Taking Care of Your Data?

Posted by admin on March 19, 2012  /   Posted in BI and Custom Development, Business

by Will Gunadi


Regardless the size of your business, when it has been running for a while, you will have at hand what all businesses accumulate over time. No, in this case we are not talking about profits nor debts, rather, it is about: Data.

Just by the fact that they operate, businesses will gather various types of data. More importantly, business data is the lifeblood of every business decision made. The higher the data quality is, the better the decisions you could be making.

Problem is, when it’s not sufficiently reviewed and monitored, all accumulated data will start to lose its integrity and accuracy, resulting in misleading indicators and measurements, which will yield two of the most damaging factors for a business: Loss of opportunity and Hidden costs.

How does business data go bad?

Over time, any computerized business’ systems will start to have bad data. This is not something mysterious, in fact, as a savvy business owner, you should be expecting it. Here are some of the ways for bad data to creep into a system:

  • Bad data entered by users that is not caught by the existing validation rules
  • Policy or business rule changes
  • Program bug due to the implementation of the policy changes
  • Structural changes to external data (eg. tax rate, zip code, area code, ISO specifications)

In essence, your computer systems change as your business changes due to modifications to internal policies or to external regulations. It is inevitable that during these modifications, bad data is entered and processed into the systems. Sure, we could minimize the negative impacts with good software quality testing practices, but we would not be able to catch them all.

The only effective way to combat bad data caused by these unavoidable changes, you need a system in place that will allow you to see the data comprehensively in order to locate these bad data early before they become real problems.

Meet OLTP and OLAP systems

In this day and age, Online Transaction Processing (OLTP) systems have become the norm, even for small to medium businesses. Most people nowadays rely on some kind of computer-based ERP (Enterprise Resource Planning) system to run their businesses, maybe a consolidated email/calendar package, an accounting package, some combinations of order entry, invoicing, shipment, warehousing modules, all of which are backed by a database system. These are the typical components of an OLTP system.

This is where most small and medium businesses stop; which is fine apart from the fact that you only have one way to view your business data, and a non-intuitive one at that. Making data quality maintenance harder than it should be.

An OLAP (On-Line Analytic Processing) system was designed to provide you with an alternate way to view and analyze your business data; more accurately, a much more effective way.

Large corporations and big businesses spent billions of dollars every year to develop, utilize, and maintain OLAP systems to complement their OLTP ones. They know that without a constant effort to monitor and measure business data, it can go out of hand very quickly.

Unfortunately, for most small to medium businesses, the value of setting up an OLAP system has not yet become clear. Why? Partly a mindset issue. To understand the value, you have to:

  1. Realize the two distinct approaches in handling the same business data, one is geared towards data entry and processing (OLTP), the other, data analysis and decision-making (OLAP).
  2. Accept that analyzing your own data is not a waste of time and resource, rather, it is a vital part of not only running your business but – more importantly – improving it.

Adding insult to injury, the cost of implementing a typical OLAP systems is tantamount to long-term, expensive, and serious commitments. While there is absolutely nothing wrong with long-term and serious commitment to improve data quality, the expensive part prevents a lot of businesses to even try to setup and utilize OLAP systems. Which is a pity.

Why can’t I just use one system?

The difference between OLTP and OLAP starts from the underlying structure that holds your business data. OLTP systems has, as its main purpose, to capture in real-time, your business data such as order entries, invoice generations, shipment records, warehouse inventory movement, accounting journal entries, commission calculations, sales tax, etc.

A good OLTP system would have a solid data model that dictates how the above data are stored and processed. Unfortunately, this very strength causes the data to be extremely efficient for computers to process, but it is not human observable. At least not without a lot of necessary lookups and remembering bits and pieces of information all over the place. Again, something that computers are designed to deal with.

Decreasing data quality happens when no one reviews the business data. Any statistician would tell you that the accuracy of any record is proportional (up to a certain point) to the number of reviews that the record receives.

The same principal applies in this case also. The OLAP system was designed to allows you to review your business data in the manner that is effective for us (humans) to analyze, instead of computers. However because it is a computerized system, it also provides the automation and number-crunching facility that we can use to generate meaningful reports.

Okay, so what can OLAP systems do for my business?

In a nutshell, a fully operational OLAP system should help you to:

  1. Discover hidden information→
    • Your business data resides everywhere, not just in your OLTP system database. And it’s not always obvious how they relate to each other.
    • Discovering hidden data may give you an insight on certain aspects of your business that needed attention.
  2. Identify and fix data integrity problems →
    • Keep your business data accurate and up-to-date
    • Verify data relationships over long periods of time
  3. Plan for data expansion, archiving, and storage →
    • As your business grow, so will your data accumulation
    • Fulfill requirements to keep complete records (example: Legal or Tax auditing purposes)
    • Staging area for server movements, upgrades, and failure recovery
  4. View your data history, trends, and movement with improved clarity and accessibility →
    • Discover new opportunities revealed in existing data
    • Keep track of aggregate data which is not stored by the OLTP system
    • Forecasting based on trends
    • Better decision-making ability

Can my business afford one?

As mentioned above, you should view an OLAP system as a tool to grow your business. It is the natural extension to an OLTP system which you have already used.

The good news is, with open source OLAP systems such as Pentaho BI Suite, you now have a good alternative to high-cost systems from SAS, Oracle, Microsoft or IBM. Open source systems are characterized by freely available installations without any licensing restrictions.

Is it any good? As with OLTP systems, it depends on the implementation and subsequently, the implementor. A well implemented Pentaho system should be perfect for small to medium businesses not only because of the zero entrance cost, but the complete set of tools that are customizable down to the source-code level.

Of course there is nothing that prevents big businesses to use Pentaho, but along with available budget, comes options to use the other systems.


If your goal is to run a healthy business, the importance of data quality surrounding it can no longer be dismissed as an overhead. A well-planned implementation of an OLAP system should give you easy access to information that may be hidden in your OLTP systems.

Large corporations have known this for a long time. OLAP systems are not only considered, it is a prominent part within their plans and budgets. Today, with the advent of open source systems such as Pentaho BI Suite, the benefits of OLAP system has been made available to a wider range of business sizes.

There is not a single reason for not considering one. Really.

User-sensitive Dashboard in Pentaho

Posted by admin on January 06, 2011  /   Posted in BI and Custom Development

This time I’d like to address a common task that most of us will encounter at one time or another when we deal with this Business Intelligence (BI) stuff.

The task: Creating info dashboard(s) that presents relevant information based on who the user is.

Implicit in that task is the importance to 1) present information for that user and 2) *only* information for that user.  A good example would be a dashboard for a salesperson.  It should contain information about contacts and customers as it pertains to the said salesperson. No less, not more.

Depending on which way you choose to build your dashboard, the method to accomplish the task differs a bit.  Now you may ask, what are the ways to create dashboards in Pentaho?  As of this writing, there are at least two common ways, using CDF or CDE.  CDF is the mainstay method which has been around for a while, CDE is a new interactive tool produced and contributed by the talents at (that’s a web site in Portugal).

NOTE: To read on, it’s best to familiarize yourself with one of the building blocks of Pentaho, the xaction.  In a nutshell, xaction is an XML-based declarative mini-language that allows us to fetch data either from a data source or from a mondrian cube, and outputs the data someplace else (the session object is a good example).

The Question of Identity

To be able to tailor the content of the dashboard with user-sensitive information, we have to get the currently logged in user from somewhere.  From within Pentaho, you can get this information (and more) by tapping into built-in objects provided by Spring Security which is the backbone of the Pentaho authentication system.  The next question is how do we access this information?

The Road Map

Before we answer the previous question, let’s lay out the steps to accomplish this task.  First, we need to decide if we need to do some lookup to map the username (the name users use to log in) into people-friendly name, for example username=jdoe into ‘John Doe’.  If your mondrian schema requires this (as do mine), there is an extra step that you need to do.  Otherwise, skip the next section, or read it anyway since it can also be useful in other scenarios.

Seeding the Session

While your login username will already be contained inside the ‘security’ object, we need a place to put the people-friendly name.  A good one is inside the user session, which will retain this information as long as the session is still valid.  And fortunately for us, there is already a hook to put some xaction in Pentaho.  Let me introduce you to pentaho-solutions/system/sessionStartupAction.xml.

This file allows us to insert an xaction that will guaranteed to be executed when a user login.

Insert the following snippet:

<property name=”sessionType” value=”org.pentaho.platform.web.http.session.PentahoHttpSession”/>
<property name=”actionPath” value=”Analysis/rules/salesman-username-to-name.xaction”/>
<property name=”actionOutputScope” value=”session”/>

Explanation: We want the Pentaho biserver to execute the actions defined in salesman-username-to-name.xaction which is located inside a path starting from the solution called Analysis.  And the third line states that the output of the xaction will be available to be accessed in the ‘session’ object.  This bean definition represents one action, if you need two, define two beans.

Next, we need to write the .xaction file.  By the way, the best way to know the mini-language in which you script these files, there are good and working examples under pentaho-solutions/bi-developers.

Here, following the example, I created a directory called ‘rules’ under my main solution path (which itself is right under the pentaho-solutions directory).

I will highlight the important parts (pentaho-solutions/Analysis/rules/biz-rule.xaction):

<user type=”string”>

Explanation: This section defines what would be available in the .xaction as the variable ‘user’ which is seeded with the current value from Spring Security’s principalName object.  The tag surrounding the principalName input defines the scope where the input will be searched within.

<userFullName type=”string”>

Explanation: This section tells us that at the end of the execution of this .xaction, the session object will contain an attribute called ‘salesmanName’ whose value will be determined by the section below.

<action-type>fetch full name</action-type>
<user type=”string”/>
<query-result type=”string” mapping=”rsFullName”/>
<query>SELECT name FROM some_transactional_table where user_login_id='{user}'</query>

Explanation: This code defines a SQL query which accepts an input called ‘user’, and pass it into a sql query whose result set will be available to the rest of the .xaction as ‘rsFullName’.

<action-type>Extract </action-type>
<rsFullName type=”result-set”/>
<userFullName type=”string”/>
var userFullName = rsFullName.getValueAt( 0, 0 );
userFullName= userFullName + ”;

Explanation: Different from the previous code that is a SQL query, an .xaction can also contain javascript code that acts on the available variables.  The one above simply fetch the value out of the SQL query result, and assign it to a declared variable, which has to have the same name as the <output> section above.

There you have it, because we invoke this .xaction files in the startup of a user session, any dashboards created can use any values initialized by it.

Using the Session variables in Dashboards

Now we are ready to create the user-sensitive dashboards.  I am going to illustrate this using CDE which has a clean separation between layout, behavior and data source.

Using CDA (yet another project from to manage the data source, we can simply use the session variables initialized by the above .xaction within the queries that made up the CDA contents.

If you are using MDX, it will look something like this:

select {[Measures].[Total Month Sales]} ON COLUMNS,
{[Region].[All Regions]} ON ROWS
from [Sales]
where [Salesperson].[${fullName}]

The resulting dashboard will dutifully display information that is relevant to the logged in user.  It’s pretty impressive when you think about it.

Tackling Expiration Dates in Pentaho

Posted by admin on December 16, 2010  /   Posted in BI and Custom Development, Data Best Practices

The title may be a little misleading because this article encompasses a wider scope than just Expiration dates.  But I deliberately use that title because Expiration date is probably the most commonly found case of date-triggered state (or status) changes.  Of which this article is about.

Almost all business entities have a state associated with them at any given time.  Sometime these states are explicitly recorded via some kinds of Status field (column) for instance, sometimes they are implicit.  But all of them undergoes transitions from state-to-state that are governed by some business rules.

This concept is extremely important in the world of BI where transactional information are processed further to give out meanings — or even worse, to make decisions.  And in this article, we’ll see one way to represent it.

Pre-requisite: This article requires some familiarity with MDX as it is being utilized in the Pentaho BI stack (Mondrian OLAP engine, to be exact).

The Schema

Every MDX queries are run against the schema, so let’s talk about some ground rules that are not always easy to find online:

  1. Refresh this string of terminology in your mind: [Dimension.Hierarchy].[Level].<member>.<member-property>.  This is the terminology and hierarchy of a dimension in Mondrian-flavored MDX (which would be the ones we use in this article).
  2. A schema may have more than one Date/Time dimensions.  I didn’t know this, and had to find out about it through necessity.  This actually is quite a useful feature.
  3. A schema may have multiple dimensions that shares one single database table.  Sometimes you *have to have* these in order to display different columns of the table in the result using CrossJoin().  SIDE NOTE: I had an interesting conversation regarding this with Roland Bouman, who regards this as a peculiar limitation that is not imposed by any technical reasons.
  4. Members of two Time dimensions can be used as a condition specifier in functions that accept it.  For example: Filter( [TimeDim1].[Level].Members, [TimeDim].CurrentMember.Name > CurrentDateMember([TimeDim2], “<some-formatting>”).Lag(30).Name)
    IMPORTANT: Notice that the comparison is made between the ‘name’ property of the two Time dimension members.  You can’t seem to make the comparison at the member level directly.

The Problem At Hand

Let’s say your Customers have different states that they transition from and to based on some business rules.  For example customers that hasn’t ordered any gadgets from your company for more than two years will be given INACTIVE status — which by the way, is a red flag for your sales force that it’s time to go after them again.

Let’s say you want to show this in your salesperson’s Pentaho dashboard, so they can be reminded of this fact everyday they log into your system.

One Possible Solution

One way to tackle this is to create an ETL script that will attach a date column to the Customer dimension table.  Why the customer dimension? why not on the fact table itself?  Two reasons:

  1. Customer dimension may be used by more than one fact tables, if we can find out when a certain customer switches status using the dimension, we only have to do this once.
  2. Customer dimension has the correct granularity.  The status changes for a customer, not at the orders level, not at any other levels.  So the dimension is it.

After we attach the status change date information onto the Customer dimension, we then can use it in an MDX query.  Which makes sense if we are talking about any reports, cubes, or dashboards.

How do we use it? consider the following query:

select NON EMPTY {[Measures].[Sales]} ON COLUMNS,

NON EMPTY Crossjoin([Customer].[Name].Members, Filter([Status.Switch].[Date].Members, [Status].CurrentMember.Name > CurrentDateMember([Date.Timeline], “[“”Date.Timeline””].[“”Date””].[yyyymmdd]”).Lag(60.0).Name) ON ROWS

from [Sales]

Things to notice in the above MDX query:

  • The goal is to display customer names whose status has changed in the past 60 days.
  • The date when the customer status changes is recorded in the Customer dimension table, and defined in the schema as [Status] dimension, the hierarchy name is ‘Switch’, and the level is [Date], which is encoded as a string in YYYYMMDD format.  NOTE: I chose this format because it can be sorted ordinally.  That’s why it works for <, > or = conditions.
  • The powerful Filter() is used to evaluate the condition.
  • I defined two dimensions: [Customer] and [Status] that are keyed against the same customer dimension table.  This way I can use the [Status] dimension in the Filter() and use the [Customer] dimension to display the name of the customer.
  • It is very important to specify the correct Level in the first parameter to the Filter function, otherwise you’ll never get a match.  Read this sentence again, it took me almost one full day to figure this out.
  • I use the equally powerful CurrentDateMember() to get the member of yet another dimension [Date] that has the same format as the [Status] member.  To get to know this very useful function, read this primer by Diethard Steiner.
  • Then I use the .Name property of the two “YYYYMMDD” members in the condition.  This Member.Name facility is extremely powerful and flexible, but only after you understand what is going on.

The Result

Goal accomplished, now I have a list customers whose status has changed in the past 60-days.  A very useful list to be shown to a salesperson.  Of course I still have to wire this query into a dashboard or report, but that’s beyond the scope of this article.

And now you have a full know-how to tackle various reporting and query needs associated with date-sensitive state changes.  Again, Expiration date is just one of the most common instance of this problem.  In reality, there are tons of other scenarios that will fit this one.

SSO With Pentaho Community Edition

Posted by admin on September 03, 2010  /   Posted in BI and Custom Development


Pentaho is an amazing system. Built upon countless man hours from all over the world, it is one of the testament to the effectiveness of the open-source SDLC paradigm.

But here’s the rub, those of us Community Edition users — who for various reasons cannot use the Enterprise Edition — are left on our own when it comes to the more “advanced” features … such as SSO integration.

After messing with this for the last week, with a lot of help from my colleague and probably one of the most useful and fun (yes, fun!) online user community that I’ve dealt with (the #pentaho irc channel), I finally cracked the proverbial nut.

So with my boss’ blessing, I decided to document what I had to do to make this work in the spirit of giving back to the community.  Plus with the rising awareness of the benefits of BI even for small to medium corporations, I have no doubt that this information would be useful for someone somewhere.

Due to variance in SSO setup, I am not implying that the way I set it up will work for yours. That’s all for the obligatory mini-DISCLAIMER.

The Need

If your organization does not have a Single-Sign-On implemented for your enterprise applications, then this writeup is irrelevant. The fact is, SSO is a useful, productivity-boosting feature for users (and developers too) that, while almost always a major pain to setup, the payback is usually worth the hassle.

In this writeup, my scenario revolves around trying to fit Pentaho (Community Edition), which I’ll refer to as PCE from here on, into an existing SSO implementation.

Version Information:
Pentaho Business Intelligence Server – 3.6.0-stable
Microsoft IIS – version 7.0.6000

Now, for those who are familiar with setting up an SSO system, the next question will be a basic one:

Which SSO implementation did you use?

The SSO Setup

The one that we setup at work utilizes Microsoft Active Directory to authenticate users coming from the website.

While there are some documentation on the Pentaho Wiki on plugging in Pentaho into SiteMinder or CAS, less can be found when you search for Microsoft Active Directory. Which is a shame, because despite being a Linux guy myself, I have to admit that when configured correctly, their implementation of SSO, from the users’ perspective, works fairly well.

With MSAD, once it authenticated you via the usual login screen, it will push information through AJP protocol (1.3) to the Tomcat server that hosts the Pentaho biserver-ce.

For the sake of brevity and clarity, we won’t be discussing how to setup AJP to work with IIS.  Suffice to say that it uses one of the ISAPI Filter extensions called: isapi_redirect.dll to accomplish this.

The first thing to do is to modify the conf/ and conf/  The way these works is routing by URL pattern.  It should be self-explanatory to modify, contact me if you need more info.

Gotcha(TM) #1: On my Ubuntu development server, somehow Tomcat AJP listener isn’t really listening to requests coming via tcp (that’s TCP for IP4 clients), rather it waits on tcp6.  And this is not obvious either, especially when you use netstat.  Netstat will show tcp6 whether Tomcat is listening to tpc *and* tcp6 OR just on tcp6.

So now what to do? After searching for answers online, I came onto a tip to specify the actual address where the server is listening on.   Somehow this forces Tomcat to listen to both tcp and tcp6 on 8009 (AJP protocol).  To be specific: add an attribute address=”your.server.ip.address” to the <Connector> tag that configures the 8009 port in the Tomcat’s server.xml

After being sidetrack by this, I finally was able to receive AJP requests  from IIS to Tomcat which in turn dutifully re-routes them to /pentaho where the PCE lives.

The Switch

At this point, all the PCE can do is to throw a fit because it does not know what to do with the AJP request coming from IIS (i.e the user) plus it has no idea that there is an authenticated user information within the request.

So we need to initiate the switch from the pre-installed JDBC-based authentication/authorization to the one based on LDAP, of which the MS Active Directory is an implementation.

To do this, you can follow the information in the link that I’m about to give you. But, come back here after you read it because while it listed the steps, it does not give you a clue on what those modifications are really for.  Well, unless you are an LDAP and Spring Security -expert.

Here’s the link.

In summary, here are the list of files you need to touch and modify:

Under biserver-ce/pentaho-solution/system directory:

  • pentaho-spring-beans.xml – the big switch, this is where you tell Pentaho to use LDAP instead of JDBC authentication/authorization system.
  • – this file basically is the center of the modification, we will talk about this file in depth on the next section.
  • applicationContext-spring-security.xml – this is where ACL (Access Control Level) is setup at URL level.  Search for <property name=”objectDefinitionSource”>.  Scarily, the actual URL patterns and permitted roles are defined within a hardcoded CDATA block (!!).  What’s wrong with another .properties file guys?  All you need to do here is to substitute the default Pentaho roles such as Admin, Authenticated, etc. with the new ones from LDAP.
  • applicationContext-spring-security-ldap.xml – this is where the majority of the values in the above .properties file are being used.  As far as I recall, I didn’t change this file at all, which is always a good thing.
  • applicationContext-pentaho-security-ldap.xml – this file contains the two queries that populates the Pentaho UI when we select assign permissions to Users or Roles.  See “The Exception” section below.
  • pentaho.xml – this file governs who can do what to the solutions in the repositories.  All you need to modify in this file is replacing the default roles with the ones you define in LDAP.  IMPORTANT: Anytime you modify the default settings in this file, always drop these tables from the hibernate database: pro_acls_list and pro_files (in that order), then restart the biserver, this will rebuilt the two tables with the new default permissions for the solutions.

Under biserver-ce/pentaho-solution/system/data-access-plugin directory:

  • settings.xml – this file governs who can do what to the defined data sources (you know, the all important source of  data for your ad-hoc reports and cubes).  All you need to modify in this file is replacing the default roles with the ones you define in LDAP.  A #pentaho irc channel community member (pstoellberger) helped me out on this one.  Without his quick source sleuthing, there’s no telling how many more hours I’d spend figuring this out.

Under biserver-ce/tomcat/webapps/pentaho/WEB-INF directory:

  • web.xml – this file configures the pentaho web application.  Since we are plugging Pentaho into an existing enterprise application, we need to configure it to reflect this.  All you need to do is to make sure that this section of the file is properly defined (see highlighted part below):
    • <context-param><param-name>base-url</param-name>
    • <param-value></param-value>
    • </context-param>

Now let’s talk about some of these modifications in more depth.

The Property File to be exact.  This is the only property file that you need to modify for this purpose.

The values in this file are being used in three Spring-Security bean definition files.  To clarify, Pentaho uses Spring-Security to implement their authentication/authorization layer.  A wise decision that pays rather handsomely as you can see later in this article.

Let’s walk through these values:

  • contextSource.providerUrl – this should contain your ldap: URL, between you and your sysadmin, this shouldn’t be a piece of cake to get.  Example: ldap://
  • contextSource.userDn – set the value to the distinguishedName (DN) of the read-only user that you use for accessing LDAP tree.  Example: CN=LDAP Searchdog,CN=Users,DC=acme,DC=com
  • contextSource.password – the password of the read-only user above

This section takes care of the basic connecting to LDAP server.  This is used in various places as expected.

Next, we’ll fill out info for single user search (for authentication):

  • userSearch.searchBase – this value points to the root of the LDAP tree where you want the search to commence from. Example: DC=acme,DC=com
  • userSearch.searchFilter – this is the LDAP attribute that will be matched against the supplied parameter (typically the user name).  Example: (sAMAccountName={0})  <– the {0} is where the parameter would be substituted.

Next, we’ll specify how to fetch the roles of a given user (setup on LDAP):

  • populator.convertToUpperCase – when this value is ‘true’ the roles coming from LDAP will be converted into all upper case.  Not sure what this buys us, but it’s important to be consistent.  Don’t set this to true and then forgot to capitalize the roles wherever it’s defined.
  • populator.groupRoleAttribute – which LDAP attribute held the roles. Example: cn
  • populator.groupSearchBase – same as the userSearch.searchBase above
  • populator.groupSearchFilter – specifies the condition for the search, that is using the username  to get the roles he/she is associated with
  • populator.rolePrefix – if you need a prefix, I haven’t found out why would I need one.
  • populator.searchSubtree – another boolean value that indicates whether to search into the LDAP subtrees or not.

Lastly, we give the proper info for searching available roles in LDAP.  This is an important query that will actually populate the Pentaho UI where we select Roles to assign permissions to certain Reports or Cubes (or ‘Solutions’ if we use Pentaho’s lingo).

  • allAuthoritiesSearch.roleAttribute – which LDAP attribute held the value for the roles.  Example: cn
  • allAuthoritiesSearch.searchBase – where you’d want to search to begin.  IMPORTANT: the way my LDAP server is organized, when this property is set to the root of the tree (DC=acme,DC=com), the subsequent pentaho code failed to populate the UI control that allows us to select these roles.  Only when I specify a subtree that has only the roles, would this work.  Example: OU=Some Subgroup,DC=acme,DC=com
  • allAuthoritiesSearch.searchFilter – this is the criteria that is shared by all the roles we want to pull from the LDAP server.  Example: (objectClass=group)

The Exception

One LDAP query that you may want to disable is the allUsernamesSearch.  This query is defined in one of the xml files modified, called: applicationContext-spring-security-ldap.xml.

The reason that it is a good idea to disable this, is just common security/access control practice, you do not assign permissions at the users level, you define permissions with associations to roles instead.

So let’s disable the query, the way to do it is to make sure that the definition of the Spring bean points to a class that has been programmed to do nothing.  It will look something like this:

<bean id=”allUsernamesSearch” class=””

What this is saying is when the UI that allows the admin to assign permissions are displayed:

The users selection box is empty, thanks to the NoOpLdapSearch class defined above.  This means you can’t assign permission to an individual user.  In 99% of the cases, this is what you want.

The Usage

The last step that needs to be done after all the configuration above, is to actually use the roles defined in LDAP at the appropriate places.

‘Consistency’ is the keyword here, once you have defined a set of new roles in MS Active Directory to be used with Pentaho, then you *must* substitute default Pentaho roles (Admin, Authenticated, etc.) in the aforementioned configuration files with the appropriate new roles.

I don’t see the point of belaboring on this as the application would be unique to your own authentication/authorization needs.  Just be aware that a single typo will bring the system to a halt.  Involving some kind of version control is highly recommended when modifying these files.

The most unexpected and quite amazing fact in this whole thing is that Spring-Security automatically handles the authenticated user information that was sent from IIS to Tomcat without any intervention on my side.

Lesser security libraries would probably require some property tweaking or custom-written filters to do this seemingly trivial but important step.   This to me has proven one of the reasons for the maturity of Spring as one of the few Java frameworks that is truly enterprise -worthy.

The Loose Ends

Some miscellaneous random bits of info that would have saved me some time and effort had I known them before I started on this task:

  • The log file for Pentaho is located in: biserver-ce/tomcat/bin/pentaho.log
  • To find out about problems with your ISAPI Filters, view the log files located where the extension .dll file is.  In my case it’s called isapi_redirect.log
  • Turn the log4j.xml logging level for spring-security to INFO or even DEBUG to follow what’s happening if the modifications do not seem to take effect.  This is quite obvious, but when you’re busy pulling your hair out, it’s easy to forget.
  • Don’t forget to turn it back to WARN or ERROR when your modifications *do* work.
  • Oh, and Pentaho Administrator Console is useless once you switched to LDAP, it is only configured to work with the JDBC user/role management.

We serve businesses of any type and size
Contact Us Today