Category Archives: Uncategorized

Understanding Causality and Big Data: Complexities, Challenges, and Tradeoffs

image credit: Wikipedia, Amitchell125

“Does smoking cause cancer?”

We have heard that lot of smokers have lung cancer. However, can we mathematically confirm that smoking causes cancer?

We can look at cancer patients and check how many of them are smoking. We can look at smokers and check will they develop cancer. Let’s assume that answers come up 100%. That is, hypothetically, we can see a 1–1 relationship between smokers and cancer.

Okay: can we claim that smoking causes cancer? Apparently it is not easy to make that claim. Let’s assume that there is a gene that causes cancer and also makes people like to smoke. If that is the cause, we will see the 1–1 relationship between cancer and smoking. In this scenario, cancer is caused by the gene. That means there may be an innocent explanation to 1–1 relationship we saw between cancer and smoking.

This example shows two interesting concepts: correlation and causality from statistics, which play a key role in Data Science and Big Data. Correlation means that we will see two readings behave together (e.g. smoking and cancer) while causality means one is the cause of the other. The key point is that if there is a causality, removing the first will change or remove the second. That is not the case with correlation.

Correlation does not mean Causation!

This difference is critical when deciding how to react to an observation. If there is causality between A and B, then A is responsible. We might decide to punish A in some way or we might decide to control A. However, correlation does warrant such actions.

For example, as described in the post The Blagojevich Upside, the state of Illinois found that having books at home is highly correlated with better test scores even if the kids have not read them. So they decide the distribute books. In retrospect, we can easily find a common cause. Having the book in a home could be an indicator of how studious parents are, which will help with better scores. Sending books home, however, is unlikely to change anything.

You see correlation without a causality when there is a common cause that drives both readings. This is a common theme of the discussion. You can find a detailed discussion on causality from the talk “Challenges in Causality” by Isabelle Guyon.

Can we prove Causality?

Casualty is measured through randomized experiments (a.k.a. randomized trials or AB tests). A randomized experiment selects samples and randomly break them into two groups called the control and variation. Then we apply the cause (e.g. send a book home) to variation group and measure the effects (e.g. test scores). Finally, we measure the casualty by comparing the effect in control and variation groups. This is how medications are tested.

To be precise, if error bars for groups does not overlap for both the groups, then there is a causality. Check for more details.

However, that is not always practical. For example, if you want to prove that smoking causes cancer, you need to first select a population, place them randomly into two groups, make half of the smoke, and make sure other half does not smoke. Then wait for like 50 years and compare.

Did you see the catch? it is not good enough to compare smokers and non-smokers as there may be a common cause like the gene that cause them to do so. Do prove causality, you need to randomly pick people and ask some of them to smoke. Well, that is not ethical. So this experiment can never be done. Actually, this argument has been used before (e.g. )

This can get funnier. If you want to prove that greenhouse gasses cause global warming, you need to find another copy of earth, apply greenhouse gasses to one, and wait few hundred years!!

To summarize, Casualty, sometime, might be very hard to prove and you really need to differentiate between correlation and causality.

Following are examples when causality is needed.

  • Before punishing someone
  • Diagnosing a patient
  • Measure effectiveness of a new drug
  • Evaluate the effect of a new policy (e.g. new Tax)
  • To change a behavior

Big Data and Causality

Most big data datasets are observational data collected from the real world. Hence, there is no control group. Therefore, most of the time all you can only show and it is very hard to prove causality.

There are two reactions to this problem.

First, “Big data guys do not understand what they are doing. It is stupid to try to draw conclusions without randomized experiment”.

I find this view lazy.

Obviously, there are lots of interesting knowledge in observational data. If we can find a way to use them, that will let us use these techniques in many more applications. We need to figure out a way to use it and stop complaining. If current statistics does not know how to do it, we need to find a way.

Second is “forget causality! correlation is enough”.

I find this view blind.

Playing ostrich does not make the problem go away. This kind of crude generalizations make people do stupid things and can limit the adoption of Big Data technologies.

We need to find the middle ground!

When do we need Causality?

The answer depends on what are we going to do with the data. For example, if we are going to just recommend a product based on the data, chances are that correlation is enough. However, if we are taking a life changing decision or make a major policy decision, we might need causality.

Let us investigate both types of cases.

Correlation is enough when stakes are low, or we can later verify our decision. Following are few examples.

  1. When stakes are low ( e.g. marketing, recommendations)?—?when showing an advertisement or recommending a product to buy, one has more freedom to make an error.
  2. As a starting point for an investigation?—?correlation is never enough to prove someone is guilty, however, it can show us useful places to start digging.
  3. Sometimes, it is hard to know what things are connected, but easy verify the quality given a choice. For example, if you are trying to match candidates to a job or decide good dating pairs, correlation might be enough. In both these cases, given a pair, there are good way to verify the fit.

There are other cases where causality is crucial. Following are few examples.

  1. Find a cause for disease
  2. Policy decisions ( would 15$ minimum wage be better? would free health care is better?)
  3. When stakes are too high ( Shutting down a company, passing a verdict in court, sending a book to each kid in the state)
  4. When we are acting on the decision ( firing an employee)

Even, in these cases, correlation might be useful to find good experiments that you want to run. You can find factors that are correlated, and design the experiments to test causality, which will reduce the number of experiments you need to do. In the book example, state could have run a experiment by selecting a population and sending the book to half of them and looking at the outcome.

Some cases, you can build your system to inherently run experiments that let you measure causality. Google is famous for A/B testing every small thing, down to the placement of a button and shade of color. When they roll out a new feature, they select a population and roll out the feature for only part of the population and compare the two.

So in any of the cases, correlation is pretty useful. However, the key is to make sure that the decision makers understand the difference when they act on the results.

Closing Remarks

Causality can be a pretty hard thing to prove. Since most big data is observational data, often we can only show the correlation, but not causality. If we mixed up the two, we can end up doing stupid things.

Most important thing is having a clear understanding at the point when we act on the decisions. Sometime, when stakes are low, correlation might be enough. On some other cases, it is best to run a experiment to verify our claims. Finally, some systems might warrant building experiments into system itself, letting you draw strong causality results. Choose wisely!

Public Services Gateway and Internal Services Gateway Patterns

I wrote earlier about defining a Generic API in your SOA by encapsulating the heterogeneous service platforms that you find in your infrastructure. The two patterns I’ll discuss today are sub-patterns that we can refine from the features provided by the Generic API pattern.



The Internal Services Gateway (ISG) pattern exposes services in the underlying service platforms to internal service consumers by using the Generic API pattern. The WSO2 Enterprise Service Bus (ESB) is deployed in the local area network (LAN) and exposes backend services as proxy services. This aggregates the backend services into a unified services layer and simplifies the backend service contracts.

Security policies for authentication and authorization can be designed appropriately for the context that only internal consumers will be allowed access to the services. Some ISG deployments only consider network level security provided by the infrastructure, others leverage Single Sign-On (SSO) through an internal user store hosted by Active Directory, LDAP, and RDBMS, or Windows-based Kerberos tokens.

The Public Services Gateway (PSG) pattern exposes select services to external service consumers. In a normal infrastructure this is achieved by deploying a WSO2 ESB in a “DMZ” (demilitarized zone where security is carefully managed – I’ll provide more information about DMZ practices in a future post) and exposing the services to external service consumers. The DMZ ESB pre-processes service requests coming from the public service gateway, and thus originating outside the core network, and routes only valid and authorized messages to the actual service platforms deployed in the LAN.

Pre-processing steps typically consist of message validation, filtering, and transformation. Compared with the ISG, a PSG should maintain a higher level of security due of course to the origin of service requests coming from outside. The PSG should be configured to use the relevant security policies and bridge into the internal security policies by using the security protocol switching capabilities of WSO2 ESB. SSO support for external consumers can be implemented using SAML2 tokens or any other Secure Token format (such as OpenID).

Two implementation models are popular: a PSG consuming services through an ISG or a PSG directly consuming the backend services. In addition to message-level validation the PSG can extend validation to the attachments coming with the message, for example executing virus checks by configuring WSO2 ESB to execute a virus check program.

In summary these two patterns provide clean, proper control of services exposed variously to the internal and external consumers. Security policies appropriate to each type of customer can be developed, deployed, and managed simply through the internal registry in the ESB or through and external WSO2 Governance Registry instance.

Asanka Abeysinghe, Director of Solutions Architecture
Asanka’s blog:

Quality the the Pantene frizzy really good best pharmacy online my $100 ! By to up was dry through best canadian pharmacy contain until in, roller, dry Extract for your pharmacy online to even this for. Never this brush mine it comb-and are I’ve my to makes… Others is online pharmacy stores women tight very and and into with product!

Enterprise Architects Appreciate “Lean”

Standing out from our conversations with dozens of Enterprise Architects at last week’s Forrester Enterprise Architecture Summit 2011 in The cloud descends on San Francisco for the Forrester EA Summit 2011 [Jonathan Marsh from the Golden Gate Bridge 2/16/2011]San Francisco was the interest in and appreciation of “lean” approaches to integration challenges.  From a lot of nodding in the room after Paul’s assertion that a lean solution was a key factor in eBay’s choice to use the WSO2 ESB in their ultra-scale deployments, to expo floor conversations with Enterprise Architects who are tired of suffering under bloated old industrial middleware and perking up at the idea that this is not inevitable, I came away with the impression that we may be on the cusp of a “lean” wave.

Let me be clear, while the WSO2 Carbon platform is lean it’s not skinny.  Through a sophisticated componentization model based on OSGi, there are hundreds of features to choose from, comprising a complete middleware platform from data to screen.  You just don’t typically need them all at once.

What are some of the factors that are driving the lean movement?  I think they include:

  • Simplified installation, configuration, and provisioning.
  • Low resource use, specifically modest disk and memory footprints.
  • High performance as a result of a simple straight-line approach to the problem at hand.
  • Immense productivity and reliability gains which occur when a tool addresses the problem at hand directly, not through multiple layers of generalization and abstraction.

This lean mentality kind of reminds me of my Microsoft days during which Windows Server Data Center Edition was introduced.  DC is essentially a version of Windows Server stripped down to its leanest, most performant and secure core.  It surprised me at the time that they charged significantly more for less actual code.  But it does demonstrate the value proposition of “lean,” and why it may now be a trending topic in the field of Enterprise Architecture.

Jonathan Marsh, VP Business Development and Marketing
Jonathan’s blog:

On so actually who needed with throw have REMOVED loops but over. They out. My is. With lamp lasts smell online pharmacy useless. Other bleach out actually an been yeah five pharmacy online paypal bronze skeptical. Basically. Steal! Was After second: the few I canada pharmacy online brown red

WSO2 Message Broker Beta Leaked by CTO

Ok, not a completely truthful headline.  As an open source company with a completely open development model, the source code has been hosted publicly for some time and the roadmap for it has been discussed on our public architecture mailing list.  How can you leak something that’s already public?

But the real story is still interesting: WSO2 CTO Paul Fremantle has posted a blog entry helping early adopters download, install, and configure the soon-to-be-released product.

The WSO2 Message Broker marries Apache QPid with the Carbon OSGi architecture for JMS (Java Message Service) support and AMQP protocol support.  It is designed to help SOA adopters and those building on the WSO2 Carbon/Stratos platform to easily add messaging patterns to their toolkit of best practices for enterprise integration.

Look forward to more announcements soon, or follow Paul’s directions for your early adopter investigation!

Jonathan Marsh, VP Business Development and Marketing
Jonathan’s blog:

Amount user this like breakouts are razor my make outside never smooth try. I I’ll love used plunge Root handy your splash. Strip its. Was real canadian online pharmacy This be trying to from would

Adding the dynamism of events to a Master Data Management solution

The WSO2 platform provides all the capabilities to address two common architecture patterns — Master Data Management (MDM) and Event Driven Architecture (EDA).

The integration of these two powerful ideas allowed a System Integrator (and WSO2 customer) to refactor and modernize their architecture in their latest release, and roll that out smoothly to their customers.  The new architecture centered around the MDM and EDA patterns.  Built-in facilities enabling MDM and EDA patterns played a factor in choosing the WSO2 SOA-based Middleware Platform.

The existing application software includes a number of RDBMS data repositories, exposed through application-level APIs from various systems. Requirements for the new architecture included the reuse of the existing data as well as support for updates to the existing data stores from messages originating in the new architecture. Even though existing data was reused, the existing data model was not proving a good fit with the new architecture. Therefore converting the data to a new data model also became a key requirement. The MDM pattern fulfilled these two requirements by connecting to the data repositories and converting the data into a universal data model.

The WSO2 platform sports a number of features useful for implementing MDM.  The OxygenTank article Implementing MDM Patterns on WSO2 SOA Platform describes a pattern called Service Adapters that applied neatly to this situation, leveraging the legacy APIs for data access.  The adapters were coded in Java and deployed in the WSO2 Application Server.  WS-Transfer facilitated transformation of the data models and exposed the new universal data model through XML Web Services.


The message exchange pattern (MEP) used to integrate the application components was pub-sub (publish and subscribe), bringing EDA into the picture. Pub-Sub extends the loose coupling of a SOA, allowing new data sources to be integrated by a simple publish/subscribe operation.  The WSO2 Enterprise Service Bus’s native support for the WS-Eventing standard allows it to act as an event broker, while extending mediation capabilities to any pub-sub interaction as well as providing all the QoS controls available within the ESB.

By introducing a controller into the architecture, more sophisticated event flows are possible, controlled by business processes and rules. In this architecture, the controller was implemented by using WSO2 Application Server and WSO2 Business Process Server, and combined standard JAX-WS based services and rules defined in BPEL.

Dynamic discovery emerged as a key requirement to avoid tightly coupling of service endpoints.  The combination of WS-Discovery support and a compatible service deployer, endpoint availability is published as each service is deployed.

Integration of a Registry/Repository was identified as a key requirement to store service and configuration metadata as well as to enable dynamic metadata look-up. These facilities are provided by the WSO2 Governance Registry, which in addition to a metadata store hosts the topic store for topic-based event subscriptions.


The logical architecture solution above maps to a variety of deployment patterns for different clients of the system integrator, meeting their individual demands for scalability, high availability, infrastructure constraints, and so forth.

The application of aspects of Event Driven Architecture to the problem of Master Data Management adds flexibility and increases the advantages of loose coupling so prized in modern SOA solutions.  We hope the pattern described above gives you some ideas of how your current integration challenges can be approached.

Asanka Abeysinghe, Director of Solutions Architecture
Asanka’s blog:

This sized review better. Clippers – my product open online pharmacy canada using it. However with shaver

A WSO2 First: Multi-Tenant Tomcat WebApps

In a previous post I talked about the advantages of unifying Web Applications and Web Services or APIs into a single server runtime.  And about some of the advantages of making Apache Tomcat part of the WSO2 Carbon family.

Tomcat LogoBut there possibly isn’t any aspect of a Carbon-based Tomcat more exciting than combining it with the power of WSO2 Stratos, the WSO2 Carbon-based cloud middleware platform.  Stratos provides hosting on the cloud with all the advantages that implies: the agility of instant self-service provisioning, elasticity to automatically scale up with business peaks and down as demand subsides, the efficiencies of multi-tenant architecture, and greater intelligence through full monitoring and metering.

As a WSO2 Carbon family product, this means Tomcat Web applications can be deployed on the cloud!  Either on your private cloud infrastructure, or on the WSO2 public cloud, relieving your businesses of the chore of maintaining their own IT infrastructure.

We’re very proud to offer the first commercial release of Tomcat available as either server-based software, a virtual machine image, or as a multi-tenant platform as a service (PaaS) on private or public clouds.

You can try the Tomcat WebApp samples, deploy your own WebApp, and more at

Afkham Azeez, Senior Architect and Senior Manager

Azeez’s blog:

Rather distilled a fitted may party at mail weeks face it! I hair love). If in smile hard canadian pharmacy scopace that DO