Tag Archives: Big Data

Event-Driven Architecture and the Internet of Things

It’s common knowledge now that the Internet of Things is projected to be a multi-trillion dollar market with billions of devices expected to be sold in a few years. It’s happening already. What’s driving IoT is a combination of low-cost hardware and lower power communications, thus enabling virtually everything to become connected cheaply. Even Facebook talked about it in their recent F8 conference (photo by Maurizio Pesce). 

16748634049_d7aea3646d_k

And why wouldn’t they? A vast array of devices that make our lives easier and smarter are flooding the market ranging from fuel-efficient thermostats, security systems, drones, and robots, among others. The industrial market for connected control and monitoring has existed and will expand in automated factories, logistics automation, and building automation. However, efficiencies are being found with new areas. For instance, connected tools for the construction site enable construction companies to better manage construction processes. We are also seeing increased intelligence from what can be referred to as the network effect – the excess value created by the combination of devices all being on a network.

What’s remarkable is that all IoT protocols share one common characteristic, i.e. they are all designed around publish/subscribe. The benefit of publish/subscribe event driven computing is simplicity and efficiency.

Devices or endpoints can be dynamic, and added or lost with little impact to the system. New devices can be discovered and rules applied to add them to the network and establish their functionality. All IoT standards support some form of discovery mechanism so that new devices can be added as near seamlessly as possible. Over the air a message can be delivered once to many listeners simultaneously without any extra effort by the publisher.

Addressing The Challenges

All of this efficiency and flexibility sounds too good to be true? You guessed right. The greatest challenge with this is security and privacy. While most protocols support encryption of messages, there are serious issues with security and privacy with today’s protocols. There are many IoT protocols and the diversity indicates a lot of devices will not be secure and it is likely that different protocols will have different vulnerabilities. Authentication of devices is not generally performed, so various attacks based on impersonation are possible.

Most devices and protocols don’t automate software updating and complicated action is needed sometimes to update software on devices. This can lead to vulnerabilities persisting for long periods. However, eventually, these issues will be worked out and devices will automatically download authenticated updates. The packets will be encrypted to prevent eavesdropping and it will be harder to hack IoT device security, albeit this could take years. Enterprise versions of devices will undoubtedly flourish, thereby supporting better security as this will be a requirement for enterprise adoption.

Publish/subscribe generates a lot of excitement due to the agility it gives people to leverage information easily, thus enabling faster innovation and more network effect. Point-to-point technologies lead to brittle architectures that are burdensome to add or change functionality.

WSO2 has staked out a significant amount of mindshare and software to support IoT technologies. WSO2 helps companies with its lean, open-source componentized event driven messaging and mediation technology that can go into devices and sensors for communication between devices and services on hubs, in the cloud or elsewhere; big data components for streaming, storing and analyzing data from devices; process automation and device management for IoT and application management software for IoT applications and devices. WSO2 can help large and small firms deploying or building IoT devices to bring products to market sooner and make their devices or applications smarter, easier, and cheaper to manage.

To learn more about event-driven architecture refer to our white paper – Event-Driven Architecture: The Path to Increased Agility and High Expandability

Want to know more about using analytics to architect solutions? Read  IoT Analytics: Using Big Data to Architect IoT Solutions

 

Understanding Causality and Big Data: Complexities, Challenges, and Tradeoffs

image credit: Wikipedia, Amitchell125

“Does smoking cause cancer?”

We have heard that lot of smokers have lung cancer. However, can we mathematically confirm that smoking causes cancer?

We can look at cancer patients and check how many of them are smoking. We can look at smokers and check will they develop cancer. Let’s assume that answers come up 100%. That is, hypothetically, we can see a 1–1 relationship between smokers and cancer.

Okay: can we claim that smoking causes cancer? Apparently it is not easy to make that claim. Let’s assume that there is a gene that causes cancer and also makes people like to smoke. If that is the cause, we will see the 1–1 relationship between cancer and smoking. In this scenario, cancer is caused by the gene. That means there may be an innocent explanation to 1–1 relationship we saw between cancer and smoking.

This example shows two interesting concepts: correlation and causality from statistics, which play a key role in Data Science and Big Data. Correlation means that we will see two readings behave together (e.g. smoking and cancer) while causality means one is the cause of the other. The key point is that if there is a causality, removing the first will change or remove the second. That is not the case with correlation.

Correlation does not mean Causation!

This difference is critical when deciding how to react to an observation. If there is causality between A and B, then A is responsible. We might decide to punish A in some way or we might decide to control A. However, correlation does warrant such actions.

For example, as described in the post The Blagojevich Upside, the state of Illinois found that having books at home is highly correlated with better test scores even if the kids have not read them. So they decide the distribute books. In retrospect, we can easily find a common cause. Having the book in a home could be an indicator of how studious parents are, which will help with better scores. Sending books home, however, is unlikely to change anything.

You see correlation without a causality when there is a common cause that drives both readings. This is a common theme of the discussion. You can find a detailed discussion on causality from the talk “Challenges in Causality” by Isabelle Guyon.

Can we prove Causality?

Casualty is measured through randomized experiments (a.k.a. randomized trials or AB tests). A randomized experiment selects samples and randomly break them into two groups called the control and variation. Then we apply the cause (e.g. send a book home) to variation group and measure the effects (e.g. test scores). Finally, we measure the casualty by comparing the effect in control and variation groups. This is how medications are tested.

To be precise, if error bars for groups does not overlap for both the groups, then there is a causality. Check https://www.optimizely.com/ab-testing/ for more details.

However, that is not always practical. For example, if you want to prove that smoking causes cancer, you need to first select a population, place them randomly into two groups, make half of the smoke, and make sure other half does not smoke. Then wait for like 50 years and compare.

Did you see the catch? it is not good enough to compare smokers and non-smokers as there may be a common cause like the gene that cause them to do so. Do prove causality, you need to randomly pick people and ask some of them to smoke. Well, that is not ethical. So this experiment can never be done. Actually, this argument has been used before (e.g.https://en.wikipedia.org/wiki/A_Frank_Statement. )

This can get funnier. If you want to prove that greenhouse gasses cause global warming, you need to find another copy of earth, apply greenhouse gasses to one, and wait few hundred years!!

To summarize, Casualty, sometime, might be very hard to prove and you really need to differentiate between correlation and causality.

Following are examples when causality is needed.

  • Before punishing someone
  • Diagnosing a patient
  • Measure effectiveness of a new drug
  • Evaluate the effect of a new policy (e.g. new Tax)
  • To change a behavior

Big Data and Causality

Most big data datasets are observational data collected from the real world. Hence, there is no control group. Therefore, most of the time all you can only show and it is very hard to prove causality.

There are two reactions to this problem.

First, “Big data guys do not understand what they are doing. It is stupid to try to draw conclusions without randomized experiment”.

I find this view lazy.

Obviously, there are lots of interesting knowledge in observational data. If we can find a way to use them, that will let us use these techniques in many more applications. We need to figure out a way to use it and stop complaining. If current statistics does not know how to do it, we need to find a way.

Second is “forget causality! correlation is enough”.

I find this view blind.

Playing ostrich does not make the problem go away. This kind of crude generalizations make people do stupid things and can limit the adoption of Big Data technologies.

We need to find the middle ground!

When do we need Causality?

The answer depends on what are we going to do with the data. For example, if we are going to just recommend a product based on the data, chances are that correlation is enough. However, if we are taking a life changing decision or make a major policy decision, we might need causality.

Let us investigate both types of cases.

Correlation is enough when stakes are low, or we can later verify our decision. Following are few examples.

  1. When stakes are low ( e.g. marketing, recommendations)?—?when showing an advertisement or recommending a product to buy, one has more freedom to make an error.
  2. As a starting point for an investigation?—?correlation is never enough to prove someone is guilty, however, it can show us useful places to start digging.
  3. Sometimes, it is hard to know what things are connected, but easy verify the quality given a choice. For example, if you are trying to match candidates to a job or decide good dating pairs, correlation might be enough. In both these cases, given a pair, there are good way to verify the fit.

There are other cases where causality is crucial. Following are few examples.

  1. Find a cause for disease
  2. Policy decisions ( would 15$ minimum wage be better? would free health care is better?)
  3. When stakes are too high ( Shutting down a company, passing a verdict in court, sending a book to each kid in the state)
  4. When we are acting on the decision ( firing an employee)

Even, in these cases, correlation might be useful to find good experiments that you want to run. You can find factors that are correlated, and design the experiments to test causality, which will reduce the number of experiments you need to do. In the book example, state could have run a experiment by selecting a population and sending the book to half of them and looking at the outcome.

Some cases, you can build your system to inherently run experiments that let you measure causality. Google is famous for A/B testing every small thing, down to the placement of a button and shade of color. When they roll out a new feature, they select a population and roll out the feature for only part of the population and compare the two.

So in any of the cases, correlation is pretty useful. However, the key is to make sure that the decision makers understand the difference when they act on the results.

Closing Remarks

Causality can be a pretty hard thing to prove. Since most big data is observational data, often we can only show the correlation, but not causality. If we mixed up the two, we can end up doing stupid things.

Most important thing is having a clear understanding at the point when we act on the decisions. Sometime, when stakes are low, correlation might be enough. On some other cases, it is best to run a experiment to verify our claims. Finally, some systems might warrant building experiments into system itself, letting you draw strong causality results. Choose wisely!