sa
2014/06/10
10 Jun, 2014

Time Series Analysis with WSO2 Complex Event Processor

  • Sohani Weerasinghe
  • Senior Software Engineer - WSO2
Archived Content
This content is provided for historical perspective only, and may not reflect current conditions. Please refer to the WSO2 analytics page for more up-to-date product information and resources.

A time series can be defined as a collection of data recorded over a period of time (e.g. daily, weekly, monthly, yearly, etc.). Mainly time series analysis can be used to make decisions based on long-term forecasting. Stock and flow can be identified as two main type of time series. Stock series are measures, or counts, taken at a point in time, e.g. the number of mobile phones available in a shop on a particular day. This figure will change from day to day depending on the amount of stock received that day and the number of phones sold. Flow series are measures of activity over a given period of time, e.g. the number of mobile phones sold by the shop in a particular month. This figure will change day by day, depending on the number of phones sold each day. At the end of the month, the total number of sales can be calculated. Basically these two types are treated in pretty much the same way in the time series analysis process, but the main difference is that flow series can be affected by the calendar known as trading day effects, e.g. there will be more sales on mobile phones in January with five weekends rather than four.

General aspects of time series patterns

Most time series patterns can be described in terms of four basic components - secular trend, seasonal variation, cyclic variation, and irregular variation.

  • Secular trend

    Secular trend is a long-term movement in time series where its direction is either upward or downward. Data is taken over a long period of time and the trend can be simply detected by taking the averages over a certain period. If the averages change with time, we can say there can be a trend in the series. Examples of upward trend are population increases over a period of time, price increases over a period of years, and production of goods on the capital market of the country increases over a period of years. An example of declining trend is the sales of a commodity may decrease over a period of time due to a substitute product. Figure 1 shows how sales has increased over a year, where Y denotes sales and X is the time in months.

    Figure 1

  • Seasonal variation

    This can be a short-term fluctuation in a time series that occurs periodically in a year. This continues to repeat year after year. Some of the major factors that are responsible for the repetitive pattern of seasonal variations are natural conditions like weather, social, and cultural behaviors, and business and administrative purposes. Some examples would be more woollen clothes are sold in winter than in the season of summer. Regardless of the trend, we can observe that in each year more ice creams are sold in summer and very little during the winter season.

    Seasonality in a time series can be identified by regularly spaced peaks that have a consistent direction. Figure 2 shows the sales of a commodity over a year where sales has increased in winter and summer than in other seasons.

    Figure 2

  • Cyclical variations

    Cyclical variations are upward or downward movements in a time series where the period of the cycle is greater than a year. These variations are not as regular as seasonal variation. There are different types of cycles of varying in terms of length and size. An example for cyclic variations is ups and downs in business activities over a long period of time.

    Figure 3

  • Irregular variation

    Irregular variations are fluctuations in time series that are short in duration and there is no regularity in the occurrence pattern. These variations are also referred to as residual variations since by definition they represent what is left out in a time series after trend, cyclical, and seasonal variations. Irregular fluctuations are a result of unforeseen events like floods, earthquakes, wars, etc.

Main process of time series analysis in WSO2 CEP

WSO2 CEP is an enterprise grade server that integrates with various systems to analyze meaningful patterns in the real world, such as financial analysis, fraud detection, sales, shipping, and other business events. The core backend runtime engine of the WSO2 CEP server is WSO2 Siddhi, which can be used to perform time series analysis based on user inputs.

The user can configure an execution plan defining the input streams and appropriate function in the Siddhi query. Thereafter, it will provide the regression equation with relevant data, such as beta coefficients, standard error, T statistics of each coefficient, in order to forecast the future data or to identify outliers. Mainly the user needs to define the confidence interval required and based on that the coefficients are identified as weak or strong. If a predictor variable’s P value exceeds the P value of the confidence interval, it is identified as a weak variable and the coefficient becomes zero. In that case, those variables will have no effect on the criterion variable.

When configuring the Siddhi query, initially an event stream needs to be defined specifying its name and attributes. The attributes should be defined as a pair of its name and the type. After defining the stream, Siddhi creates an input handler that is used to send the defined stream into the system. In addition, callbacks can be used to receive notifications when events are produced on the event stream. Finally the event is projected into the outgoing event stream based on the defined outgoing stream attributes.

Regression types supported by CEP

  • Simple regression

    In simple linear regression, the predicted result is based only on one variable, which is called the ‘predictor variable’ referred to as X. The variable we are predicting is called the ‘criterion variable’ referred to as Y.

    The regression equation is y = β0 + β1 * x + ε

    where β0 is the intercept and β1 is the slope and ε is the standard error of the regression equation.

    In simple linear regression, the predictions of Y when plotted as a function of X forms a graph as shown below. Where,

    Y = Sales

    X = Time in months

    Figure 4

  • Multiple regression

    The general purpose of multiple regression is to quantify the relationship between several independent or predictor variables and a dependent variable or a criterion variable. Here, the criterion variable is predicted by two or more predictor variables.

    The regression equation is y = β0 + β1 * x1 +β2 * x2 + ……..+βn * xn + ε

    In multiple regression the predictions of Y when plotted as a function of Xs forms a graph as shown below. Where,

    Y1 = Profit

    X1 = Sales

    X2 = Time in years

    Figure 5

  • Discrete seasonality regression

    Discrete seasonality regression is performed on data that has seasonal patterns. Seasonal patterns are recognized by seeing the same repeating patterns over a period of time, e.g. peak sales in December as shown in the graph, where Y denotes sales and X denotes time in months.

    Figure 6

    If the user identifies that there is a discrete seasonality in the data set the user would need to add a dummy variable. Dummy variables are independent variables that take the value of either 0 or 1. In discrete seasonality regression, a dummy variable with a value of 0 will cause its coefficient to disappear from the equation and the value of 1 causes the coefficient to function as a supplemental intercept due to the identity property of multiplication by 1, e.g. if the month is December then the value of the dummy variable will be 1 and if not it is 0. When the user defines a dummy variable it will be added to the regression equation where it calculates a coefficient for the dummy variable as well; in that case this can be considered by the user for future predictions.

    The regression equation is y = β0 + β1 * x1 +β2 * D+ ε

  • Continuous seasonality regression

    If the seasonality is not discrete, i.e. if it’s sinusoidal, in that case the data set would be in the form y ~ sin(x). The user should create a new variable defined as x1 = sin(x) and use x1 as the predictor variable for the regression. This will create the below linear regression equation.

    y = β0 + β1 * x1+ ε

    In continuous seasonality regression, the predictions of Y when plotted as a function of X forms a graph as shown in Figure 7, where Y denotes sales and X is the time in months.

    Figure 7

    If the user identifies that the data follows some other pattern, such as quadratic y ~ x2 or exponential y ~ax in that case the user needs to define X variables to represent that pattern, such as x = x2 if its quadratic and x= ax if its exponential.

Sample

The sample below is about predicting the profit of a company based on the year, investment, export and number of employees. Through this sample it describes how to carry out a time series analysis with WSO2 CEP.

Criterion variable - Profit of the company in millions

Predictor variable - Year, investment, export, number of employees

Figure 8 shows the data set and the graph related to the sample

Figure 8

Figure 9

By looking at the data set and the graph we can see that this sample is related with multiple regression where profit is dependent on more than one variable. The regression equation for this sample would be

y = β0 + β1 * x1 +β2 * x2 + β3 * x3+ β4 * x4+ε

where, year, investment, export, number of employees

y = Profit

x1 = Year

x2 = Investment

x3 = Export

x4 = Employees

β0 = the intercept

β1 = slope of Year

β2 = slope of Investment

β3 = slope of Export

β4 = slope of Employees

ε = Standard Error

After configuring the execution plan the required output can be obtained via executing the below Siddhi query expression.

from DataStream#transform.timeseries:regress(ci, y, x1, x2, x3, x4 )

select *

insert into RegressionResult;

When considering the above query, ‘DataStream’ can be identified as the defined event stream. Since this has used a custom transformer, Siddhi supports transforming by using #transform. The namespace of the created extension is ‘timeseries’ and the function name is ‘regress’. The required attributes (confidence interval, criterion variable, predictor variables) has passed as parameters for the function regress. The above query defines the output stream as ‘RegressionResult’ to hold the output attributes.

The output contains the Standard error, the intercept, slopes of predictor variables. For the above example the result is as follows

Standard Error 48.0580567924631

Intercept -201253.1410667966

Coefficient of Year 100.6863391704232

Coefficient of Investment 0.0

Coefficient of Export 0.0

Coefficient of Employees 0.00684219674285623

Profit = -201253.1410667966 + 100.6863391704232 * Year + 0.00684219674285623 * Employees

Conclusion

Time series data often arise when monitoring industrial processes where time series analysis can be used to identify patterns in correlated data like trends, seasonal variations etc. WSO2 CEP can be used to identify and quantify the relationships among data that can be used for forecasting purposes based on previous patterns.

 

About Author

  • Sohani Weerasinghe
  • Senior Software Engineer
  • WSO2