Time Series Analysis with WSO2 Complex Event Processor
- Sohani Weerasinghe
- Senior Software Engineer - WSO2
A time series can be defined as a collection of data recorded over a period of time (e.g. daily, weekly, monthly, yearly, etc.). Mainly time series analysis can be used to make decisions based on long-term forecasting. Stock and flow can be identified as two main type of time series. Stock series are measures, or counts, taken at a point in time, e.g. the number of mobile phones available in a shop on a particular day. This figure will change from day to day depending on the amount of stock received that day and the number of phones sold. Flow series are measures of activity over a given period of time, e.g. the number of mobile phones sold by the shop in a particular month. This figure will change day by day, depending on the number of phones sold each day. At the end of the month, the total number of sales can be calculated. Basically these two types are treated in pretty much the same way in the time series analysis process, but the main difference is that flow series can be affected by the calendar known as trading day effects, e.g. there will be more sales on mobile phones in January with five weekends rather than four.
General aspects of time series patterns
Most time series patterns can be described in terms of four basic components - secular trend, seasonal variation, cyclic variation, and irregular variation.
-
Secular trend
Secular trend is a long-term movement in time series where its direction is either upward or downward. Data is taken over a long period of time and the trend can be simply detected by taking the averages over a certain period. If the averages change with time, we can say there can be a trend in the series. Examples of upward trend are population increases over a period of time, price increases over a period of years, and production of goods on the capital market of the country increases over a period of years. An example of declining trend is the sales of a commodity may decrease over a period of time due to a substitute product. Figure 1 shows how sales has increased over a year, where Y denotes sales and X is the time in months.
Figure 1
-
Seasonal variation
This can be a short-term fluctuation in a time series that occurs periodically in a year. This continues to repeat year after year. Some of the major factors that are responsible for the repetitive pattern of seasonal variations are natural conditions like weather, social, and cultural behaviors, and business and administrative purposes. Some examples would be more woollen clothes are sold in winter than in the season of summer. Regardless of the trend, we can observe that in each year more ice creams are sold in summer and very little during the winter season.
Seasonality in a time series can be identified by regularly spaced peaks that have a consistent direction. Figure 2 shows the sales of a commodity over a year where sales has increased in winter and summer than in other seasons.
Figure 2
-
Cyclical variations
Cyclical variations are upward or downward movements in a time series where the period of the cycle is greater than a year. These variations are not as regular as seasonal variation. There are different types of cycles of varying in terms of length and size. An example for cyclic variations is ups and downs in business activities over a long period of time.
Figure 3
- Irregular variation
Irregular variations are fluctuations in time series that are short in duration and there is no regularity in the occurrence pattern. These variations are also referred to as residual variations since by definition they represent what is left out in a time series after trend, cyclical, and seasonal variations. Irregular fluctuations are a result of unforeseen events like floods, earthquakes, wars, etc.
Main process of time series analysis in WSO2 CEP
WSO2 CEP is an enterprise grade server that integrates with various systems to analyze meaningful patterns in the real world, such as financial analysis, fraud detection, sales, shipping, and other business events. The core backend runtime engine of the WSO2 CEP server is WSO2 Siddhi, which can be used to perform time series analysis based on user inputs.
The user can configure an execution plan defining the input streams and appropriate function in the Siddhi query. Thereafter, it will provide the regression equation with relevant data, such as beta coefficients, standard error, T statistics of each coefficient, in order to forecast the future data or to identify outliers. Mainly the user needs to define the confidence interval required and based on that the coefficients are identified as weak or strong. If a predictor variable’s P value exceeds the P value of the confidence interval, it is identified as a weak variable and the coefficient becomes zero. In that case, those variables will have no effect on the criterion variable.
When configuring the Siddhi query, initially an event stream needs to be defined specifying its name and attributes. The attributes should be defined as a pair of its name and the type. After defining the stream, Siddhi creates an input handler that is used to send the defined stream into the system. In addition, callbacks can be used to receive notifications when events are produced on the event stream. Finally the event is projected into the outgoing event stream based on the defined outgoing stream attributes.
Regression types supported by CEP
-
Simple regression
In simple linear regression, the predicted result is based only on one variable, which is called the ‘predictor variable’ referred to as X. The variable we are predicting is called the ‘criterion variable’ referred to as Y.
The regression equation is y = β0 + β1 * x + ε
where β0 is the intercept and β1 is the slope and ε is the standard error of the regression equation.
In simple linear regression, the predictions of Y when plotted as a function of X forms a graph as shown below. Where,
Y = Sales
X = Time in months
Figure 4
-
Multiple regression
The general purpose of multiple regression is to quantify the relationship between several independent or predictor variables and a dependent variable or a criterion variable. Here, the criterion variable is predicted by two or more predictor variables.
The regression equation is y = β0 + β1 * x1 +β2 * x2 + ……..+βn * xn + ε
In multiple regression the predictions of Y when plotted as a function of Xs forms a graph as shown below. Where,
Y1 = Profit
X1 = Sales
X2 = Time in years
Figure 5
-
Discrete seasonality regression
Discrete seasonality regression is performed on data that has seasonal patterns. Seasonal patterns are recognized by seeing the same repeating patterns over a period of time, e.g. peak sales in December as shown in the graph, where Y denotes sales and X denotes time in months.
Figure 6
If the user identifies that there is a discrete seasonality in the data set the user would need to add a dummy variable. Dummy variables are independent variables that take the value of either 0 or 1. In discrete seasonality regression, a dummy variable with a value of 0 will cause its coefficient to disappear from the equation and the value of 1 causes the coefficient to function as a supplemental intercept due to the identity property of multiplication by 1, e.g. if the month is December then the value of the dummy variable will be 1 and if not it is 0. When the user defines a dummy variable it will be added to the regression equation where it calculates a coefficient for the dummy variable as well; in that case this can be considered by the user for future predictions.
The regression equation is y = β0 + β1 * x1 +β2 * D+ ε
-
Continuous seasonality regression
If the seasonality is not discrete, i.e. if it’s sinusoidal, in that case the data set would be in the form y ~ sin(x). The user should create a new variable defined as x1 = sin(x) and use x1 as the predictor variable for the regression. This will create the below linear regression equation.
y = β0 + β1 * x1+ ε
In continuous seasonality regression, the predictions of Y when plotted as a function of X forms a graph as shown in Figure 7, where Y denotes sales and X is the time in months.
Figure 7
If the user identifies that the data follows some other pattern, such as quadratic y ~ x2 or exponential y ~ax in that case the user needs to define X variables to represent that pattern, such as x = x2 if its quadratic and x= ax if its exponential.
Sample
The sample below is about predicting the profit of a company based on the year, investment, export and number of employees. Through this sample it describes how to carry out a time series analysis with WSO2 CEP.
Criterion variable - Profit of the company in millions
Predictor variable - Year, investment, export, number of employees
Figure 8 shows the data set and the graph related to the sample
Figure 8
Figure 9
By looking at the data set and the graph we can see that this sample is related with multiple regression where profit is dependent on more than one variable. The regression equation for this sample would be
y = β0 + β1 * x1 +β2 * x2 + β3 * x3+ β4 * x4+ε
where, year, investment, export, number of employees
y = Profit
x1 = Year
x2 = Investment
x3 = Export
x4 = Employees
β0 = the intercept
β1 = slope of Year
β2 = slope of Investment
β3 = slope of Export
β4 = slope of Employees
ε = Standard Error
After configuring the execution plan the required output can be obtained via executing the below Siddhi query expression.
from DataStream#transform.timeseries:regress(ci, y, x1, x2, x3, x4 )
select *
insert into RegressionResult;
When considering the above query, ‘DataStream’ can be identified as the defined event stream. Since this has used a custom transformer, Siddhi supports transforming by using #transform. The namespace of the created extension is ‘timeseries’ and the function name is ‘regress’. The required attributes (confidence interval, criterion variable, predictor variables) has passed as parameters for the function regress. The above query defines the output stream as ‘RegressionResult’ to hold the output attributes.
The output contains the Standard error, the intercept, slopes of predictor variables. For the above example the result is as follows
Standard Error 48.0580567924631
Intercept -201253.1410667966
Coefficient of Year 100.6863391704232
Coefficient of Investment 0.0
Coefficient of Export 0.0
Coefficient of Employees 0.00684219674285623
Profit = -201253.1410667966 + 100.6863391704232 * Year + 0.00684219674285623 * Employees
Conclusion
Time series data often arise when monitoring industrial processes where time series analysis can be used to identify patterns in correlated data like trends, seasonal variations etc. WSO2 CEP can be used to identify and quantify the relationships among data that can be used for forecasting purposes based on previous patterns.