Modelling & Forecasting Time-Series data has been one of the cornerstones of Predictive Analytics in the era of Big Data. There are a plethora of forecasting techniques available today whose context can be a pain to understand and as we know, in the war on noise, context serves as a crucial ammunition. To that end, we, Hemanth Sindhanuru & Srinidhi K, from LatentView Analytics are presenting this series of articles where we will be discussing a structured methodology to understand, analyse & forecast time-series data
1: Introduction to Time-Series Analysis
A data-scientist comes across time-series in almost every aspect of his day to day research, may it be a straight-forward sales data of a firm or a complex cohort data of different customer segments of a firm over the years.
Although data-scientists have a slew of machine-learning techniques at their disposal for generalized data analysis, the analysis of time-series is fundamentally different due to the fact that time-series have an attribute of Sequentiality i.e. two time-series with the same n observations but in a different sequence have completely different characteristics. This attribute results in certain features specific to a time-series
This article will be the first in a series which will be summarizing the various techniques for modelling & forecasting a time-series available in the current scientific literature.
We begin with the simplest form of temporal data, Univariate Time-Series and then move onto time-series structured in other ways which include Multi-Variate Time-Series & Hierarchical Time-Series.
So, how do we define a Univariate Time-series as an entity? Any sequence of real numbers collected regularly in time, where each number represents a value & the index of the number in the series represents time-stamp of the recorded time can be considered a Univariate time-series.
The flowchart below describes a methodical approach for the analysis & forecasting of univariate time-series
The below diagram lists the forecasting models which will be covered as we move forward in this series.
Inferring from the Time-Series Plots
The first step in time-series analysis (as is the case with any other analysis) is visualizing the data. In most of the cases, the nature of the time-series & its attributes can be deduced from the time-series plot itself.
Answering the following questions from the time-series plot will give us a very strong context of the time-series attributes which we will be discussing going forward.
• Are there any periodic patterns that are clearly dominant in the data
➢ Are there any significant peaks or troughs at regular intervals?
➢ What is the scale of these periodic variations, are they constant at every interval or do they increase in amplitude?
• Is there any long-term pattern the data is displaying like
➢ Is the mean of the data increasing or decreasing or constant with time?
➢ And what is the nature of this pattern, linear, non-linear or ambiguous?
Stationarity of the Time-Series
Stationarity is one of the fundamental attributes of a time-series. A time-series is said to be stationary when the attributes of the time-series values do not depend on the time at which the series is observed i.e. there is no identifiable pattern in the time-series. The values are equivalent to random noise. So a stationary time-series would be roughly horizontal (with no increasing or decreasing or periodic patterns we looked out for above) with constant variance.
To test whether a series is stationary, we have Unit Root tests, which basically are Hypothesis tests with the Stationarity of the data as a Null/Alternative hypothesis. Some of the reliable ones among them are the Dickey-Fuller (DF) test, Phillips-Perron (PP) test, Kwiatkowski–Phillips–Schmidt–Shin (KPSS) test etc.
The R library has an implementation of the Augmented Dickey Fuller (ADF) test, which is an augmented version of the Dickey–Fuller test for a larger and more complicated set of time series models. Please note that since the ADF test is a hypothesis test, there may be cases of false +/- errors. It is recommended to further examine the data to confirm the stationarity of the data.
Keep watching this space for more