Multiple time series forecasting with FB Prophet and Apache Spark using Google Colab

4 min readMay 3, 2021

Forecasting 120 different cities’ temperature in a single time series forecasting model in a distributed manner.

Introduction

Time series analysis and forecasting is one of the key factor of any business depending on it’s productions or services. But still now, Time series is considered to be one of the less known skills in the data science domain. In this article, I will talk about Time series forecasting.

Time series forecasting can be treated as a supervised machine learning problem, where target variable is majorly depended on time axis (others variables may also contribute for the prediction, that is called multi variate Time series forecasting). Like in summer people likes to drink beer or whiskey but in winter people generally like to drink Rum or wine or brandy.

So it is quite similar with regression but not completely similar, few extra things need to be taken care about forecasting.

In this article, I will talk about multiple Time series forecasting. Now what is that?? Suppose in a retailer shop, there are multiple products available. If the branch manager wants to see the upcoming sells forecasting of all the products, then data scientists need to build different models for different products because every product has different sells trend and seasonality. But it is quite heavy task to build different models, here multiple Time series forecasting comes into picture. Using multiple Time series forecasting method, you can build different forecasting models for individual products in a single model architecture. There are various way to build multiple Time series forecasting, here I will use FB Prophet library and Apache Spark to build the model in distributed manner.

About FB Prophet and Apache Spark

FB Prophet is a nice Time series forecasting library that helps to build forecasting models without writing explicit lines of code. It is open source and it gives some extra features while forecasting which makes this library different from traditional statistical library like ARIMA, SARIMA, VARMA etc.

Here is nice blog regarding Apache Spark.

Apache Spark is a data processing framework that can quickly perform processing tasks on very large data sets, and can also distribute data processing tasks across multiple computers, either on its own or in tandem with other distributed computing tools — Ian Pointer.

Here the dataset I have used, which has almost 20 lakhs rows which contains different temperature of almost 120 different cities before preprocessing. Although, Pandas can handle this amount of data easily but in real time scenario there may be GBs of data, that cannot be handled by Pandas. Here Apache Spark will help us.

Understanding business problem

Here I have used very common example i.e. forecasting of temperature. The data contains past temperatures from the year of 1995 to 2020 of all the cities. Based on the past data, I have build a multiple forecasting model that will give different forecast temperatures for different cities. Here Apache Spark helps to build the model is a fast and distributed manner.

Model building

Basically Apache Spark needs to few particular things for perform like Ubuntu OS, where JAVA needs to be installed in local system. But virtually there are two ways you can run Apache Spark — one is in databricks, another one is in Google Colab. Here I used the Google Colab notebook. Here is the notebook.

Now, few important commands need to run before using Apache Spark in Colab.

All the necessary libraries are install. Now time to call the libraries in Colab.

Here I will skip, the formal code along part, if someone need then visit my notebook for sequential codes. Here I will highlight only the major parts of the codes.

Below is the structure of the dataframe:

Before run the code on Apache Spark, temporary season need to create.

In Spark, the converted dataframe looks like:

FB Prophet needs some special structure of data, otherwise cannot perform — date column should be named as ‘ds’ and target column as ‘y’.

Here is the output of Prophet model:

Here you will find that some extra columns have been added like — yhat_upper, yhat_lower. These are nothing but the range of confidence interval of the predicted output. By default it is 80%.

Now you can visualize the forecast for individual city.

Like below is the forecast of ‘Abidjan’ city:

Here you can find the forecast of next 365 days along with last 3 years with the confidence interval. If you observe properly, then you will realize Prophet model has captured almost all the past points in the the range 80% confidence interval. So I can say that, the model has done a wonderful job. Similarly you can try with different cities.