As practitioners of data science, we’re always curious about what the future holds for this field. My focus here is to look at the past and extrapolate to understand the near future. As a famed Danish philosopher once said, “It’s difficult to make predictions, especially about the future.”
Enterprises are inundated with data- from social, mobile, IoT and other technologies. The pace of the data flow is only accelerating. Over 300 hours of videos are uploaded on YouTube every minute, and this has grown from 100 hours every minute in 2013. Facebook users watch the equivalent of 750 years of video every day. Over 500 million tweets are generated on Twitter every day. The volume of digital data is doubling every 2 to 3 years, and is expected to reach about 40 trillion gigabytes by 2020. The total no. of ‘things’ that are connected/connectable, is expected to hit the 200 billion mark by 2020.
Enterprises are becoming digital. They want to leverage this data to drive business model innovation, interact better with their customers and drive improvements in their business process. To profit from these, companies need a digital strategy, need the right people to be driving the execution of these strategies and need a solid technology foundation along with big data analytics capabilities.
A suite of technologies have evolved over the last several years that can help enterprises create economic value out of this data. The technologies along this stack have disrupted and/or disrupting legacy data analytics technologies, which was characterized by enterprise data warehouses, data models and ETL.
Most of the disruptive technologies of today started out just the way many breakthrough technologies do – focusing on a niche segment, offering lower performance in attributes that mainstream customers care about, typically offering simpler and inexpensive options. However, they improve over time, add new attributes that customers don’t get from existing incumbents, eventually to displace the incumbents. The cycle repeats itself. This process is well understood thanks to research by Clayton Christenson (see Innovator’s Dilemma).
Now, let’s see what the stack is, and how the underlying forces are shaping changes in the stack. We will have a lot to say on many of these in our subsequent blog posts.
Data Ingestion & Management – Moving towards a secure, two way, real-time or batch data flow, which flows from various network topologies (not simply restricted to hub and spoke or one-way flows.) Ability to store this data, add master/reference/look-up data, look-up, join, set operations, sorting, filters.
From ETL of the previous era, today, we are looking at the ‘IoT era,’ where a significant amount of data is expected to be generated by sensors or ‘things’ at the edges of the network. This data flows continuously or intermittently, in a bi-directional manner, in a point-to-point topology. This is in addition to text, images, videos, etc. that organizations are already inundated with.
The data store has also evolved, from structured, ‘define on write’ data warehouses with well-defined metadata and limited data sources with structured data, to ‘define on read’ data lakes with missing metadata and high velocity unstructured data sources. Data Lakes are implemented through a combination of NoSQL, filesystem, RDBMS, MPP databases, In-memory stores.
This was an area which was managed by ETL, in the mid to late 2000s. Today, the commercial ETL suite of product is dead. The fundamental underpinnings of this transformation started out in pure play digital organizations, as a way of managing their data infrastructure, which led to the birth of Hadoop. From there, Hadoop and its animal farm of technologies overcame resistance from internal IT, to eventually be adopted as the cornerstone of modern data management.
Today, the key technologies in this stack include Hadoop, Spark, Mesos, Akka, Cassandra, Kafka & Storm, which can be used to ingest and manage any type of data for further analysis, insights and actions. MapReduce is now only used for ‘slow’ data where completeness, rather than speed, is important. In the near future, we are moving towards data flow management paradigm, such as Niagra Files (or Nifi), developed by the National Security Agency, and open-sourced under the Apache License. This allows enterprises to collect, conduct and curate real-time data, moving it from any source to any destination, while doing real-time analytics and reporting on them.
In recent times, we’ve seen the birth of the self-service movement. The promise of this is that business users can literally take matters into their own hands in order to bridge the gap between what they want and what IT delivers. The intersection of the self-service movement, open source and design thinking has led to the emergence of exciting technologies such as Tamr, Trifacta, Datameeer, DataWatch etc., which promise to take the pain out of data management and make it accessible to the business.
Analytics & Reporting – Moving away from huge amounts of ‘data puke’ delivered through complex interfaces developed using the classic waterfall models, towards an agile approach which emphasizes time to insights, visual innovation and user experience.
This was an area dominated by Business Intelligence tool vendors, such as Cognos, Microstrategy, SAS, etc. who were building advanced enterprise-wide capabilities such as scalability, metadata, authentication, etc. These tools relied on a foundation of enterprise data warehouses. However, decision makers were not getting all the insights and information needed to make smarter decisions in a timely manner.
A new breed of vendors such as Tableau, have helped the business users squarely address the problem created by the business-IT gap: make it easy to access, explore and visualize data, through intuitive interfaces. However, Tableau and clones do not have the enterprise scalability, metadata, collaboration and security capabilities provided by the traditional BI vendors. Enterprises are willing to overlook this if that helps them reduce time to insights and improves user experience.
In the near term, there are plenty of clones emerging, such as Tableau, Spotfire, Qlik, and their on-demand counterparts such as Platfora, Good Data, Domo, Chartio, PowerBI, Birst, etc. all of them excelling in one area or another. A lot of the action is on delivering the experience through the browser, and reducing the time to decisions.
Exploratory Analysis – Exploring cause-effect relationships with multi-variate visualizations, interactions analyzer, CHAID trees, regression models, ANOVA, statistical testing, multi-level relationships. Traditionally, analysts have undertaken exploratory analysis using technologies such as SAS and (more recently) R.
There is still too much pain in even simple data analysis. IBM Watson, Microsoft Cortana, and smaller competitors such as Ayasdi, BeyondCore, Quid, and others are all working to reduce the friction generated when real-world data meets the average human. Each of these vendors are addressing this through a variation of ‘Smart Insights’ offerings, whereby they use natural language to improve user experience, or apply advanced math (such as Topology) to extract hidden insights and present them in a visually compelling manner to the end user. Most of this is still delivered from the cloud, since it needs enormous computing power even on moderately sized data sets.
Apache Spark has enabled the mainstreaming of Graph Analysis, which is a technology based on graph theory that analyzes complex relationships in a hyper-connected world of enterprise social graphs. This will be a significantly new capability that will enable analysts to visually answer questions such as the shortest path between two places in a network, making better recommendations in real-time, automated routing of agents to needs, and a huge variety of problems which are too difficult to be solved with traditional analytics.
Advanced Analytics – Advanced analytics may be classified into several areas, including:
• Predictions/Forecasting/Deep Learning/Scoring – Predicting future values through simple or black box approaches, using statistical or machine learning models
• Experiment Design, A/B Testing – Design experiments or quasi-experiments that help understand the root cause of variance, in order to better understand the drivers of variability or improve a process or a task
• Optimization – Achieve business goals by finding the best solution from all feasible solutions, such as for example, the best way to allocate stocks within a portfolio in order to maximize overall profit within a given time horizon
Machine Learning is a technology that is at least 40 years old. However, the advent of cloud computing has remove computational barriers to ML modeling. In addition, another exciting area – GPU computing – makes it possible that deep learning ML models can potentially be trained in local servers, even with the coming slowdown of Moore’s law.
It’s now becoming possible to train large data sets with Machine Learning, thanks to the evolution of platforms like Apache Spark, Hexadata (0xdata), Microsoft’s contributions to open source R (in addition to packages like bigmemory), etc. In addition, cloud providers such as Microsoft, Amazon and Google deliver tools that make it easy to build and deploy predictive models.
In recent years, several researchers have come together to create Deep Learning algorithms using an open source model. In deep learning, rather than hand-code a new algorithm for each problem, you design architectures that can twist themselves into a wide range of algorithms based on the data you feed them.
Experimental design is an approach used to systematically improve a process (such as making the right offers to customers) when there are multiple factors that affect variability. Building predictive models on observed data provides insights and predictions of future customer behavior, but the data may be biased or may not exhibit variability for all the factors that control the outcomes. Experiment design allows marketers, for example, to massively and systematically increase the variables that are tested, even though there may be biases or lack of historical variation. The results of experiments can precisely quantify which variables drive the desired customer behavior. Experimental Design has led to significant increase in direct mail, product feature designs, contact center agent allocation and messaging, and in general, customer interactions.
Optimization refers to a broad range of capabilities, and has its foundation in industrial engineering. Again, manufacturing, travel, transportation and logistics and hospitality industry have invested significantly and heavily in optimization capabilities. Optimization is expected to become mainstream in marketing and customer journey management over the next several years.
Autonomous Entities – Use a combination of technologies to create smart machines that can interact as well as, or far superior to a human, such as driving cars, making investments, diagnosing diseases, etc. In one scenario, forecasters expect that over the next ten years, our commute will be made painless through shared, driver-less cars. In hospitals and clinics, Robots will do the primary diagnosis. Everyone’s smart phones will have a personal assistant that will act like a human. They will coordinate calendars for meetings, send voice/video messages, provide a visual interface for searching, do comparison shopping, remind one to take their medications on time and much more. Meanwhile, edible nanobots will make invasive surgeries totally redundant by repairing or undertaking preventive maintenance of blood vessels, tissues, bones or organs, or precisely target diseases like cancer. Virtual Reality will make travel redundant, anyone will be able to experience the feeling of being anywhere else without leaving the couch. Investments will be fully driven by a combination of autonomous robots. All of this will be made possible by the convergence of big data, machine learning, IoT, design thinking, high performance computing (cloud, gpu, quantum, etc.)