Idea Labs Journal

big data

Big Data, Big Processing, Big Compute

Ramesh Hariharan

We are at the cusp of the “IoT era”, where a significant amount of data is expected to be generated by sensors or “things” at the edges of the network. This data flows continuously or intermittently, in a bi-directional manner, in a point-to-point topology. In addition to the IoT data, there is a lot of unstructured and semi-structured data, in the form of text, images, videos, etc. that flow mostly in a single direction (from source to data lakes), in a batch or streaming flow.

Enterprises are investing heavily in ingesting, processing, and analyzing all the data in the hope of reducing costs and risks, creating new business models and improve business processes. However, there are several challenges in making this a reality. In order to understand these challenges better, lets look at the different types of processing that happens with all of the data.

Typically, we can classify them as data management(look-up, join, set operations, sorting, filters, etc.), analytics and reporting, (such as summarize, group, roll-up, rank, etc.), extracting rules and building models (building predictive and optimization models), real-time scoring (scoring customers at real time), and real-time optimal decision making (making the best pricing, recommendation or bidding decisions), and deep learning (high computing intensive tasks such as image classification). As we move towards the right, the complexity of computing increases.

The cost of computation and storage is falling dramatically, thanks to Moore’s law (which has some life left in it even after 50 years). The traditional approach to process big data has been to throw computing at the problem. For batch data management and analytics tasks, the industry evolved a map / reduce paradigm for computing. In this paradigm, the data, rather than computing architecture, is divided into large no. of pieces, across different compute units. Processing is done individually on various pieces (map task) in parallel. The results are then combined together (reduce). This is a classic divide and conquer approach that works very well for data management and for some types of analytics tasks.

However, MapReduce does not work well for tasks that require all the data at one go (for example, building predictive or optimization models, even in a batch fashion). Another class of problems where MapReduce does not work well is the processing of streaming data, and graph computing. For such problems, the approach so far has been to use specialized software such as SAS, rather than use open tools such as R. Over the last few years, Microsoft’s Revolution R provides an enhanced version of R that takes advantage of multiple threads in a processor, and this has mitigated the problems to some extent. In addition, there are other R packages that provide support for working with large datasets for certain types of predictive modeling (such as bigmemory).

More recently, Apache Spark is a platform that is being rapidly adopted, and is very adept at handling batch as well as streaming tasks, with growing support for machine learning and graph computing tasks. Spark is evolving as a single platform with all the capabilities, ranging from batch to stream to machine learning to graph processing.

In recent years, there has been an explosion of Deep Learning algorithms. These are very high compute intensive algorithms, used in tasks such as image classification, crowd density estimation, language translation, etc. These tasks are very well suited for GPU-based computing approach. Unlike a CPU (central processing unit) that consists of a few cores, a GPU (graphical processing unit) consists of thousands of simpler, smaller cores that can process several tasks in parallel. This makes GPU-based computing a perfect fit for deep learning tasks. Tasks that take days in a CPU-based approach can be completed in minutes using a GPU-based approach. However, we are still in early days, and there is a lot of work needed to be done to make the applications take advantage of the GPU architecture.

Overall, there is a clear correlation between the type of analytics that are developed and the paradigm of computing that enables these analytics. Hadoop and Map/Reduce had enabled the collection and management of big data. Spark, R, Python, combined with pervasive computing power have enabled the proliferation of advanced analytics. Spark is enabling the rise of graph computing and unveiling of complex relationships within data, while making everything else easier. GPU’s have enabled the mainstreaming of deep learning methods that have led to machines making major strides in pattern recognition and other artificial intelligence tasks. Over the next few years, we expect the emergence of even more advanced paradigms such as quantum computing, which will unleash solutions for different and more complex problems, solutions to problems that are not in the consciousness of mainstream practitioners today.