At IdeaLabs, we’re working on a lot of exciting projects. In this post, we’ll talk about the typical pain associated with even simple analysis tasks, and see how to address these pain points.
We live in a world that is at the intersection of various technological trends – mobile, social, data-driven, design thinking, on-demand, algorithmic, digital, personalized, localized, miniaturized and simplfied. Our mobile phones are at the center of our lives, we use facebook for socializing, gmail for email, sophisticated google search widgets for context-rich information seeking, evaluating alternatives and decision making, and numerous apps for organizing and managing virtually every aspects of our connected lives.
Many of these apps aesthetically pleasing, easy to use and generally work very well. The collective interaction of these technology trends have made us more productive, effective, and unleashed our creative potential. We cannot imagine living in a disconnected world – a world without access to Google, email, Facebook or Instagram – for anything more than a few hours. In fact, we see an acceleration in the rate of deployment of smart machines, beautiful interfaces, and context-rich systems everywhere, which reduces the friction of day-to-day interactions in the physical world.
However, when it comes to a specific type of knowledge work – business or data analysis, we are mostly limited to clunky interfaces, such as Excel (for simple problems) and R or Python (for complex problems). This is not to deny the clear fact of significant progress made in the capabilities of data analysis tools; on the contrary, there’s been tremendous technological innovation in various aspects of data analysis, including the ability to manage complex data flows and analyze large amounts of streaming data at real time, which have enabled our hyper-connected world.
The fact is, even in a Tableau-driven world, it still takes way too much work to discover or visualize even simple patterns in the data. For example, it requires 10 steps to create a histogram in Excel 2013. And that’s just for one column. Now, imagine that you want to do a quick visual scan of the data in a text file, data that contains 20 columns and perhaps 1,048,576 rows plus a header row. You have no option but to create a pivot table and do hundreds of clicks to just get a visual summary of the data.
Tableau’s got summary card, but it’s still too few columns at a time. Tableau is just the beginning.
Let’s consider a more typical, but complex analysis task, such as predicting which account holders are likely to default or charge-off on a loan. This is a problem of classifying an account as “good” or “bad” in the future, depending on the probability of default or charge-off. Since we don’t have access to such a dataset, we’ll consider a similar problem with open data.
Suppose you want to identify demographic characteristics that can predict whether a person earns 50K or more. Lets see how to do this using typical tools available with a data scientist.
A typical set of steps could be:
1. Extract the most recent census data
2. Load data into R, do some work to ensure that the data types and formats are correct, ensure there are no duplicates
3. Quickly create a histogram / summary of variables and inspect them for missing or abnormal values
4. Create the outcome variable by discretizing the gross income
5. If there are too many levels in a categorical variable, maybe combine some of them. I may perhaps consider binning some of the continuous variables, such as age, perhaps because there are conventionally agreed cut-offs (maybe not for this dataset, but this may apply in specific cases)
6. Split into training and test data sets
7. Build decision tree / logistic regression models, plot RoC curves, confusion matrix, etc. and select the best models
8. Finally, create a story that explains the rules to a decision maker using Excel or Powerpoint
To do this, one needs to have a good understanding of statistics, and the ability to work with different tools such as R, Excel (for the charting or pivoting), etc., doing tasks such as data cleansing, analysis, modeling, evaluation, extracting the rules, and story telling.
For problems that are even less clearly defined, it takes a lot of talent to get things done. There simply isn’t so much talent to go around.
There is a better way of doing things. One of our goals is to build tools that make the analysis process simple and useful, without reducing it to medieval medicine. We expect to achieve this by putting the user experience at the center of the platform, while making big data and advanced analytics algorithms work for the user.
In our next post, we’ll talk about how to apply some of the emerging technology trends to reduce the pain in common analysis tasks. Keep watching this space for more