A Lazy Data Scientist’s Toolkit

By Mark Wilcock

In 2006, I was thrown a difficult analytics task. My failure to make inroads into it haunts me and influences the choice of tools I use for analyzing and visualizing data even today. 

I was working for an investment bank in the risk management department. The bank had done a few trades where we had lost money. The risk managers had traced it back to a vulnerability in a pricing model. They had plugged that leak but were there any others? I had access to the database of the ~million active trades and identified which ones had gone bad using Excel and SQL.

Most of my analytics challenges, like the one above, have a degree of urgency and a commercial imperative. The theme of the Business Analytics Marathon on 7th June is time to insight which is spot on. The self-portrait of the ‘lazy’ data scientist in the title comes because I am often under time pressure to get results for clients who neither know nor care about the tools or technicalities of data science. Under these conditions, we need tools that are powerful enough to attack the problem but simple enough to use quickly and assist in having meaningful discussions with the clients. This is where good visualization comes in.

Eleven years later, I still use Excel and SQL but I have added a few more weapons to my armoury: R, Tableau, Power BI, Azure ML and cognitive services. At the PASS Business Analytics Marathon on June 7th I will demonstrate these and explain why I like them. 

Tableau is a data visualization tool. I first used it on a project in 2013. The chief risk manager at a commodities firm gave me my best specification ever - a one page sketch of a trading dashboard. I reproduced the sketch as a working dashboard, like the snapshot below (once I had generated a random but realistic data source with R). 

I was also able to quickly build and show other visualizations based on the same dataset. We had a conversation that went “Here is what you asked for but you may also like to consider these.”

Power BI is a data preparation, data modelling and visualization tool. I will demonstrate its capabilities using a public dataset provided by the European Banking Authority’s (EBA) stress test results of 51 EU banks last year.  For example, the visualization below shows the maturity profile of the credit exposure of UK banks.

R is an open source language for statistics, data preparation and machine learning – and much more. There is an old saying “All roads lead to Rome” but in my case, that could be shortened to “All roads lead to R”. I use R not so much for the base language but more for the wonderful variety of packages available. These provide functionality in all sorts of areas; I use (and will demo) packages for cleaning messy data, transforming data, text analysis and creating beautiful plots such as the one below that gives an overview of the progress of a large project.

The number and variety of packages result from the open and generous nature of the R community. I have a sneaking suspicion that people who use R lead more interesting lives, or at least have more interesting jobs, than the rest of us. Questions asked about R on stackoverflow (a question and answer site for developers) or vignettes (short introductory tutorials) for R packages often hint at interesting jobs or pursuits. A case in point, the tidytext vignette is a source of delight for Jane Austen fans. 

I use Azure Machine Learning for the “traditional” data science task of training, testing and evaluating a predictive model and I will show an example of Azure ML predicting whether a counterparty intent, whether to buy or sell, during an RFQ (request for quote) transaction. Azure ML box and connect visual interface, shown in the snapshot below, helps in exploring the dataset as well as choosing and tuning a good predictive model.

The power of any toolkit is in combining tools to achieve a result that is more than the sum of the parts. For example, the R ggplot package creates beautiful plots – but these are static images.  Power BI can bring a ggplot to life -  make it interactive by combining with other visuals and filters on a dashboard.  I will demo this during the webinar using the example below.  This is a classic “backtest” chart used in financial market risk. 

The latest addition to my armoury is cognitive services. These are powerful algorithms available in the cloud for speech and vision but I am particularly interested in the language – especially the text analytics – capabilities. Of course, there are packages in R that make using these cognitive services easy. 

Whether any of these tools would have helped me back in 2006, I am not sure. John Tukey, a famous statistician, said 20 years before in 1986 “Exploratory Data Analysis is attitude, flexibility and graph paper”. In 2017, we have better tools than graph paper in our toolkit but perhaps our best asset is our own curiosity about the analytical problem in hand. 

Back to Top