Academic institutions the world over are creating Data Science courses, degrees, institutes or colleges. Students are overwhelmingly seeking data science degrees. Industry and government are seeking to employ data science literate candidates. Prof Nalini Ravishanker, Department of Statistics, University of Connecticut, shares her experience with Data Science in recent years in a few application domains.
Data science is an emerging discipline with a plenty of job opportunities. Data scientists are trained practitioners in analysing data in the best possible way for facilitating good decision-making. In the US, data science has become ubiquitous and every university has a data science program of one sort or another. You can think of data science as a tripod, according to the National Science Foundation. Its three legs can be thought of as statistical methods, computing algorithms and domain discipline.
The world is full of uncertainty. Statistical methods enable us to quantify uncertainty. Computer science is what gives us efficient algorithms. So if computing algorithms and statistical methods pair up, then we have a coherent discipline, where we have efficient algorithms for handling complex, big, humongous, streaming data or any kind of data coming into play. We can forecast and do efficient decision making under the blanket of mathematical statistics or probability, so that you have confidence in what you are doing. It’s not ad hoc. This is a disciplined way of doing things. The third leg of the tripod is the domain discipline. You could be running a marketing firm, a cybersecurity firm or a finance operation that produces data. If you want to analyse it well, in order to make decisions, then perhaps you should think about this new discipline that is cropping up called data science.
Good data scientists are definitely very savvy data analysts who facilitate decision making under uncertainty. Earlier, data engineers did all the warehousing. They brought the data, cleaned it up and brought it to the table. Then the data was passed to the analysts who did all the statistics and computer algorithms. But that division is becoming fuzzy now. Today, a data scientist most likely has to do some data engineering as well. They also need to talk to one another. Everyone is looking to hire data scientists. If you train yourself in data science, you will have a job. There are tons of places where data science is being done these days.
Case Study 1: Freeze Loss Mitigation
We did IoT data analysis for an insurance firm to mitigate insurance loss. The IoT data came across time. Thus, it is time series analysis done in collaboration with a company called The Hartford Steam Boiler (HSB), which is in Hartford, Connecticut, near my University. I’ve been working with this company since 2019 on this project, which concerns freeze loss mitigation. Pipes freezing is a very common problem in the cold parts of the United States. The water pipes may break and the whole house or your building gets flooded. It is a catastrophe for the client and the insurance firm, which has to pay out huge settlements. The whole idea from the insurance firm’s point of view is about what I can do at the client’s place to bring down this risk of freeze loss.
Here is the idea: HSB will install sensors in some locations in a building where they can detect the temperature. The client will have to monitor it. The temperature inside the building will be transmitted back to HSB. There will be engineers and other analysts looking at their dashboards and computers, looking if there is a potential for a freeze. If the temperature gradient keeps going down, then they would alert the customer by a telephone call or a text message. Then they sit and hope that the insured person will take action, hopefully go and raise the thermostat in their house. The problem is, no client is going to take the time to call HSB and say, “Thank you for the alert. I just fixed the problem.” So HSB is left guessing.
And here is where they seek help from a data analyst. They know what the temperature trajectory looks like. They also know what the outside temperature looks like. They came to us and said, “You call yourself time series experts. Can you use this information and tell us how and when to alert the customer first? And second, can you tell by just looking at the data before the alert and after the alert, whether the customer at the other end could have taken action or not?”
It seemed like a tall order. But that’s how data science works. This project was a joint venture between people in computer science and statistics. I was involved with the head of the Computer Science department. He and I formed a team, with our graduate students. For the last four years, we’ve been tackling problems like this.
Need for Effective Alerts
The data ranges from 2020 to 2021, covering just the winter months in the US. HSB installed 509 sensors with at least one alert event, spanning 38 States, 28 parent companies and 28 device locations (like kitchen, basement, etc.). Our team has integrated machine learning methods (ML) with statistical inferential methods to produce effective alerts. Imagine you watch a movie in your house or take a nap. An alert keeps coming. A false alarm can get really irritating. It’s just like crying wolf. The next time real alert happens, you will ignore it. Effective alerts are not false alarms.
We came up with a method to modify an existing machine learning algorithm to reduce the number of false alerts. The second thing we did was to detect the action taken by the insured, after the alert. The pattern of the insured’s data may or may not change. So we compared the before-data with the after-data, again, using statistical ideas, to say whether, with a high probability, we think the customer took action or not. The methodology we use is called causal impact methodology in a Gaussian process modelling framework.
The third thing that we did was clustering the sensors. Imagine there are 1000s of sensors and they keep growing every week. We can cluster the sensors, according to the behaviour of the temperatures. For instance, out of the 4000 sensors, 600 behave one way; 400 behave another way and so on. That is useful for a variety of reasons. Here we use data science algorithms called nearest neighbour methods. All our work is generally coded in Python and/or R. This work was done with Python code. The first deliverable was better alerts to customers. (Fig: Deliverable 1)
Between October and February, the external temperature is quite low. There is always the possibility that pipes might freeze. The internal temperature is higher because we all live in heated buildings. But there are points where the internal temperature is dipping. For some reason, we begin to suspect that it can go further and the customer is alerted at this point.
Suddenly, after this, there seems to be a jump up and then it starts to fluctuate again. So, basically what we think is that the customer took action. Obtaining this alert was through isolation forest plus algorithm that we developed. We use something called data thresholding and distance based filtering. We cleverly used the correlation between outside temperature and inside temperature. Of course, there are lots of equations involved.
The second deliverable (Fig: Deliverable 2) is to detect customer action by just looking at the temperatures. There are four scenarios that you can see: True Positive, True Negative, False Positive and False Negative. The next thing that we do is to set clustering sensors. In a neighbourhood system, we come with a distance metric. It is like saying if somebody is close to me or not close to me. People that are close to me, come into the same cluster. People that are far away belong to different clusters. Heuristically, we were able to cluster the different temperature data into six different groups. They are useful for people analysing the data.
The R method also allows us to come up with new clusters. As new data comes in, they don’t have to fall into the same groups as before. Week by week, we start learning. So, this is a dynamic learning process, where we have used all kinds of tools and models to understand the data. In summary, although I took a very simple example of IoT temperatures that an insurance company is using, the methods are very general and fundamental. We have R code and Python code on our GitLab link that people can leverage and try and apply and see if it works. This can be applied in many disciplines.
Case Study 2: Marketing Promotion
When a firm gives promotions, how effective are they? To which kind of customers should they give promotions or should they stop giving promotions? We call this as the promotion effectiveness study. The datais from a leading personal care manufacturer who spends lots of money and wants to know the increase in revenue to justify the increase in promotional spends. Can they quantify it for different time periods? That is the new thing we can bring to the table. We can do it in a hierarchical way for different retail outlets, different channels, different regions and so on and find out how to optimise the promotional spends.
The study is done in collaboration with a company called Cogitaas AVA. It’s a Mumbai based consulting company. The selling is done on four sales channels: general trade, general trade-wholesale, modern trade and super stockist. The data is monthly, over 36 months. The previous thing we saw was a 15 minute data over a long period of time. That is high frequency data. This is just monthly data, the kind of data many retailers might use on volume, sales, price, number of retailers, etc. There are about 15 or 17 regions in India, like Uttar Pradesh, Andhra Pradesh and Tamil Nadu.
As a data analyst, the first thing you do is visualization, using bar graphs. We looked at the regional distribution in the x axis and the channels in the y axis. We can see in which state, which one is dominant. We can also do it by region and channel. We see spikes followed by depression. If there is a promotion, the sales increase.
We also look at trade promotions—both deep and shallow discounts given to retailers to increase distribution and not to the end customer. We did for different products and modelled the sales in week t – week by week. There is advantage in looking at data as it comes over time. We modelled the sales for 104 weeks under four channels for 17 regions. We built static and dynamic time series statistical models.
The client believed that a unified management could be cost saving. The data scientists are not doing any analysis as standalone but reacting to the client’s needs. So if the client gives us the mandate, we think of the best way to implement the client’s request. So we clustered different regions and used similar statistical models within each of those clusters.
The model that you build for Tamil Nadu need not be the model you build for Gujarat. They could all have their own little models. The dynamic model identifies time periods of above average and below average impact because our goal is to see the impact of promotional spend on sales. We are able to assess in which periods of time, the impact is high and in which periods of time the impact is low and for which channel it is high, low, etc. This kind of deep dive will help the firm to understand better and track the promotional dollars. One of the deliverables was the dynamic impact of spends by channel and region.
We can do all the slicing and dicing with the kind of models that we build. Then we can create dashboards for firms. They can click the channel or region and see exactly what is happening and get their ROI calculations, which is the final output. We can also do this for E-commerce platforms and dynamically measure the impact on sales and identify campaigns that deliver healthy ROI. The dynamic time series modelling is becoming a routine part of data science.
When you think about data, the data wrangling is a common term we use. Then the next step is developing an effective analysis modelling framework to tackle the questions; computation–coding with speed and accuracy; and then creating a dashboard. In almost every situation, the firm would expect the data analyst to help with the automation and create a dashboard with some interactive tools for their people to use.
From an academic point of view, what are the outcomes that people like me see from collaborations with industry?
- First of all, we advance the science, because we develop new methods. Every time, someone from a firm comes to us with a problem and that problem cannot be solved by existing methods effectively, there is an opportunity to develop new methods. That pushes the science forward.
- We can also develop code, which is very useful. If everyone keeps on coding the same algorithm over and over again, it’s a waste of human resource. So the code is developed and shared. People can use code that is available and then automate the methods for deployment.
- That is also a useful tool for our students to learn. We train graduate students in data science for the workforce.
- We also produce publications to push the science and put the work that we have done in the public domain.
- We can have healthy industry-academia partnerships. Firms like HSB or Cogitaas partner with academia (UConn DS). The firms can provide interesting open problems, which can be used as course projects or as capstone projects.
If I’m worried about losing my key customers, can we have alerts to say, ‘Watch out. This customer could potentially leave.”
You’re talking about churn. There are many well-known models for churn. What methods one uses will depend on the kind of data that one is able to get.
What are your expectations of students applying for the data science programs?
My university as well as any other university that starts these programs would have some gateway. Many US institutions make it as open as possible. We are not saying that only statistics undergraduates or computer science undergraduates should apply. In the curriculum, for example, in my university, we have differential calculus, an intro course in statistics and some training in R and Python. Some quantitative literacy is definitely necessary. Many universities offer an intro to data science course, which levels the field and makes those who join, at least, minimally competent, so that when they start taking next level courses, they could catch on.
My own experience this year has been that when I taught applied statistics for data science students, there was deep silence for the first few weeks. Anytime I would ask a question, the class would just stare back at me and I was scared. But then you would be surprised at how things changed. Towards the end, if I asked a question, there were six hands shooting up. The students are very resilient and they catch up very fast.
How important is data cleaning in the context of the big data world that we are all living in?
It’s very important because we all know that it is garbage in, garbage out. So we need to put good data into the modelling. First of all, there is a lot of missing data. Then we have to come up with good methods for imputing. We must look at the data from different points of view with checks and balances. Everything has to gel. We spend a lot of time making sure that the data is kosher before we trust and put it into a model.
Is the intro to data sciences course a preparatory course for MS program in the US or is it a standalone course offered by the university?
The intro to data science will be the first course that the ‘masters’ students take after they enroll and come to the program. The performance in the intro course is not necessarily a gateway to enroll for the MS program. The intro to data science carries three credits out of the 30 credits in our program.
How do you ensure the genuineness of the data? Will you be able to identify if the firm is deliberately giving wrong data?
We look in many ways and do lots of checks and balances. I think if a firm works with me, I’m hoping that they are not falsifying the data. If we suspect that the data may be falsified, there are ways in which we can detect. We can look for consistency across time and do some pattern recognition.
How do you address issues that come because of using data from multiple data sources? How do you deal with data security?
Dealing with multiple data sources is very easy. We start building models using hierarchy. Data privacy and security is a very critical thing today for the clients. My university does not want to take responsibility for storing client data on our premises. So all of our analysis is done on the company’s machines. No data comes into the university.
Moonlighting is a serious issue now. How can we prevent employees dealing in data science, from passing on the data to a competitor?
Companies generally make sure that you can only access the data with a computer and you cannot pull out any data from there. You have to do the modelling within the system. The permissions must be provided in such a way that even if you want to, you cannot take anything out of that data. We also have to sign MOUs. There are very strict laws. Even before we start the work, there has to be a contract.
We come across cases where the client has a lack of clarity on the business problem. How can we solve this issue?
We get clarity through dialogues. The first few weeks goes in a dialogue between the two parties. That is why perhaps you need an experienced person. Converting a management problem into a data science problem calls for skills and requires a lot of interactions and dialogue. It takes some time. Once you narrow it down and clearly define the problem, then the solutions are bang on. So when you implement it back into your company, you get exceptional results.
How do you get value out of data science? How would you advise the students who want to pursue a career in the field of data science?
I am a statistician, a half data scientist as per my definition. I would, of course, say it’s very, very valuable. The workplace needs data scientists. We know that things change. Once upon a time, Physics was the king. Then Bio became the king. Now everybody wants a trained data scientist. That is why the universities are all responding by opening data science programs, including IIT Madras. There is a need and, therefore, if the students are willing to invest a year or year and a half of their time to study data science, they will get jobs.