Building Effective Data Science Teams for Advanced Decision Making
Anurag Harsh
Founder & CEO: Creating Dental Excellence, Marvel Smiles and AlignPerfect Groups
The Value of Data
Around twelve years ago I started actively working with data and building analytics groups in leading companies to help them sell more, market better and be more efficient. I think that might have been the beginning of the field of data science as a professional discipline and what I believe to be where the data movement started. The volume of data is forecast to grow at a rate of 32% CAGR to 180 Zettabytes by the year 2025 (Source: IDC). Now data science professionals are in demand and every company wants to hire them. This is mainly because of Google, Facebook and Amazon showing the world the power of data and applied intelligence, creating something impactful out of the data. It is thus imperative that one understand the role that data scientists play in an organization and how to think about creating data science teams that are effective.
What does it mean to be a data driven organization? In my experience it’s simply a company that collects, processes and uses data in a manner that allows it to create efficiency, create competitive new products and make better decisions. I prefer organizations that use data effectively than those that claim to collect lots of it or brag about its complexity.
Companies like Amazon have taken data to a whole new level. They have mastered the art of using analytics to their advantage, and in the process to the advantage of the customer. Now other ecommerce companies have become adept at recommending purchases to a customer – recall when you see the section online that reads “people who viewed these items also viewed or purchased” etc. This is what we call “collaborative filtering” and it is the next best thing to an online search because it allows retailers like Amazon to get customers to make multiple purchases or any purchase in that browsing session instead of losing them. There is a social network fueling this phenomenon at the core of which lies data, also defined as a collection of consumers that are somehow connected with one another via the products they browse, chose or purchase.
Social networks use the ‘people you may also know’ theorem that I am sure you may have come across where if you know Bob Jones and he knows Liz then there is the probability you may know Liz as well. As data scientists we try to solve for these types of problems on charts and graphs where the data sets are not as close. Imagine trying to find Bob Jones on Facebook without any context. The social network goes a step further to use state of the art machine learning models that also observe the time and number of connections it takes you to acquire friends or get to a long-term commitment perhaps. This is critical because Facebook tries to reduce this time to commitment to reach a critical number of friends because that means you’ll hang around more and perhaps do more as opposed to quitting.
The same concept applies to streaming sites that encourage you to add more movies to your queue because data suggests you might become a longer-term customer once you cross a certain threshold of movies in your queue. This allows the streaming sites to analyze the flow of content to maximize the number of new customers that will cross the threshold minimum number of movies in their queue and become long term streamers. The trial period for the leading streaming sites does a decent job solving for this.
Analyzing customer data to fuel longer term commitment is not just the forte of big social media and streaming players. The video game sites also use data effectively. For example, Zynga generates a massive volume of data incessantly analyzing its users, their activity and movement within the games. Zynga realizes their gamers long term prospects is a function of their interaction with other gamers on the platform, how many buildings they build in the first ‘x’ days and monsters they annihilate in the first ‘y’ hours. This allows Zynga to curate the product to increase engagement so its new users can achieve these goals.
A/B testing of new websites to make them effective, sell more product or find better search results based on data collected, has been around for a long time pioneered by Amazon and Google, the later using MapReduce to analyze large datasets. Yahoo created Hadoop, that has become the most important tool in the data scientists toolkit today, marketed by Cloudera/Hortonworks (a spinoff of Yahoo), MapR and other companies.
In the credit card and payment processing world of MasterCard, Paypal, Amex, Venmo etc, a common use case is the collection and analysis of data to look at pattern abnormalities for fraud detection and then acting on the intel in a matter of milliseconds, a daunting task to say the least. Google itself has to deal with data at a scale that perhaps no other company has today and have had to develop more bespoke solutions comprising hardware, software (MapReduce) and algorithms (PageRank) to make sense of the data, much of which have now been opened up to the developer community through open source projects. The best data driven organizations believe that if they cannot measure something then it cannot be fixed, and that value is created by
- Collecting as much data as possible so it can be used
- Measuring the data within a short time frame
- Democratizing the analysis and testing of the data so many people get to review, poke holes in it and eradicate bugs, and
- Encouraging inquisitiveness about the data and the reasoning behind its changing state.
Most importantly, remember that data is not just for data scientists and analytics experts but for everybody within an organization.
Roles Played by Data Teams
Let us look at the three groups of roles played by data teams-
Group 1: Data Science and Analytics
Group 2: DataOps and Services
Group 3: Data Engineering & Infrastructure
Group 1: Data Science and Analytics
Business Intelligence and Decision Making
A vital aspect of decision making is creating KPI dashboards on the key metrics and for that you have to define what those metrics really are, a non-trivial task. I have seen dashboards full of blind spots with KPIs that have no real impact on the operational decision making. Companies needs to pick KPIs that act as “dials” to move the business in the right direction and where the performance indicators are in the context of one another. If you look at percentages make sure you are also looking at the raw numbers. The KPIs also need to evolve as the business expands. A weather report with just temperature is far worse than one that also includes air pressure, humidity etc. Also, how the data is shown or distributed matters, from excel, google sheets or web forms to more sophisticated tools such as Tableau, Qlikview and QlikSense or PowerBI etc.
I had mentioned earlier that data is not just the privilege of the data science group but must be accessible to everybody within an organization while taking into account security policies and other legalities. The democratization of data is important. A good example is how Facebook has allowed its employees to query its Hadoop powered data store using a language called Hive. This allows any Facebook employee to create their own custom dashboard. No programming necessary. Zynga has something similar but uses a different technology powering two separate data stores, one with a higher level of performance supported by service level agreements for KPI reporting that is always available and another for use by regular employees where performance may not be at the highest level, but the data and reports are accessible nevertheless. eBay also has something like this where they use TeraData to create cubes of data for each team comprising datasets and custom databases that are unique to the team and that the team can interact with.
In many organizations the demand for data and analytics to inform key decision making has never been greater. This is what is known as classical decision science where groups of experts look at internal as well as external data sources to carry out competitive analysis, provide clarity on tactical decision making or help with strategic planning. A decision science team works on these types of tasks figuring out where to invest in or divest out of, when to sell or buy a business, which country to get into, or analyze a specific market for the company. For this, the team might for example look at census as well as internal data to build predictive models to examine against the data that is current or to be procured. Do note that in all of my 25+ year career, rarely have I found a magic bullet in all of my analyses and data science projects – some rare discovery or number that has changed the course of the business or company, so if you are a data science team looking for one and find it that’s great but don’t count on it. Instead look for dials that you can count on to amplify value and then look for more dials to further increase value.
Predictive Maintenance and Anomaly Detection
Data Science teams are using machine leaning to detect anomalies, analyzing time series data from IoT sensors for example for monitoring temperature of vibration to find anomalies, make predictions on the equipment’s remaining life. These teams analyze large volumes of high dimensional data using deep learning and then layer audio or image data on top from other sensors, mics and cams and use neural nets to replace the more traditional methods of generating insights. These decision scientists can run algorithms that predict failures ahead of time to allow for planes interventions thus reducing downtime and operating costs and increasing production yield.
Logistics optimization for cost reduction
Data science teams also work on optimizing logistics through real time forecasting resulting in an overall reduction in costs. Techniques such as continuous estimation can add tremendous value in routing delivery traffic, improving fuel efficiency and reducing delivery times. Sensors can monitor vehicle performance and driver behavior and then use that data to coach drivers in real time with information such a when to slow down, optimize fuel, reduce maintenance costs, prediction congestion or weather-related alerts.
Customer service management and personalized marketing
Decision science teams not only work on speech recognition data from call centers to offer a better customer experience, but also conduct deep learning on audio data for assessing customers’ emotional state and then route to a human to handle. I had already mentioned the “customers who bought this also bought” concept earlier that can lead to considerable new sales. Then I have seen decision science teams work on personalized pricing and promotions where for example pricing in consumer insurance can be personalized based on driving history data and distances that the consumer might have driven in the past. Marketing is the #1 function for data science teams today working on use cases such as:
- Pricing and promotion
- Marketing budget allocation
- Customer service management
- Customer acquisition and lead generation
- Channel management
- Churn reduction
- Customer experience
- Customer acquisition
Product Analytics
Product Analytics is a huge area for data scientists and one that remains relatively unexploited. We are talking about data science teams creating products or applications for personalized content and product recommendations, driving value proposition or reducing non-sale exits where customers leave without making some purchase. The cost of computation has significantly diminished over the years so data science teams can now easily build models to test how effective their products have been. The decision scientists are the teams that write the algorithms and build models to reduce flight cancellations by even 1% for major airlines with 100k+ flights each day. That moves the needle.
Other Roles and Use Cases for Decision Science teams
· Analytics-driven accounting and IT
· Analytics-driven hiring and retention
· Fraud and debt analytics
· Inventory and parts optimization
· Logistics network and warehouse optimization
· Next product to buy/individualized offering
· Procurement and spend analytics
· Product development cycle optimization
· Product feature optimization
· Risk modeling
· Sales and demand forecasting
· Smart capital expenditures
· Task automation
· Workforce productivity and efficiency
· Yield optimization
Fraud Detection and Risk Management
I have seen data science teams work hand in hand with the corporate risk, privacy and security teams as well as the fraud teams. This role has become increasingly important in the current state of battle between the hackers and corporations. Data teams work on collection, detection, mitigation and forensics. The collection of data s not an easy task and hackers will exploit any data limitations. Sometimes the cost and storage capacity can prohibit all data that’s relevant from being collected and stored. It thus becomes critical to figure out which data sets to collect and that’s where the data science team’s role becomes crucial. Often, I’d hear companies claim they could have thwarted an attack or breach if only they’d collected some specific types of data they didn’t have.
I have also seen incident response time to be an issue. This is the time is takes to process data that data engineering teams need to work to reduce so that the data fraud team’s response time to a real time attack is fast. Thus, companies need trained data engineering and science teams to engineer, process and respond to breaches or attacks quickly and then work to alleviate the attack by classifying good from bad users using modeling techniques and shutting down the breach. This is a skilled task where data scientists are transforming existing data into new variables e.g. IP addresses that are collected are useless on their own but when transformed into a model that reveals the number of bad users coming from a particular IP address within a finite period, or the country of origin of the IP address, that makes the data infinitely more valuable. The idea here for data scientists is to convert data into variables that can be acted upon. Data scientists also work on Forensics to learn the reason for the attack, how it might have been orchestrated and how to thwart them in the future.
Group 2: DataOps and Services
DataOps teams are responsible for the data stores, databases and structures, data schemas and warehouses sitting either inside data centers or on the cloud. This team is responsible for the day to day maintenance and monitoring of these systems upon which the data science and analytics group’s work sits. DataOps teams work to improve the quality and reduce the cycle time of data analytics providing the tools, processes, and organizational structures to support the data science team’ projects. They work across the entire data lifecycle from data preparation to reporting and recognize the interconnected nature of the data science team and ITOps.
DataOps incorporates the Agile methodology to shorten the cycle time of analytics development in alignment with business goals. This team focuses on continuous delivery by leveraging on-demand IT resources and by automating test and deployment of analytics. This merging of software development and ITOps has improved velocity, quality, predictability and scale of software engineering and deployment.
Borrowing methods from DevOps, DataOps seeks to bring these same improvements to data analytics. This team utilizes statistical process control (SPC) to monitor and control the data analytics pipeline. With SPC in place, the data flowing through an operational system is constantly monitored and verified to be working. If an anomaly occurs, the data science team can be notified through an automated alert. DataOps is not tied to a particular technology, architecture, tool, language or framework. Tools that support DataOps promote collaboration, orchestration, quality, security, access and ease of use.
The DataOps group needs to be top notch. They should work closely with the ITOps and BizOps teams while existing independent of those teams. This group is trained for disaster recovery making sure the systems are up and running and performing at their peak. The DataOps team is also charted to make sure the dashboards and reports used by the business are instantly generated and always accessible via the reporting layer. While the data science team determines what goes on the dashboards, it’s the DataOps team that makes sure the reports show up on time. So, the two teams must constantly be collaborating.
The DataOps and Services group must not be confused with the Data Engineering and Infrastructure group. DataOps is a unique skill set that seeks to increase velocity, reliability, and quality of data analytics. It emphasizes communication, collaboration, integration, automation, measurement and cooperation between data scientists, analysts, Data-ETL engineers (extract, transform, load), IT, and QA. It aims to help organizations rapidly produce insight, turn that insight into operational tools, and continuously improve analytic operations and performance.
Group 3: Data Engineering and Infrastructure
The Data Engineering and Infrastructure Group is responsible for the underlying technology and infrastructure needed to support the DataOps and Data Science groups and make sure AI @Scale can successfully happen. The tools and platforms this team selects, monitors, maintains and supports are some of the most sophisticated in the world of tech and have been constantly evolving. The tech community through open source projects has been collaborating on these platforms for over a decade.
This team also actively participates in the implementation of the proof of concepts and ML models/apps generated by the Data Science teams and makes sure the service level agreements are in force as the product moves from pilot to production in preparation to deliver performance at scale. For example the Data Science team (Group 1) might use SQL with a relational database to test out a pilot, however moving the pilot to production might require moving to HBase queries by Hive or PIG that are AI @Scale tools usually provisioned and monitored by the data infrastructure and engineering team (Group 3).
I have written extensively about the different types of data tools and platforms as well as recorded detailed podcasts on the topic. Should you be interested in indulging and have a few hours to spare, below are excerpts extracted from those podcasts specific to the data sub-topics:
- Data Science Tools
- Data Collection Tools and Strategies
- Data Governance
- Data Integration
- Test Data Management
- Data Visualization
- Big Data Analytics
I can offer some examples below, but you might want to listen to the podcasts to get a sense of the tools and tech out there. I have seen corporations use all of them in some shape or form although I do have my personal favorites that I have highlighted at the top of each podcast.
Some companies use Kafka, Flume and Scribe for streaming and data collection, gathering data from different sources, aggregating the data and then feeding it into a database or a system like Hadoop which is a popular framework for data processing in batches and an open source version of Google’s popular 2004 MapReduce model. S4 and Storm process streaming data, then we have job schedulers like Azkaban and Oozie to manage the data flows, languages like Pig (scripting language) and Hive (like SQL) for querying large databases and data stores like Voldemort, Cassandra and HBase.
Placing the Data Analytics Team Within the Organization
Knowing what types of roles the data science team will work on is not enough. I have often seen confusion around how such analytics teams ought to be positioned within the organization – by functional area, a centralized team perhaps, part of the innovation hub or something else. I suggest looking at things like how big the organization is, what is the core focus i.e. is it more engineering, marketing, design or product driven, and who are the people involved in it.
My teams have almost always had multiple roles in the initial phases from analytics, dataops, infrastructure, etc. and almost everybody has all these skills or at least one dominant skill and then picks the rest up on the job by working with others. Eventually the team focuses on their core expertise and areas of interest. It is also not a good idea to have a swiss army knife approach due to risk of single point of failure. I have worked in corporations where the data analytics team sat withing the COE innovation hub – a center of excellence.
I personally prefer, though not always, a hub and spoke model where there is a core team in the center with other team members placed with the sponsoring business units (supply chain, manufacturing, sales etc.). I have also seen structures where every unit has its own fully baked analytics team, but this usually leads to less sharing of information and deeper silos.
I have learnt over the years that the analytics team if small should sit together and not spread out over different regions or offices especially when working on the same core project. Distance is a burden because in data science, speed of interaction is important given the teams work on highly complicated business problems and the interaction between the team members therefore becomes a critical component of the project’s success.
Its also important that data and analytics not be the forte of just the data science teams however they are placed within an organization. Companies like Facebook and Zynga have proved that democratizing the analytics function to everybody within the organization can prove wonders. I had mentioned earlier in this article how Facebook has allowed its employees to query its Hadoop powered data store using a language called Hive that allows any Facebook employee to create their own custom dashboard. No programming necessary. Zynga has done something similar. Now moving the data and analytics function company-wide is no easy task, it requires a robust infrastructure that can handle the load, training programs and classes for employees and an overall change of messaging to include “data thinking” just like organizations have adopted “design thinking”.
What to call the Data Science Team Members?
How is a data scientist different from a business analyst when often in large companies an analyst with one year of QlikView or PowerBI experience building dashboards can call themselves a data scientist. Giving them the correct title will always remain a challenge. Business analyst does not do justice to their new skills and data analyst might suggest they only do data analytics and nothing else. Some of the team may have PhDs or engineering experience. I have seen companies hand out titles such as research scientist but they tend to work in labs on isolated projects that are usually so advanced for their time and cut off from the product development teams that it’s a stretch to connect them to the organization at scale.
I tend to focus my teams on building data applications and conducting advanced analytics with substantial impact on the business. These teams work with tools that are not usually used by a business or data analyst and have a different approach to conducting the analysis and presenting the findings. These are people that have deep technical expertise with strong academic backgrounds and advanced degrees in a scientific subject such as Physics, Math and sometimes even Medicine, can cut through the clutter and come up with crystal clear theories for testing, are effective communicators either verbally or via the reports they produce and are some of the more creative people I have worked with in an organization.
Having spent a lot of time with data and good data scientists, we believe at our core that the effort it takes in the cleanup and preparation of data is not something that gets in the way of resolving a problem, instead, the cleanup and preparation of data is the problem. Good data scientists have a variety of skills from finding fertile data sources, the ability to manipulate large amounts of data irrespective of hardware or software limitations, cleanup, preparation and aggregation of data sets, visualization of the data, building apps based on the data and allowing others to access and build apps on the data.
Talent and Recruitment
My approach has always been to recruit talented individuals that have a long history of getting their hands dirty with data and creating interesting things with it. I don’t care what industry they did this at or if on their personal time, for me creative use of solving a problem with data in a unique way bears value.
I also like recruiting bright and talented graduates straight out of universities and putting them a data and analytics rotational program if that’s what they are interested in. I believe it’s a skill that will stay with them for their life and help them tremendously in their careers. It also helps me from hiring somebody that had developed bad data behavior that might be difficult for them to unlearn. I also avoid hiring business analytics or analytics that might have worked for a year in a lone machine learning project and now call themselves data scientists.
I also run hackathons and contests where contestants compete against each other on finding the best and most elegant solution using data to a prediction problem within a finite amount of time. Companies like Kaggle or TopCoder also organize many such contests that you can use as sources of talent to recruit from.
I ask myself and ask my recruiting teams if they would want to do a startup with a potential new hire, be locked up in an office with them for long hours and be able to still enjoy their company from a social and intellectual perspective. I also consider trust and communication to be a big factor in that can we trust you and communicate with you, and of course vice versa as well, otherwise what’s the point. Empathy trust and communication creates efficiency and makes working on intense data projects enjoyable.
As a leader I like to set high goals for a new data scientist and do what I can to help that hire be successful at it. I tell them to impress the hell out of me by going the extra mile, thinking out loud and show me they can be a part of a team that is creating the #internetofneed.
Do they have what it takes is what I need to see and for that my team works long and hard to bring the new hire up to speed and continuously support them with their understanding of people, processes, technology and organization. I also like pairing new hires with their buddies, that help them be successful. Success a team sport, period. The best new hires typically show their value in the first three to six months. I also like to imagine the potential new hire in a future role few years later even if that may not be with our company. Potential matters.
In Sum
I know I have won when I see data and analytics being used everywhere in the company, when different teams are talking “data” and trying to see how they can do something better using data and more importantly are asking us for help in making them more data conversant. I know I have won when I see data products cropping up everywhere even when our core data teams may not have worked on them. The best thing that can happen to an organization is the democratization of data. Let it flow into the arteries of the organization and let new products emerge everywhere.
There has never been a better time for companies to adopt data. There will never be a better time for an individual to understand and work with data. We are hiring great data scientists, data managers and data engineers across the world. If you are one and want to work on some of the coolest projects of our time with the most advanced tech, you know where to find me.
Other Articles by the Author on Data, AI and Analytics
AI at Scale - The Key to Organizational Growth
Data Culture is Vital for Organizational Growth
Principal-Global Operations at Imagio Enterprises Group | Bridging gaps in communities through strategic relationships, with Innovative Technologies
5 年Data Science, what a great science
High Growth | Data | Analytics | Technical | Strategic Connector
5 年A great overview of data in organizations, Anurag Harsh.? Especially appreciate your line "the democratization of data."
General Management, Sales , P&L, Global Product Management, Customer Service Manager, Advanced Process Control
5 年Well written post. Lots to learn and draw parallels..
Great to see materials companies creating technology roles within the company and investing in initiatives leverage data analytics to improve the business.
COO | Director - Sales &: Marketing, BD, SCM / Procurement & System Development - Solar & Renewal Energy, Lithium-Ion Energy Storage For EV Mobility, ESS, UPS/Inverter, Telecom Solutions & Tyre Management Solutions.
5 年Very true ????