What Makes for an Effective Data Practitioner in 2024


Marck Vaisman

Senior Technical Specialist, Microsoft
marck.vaisman@microsoft.com

Adjunct Professor, Data Science, Georgetown University
marck.vaisman@georgetown.edu

These are my personal opinions and do not represent any organization

The OG Data Science Cheat Sheet (2010)

ALL BY MYSELF

Analyzing the Analyzers (2012-2013)


Four personas

  • Data Businesspeople
  • Data Creatives
  • Data Developers
  • Data Researchers


Five skill areas

  • Business
  • ML/Big Data
  • Math/OR
  • Programming
  • Statistics

(Harris, Murphy, and Vaisman 2013)

Skills evaluated in our survey (in alphabetical order)

  • Algorithms (ex: computational complexity, CS theory)
  • Back-End Programming (ex: JAVA/Rails/Objective C)
  • Bayesian/Monte-Carlo Statistics (ex: MCMC, BUGS)
  • Big and Distributed Data (ex: Hadoop, Map/Reduce)
  • Business (ex: management, business development, budgeting)
  • Classical Statistics (ex: general linear model, ANOVA)
  • Data Manipulation (ex: regexes, R, SAS, web scraping)
  • Front-End Programming (ex: JavaScript, HTML, CSS)
  • Graphical Models (ex: social networks, Bayes networks)
  • Machine Learning (ex: decision trees, neural nets, SVM, clustering)
  • Math (ex: linear algebra, real analysis, calculus)

  • Optimization (ex: linear, integer, convex, global)
  • Product Development (ex: design, project management)
  • Science (ex: experimental design, technical writing/publishing)
  • Simulation (ex: discrete, agent-based, continuous)
  • Spatial Statistics (ex: geographic covariates, GIS)
  • Structured Data (ex: SQL, JSON, XML)
  • Surveys and Marketing (ex: multinomial modeling)
  • Systems Administration (ex: *nix, DBA, cloud tech.)
  • Temporal Statistics (ex: forecasting, time-series analysis)
  • Unstructured Data (ex: noSQL, text mining)
  • Visualization (ex: statistical graphics, mapping, web-based dataviz)

Evolution of Data Science and hot skills 2008-2024

  • In the beginning (2008-2012): SQL, R, Python, scikit-learn, Pandas
  • Big Data (~2010 on): Hadoop, MapReduce, Hive, Spark (in 2014)
  • Kaggle era (2011 on): xgboost
  • Data Product era: Product Management
  • Deep Learning era (~2014 onwards): NNets, CNNs, GANs, CUDA
  • MLOps era: AutoML, Models as APIs, MLFlow
  • Cloud era: IAAS, PAAS, SAAS, VMs, networking, security, infra-as-code
  • Responsible AI era (~2018 onwards): XAI, Data privacy, Data governance, AI/ML safety
  • Generative AI era (~2022 onwards): using LLMs, langchain, RAG pattern, building guardrails, prompt engineering, knowledge graphs

Note: all times approximate and somewhat overlapping. Combination of skills and tools.

Skills shown at their peak of hotness, and everything is cumulative!

In 2024, the hard truth: an overloaded definition and set of expectations leading to the Data Practitioner Brew

DALL·E 3 prompt: An ethnically and gender diverse group of at least eight sorcerers standing around a large cauldron labeled “Data Science” acting as if they are throwing stuff in. The cauldron must be visible. The liquid in the cauldron is bubbling and smoke is rising.

How it started, how it’s going

~2012

2024

DALL·E 3 prompt: a developer with eight arms, each typing on a separate macbook laptop where the screen is showing matrix like green text

Expectations vs. Reality

Ooops, we f**** up!

Data Science is complex

Multiple and not mutually-exclusive definitions

  1. Science: Data science as a continuation of empirical science’s tradition of data analysis

  2. Research Paradigm: A new approach to conducting research

  3. Research Method: A method that transforms the research process from deductive to inductive

  4. Discipline: An emerging academic field

  5. Workflow: A series of steps in data analysis

  6. Profession: A career focused on extracting knowledge and insights from data

Mastering it’s broadness is extremely difficult and needs time

(Hazzan and Mike 2023)

… and teaching it is even harder!

DALL·E 3 prompt: a lecturer standing in the front of the room facing tells a classroom full of students that are working on laptops to “open the command line” and the students have and puzzled, surprised and nervous reactions


Challenges

  • Broad Discipline
  • Complex Topics
  • Variety of Thinking Skills
  • Special Professional and Organizational Skills
  • Educators’ Background

(Hazzan and Mike 2023)

So, WWDSD?

What would a Data Scientist do? #askingforafriend

What?

  • Preliminary research
  • Collect/compile data
  • Process data
  • Analyze/summarize/model data
  • Present results

How

  • Use generative AI to help, duh!

The work we did in 2013 has over 146 citations over the last 10 years

Prior work on data science skills and competencies

  1. EDISON Data Science Framework (EDSF)
  2. AIS IS 2010 Curriculum Guidelines
  3. Business Higher Education Forum (BHEF) Data Science and Analytics Competency Map
  4. ACM and IDASS Competencies
  5. Park City Math Institute Curriculum Guidelines for Undergraduate Programs in Data Science

(Schmitt et al. 2023; Weiser et al. 2022; Hazzan and Mike 2023)

The ability to do a task combines skills, competencies, and knowledge

Cuadrado-Gallego and Demchenko (2020)

To be effective, it’s much more than an a-la-carte menu

(Weiser et al. 2022; Hamutcu and Fayyad 2020; Fayyad and Hamutcu 2022; Hazzan and Mike 2023; Adhikari and Jordan 2021; Cuadrado-Gallego and Demchenko 2020)

Base Data Competencies

Data Literacy

  • Read, understand, use, and communicate data effectively
  • Capacity to interpret and analyze data
  • Competence to create knowledge from data
  • Proficiency to communicate insights derived from data to others

Data Wrangling

  • Work with structured and unstructured data stored in multiple formats (delimited, json, binary) and systems (databases, distributed, apis, cloud)
    • Acquire (from raw)
    • Adequately combine and manipulate the data to produce analytical datasets
  • Identify implicit and explicit missing data

Computational/Programming

  • Don’t repeat yourself
  • Write modular code
  • Metaprogramming
  • Run a script/program unattended
  • Interface with multiple systems
  • Setup, manage, and troubleshoot computational environment

Data Visualization

  • Ability to present data effectively to the right audience
  • Storytelling

21st Century Competencies


  • Cognitive: Involves reasoning and memory, and includes competencies such as analysis, problem solving, scientific literacy, and creativity.

  • Intrapersonal: Deals with the capacity to manage one’s behavior and emotions to achieve one’s goals. It includes competencies such as perseverance, adaptability, flexibility, self-direction, and the ability to cope with uncertainty.

  • Interpersonal: Includes competencies, such as communication and collaboration, which are used to express information to others, to interpret others’ messages, and to respond appropriately. It involves expressing ideas and interpreting and responding to messages from others.

  • Ethics: The integration of ethical and social issues into all areas of computer science and, by extension, data science.

  • Decision Intelligence: encompasses the processes, tools, and practices that enable organizations to make informed decisions based on data analysis and modelling.

Data Analytics Competencies


  • Statistical Methods: descriptive statistics, exploratory data analysis (EDA) for discovering new features in the data, and confirmatory data analysis (CDA) for hypothesis testing.
  • Machine Learning: methods for information search, image recognition, decision support, and classification.
  • Data Mining: modeling and knowledge discovery for predictive rather than purely descriptive purposes.
  • Text Analytics: Extracting and classifying information from textual sources using statistical, linguistic, and structural techniques.
  • Predictive Analytics: Application of statistical models for predictive forecasting or classification.
  • Business Analytics: Covers data analysis that relies heavily on aggregation, involving various data sources to focus on business information.
  • Computational Modeling, Simulation, and Optimization: Pertains to methods used for computational modeling, simulation, and optimization.

Data Engineering Competencies


The application of engineering principles and modern computer technologies to design, implement, and manage data analytics applications and infrastructures throughout the data lifecycle:

  • Utilize engineering principles to research, design, develop, and implement new instruments and applications for data collection, storage, analysis, and visualization.

  • Develop and apply computational and data-driven solutions to domain-related problems using a wide range of data analytics platforms, focusing on big data technologies for large datasets and cloud-based data analytics platforms.

  • Develop and prototype specialized data analysis applications, tools, and supporting infrastructures for data-driven scientific, business, or organizational workflow; use distributed, parallel, batch, and streaming processing platforms, including online and cloud-based solutions.

  • Develop, deploy, and operate large-scale data storage and processing solutions using different distributed and cloud-based platforms for storing data (e.g., Data Lakes, Hadoop, HBase, Cassandra, MongoDB, Accumulo, DynamoDB, etc.).

(Cuadrado-Gallego and Demchenko 2020)

Data Management Competencies


  • Develop and implement data strategy, particularly in the form of data management policy and data management plan (DMP).

  • Develop and implement relevant data models, define metadata using common standards and practices, for different data sources in a variety of scientific and industry domains.

  • Integrate heterogeneous data from multiple sources and provide them for further analysis and use.

  • Maintain historical information on data handling, including reference to published data and corresponding data sources (data provenance).

  • Ensure data quality, accessibility, interoperability, compliance to standards, and publication (data curation).

(Cuadrado-Gallego and Demchenko 2020)

Research Methods Summary


  • Understand and apply research methods and techniques that distinguish the data scientist profession from other fields.
  • Design experiments, including data collection, for hypothesis testing and problem-solving.
  • Develop and guide data-driven projects, encompassing project planning, experiment design, data collection, and handling.
  • Develop and implement research data management plans (DMP), and apply data stewardship procedures 1 .
  • Consistently apply project management workflows, including scope, planning, assessment, quality and risk management, and team management.

(Cuadrado-Gallego and Demchenko 2020)

Generative AI and other Competencies

Higher level thinking!

Data thinking integrates computational thinking, statistical thinking, and domain thinking.

Computational thinking is a set of cognitive and social skills applied in problem-solving processes that are important for everyone, not just computer scientists. It includes skills such as problem formulation, decomposition, organization, and logical analysis of data, and is recognized as a key 21st-century skill.

Statistical thinking involves the logical and statistical analysis of data, using visualizations and statistical methods to identify patterns and anomalies. It emphasizes the importance of preserving the meaning derived from the domain knowledge when performing any process or calculation on data.

Domain thinking refers to the understanding and application of domain-specific knowledge in the context of data science. It involves the integration of computational and statistical thinking with the knowledge and context of a particular domain, such as social science or medicine, to improve computational models and domain understanding.

Together, these modes of thinking are essential components of data thinking, which is the integrated skill set needed to work effectively with data science. Data thinking treats algorithms and data equally, combining computational thinking, statistical thinking, mathematical conceptions, and context-based thinking associated with the application domain.

(Hazzan and Mike 2023; Adhikari and Jordan 2021)

Here we go again…

Integrating liberal arts into skilling and expectations

Call to action

  1. Prioritize
  2. Standardize Roles
  3. Utilize Frameworks
  4. Skills-Based Assessment
  5. Industry-Specific Classifications
  6. Benchmarking and Comparison
  7. Communication and Education
  8. Continuous Role Evolution
  9. Continuous Tooling Evolution

References

Adhikari, Ani, and Michael I. Jordan. 2021. “Interleaving Computational and Inferential Thinking in an Undergraduate Data Science Curriculum.” Harvard Data Science Review, March. https://doi.org/10.1162/99608f92.cb0fa8d2.
Cuadrado-Gallego, Juan J., and Yuri Demchenko. 2020. The Data Science Framework: A View from the EDISON Project. Edited by Juan J. Cuadrado-Gallego and Yuri Demchenko. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-51023-7.
Fayyad, Usama, and Hamit Hamutcu. 2022. “From Unicorn Data Scientist to Key Roles in Data Science: Standardizing Roles.” Harvard Data Science Review, July. https://doi.org/10.1162/99608f92.008b5006.
Hamutcu, Hamit, and Usama Fayyad. 2020. “Toward Foundations for Data Science and Analytics: A Knowledge Framework for Professional Standards.” Harvard Data Science Review, June. https://doi.org/10.1162/99608f92.1a99e67a.
Harris, Harlan D., Sean Patrick Murphy, and Marck Vaisman. 2013. Analyzing the Analyzers: An Introspective Survey of Data Scientists and Their Work. First edition. Beijing: O’Reilly.
Hazzan, Orit, and Koby Mike. 2023. Guide to Teaching Data Science: An Interdisciplinary Approach. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-031-24758-3.
Schmitt, Karl R. B., Linda Clark, Katherine M. Kinnaird, Ruth E. H. Wertz, and Björn Sandstede. 2023. “Evaluation of EDISON’s Data Science Competency Framework Through a Comparative Literature Analysis.” FoDS 5 (2): 177–98. https://doi.org/10.3934/fods.2021031.
Weiser, Orli, Yoram M. Kalman, Carmel Kent, and Gilad Ravid. 2022. “65 Competencies: Which Ones Should Your Data Analytics Experts Have?” Commun. ACM 65 (3): 58–66. https://doi.org/10.1145/3467018.

Technical Notes (and more relevant skills…)

  • Images generated with the DALL·E3 model on Azure OpenAI
  • Content summarization and RAG:
  • Presentation written in Quarto and rendered as reveal.js and presented in browser

Who’s with me? Are we going to make this right?

Let’s talk about skilling the next generation!


marck.vaisman@microsoft.com

marck.vaisman@georgetown.edu

wahalulu

marckvaisman

wahalulu