Evolving Landscape of Data Science Education

From Instructor Pedagogy to Policy Perspectives

Sinem Demirci, Ph.D.

Hello!

A headshot of a woman with curly, short, shoulder-length hair with green eyes.

Sinem Demirci, PhD

Personal Website

sinemdemirci

sinemmdemirci

drsinemdemirci

sdemirci@calpoly.edu




Figure 1. Self-Conceptualization of Data Science Education Layers

  • My short journey to data science education
  • Data Science and The Layers of Data Science Education
  • Undergraduate Data Science Education at the Instructor Level
    • Pedagogical Content Knowledge
    • Aim of the Study
    • Methodology & Highlights from the Initial Findings
  • Undergraduate Data Science Education at the Research and Policy Level
    • Aim of the study
    • Methodology
    • Strengths and Knowledge Gaps
  • Weaving Them Together: Further Agenda

Data Science: A Brief Introduction

What is Data Science?

  • an emergent field that blends multiple areas.
  • demands expertise in a range of skills and concepts spanning statistics, computer science, mathematics, and other domains (Mike and Hazzan, 2023).
  • no agreement for a single definition because of its multifaceted nature.
  • A Venn diagram (Figure 2) is typically used to help illustrate the interdisciplinary nature of data science as a discipline.
Venn diagram of data science composed of application domain, computer science, and mathematics and statistics. Data science is located at the intersection of these three domains.

Figure 2. Venn diagram of data science (Mike and Hazzan, 2023)

Data Science Jobs

The US Bureau of Labor Statistics, Occupational Outlook Handbook

  • Between 2022 to 2032, data scientist jobs are anticipated to witness a 35% growth.

Indeed Editorial Team outlined the reasons for high demand in data science jobs as follows:

  • Low number of data science professionals
  • Value of data science
  • Competitive salaries
  • Job security
  • Data organization challenges of companies

6 In-Demand Data Scientist Jobs in 2024

  • Data scientist, data analyst, data engineer, data architect, machine learning engineer, business intelligence engineer

Image by jcomp on Freepik

Data Science in Higher Education Context

  • The interdisciplinary nature of data science has been discussed in data science education community (e.g., Asamoah et al., 2020)
    • It brings unique challenges to determine the scope and content of data science courses/majors (Yan & Davis, 2019).
  • Some initiatives have been taken to provide curriculum guidelines for data science (e.g., De Veaux et al., 2017; National Academies of Sciences, Engineering and Medicine, 2018 ) and essential skills for a data scientist (e.g., De Veaux et al., 2017)
    • More research on data science education is required
      • to enhance the scope and
      • to cultivate proficient data scientists who can meet the growing demand for data scientists in various professional fields.





Undergraduate Data Science Education at the Instructor Level

Teaching Data Science

  • The effective teaching of concepts across diverse disciplines has evolved through continuous beta testing over the centuries.
  • Unlike many established disciplines, data science is a relatively new field (Kelleher & Tierney, 2018).
  • The nature of teaching data science has been studied in a systematic manner recently compared to the other disciplines. (Schwab- McCoy et al. 2021)
    • Thus, it can be considered as we are still at the beginning of learning how to teach it.

Pedagogical Content Knowledge-I

  • Pedagogical content knowledge (PCK) is one of the theoretical frameworks that provide an insight to an integration of content and pedagogy to teaching that enable instructors to monitor their teaching practices (Shulman, 1987).
  • In addition to content knowledge, it requires additional knowledge and skills.
  • In Magnusson et al.’s PCK model (1999) for science teaching, five components were specified as
  1. orientation toward teaching;
  2. knowledge of learners;
  3. knowledge of curriculum;
  4. knowledge of instructional strategies; and
  5. knowledge of assessment.

Figure 3. Hexagon model of pedagogical content knowledge for science teaching (Park & Oliver, 2008, p.279)

Introduction to Data Science Courses


  • Introduction to Data Science (IDS) courses are introductory courses offered by different departments such as mathematics, statistics, data science, or computer science aiming to enable individuals to grasp the basics of data science.
    • Even though majority of the enrolled students are from these departments, students are coming from almost every major/department to these courses.

Aim of the study

  • The objectives of this study, which is planned in line with the capacity building of IDS educators needs consist of two stages:
    • exploring the Pedagogical Content Knowledge (PCK) of the IDS instructors; and
    • developing a measurement tool that measures PCK for introduction to data science teaching.

Methodology

  • In this study, we chose qualitative research design (Merriam & Tisdell, 2016)
    • Our aim was to understand how IDS instructors interpret their teaching experiences in IDS courses and “what meaning they attribute to their experiences” (Merriam, 2009, p. 23).
    • Table 1 represents a quick summary of characteristics of qualitative and quantitative research.

Sampling Procedure

  • We defined the target population to consist of instructors who taught an introductory data science course at least twice at the undergraduate level.
  • Our rationale for recruiting participants was as following:
    • We tried to standardize the name of the IDS courses.
      • We selected participants who taught a course whose title include ‘Data Science’ and one of the following keywords:
        • Introduction, Principles, Elements or Fundamentals
    • When an instructor teaches a course for the first time, they focus on multiple aspects of the course as a novice.
      • Thus, instructors who have gone through the second iteration of the course would be able to reflect deeper about the course and the students.
  • We recruited 16 participants (2 pilot, 14 main study) via mailing lists and online forums with large teacher-scholar communities.

Sample Profile - IDS Instructors

  • All participants were from North America
  • Only 4 instructors were the sole instructor in their IDS course. The other participants have either co-instructors or TAs/graders.
  • The instructors had terminal degrees in varying subjects including statistics, mathematics, computer science, genetics, and economics.
  • They all had been teaching an introductory data science course for a varying number of years, with a range from 1 to 10 years of experience.
Formal training in Data Science

Self-taught – 4 participants

Workshops – 4 participants

Industry experience – 2 participants

Others – 5 (enrolling some DS courses in graduate years, graduated from closely related areas such as Stat and CS.

Formal Training in Teaching

Workshops – 3 participants

TA trainings – 3 participants

Course/internship – 3 participants

Degree – 1 participant

None – 5 participants

Sample Profile - IDS Classrooms

  • Students are coming from almost every major/department.

    • Majority – Mathematics, Statistics, Computer Science, Data Science
    • Others – Engineering, Business School, Social Science, Economics, Humanities, Life Sciences, Environmental Science, Political Science, Health Science, Undecided
  • Prerequisite Yes – 6; No – 8

  • Prerequisite to any other course​ Yes – 11; No – 2; Not sure – 1

Table 2: Class sizes reported by IDS instructors

Class Size n
300+ 2
200-299 1
100-199 2
30-39 2
20-29 3
10-19 3
1-9 1

Data Collection

  • We collected data through online semi-structured interviews from 14 participants.
    • We designed specific questions to explore teaching experiences of IDS instructors.
    • We also had some follow-up questions depending on the responses to elaborate their PCK.
  • Each participant was compensated with a £ 50 gift card for their time.

Data Analysis I

  • Data analysis starts simultaneously with the data collection and proceeded iteratively throughout the qualitative research.

  • Qualitative content analysis (Merriam & Tisdell, 2016) was used for generating a comprehensive codebook.

    • Codebook is an important part for qualitative data analysis. It serves to
      • determine/identify themes, categories, and codes.
      • collect evidences for validity and reliability of the studies.

Data Analysis II - PCK Map

  • What does a PCK Map refer to?
    • The PCK mapping approach used in this study was based on the hexagon model of PCK (Park & Oliver, 2008), which defines PCK as the integration of six components.
    • This model emphasizes the importance of interactions among these components.
    • In other words, to advance to higher levels of PCK, teachers not only need to improve individual components but also strengthen the coherence between them.

Sample map of Park and Oliver, 2012

Data Analysis III - Summary

Validity and Reliability Evidences of the Study

To enhance the trustworthiness of the study, we collected indicators for transferability, dependability, and credibility (Merriam & Tisdell, 2016).

  • Particularly, we provided a detailed description for our participants’ profile, data collection and data analysis procedures.
  • We also had different participants (e.g., differed in terms of year of experience, terminal degree etc.) based on our selection criteria which enabled maximum variation in our sample.
  • Our research team continuously compared and discussed to determine the extent of codebook based on the theoretical framework and data of the study.

Highlighted Findings

Orientations to Teach IDS

There were three recurring orientations among the IDS instructors:

  • to enhance data literacy
  • to familiarize students with a programming language
  • to teach to learn from data

Knowledge of Students’ Understanding

  • Every IDS instructor reported that IDS students have diverse backgrounds.
    • Students who do not have programming background need more help than their peers.
    • Students tend to lose their motivation when they come across error messages while coding

Curriculum and Instructional Strategies

  • Most IDS instructors prepared their own materials.
    • GAISE, The Park City Math Institute report (De Veaux et al.2017), and ACM’s computing competencies guidelines (Danyluk et al.2021) were the 3 most frequently used guidelines.
  • IDS instructors reported that they are using more than one teaching strategies.
    • Demos, simulations, lecturing, classroom discussions, peer learning

Knowledge of Assessment

  • IDS instructors use more than one assessment tool to evaluate learning in their classroom.
    • Majority of the IDS instructors used both formative and summative assessment but a few IDS instructors used these terms explicitly.

A Sample of PCK Map

  • PCK mapping approach used to explore interactions of six components of PCK. The more interactions among components provide an indicator for having higher levels of PCK.

Figure 4. An example of PCK Map

Discussion and Conclusion - I

  • In this study, IDS instructors shared their teaching styles, revealing their potential level of PCK as well as the interactions between its elements.

  • Although data analysis is still ongoing, initial findings have started to emerge. We anticipate unveiling interaction maps of IDS instructors’ PCK, intending to identify needs in undergraduate data science teaching.

    • Ultimately, our goal is to contribute to capacity building in IDS at the instructor level.
  • While we found the PCK framework useful in providing an initial understanding of IDS instructors’ nature of PCK, we acknowledge the potential requirement for further modifications to actively determine the elements of PCK in data science education context.




Undergraduate Data Science Education at the Policy Level

Introduction

  • Undergraduate data science education as a major focus in the data science education community.

  • Efforts being made to identify data science competencies such as

    • The Park City Math Institute report (De Veaux et al.2017),
    • The Framework of National Academies of Sciences, Engineering, and Medicine (2021),
    • Association for Computing Machinery’s computing competencies guidelines (Danyluk et al.2021), and
    • European Union’s EDISON project (Wiktorski et al.2017a).
    • Accreditation Board for Engineering and Technology’s (ABET’s) accreditation for data science programs (2024)
  • A global call for need by many professional organizations and scholars to understand undergraduate data science education as well as the scientific literature on this topic.

Aim of the study

In this project, the goals were to

  1. specify current evidence and knowledge gaps in undergraduate data science education and
  2. inform policymakers and data science educators/practitioners about the present status of data science education.

We conducted a systematic literature review (Evans & Benefield, 2001; Liberati et al., 2009) by using certain criteria.

Methodology

  • Systematic literature review is generally used for informing evidence-based decisions in a field by compiling all empirical evidence that fits pre-determined inclusion and exclusion criteria (Evans & Benefield, 2001; Liberati et al., 2009).
  • This approach was selected within the context of the study because we aimed to
    1. to gain a deeper understanding to be able to specify current evidence and knowledge gaps;
    2. inform policymakers and data science educators/practitioners; and
    3. inform possible further quantitative meta-analysis topics that can be conducted in this field.

Data Collection -I

We opted to extract data from six databases that potentially include publications on data science education. These databases were

  1. ERIC ProQuest

  2. IEEE Xplore

  3. PubMed

  4. Science Direct

  5. Scopus and

  6. Web of Science

Data Collection - II

  • The data collection and analysis process was an iterative process.
    • Over a span of a year, data extracted from databases were corrected and the scope of the variables of interest was continuously revised and discussed.
  • Initial Criterion: Documents including the specific keyword as “data science education” (in quotes) in at least one of the following fields: title, abstract, keywords.
    • A total of 197 publications met our initial criterion.
    • The initial database search was in December 2022 and thus mainly have publications published prior to this date with a few exceptions.

Data Analysis - In General

  • We conducted data analysis in two stages:
    • preliminary data analysis and
    • in-depth data analysis.
  • Throughout data analysis, we excluded publications from the pool due to either
    • format reasons (e.g., posters, panels, duplicated publications) or
    • not undergraduate level (K-12, high-school etc.).
  • At both preliminary and in-depth analyses stages, each publication was assigned to two researchers randomly for independent review.
    • The reviewer pairs discussed any discrepancies between their analysis decisions and tried to reach a consensus.
    • In cases where conflicts persisted, the entire group deliberated on the final decision.

Data Analysis - Preliminary Stage

  • In addition to making inclusion-exclusion decisions, a list of variables about publications was also finalized that may be worth examining during the in-depth analysis stage.

  • In this stage, abstracts were read.

    • 67 publications were excluded
    • 130 publications remained by the end of the preliminary analysis.

Data Analysis - In-Depth Stage

  • In the in-depth analysis stage, we read the full publications.
    • We excluded 53 publications
    • A total of 77 publications remained for the in-depth analysis.

We collected data related to

  1. affiliation country of researchers
  2. open access status;
  3. explicit research question stated;
  4. data collection and type of data (quantitative vs. qualitative);
  5. content area of the publication; and
  6. main field and discipline of the publication.
  • In addition,we noted big picture notes of each publication to look for patterns of main study areas of the publications.

Validity and Reliability Evidences of the Study

Dependability: We recorded our research steps such as how data were collected, how categories were derived and how we made decisions throughout the research study as suggested by (Merriam,2015).

Transferability: We provided a rich, thick description (Merriam,2015) of the data collection,data wrangling and data analysis procedures in the public repository which has open access to every reader who wish to examine closely or reproducing the data analysis.

We also ensured maximum variation by including different undergraduate data science education studies conducted in different fields and/or included different content areas.

Highlighted Findings







The body of literature detected is very recent, increasing monotonically, and the oldest paper published in 2015.







Over the past eight years,

  • A higher percentage of conference articles (57%) have been published compared to journal articles (38%)
  • A majority of published work on undergraduate data science education has been freely available to the public:

Strengths in Undergraduate Data Science Education

1. Open Access: A majority of published studies in undergraduate data science education are open access, marking a substantial strength in the field.

2. Interdisciplinarity: Scholars from diverse fields are contributing the data science education literature.

Data science education practices in different programs such as
- data science programs,
- computer science education (Bile Hassan & Liu, 2020),
- microbiology (Dill-McFarland et al., 2021), and
- business (Miah et al., 2020).

The course examples as
- introductory computing (Fisler, 2022),
- modern technologies course
- computer science and engineering students (Rao et al.,2019),
- general education IT course (Haynes et al.(2019)),
- medicine (Doudesis & Manataki(2022b)),
- psychology (Tucker et al.(2023)).

Knowledge Gaps

1. There is no sufficient empirical data: 44 studies out of 77 did not collect data.

  • Given the scopes of content areas such as calls to action, educational technology, and program examples coupled with the emergent nature of the field, the lack of empirical data is not a surprising finding.

    • However, it also suggests that undergraduate data science educators have not yet begun collecting empirical data systematically.

Warning

Emphasis on lack of empirical data is not a promotion of empiricism over all other ‘ways of knowing’.

What is being highlighting is the disproportionately high percentage of studies lacking empirical data,

  • which complicates the literature’s potential for gaining a deeper understanding and identifying recurring patterns.

2. Reproducibility is one of the potential challenges in undergraduate data science education research: Speculatively, the absence of critical information about research designs,such as the lack of research questions, participants’ profile and non-collection of data may contribute to the reduced reproducibility of available studies.

  • One could argue that succinct nature of conference articles may inadvertently interfere with the comprehensive documentation necessary for the effective replication or modification of research.

  • A potential reason for this gap may also be the minimal training that most instructors receive in reproducibility (Horton et al. 2022).

3. Not all Data Science disciplines contribute equally to the overall body of knowledge: The prevailing trend indicates ongoing multidisciplinary collaborations.

  • Notably, computer science and data science emerge as the leading contributors to the literature.

    • In contrast, fields such as statistics, mathematics, as well as other fields closely related to data science exhibit a limited presence in studies related to undergraduate data science education.
  • This result aligns with the study of Wiktorski et al.(2017b), who reported that Mathematics and Statistics departments are not at the forefront of data science degree programs.

  • This is perhaps the most important finding for the statistics community.

Recommendations

Scientific studies are an integral part to review existing practices as well as to improve higher education institutions’ data science practices. Thus, we should

Recommendation 1: Prioritize investments in empirical studies.

Recommendation 2: Diversify research efforts to enrich the spectrum of studies.

Recommendation 3: Encourage scholars in key data science fields to contribute more to publications.

Weaving Them Together: Further Agenda




Undergraduate Data Science Education at the Classroom/Student Level

What is Next?

  • Over the next years, my research will extend and encompass university and community college students enrolled in data science courses.

  • I am planning to explore the role of students’ contexts in terms of multiple cognitive, affective, and multicultural variables to support their learning.

    • Among these variables, I would like to conduct further research to contribute to determining ‘threshold concepts’ in data science education.

Note

Threshold concepts:“…‘conceptual gateways’ or ‘portals’ that lead to a previously inaccessible, and initially perhaps ‘troublesome’, way of thinking about something.” without which the students cannot proceed further in learning a certain discipline.”

There are studies reporting these threshold concepts (e.g., Beitelmal, Thomas et. al, 2010) but the literature on this topic is still intact to be explored systematically.




During the next 5 years of my research, I would like to

  • become proficient in supporting data science instructors’ PCK to teach data science

  • conduct research to determine threshold concepts and explore possible ways to promote students’ learning these concepts

  • sharing best practices with the data science education community.

References

Asamoah, D. A., Doran, D., & Schiller, S. (2020). Interdisciplinarity in data science pedagogy: a foundational design. Journal of Computer Information Systems, 60(4), 370-377, https://doi.org/10.1080/08874417.2018.1496803

Beitelmal, W. H., Littlejohn, R., Okonkwo, P. C., Hassan, I. U., Barhoumi, E. M., Khozaei, F., … & Alkaaf, K. A. (2022). Threshold Concepts Theory in Higher Education—Introductory Statistics Courses as an Example. Education Sciences, 12(11), 748.

De Veaux, R. D., Agarwal, M., Averett, M., Baumer, B. S., Bray, A., Bressoud, T. C., … & Ye, P. (2017). Curriculum guidelines for undergraduate programs in data science. Annual Review of Statistics and Its Application, 4, 15-30.

Donoghue, T., Voytek, B., & Ellis, S. E. (2021). Teaching creative and practical data science at scale. Journal of Statistics and Data Science Education, 29(sup1), 27-39, https://doi.org/10.1080/10691898.2020.1860725

Fraenkel, J. R., Wallen, N. E., & Hyun, H. H. (2012). How to design and evaluate research in education (Vol. 7, p. 429). New York: McGraw-hill.

Kelleher, J. D., & Tierney, B. (2018). Data science. MIT Press.

Magnusson, S.J., Borko, H., & Krajcik, J.S. (1999). Nature, source, and development of pedagogical content knowledge for science teaching. In: Gess-Newsome, J., & Lederman, N., (Eds.), Examining Pedagogical Content Knowledge. United States: Kluwer Press. pp. 95-132

Merriam, S. B. (2009). Qualitative Research: A Guide to Design and Implementation. San Francisco: CA: Jossey-Bass.

Merriam, S. B., & Tisdell, E. J. (2016). Qualitative Research: A Guide to Design and Implementation (Fourth Edition). San Francisco.

Mike K. & Hazzan, O. (February 2023). What is data science? Communications of the ACM, 66(2), 12–13, https://doi.org/10.1145/3575663

National Academies of Sciences, Engineering and Medicine Consensus Report (2018). Data Science for Undergraduates: Opportunities and Options. Washington, https://nas.edu/envisioningds.

Park, S., & Chen, Y. C. (2012). Mapping out the integration of the components of pedagogical content knowledge (PCK): Examples from high school biology classrooms. Journal of research in science teaching, 49(7), 922-941.

Park, S., & Oliver, J. S. (2008). Revisiting the conceptualisation of pedagogical content knowledge (PCK): PCK as a conceptual tool to understand teachers as professionals. Research in science Education, 38, 261-284.

Park, S., & Suh, J. K. (2019). The PCK map approach to capturing the complexity of enacted PCK (ePCK) and pedagogical reasoning in science teaching. In Repositioning pedagogical content knowledge in teachers’ knowledge for teaching science, 187-199.

Qian, Y., & Lehman, J. (2017). Students’ misconceptions and other difficulties in introductory programming: A literature review. ACM Transactions on Computing Education (TOCE), 18(1), 1-24, https://doi.org/10.3102/0002831213477680

Shulman, L. (1987). Knowledge and teaching: Foundations of the new reform. Harvard Educational Review, 57(1), 1-23.

Schwab-McCoy, A., Baker, C. M., & Gasper, R. E. (2021). Data science in 2020: Computing, curricula, and challenges for the next 10 years. Journal of Statistics and Data Science Education, 29(sup1), S40-S50.

Thomas, L., Boustedt, J., Eckerdal, A., McCartney, R., Moström, J. E., Sanders, K., & Zander, C. (2010). Threshold concepts in computer science: An ongoing empirical investigation. In Threshold concepts and transformational learning (pp. 241-257). Brill.

Yan, D., & Davis, G. E. (2019). A first course in data science. Journal of Statistics Education, 27(2), 99-109, https://doi.org/10.1080/10691898.2019.1623136