Big data has arrived in education. Educational data science, learning analytics, computer adaptive testing, assessment analytics, educational data mining, adaptive learning platforms, new cognitive systems for learning and even educational applications based on artificial intelligence are fast becoming parts of the educational landscape, in schools, colleges and universities, as well as in the networked spaces of online courses.
As part of a recent conversation about the Shadow of the Smart Machine work on machine learning algorithms being undertaken by Nesta, I was asked what I thought were some the most critical questions about big data and machine learning in education. This reminded me of the highly influential paper ‘Critical questions for big data’ by danah boyd and Kate Crawford, in which they ‘ask critical questions about what all this data means, who gets access to what data, how data analysis is deployed, and to what ends.’
With that in mind, here are some preliminary (work-in-progress) critical questions to ask about big data in education.
How is ‘big data’ being conceptualized in relation to education?
Large-scale data collection has been at the centre of the statistical measurement, comparison and evaluation of the performance of education systems, policies, institutions, staff and students since the mid-1800s. Does big data constitute a novel way of enumerating education? The sociologist David Beer has suggested we need to think about the ways in which big data as both a concept and a material phenomenon has appeared as part of a history of statistical thinking, and in relation to the rise of the data analytics industry—he suggests social science still needs to understand ‘the concept itself, where it came from, how it is used, what it is used for, how it lends authority, validates, justifies, and makes promises.’ Within education specifically, how is big data being conceptualized, thought about, and used to animate specific kinds of projects and technical developments? Where did it come from–data science, computer science–and who are its promoters and sponsors in education? What promises are attached to the concept of big data as it is discussed within the domain of education? We might wish to think about a ‘big data imaginary’ in education—a certain way of thinking about, envisaging and visioning the future of education through the conceptual lens of big data—that is now animating specific technical projects, becoming embedded in the material reality of educational spaces and enacted in practice.
What theories of learning underpin big data-driven educational technologies?
Big data-driven platforms such as learning analytics aim to ‘optimize learning’ but is it always clear what is meant by ‘learning’ by the organizations and actors that build, promote and evaluate them? Much of the emerging field of ‘educational data science’—which encompasses much educational data mining, learning analytics and adaptive learning software R&D—is informed by conceptualizations of learning that are rooted in cognitive science and cognitive neuroscience. These disciplines tend to focus on learning as an ‘information-processing’ event—to treat learning as something that can be monitored and optimized like a computer program—and pay less attention to the social, cultural, political and economic factors that structure education and individuals’ experiences of learning.
Given the statistical basis of big data, it’s perhaps also not surprising that many actors involved in educational big data analyses are deeply informed by the disciplinary practices and assumptions of psychometrics and its techniques of psychological measurement of knowledge, skills, personality and so on. Aspects of behaviourist theories of learning even persist in behaviour management technologies that are used to collect data on students’ observed behaviours and distribute rewards to reinforce desirable conduct. There is an emerging tension between the strongly psychological, neuroscientific and computational ways of conceptualizing and theorizing learning that dominate big data development in education, and more social scientific critiques of the limitations of such theories.
How are machine learning systems used in education being ‘trained’ and ‘taught’?
The machine learning algorithms that underpin much educational data mining, learning analytics and adaptive learning platforms need to be trained, and constantly tweaked, adjusted and optimized to ensure accuracy of results–such as predictions about future events. This requires ‘training data,’ a corpus of historical data that the algorithms can bee ‘taught’ with to then use to find patterns in data ‘in the wild.’ Who selects the training data? How do we know if it is appropriate, reliable and accurate? What if the historical data is in some ways biased, incomplete or inaccurate? Does this risk generating ‘statistical discrimination’ of the sort produced by ‘predictive policing,’ which has in some cases been found to disproportionately predict that black men will commit crime? Educational research has long asked questions about the selection of the knowledge for inclusion in school curricula that are to be taught to students—we may now need to ask about the selection of the data for inclusion in the training corpus of machine learning platforms, as these data could be consequential for learners’ subsequent educational experience.
Moreover, we might need to ask questions about the nature of the ‘learning’ being experienced by machine learning algorithms, particularly as enthusiastic advocates in places like IBM are beginning to propose that advanced machine learning is more ‘natural,’ with ‘human qualities,’ based on computational models of aspects of human brain functioning and cognition. To what extent do such claims appear to conflate understandings of the biological neural networks of the human brain that are mapped by neuroscientists with the artificial neural networks designed by computer scientists? Does this reinforce computational information-processing conceptualizations of learning, and risk addressing young human minds and the ‘learning brain’ as computable devices that can be debugged and rewired?
Who ‘owns’ educational big data?
The sociologist Evelyn Ruppert has asked ‘who owns big data?’, noting that numerous people, technologies, practices and actions are involved in how data is shaped, made and captured. The technical systems for conducting educational big data collection, analysis and knowledge production are expensive to build. Specialist technical staff are required to program and maintain them, to design their algorithms, to produce their interfaces. Commercial organizations see educational data as a potentially lucrative market, and ‘own’ the systems that are now being used to see, know and make sense of education and learning processes. Many of their systems are proprietorial, and are wrapped in IP and patents which makes it impossible for other parties to understand how they are collecting data, what analyses they are conducting, or how robust their big data samples are. Specific commercial and political ambitions may also be animating the development of educational data analytics platforms, particularly those associated with Silicon Valley where ed-tech funding for data-driven applications is soaring and tech entrepreneurs are rapidly developing data-driven educational software and even new institutions.
In this sense, we need to ask critical questions about how educational big data are made, analysed and circulated within specific social, disciplinary and institutional contexts that often involve powerful actors that possess significant economic capital in the shape of funding and resourcing, cultural capital in terms of the production of new specialist knowledge, and social capital through wider networks of affiliations, partnerships and connections. The question of the ownership of educational big data needs to be located in relation to these forms of capital and the networks where they circulate.
Who can ‘afford’ educational big data?
Not all schools, colleges or universities can necessarily afford to purchase a learning analytics or adaptive software platform—or to partner with platform providers. This risks certain wealthy institutions being able to benefit from real-time insights into learning practices and processes that such analytics afford, while other institutions will remain restricted to the more bureaucratic analysis of temporally discrete assessment events.
Can educational big data provide a real-time alternative to temporally discrete assessment techniques and bureaucratic policymaking?
Policy makers in recent years have depended on large-scale assessment data to help inform decision-making and drive reform—particularly the use of large-scale international comparative data such as the datasets collected by OECD testing instruments. Educational data mining and analytics can provide a real-time stream of data about learners’ progress, as well as automated real-time personalization of learning content appropriate to each individual learner. To some extent this changes the speed and scale of educational change—removing the need for cumbersome assessment and country comparison and distancing the requirement for policy intervention. But it potentially places commercial organizations (such as the global education business Pearson) in a powerful new role in education, with the capacity to predict outcomes and shape educational practices at timescales that government intervention cannot emulate.
Is there algorithmic accountability to educational analytics?
Learning analytics is focused on the optimization of learning and one of its main claims is the early identification of students at-risk of failure. What happens if, despite being enrolled on a learning analytics system that has personalized the learning experience for the individual, that individual still fails? Will the teacher and institution be accountable, or can the machine learning algorithms (and the platform organizations that designed them) be held accountable for their failure? Simon Buckingham Shum has written about the need to address algorithmic accountability in the learning analytics field, and noted that ‘making the algorithms underpinning analytics intelligible’ is one way of at least making them more transparent and less opaque.
Is student data replacing student voice?
Data are sometimes said to ‘speak for themselves,’ but education has a long history of encouraging learners to speak for themselves too. Is the history of pupil voice initiatives being overwritten by the potential of pupil data, which proposes a more reliable, accurate, objective and impartial view of the individual’s learning process unencumbered by personal bias? Or can student data become the basis for a data-dialogic form of student voice, one in which teachers and their students are able to develop meaningful and caring relationships through mutual understanding and discussion of student data?
Do teachers need ‘data literacy’?
Many teachers and school leaders possess little detailed understanding of the data systems that they are using, or required to use. As glossy educational technologies like ClassDojo are taken up enthusiastically by millions of teachers worldwide, might it be useful to ensure that teachers can ask important questions about data ethics, data privacy, data protection, and be able to engage with educational data in an informed way? Despite calls in the US to ensure that data literacy become the focus for teachers’ pre-service training, there appears little sign that the provision of data literacy education for educational practitioners is being developed in the UK.
What ethical frameworks are required for educational big data analysis and data science studies?
The UK government recently published an ethical framework for policymakers for use when planning data science projects. Similar ethical frameworks to guide the design of educational big data platforms and education data science projects are necessary.
Some of these questions clearly need more work, but make clear I think the need for more work to critically interrogate big data in education.