As the artificial intelligence revolution unfolds around us, many education researchers and practitioners believe that AI will soon lead to highly personalized interventions, such as intelligent tutors. In theory, these tools should more precisely respond to students’ needs and engage them with more relevant learning materials, leading to improved educational progress. But AI application development relies on large, high-quality data sets—a standard that too often is unmet, since generative AI models are mostly trained on publicly available data that are opaque, lack documentation, and are likely biased.
The Institute of Education Sciences (IES), the independent science agency in the U.S. Department of Education that I led through March of this year, sits on lots of data that can and should be used to advance our understanding of student learning. This is especially true of IES’s statistical unit, the National Center for Education Statistics (NCES), which administers the National Assessment of Educational Progress (NAEP).
Through its assessments, the NAEP program has amassed vast amounts of high-quality data on what students know and can do. (Around half a million students take the 4th and 8th grade reading and math assessments every other year; tests in other grades and subjects occur less frequently.) NAEP data are particularly valuable for AI training purposes since NAEP assessments are all nationally representative (ensuring that data don’t reflect only a limited segment of the population). Further, the data are “labeled,” meaning the assessments have already been scored by experienced human graders and often include detailed information about the concept being tested. During the last five years, well over $700 million in federal revenue from American taxpayers has been spent (more than $100 million on question development alone) to create this treasure trove of data. It includes hundreds of thousands of student essays, math exercises, and answers to civics tests. This large data set can help researchers, policymakers, parents, and teachers improve student learning and performance using the power of AI.
But this is not happening at the pace it should. Getting access to data for research purposes through NCES is currently far too difficult. Cumbersome application procedures, bureaucratic hurdles, and slow processes plague researchers and organizations alike. For example, one team of highly qualified researchers from Vanderbilt University sought access to three NAEP math datasets for almost a year, facing frustrating administrative inefficiencies like lost paperwork and a refusal to accept e-signatures that required the team to mail documents to multiple people before their application for the data could even be submitted.
These issues are due to legacy security policies meant to protect paper records and data stored on compact discs (remember those?). This is not the world we live in today.
Many government agencies, IES included, now provide secure, remote access to confidential datasets. The Administrative Data Research Facility (ADRF) created by the Coleridge Initiative is a secure research platform that eases access to sensitive confidential microdata. It provides a model of how data can be protected while facilitating access to contemporary cloud infrastructure for improved collaboration, access to shared computing resources, and other benefits. This virtual enclave now allows safe access to NAEP and other student data from IES. State education and workforce agencies, post-secondary institutions, and non-profits also make use of this facility.
Despite this innovation, there is a bottleneck in getting remote access to IES data. Applicants must fill out antiquated forms that refer to “anti-virus software,” locked file cabinets, Internet-disconnected computers, and other items that were clearly created in a very different era of data storage and research. It’s time to remove these archaic barriers and get NCES and NAEP data out faster to facilitate AI development for educational purposes.
The first task is a concerted effort to modernize the current secure-data application process, making it easier for researchers and developers to obtain the data they need for their projects. A new proposal request system is needed to process online applications instead of relying on paper submission through the mail. Digital submissions would support more automated reviews, finding and fixing low-level errors like a missing signature. This would in turn allow the trained and highly paid staff who review applications to focus on more substantive issues. Furthermore, it would accommodate the prevalence of remote work and multi-institutional collaborations by enabling e-signatures and collaboration in an online rather than physical space.
In the long term, increasing access will also mean widening the lens on permissible uses of NCES data. NCES appropriately focuses on high-quality data products that support statistical uses of student data, avoiding enforcement, surveillance, or marketing uses. More broadly, IES and its centers primarily work with universities and non-profit organizations. However, many organizations in the private sector, especially tech firms, are interested in using the data for AI-related purposes. Our current system rarely allows for this, in part due to privacy concerns about student data, but equally prohibitive is a bureaucratic culture that looks askance at commercial enterprises. Yet statistical uses can align with training and analytic uses, and there are many privacy-enhancing technologies in development and deployment in other agencies from which IES and NCES can learn.
Clearly, these changes must be consistent with the Family Educational Rights and Privacy Act (FERPA), which governs access to education data. And, just as clearly, IES and NCES must protect the privacy of student data within that law. But none of these proposed updates to the process for accessing data affect the protections in place in existing (archaic) systems. More broadly, FERPA has all too often acted as a brake on needed changes to ways in which valuable data can be used. IES must lead an effort to better balance the concern for protecting student privacy against the reality that the nation also needs breakthroughs that only access to bias-free and representative data of the sort generated by NAEP can provide.
Mark Schneider was the Director of the Institute for Education Science from 2018 through March 2024.