11th Grader Takes An AI Tutoring Deep Dive

And a human tutoring expert takes notes

Sean writes:

Great tutoring works, but great tutors are hard to find. Large Language Models (LLMs) could, in theory, meet this massive demand in a cost-effective way. But can they actually tutor—and if so, for whom?

In November 2023, I co-wrote an essay, predicting that AI could be transformative for motivated kids but “mere meh” for the unmotivated. In April 2024, our friend Laurence Holt published The 5% Problem in this publication, arguing along the same lines that edtech tends to help the rich get richer—where here the “rich” are the academically strong and motivated to learn the topic at hand. In May, Laurence and I held a small AI summit at Harvard. We had hoped to have a good counterargument to our thesis revealed but failed to find anything convincing. We still hope to!

In July, I deployed one of the 5%—my 16-year-old intern Nash—to assess how much current AI helps “stronger” students like him. The results exceeded my expectations. Here is his story. Then rejoin me for my key takeaways at the end.

Nash writes:

I’m Nash, a high school junior. This year I’m taking AP Statistics. I was curious to see if the AI platforms GPT and Claude could help me learn something about the subject in a self-directed way.

Overall, it worked well. I prefer this method to other ways of teaching myself something, and I could even see it as a reasonable substitute for typical classroom instruction—at least for really motivated kids. Here’s what happened.

My first effort was with GPT-4o. The process was simple: I’d look up a question from another source (whether Khan Academy or a traditional textbook) about a topic like standard deviation, take a screenshot of it, and copy it over. Then I’d ask 4o to explain it to me.

For example, I’d ask the question, “Is this a valid probability distribution?” and it would reply like this:

Now, this is very different from a human tutor. It’s not, in its base mode, trying to get me to work through the problem. Rather, it’s answering like Google would (maybe because it is competing with Google?).

However, there is a way to remedy this, at least to some degree. ChatGPT allows you to create your own Custom GPT. We created one meant to mimic a human math tutor, with two main differences from the default GPT-4o. One is that it tries to engage the student more with questions, guiding them to solve it on their own instead of immediately revealing the answer. It also tries to speak more plainly. The result looks like this:

As you can see in this example, it was able to walk me through the steps of the problem and only provided me with the information that was absolutely necessary to complete it (in this case, the properties of a valid probability distribution). The only downside to this version is sometimes it breaks the steps down too much. You may, for example, complete the steps and solve the problem but not be able to repeat it again, because you lost sight of the big picture and why you were doing what you were doing.

Hallucinations were rarely a problem. I saw a few. But I treated it like a teacher who occasionally makes mistakes on purpose to try to get kids to “catch” them.

For topics where I’m strong, I’d use base GPT for speed. For topics where I get stuck, I’ll use this Custom GPT.

I also tried the Claude 3.5 Sonnet model. The difference between it and GPT-4o? Minimal. Take a look:

I would observe that ChatGPT seems to be more mathematical, while Claude’s problem solving is more literary. People may prefer one or the other, but they get you to the same destination.

In my situation, I was often progressing from “half know” to “full know.” I could get the gist of what was happening pretty quickly, and the LLM could get me to the finish line. But I think this would go badly with struggling students who have little base knowledge in a topic. A human tutor would be much better at getting someone from “no idea” to “half know.”

Okay, the LLM helped me practice problems. But what if I wanted to go deeper—to learn not just how but why—to go beyond “full know” to “mega know”? Can LLMs help with that?

I tinkered with them. For example, I asked ChatGPT the reasoning behind why we calculate standard deviation the way we do, then asked some follow-up questions.

To me, this summary of the methodologies and rationale felt helpful and well explained. It’s easier for me to remember that you have to square the deviations to make them exaggerated so that you get a better sense of the outliers.

However, this leads into what’s most likely the greatest challenge in LLM tutoring right now. A human tutor’s main purposes are to teach and to motivate. It’s nearly impossible to teach a student who doesn’t want to learn. And that is the major drawback to AI tutoring. From the jump, it needs user input even to start the session. If the user is distracted by something else or their responses are not on topic, no teaching (or learning) will get done. I think LLMs work well for motivated learners, but in the cases where the user absolutely does not want to be learning, an AI tutor is not effective because it lacks the strategies to motivate them.

My Learning Efficiency Rankings, from worst to best.

  1. Online videos
  2. Textbook alone
  3. Normal classroom
  4. Claude
  5. GPT

However, efficiency isn’t the only aspect to consider. Personally, I still enjoy learning at school more than trying to learn things on my own. So even if I could theoretically race through AP Stats in two months, I’d rather just learn it in school alongside my classmates.


EdNext in your inbox

Sign up for the EdNext Weekly newsletter, and stay up to date with the Daily Digest, delivered straight to your inbox.


Sean writes:

I’m a National Board–certified math teacher who taught in New York City and Chicago. Previously, I led math instructional design for a large international education organization, where our teachers achieved significant math gains for students. With that context in mind, here are my impressions after working with Nash:

 

1. Chat GPT4o right now—for the motivated child described in Holt’s essay—works better than an average human tutor. With those top students, a human tutor introduces a topic, shows an example, and the student typically “gets it.” If not, they might ask the tutor one or two questions to achieve “full know.”

I’d give 4o the slight advantage over a human tutor because it can work at the speed of the motivated top 5% student. Plus, it can elaborate on anything the student needs help with in a style that fits them (especially if you build a custom GPT, as we did for Nash). A recent study corroborates Nash’s experience across 839 students: the custom GPT version out-performed the “base version.”

No human tutor is as fast or intellectually versatile as state-of-the-art LLMs, as long as the prompts they’re fed are clear and specific..

 

2. As Dan Meyer writes, “Great teachers . . . do not wait for the demand for their teaching to arise naturallyin a student. They see it as their job to create demand.”

When I watched Nash engage with an AI tutor, that demand was there naturally. He was curious about something or needed help solving a problem, so he asked 4o. It helped him to move forward. He didn’t need a teacher to bring out his motivation.

I noted a transactional quality to Nash’s interactions with 4o that would make some educators uneasy. Observing him teach himself standard deviation, I felt the need to ask him some “Check For Understanding” questions, both to push his understanding and, as a teacher, to feel useful. Our discussions did elevate his understanding, but they weren’t essential. Nash was fine. I can imagine motivated kids in the 5 percent really enjoying interactions with an LLM—the opportunity to go back and forth about a topic at any time and in any depth.

So far, so good?

 

3. Perhaps you’ve intuited the huge caveat. AI problem solving, even when customized to act more like a real tutor, will not work for a great majority of students. I think it would be worse than a typical human tutor for 80 percent of them, the same for 15 percent, and better for 5 percent. This aligns with Holt’s thesis and the spirit of Meyer’s critique.

Not only can’t LLMs easily manufacture interest or motivation in students, their helpfulness may have unintended consequences. When I asked Nash if some of his peers would use LLMs as just an “answer-giver,” he just smiled; of course they would. (As a former high school teacher, I should’ve known better.) That same randomized controlled trial I cited earlier had a curious finding that backed this up: students overrated how much the AI helped them learn as opposed to giving answers. They leaned on it too much, and having it taken away hurt their performance relative to the control group.

 

4. I think if Nash only worked with GPT4o as his tutor in AP States this year instead of taking the class at his high school, he’d score a perfect 5 on the exam after just six weeks of effort. Instead, he’ll take his class for 30 weeks and probably end up with the same score.

Importantly, Nash does not want to take the more efficient route. He likes high school—his friends, the experience of attending classes, the discussions that happen. He likes his teachers and the social camaraderie. So, what’s the rush?

 

5. However, I can’t help but wonder a few things.

a. If given the option, how many 5 percent students would opt out of honors classes and self-paced GPT-run courses?

b. If Nash could engage with GPT4o along with some buddies instead of attending a normal AP Stats class with a teacher, would he choose the AI tutor?

c. How much better will this get? Already there are claims that new advances make months-old versions of AI tools seem prehistoric. OpenAI has released two major updates since Nash and I worked together – a voice mode, and “o1 advanced,” both of which I would have used in my work with Nash.

But we’ve been here before. Edtech waves have come and gone, and empirically we’ve seen that the benefits mostly redound to kids like Nash.

Even so, I was more impressed by what 4o could do as a tutor than any other tech product I’ve seen kids interact with. Its ceiling as a tutor in a one-on-one context is relatively higher than Khan Academy’s resources or Zearn or any other learning platform I’ve seen. Expert human tutors still have the advantage, but they are hard to find and expensive.

If ChatGPT4o and Claude surpassed my expectations with Nash, what will the next surprise look like? Laurence Holt and I may need to update our AI predictions in 2025.

Sean Geraghty is an education consultant. Nash Goldstein is a high school junior in Watertown, Massachusetts.

Last Updated

NEWSLETTER

Notify Me When Education Next

Posts a Big Story

Program on Education Policy and Governance
Harvard Kennedy School
79 JFK Street, Cambridge, MA 02138
Phone (617) 496-5488
Fax (617) 496-4428
Email Education_Next@hks.harvard.edu

For subscription service to the printed journal
Phone (617) 496-5488
Email subscriptions@educationnext.org

Copyright © 2024 President & Fellows of Harvard College