Feb 3, 2026

Can AI Actually Read a Book?

Introduction

Sort of half-seriously, I decided to try an AI co-writing experiment in honor (?) of “National Novel Writing Month” a few months ago. Not officially, mind you, but I am of the generation that vaguely feels like they should be writing a novel every November. It doesn’t escape me that the timing might have gone against the spirit of the annual writing tradition, but I was curious. I’ve been developing context engineering best practices for my AI orchestration layer, Artificer, and a few client apps, and already knew that AI can’t, really, write a novel. Not on its own. What I was curious about was how far I could actually take it. I set up a Claude Project and sallied forth.

I actually enjoyed the initial results: the one actual advantage of co-writing with AI was that I could look forward to what might happen next, because I didn’t necessarily know what was going to happen. Unfortunately, about 10k words into the effort, I abandoned what had become an expensive game of exquisite corpse. Yes, what stopped me was money: I had other things to do with my AI budget (coding), and it turns out directing and editing AI fiction is pretty resource intensive. Who knew?

Honestly, the fact makes me and my MFA rest a bit easier.

It only occurred to me a few weeks later that (real!) worries about “AI slop” overlooked the same thing I myself had overlooked. Namely, before asking the question “could AI write a book,” I had not actually asked if “AI” could read a book.

Once I contemplated this question, I felt increasing cognitive dissonance. I almost felt crazy. Surely, there were AI agents that could read a book. Hadn’t LLMs “read,” like, every book? By definition? Almost all of the AI services related to research and writing generation seemed like they couldn’t exist if that weren’t a solved problem. However, knowing what I know now about RAG patterns, semantic search, and context windows, I know this question wasn’t as straightforward as it seemed. If an LLM could actually “read” a book—that is, comprehend and map contents in a logical and informative way, at minimum—then I would be a lot further along on my various projects.

How AI “Reads” Text Today

When I started looking into this, I assumed the problem was basically solved. There are plenty of AI research tools, content generation services, and “chat with your documents” features. Surely if these tools work, AI can read books, right?

What Actually Exists

Semantic search and research tools like Elicit, Semantic Scholar, and Consensus excel at finding relevant papers, extracting key claims, and synthesizing information across many documents. They use RAG (Retrieval-Augmented Generation) under the hood: chunk documents into passages, generate embeddings for similarity search, retrieve relevant chunks based on your query, then feed those chunks to an LLM for synthesis. They’re designed for the research workflow: “Find me papers about X, summarize their conclusions, identify patterns across the corpus.”

This works remarkably well for that use case. Each paper is relatively self-contained. You’re looking for discrete claims and findings. Cross-referencing multiple sources is the goal. The chunking doesn’t destroy meaning because academic papers are structured to make individual sections interpretable on their own.

“Chat with a PDF” tools use the same RAG pattern for single documents you upload and ask questions about. Under the hood, most use the same RAG pattern or try to fit the document into an LLM’s context window. For papers, reports, or shorter documents, this works fine.

Content generation tools like Sudowrite or Claude’s extended context features can help with writing by maintaining awareness of your manuscript… up to a point. When I tried uploading a full novel manuscript to test Sudowrite’s marketed capabilities, it couldn’t even accept the file despite claiming to support book-length projects.

Digital humanities approaches have their own toolkit, focused on “distant reading,” which involves analyzing patterns across thousands of texts, tracking word frequencies, and identifying themes across corpora. These tools are powerful for literary scholarship but aren’t designed for deep comprehension of individual narratives.

The Core Limitations

All of these approaches run into the same fundamental constraints:

Context windows, even as they expand, create a hard limit. Yes, models now support 100K, 200K, even 1M+ tokens. But attention degradation is real: benchmarks like AcademicEval (2025) show 20-50% drops in narrative comprehension tasks beyond 100k tokens, with hallucination rates climbing as context expands. Drop a novel into a context window and ask about a character’s motivation arc, and you’re gambling on whether the model actually “saw” the relevant passages or whether they got lost in the attention mechanism’s weighted averaging.

On their own, even agents who have access to frontier models lack persistent mental models. An LLM processes text as a sequence and generates a response. It doesn’t build and maintain a structured representation of the story world. When you ask it a follow-up question, it’s not querying an internal knowledge graph; rather, it’s processing the entire conversation history again, from scratch, generating tokens probabilistically based on what it’s seen before.

Even with the addition of RAG-supported “memory” systems, each new interaction starts with retrieval and fresh generation. Agents with RAG may indeed remember the name of the main character over multiple sessions, but there’s no accumulated understanding, no evolving model of character states, no timeline that gets updated as you read further. The system has to reconstruct its “understanding” on every single query even if it gets the facts right.

Context engineering, generally speaking, lacks validation. In software engineering, you write tests. You verify behavior. You can guarantee (within reason) that a function will produce the same output given the same input. In prompt engineering, you’re trying to coax probabilistic behavior toward consistency, but you can’t guarantee the same result twice. This matters enormously when you need reliability: if your “reading” of a novel changes randomly between queries, it’s not really reading.

Why LLMs Can’t “Read” (Simply Put)

Here’s the fundamental difference: LLMs predict the next token based on statistical patterns learned from training data. They’re exceptionally good at this. They generate text that looks like understanding because they’ve seen millions of examples of coherent discussion about narratives.

But they’re not building the kind of structured mental model that a human reader builds. They’re not tracking causality chains, maintaining character state, or constructing timelines. When an LLM “answers” a question about a character’s motivation, it’s generating a plausible response based on the text it can attend to and the patterns it learned during training, not reasoning over a persistent representation of who that character is, what they’ve done, and how they’ve changed. Adding basic RAG is roughly equivalent to having the SparkNotes open while you read a paragraph: possibly useful, possibly misleading.

Although RAG patterns are becoming increasingly more common, a lot of people still have “one-off” interactions with LLMs, like asking ChatGPT a quick question or using AI Mode when googling. This is why asking an LLM about plot details from earlier in a long novel often produces confidently stated hallucinations. It’s not that the model is “forgetting”—it never had a structured memory to begin with. It’s generating plausible-sounding text about what might happen in a story with these elements, not retrieving facts from a mental model of this specific story.

The Real Gap

The tools we have work well for what they’re designed for:

Finding patterns across many documents (semantic search)
Answering questions about reference material (RAG)
Generating text that sounds coherent (LLMs)

What they don’t do is comprehend a single long-form narrative as a coherent whole—tracking its entities, their relationships, the causal structure, the temporal flow, and the evolution of character states across 100,000+ words.

That’s not a failure of these tools. It’s just not what they were built to do. And once you recognize that gap, it becomes clear why the tools marketed as “write your novel with AI” or “analyze your manuscript” fall short. They’re applying semantic search, the AI equivalent of cribbed notes, and token prediction to a problem that actually requires something closer to what human readers build: a persistent, structured mental model of the story world.

Which raises an interesting question: what are human readers building when they read?

How Humans Actually Read Narratives

To build an AI that can actually read, we first need to understand what reading is—it turns out to be far more sophisticated than unconscious experience may suggest. In addition, as far as I can tell, there’s a remarkable amount of general consensus over time, even going back to the early twentieth century.

The foundation in this area of cognitive science begins with Schema Theory (Bartlett, 1932; Rumelhart, 1980). The core insight: readers don’t process text passively. Instead, they use internalized mental frameworks, schemas, to organize and interpret information. You can’t understand a story about a restaurant visit without your existing “restaurant schema” (you order food, someone brings it, you pay). Schemas provide the scaffolding that makes comprehension possible.

For narrative specifically, Story Grammar theory (Mandler & Johnson, 1977; Stein & Glenn, 1979) identified the structural scaffolding readers use: settings, characters, initiating events, internal responses, plans, attempts, consequences, and resolutions. This framework remains foundational, and it’s still the basis for current assessment tools and intervention programs. When something violates story grammar (a resolution before a conflict, a consequence without an attempt), readers immediately notice the incoherence. These aren’t arbitrary categories: they’re cognitive structures readers actively employ to make sense of plot.

Interestingly, these insights from cognitive science echoed discoveries literary theorists had made decades earlier. Russian Formalism in the early 1900s introduced concepts like fabula (the chronological sequence of events) and syuzhet (how those events are presented, such as with flashbacks, revelations, perspective shifts). Essentially, narratologists were describing structurally what cognitive scientists would later discover readers construct mentally. Psychologist Jerome Bruner explicitly drew these connections in Actual Minds, Possible Worlds (1986), showing how narratology and cognitive science had been circling the same phenomena from different angles. Readers mentally construct the fabula even when reading a non-linear syuzhet.

Kintsch’s Construction-Integration model (1988, refined through the 1990s) described the process by which readers build understanding. It’s a two-stage process: First, readers generate propositions and ideas, forming a networked mental map using their prior knowledge (the “construction” phase). Then they choose the best interpretation, monitor their comprehension, and repair understanding when breakdowns occur (the “integration” phase). The result is what’s called a situation model, that is, a mental representation of the world being described, not the text itself. This is why you can remember the gist of a story without remembering exact sentences: you’re storing the situation, not the text.

The Event-Indexing Model (Zwaan, Langston & Graesser, 1995) built on this by describing five dimensions that readers track while building their situation models: time, space, causation, motivation, and protagonist. The model explains why some previous events stay “active” in working memory while others fade. Their salience depends on whether they share these dimensions with the current event. This is why you can follow a character through a book even when they disappear for chapters, but you might forget minor characters who only appear once.

More recent neuroscience work has validated these cognitive models at the neural level. fMRI studies using naturalistic stimuli—actual stories and films rather than isolated sentences—show that the brain’s default mode network activates specifically during situation model construction, with regions like the temporoparietal junction tracking suspense, causality, and character mental states. The theory from the 1970s-90s turns out to map onto measurable brain activity.

What’s crucial here is that human readers are doing three things simultaneously:

Building a chronological event graph (the fabula)
Tracking how it’s presented (the syuzhet)
Maintaining entity state over time (how characters change, what they know, their relationships)

This is radically different from what an LLM does. An LLM processes text as a sequence of tokens and predicts what comes next. It has no persistent mental model. It doesn’t track causality chains or character arcs; instead, it generates text that looks like it understands these things because it’s seen millions of examples of coherent narrative. But there’s no actual graph structure, no timeline, no entity state management happening under the hood.

The gap becomes obvious when you ask an LLM about character motivations across a 100k-word novel. It might give you a plausible answer based on whatever chunks of text were in its context window, but it’s not drawing on a persistent model of that character’s goals, beliefs, and trajectory. It’s pattern-matching against narrative conventions it’s learned, not reasoning over a representation of the story world.

Current RAG-based approaches to LLM lack of “memory,” while increasingly better at accurately updating changing information over time, do not necessarily privilege when, how, and why information has changed.

This matters because if we want AI agents that can truly work with long-form narrative—editing, analyzing, translating with coherence—we need to give it something analogous to what human readers build: a story world model with entities, events, causal chains, and temporal structure.

How AI Could “Read” a Book

Now for the interesting part: if we understand what human readers build mentally, we can start to imagine what a computational equivalent might look like.

The approach wouldn’t be to make LLMs “understand” narratives the way they currently process text. Instead, it would be to build explicit computational structures that mirror the cognitive models and then use LLMs (and other tools—LLMs are not necessarily the best at reasoning either) as interpretive and generative engines over those structures. Whether this actually works remains to be seen, but the theoretical architecture intrigues me.

A Possible Three-Layer Architecture

Layer 1: Text Processing and Entity Extraction

You’d start with the fundamentals: chunk the text intelligently (by scene boundaries, not arbitrary token limits), then extract entities, events, and relationships using NLP tools like SpaCy combined with LLM-powered semantic understanding. Existing knowledge graph libraries could handle the graph construction mechanics, but you’d need narrative-specific extraction that identifies story grammar elements, not just generic entities.

Layer 2: Story World Model Construction

This is where cognitive science could become code. The system would need to build a persistent graph structure that maintains:

Event nodes with temporal ordering and causal relationships (the fabula)
Entity tracking with state changes over time: character knowledge, relationships, locations, goals
Story grammar scaffolding: explicit representation of initiating events, attempts, consequences, resolutions
The five event-indexing dimensions: time, space, causation, motivation, protagonist connections

Think of it like this: while the LLM reads through the text once, it would be populating a database that captures “what a reader would know at any point in the story.” Not just what’s happened, but what’s causally connected, what’s temporally ordered, which character states have changed. This, in addition to having factual and accurate information surfaced through RAG.

The Version History Problem

There’s a subtle but crucial challenge here that bears repeating: human readers don’t just track the current state of the story world; they also maintain something like a “belief history” that shapes how they interpret revelations and plot twists.

Consider a murder mystery where Brother Matthew acts suspiciously throughout the first two-thirds of the novel. Readers (and characters) operate under the working hypothesis that he might be the murderer. Then it’s revealed he was planning the abbot’s surprise birthday party.

A human reader experiences this as: “Ah, I was wrong about Brother Matthew, and that realization recontextualizes all his earlier suspicious behavior, and I understand why the protagonist was misled, and I feel the emotional weight of the misdirection.” They maintain both the current truth AND the history of what was believed when.

An LLM, without (and even with) careful prompt engineering and RAG assistance, might simply update the database: “Brother Matthew: not the murderer” and move on. The system would lose the fact that he was a suspect, that this suspicion shaped character behavior, that the revelation has narrative weight because of the prior belief state.

This is fundamentally about epistemic state tracking—not just “what is true in the story world,” but “what did characters believe at time T” and “what did readers know at time T.” It’s related to the fabula/syuzhet distinction again: the order and timing of revelation matters enormously. A plot twist only works if the system understands what information was available before the twist.

I’ll be honest: I haven’t fully worked out how to architect this. You’d probably need something like version control for belief states—temporal snapshots of “what was known when”—combined with explicit tracking of information revelation. It might look something like:

Event_147: {
  type: "revelation",
  content: "Brother Matthew planning party",
  invalidates_beliefs: [Belief_23, Belief_67],
  recontextualizes_events: [Event_12, Event_45, Event_89],
  emotional_weight: "surprise + relief"
}

But even with that structure, getting an LLM to consistently populate it correctly, to recognize when a revelation is a revelation, to understand what prior beliefs it invalidates—that would require sophisticated context engineering I haven’t figured out yet, and the complexity grows as the graph of structured data balloons in size.

This matters especially for:

Mystery and thriller narratives (red herrings, false leads)
Unreliable narrators (what the narrator claims vs. what’s true)
Dramatic irony (reader knows something characters don’t)
Character misunderstandings (tracking who knows what when)

Layer 3: Retrieval and Reasoning

When you query the system (“How has Alice’s relationship with Bob evolved?”), you wouldn’t just be doing semantic search over text chunks. You’d:

Retrieve the relevant subgraph (Alice nodes, Bob nodes, relationship edges, events they share)
Trace state changes across the timeline
Use an LLM to reason over that structured data and generate natural language

The LLM wouldn’t be trying to “remember” the book—it would reason over a representation that makes the relevant information explicit and retrievable.

The Prompt Engineering Challenge

Building the story world model would also require sophisticated prompting. At the very least you’d need few-shot examples that teach the LLM to identify:

Story grammar elements (“Is this an initiating event or an attempt?”)
Causal vs. coincidental connections
Character state changes vs. static descriptions
The difference between syuzhet (text order) and fabula (story order)

A multi-agent or “chain of experts” approach might help here: one agent extracts entities, another identifies events, another builds causal chains, another tracks character states. Each would have specialized prompts tuned for its cognitive task.

Some computational prototypes already point in this direction. The Indexter system implemented Event-Indexing computationally for interactive narrative (Cardona-Rivera & Young, 2012). More recently, work on social event detection (LSED, 2025) uses LLM+RAG approaches to build event graphs from text with incremental tracking. These aren’t full “reading” systems, but they’re pieces of the puzzle.

The Limitation: Conventional Narratives

This architecture would probably work best with what we might call “conventionally structured narratives”: stories that follow recognizable patterns, have clear cause-and-effect chains, and maintain consistent character identities. Experimental fiction that deliberately subverts narrative logic, stream-of-consciousness prose, or highly fragmented storytelling would be significantly harder. The story grammar scaffolding depends on narratives actually using that grammar.

But that’s not as limiting as it sounds. Most novels, most fanfiction, most genre fiction, most narrative nonfiction—these follow recognizable patterns. The system would likely handle a Tom Clancy thriller, a romance novel, a biography, a Korean webnovel, or a YA fantasy series just fine. It would struggle with Gravity’s Rainbow. (Even if, ironically, an LLM might have a more “literate” hot take on Thomas Pynchon than you do when put on the spot.)

What This Could Enable (That Humans Can’t Do)

If this worked, a system with an explicit story world model could provide capabilities that human readers lack:

Systematic consistency checking: Does Alice’s eye color stay consistent across 300,000 words? Do the travel times between locations make sense? Are there dangling plot threads that never resolved? Humans miss these because working memory is limited, but a computational model wouldn’t forget.

Relationship trajectory analysis: How did the emotional dynamic between these characters change? Can you show me every interaction and map the progression? A human would need to reread and take notes. The system would already have the graph.

Comparative structure analysis: How does this story’s pacing compare to genre conventions? Where do the act breaks fall? Is the climax positioned typically? This requires holding multiple narratives in mind simultaneously—trivial for software, impossible for humans.

Targeted revision support: “Show me everywhere this character’s motivation conflicts with their actions” or “Find plot points that don’t have clear causal antecedents.” These are editing tasks that require holding the entire narrative in structured memory.

The goal wouldn’t be to replace reading; rather, it would be to augment it with capabilities that serve writers, editors, translators, and researchers who need to work with narratives at a structural level.

Some might worry that these augmented reading practices would erode the very cognitive abilities AI is supposed to support. I mean, yeah. We simply don’t know the long-term effects, and the stakes are particularly high for generations currently learning to read. In some ways I think this makes understanding the potential and limitations of AI reading even more important, because no matter what we may wish, AI summarizing and answering questions about text is not going to go away. It will only become more important to identify how AI continues to (fail to) read.

Conclusion

So why hasn’t this been solved yet? Recent papers confirm that improvements based on models alone may be somewhat of a “tarpit idea” at the moment: a 2025 arXiv study found LLMs handle causal soundness at small scales but fail on intentionality and conflict arcs without explicit planning mechanisms. The pieces exist—cognitive science frameworks, graph libraries, capable LLMs—but integration remains unsolved. The use case is specialized enough that major AI companies aren’t prioritizing it, though the solutions may apply beyond narrative: legal and medical AI agents face similar challenges with complex, state-dependent documents.

That’s both exciting and sobering. It means there’s meaningful technical work to do, problems to solve that require both understanding the theory and figuring out the implementation details. It’s the kind of challenge that appeals to me, sitting at the intersection of literature, cognitive science, and systems architecture.

What does this imply about how we use AI tools for content generation and studying now? Mostly that we should be clear-eyed about their limitations. RAG works great for reference material. Semantic search works great for finding patterns across documents. LLMs work great for generating plausible text. But none of them are actually reading in the way humans read, and pretending they are leads to disappointment when they fail at tasks that require genuine comprehension.

It’s compelling that a technical implementation is possible today, even if I don’t know yet whether it will work as well as I want it to. The pieces exist. The theory exists. We just need to keep building it and find out what breaks.

References

Bartlett, F.C. (1932). Remembering: A Study in Experimental and Social Psychology. Cambridge University Press.

Bruner, J. (1986). Actual Minds, Possible Worlds. Harvard University Press.

Cardona-Rivera, R.E. & Young, R.M. (2012). Indexter: A computational model of the event-indexing situation model theory of narrative understanding. Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment.

Kintsch, W. (1988). The role of knowledge in discourse comprehension: A construction-integration model. Psychological Review, 95(2), 163-182.

Mandler, J.M. & Johnson, N.S. (1977). Remembrance of things parsed: Story structure and recall. Cognitive Psychology, 9(1), 111-151.

Rumelhart, D.E. (1980). Schemata: The building blocks of cognition. In R.J. Spiro et al. (Eds.), Theoretical Issues in Reading Comprehension. Lawrence Erlbaum.

Stein, N.L. & Glenn, C.G. (1979). An analysis of story comprehension in elementary school children. In R.O. Freedle (Ed.), New Directions in Discourse Processing. Ablex.

Zwaan, R.A., Langston, M.C., & Graesser, A.C. (1995). The construction of situation models in narrative comprehension: An event-indexing model. Psychological Science, 6(5), 292-297.