National Science Foundation grant on computer literacy: AutoTutor (1997-2001)

Project Summary

The long-term practical objective of the proposed research is to develop a fully automated computer tutor. The tutor would be able to (a) extract meaning from the contributions that the student types into a keyboard and (b) formulate dialog contributions with pedagogical value and conversational appropriateness. The dialog contributions of the tutor would be in different formats and media: printed text, synthesized speech, simulated facial movements, graphic displays, and animation. Such an achievement will require an interdisciplinary integration of theory and empirical research from the fields of cognitive psychology, discourse processing, computational linguistics, artificial intelligence, human-computer interaction, and education. An attempt to build an automated tutor should advance theories and computational models in these fields. In particular, the proposed project will directly address constructivist theories of learning and cognition, discourse theories of collaboration and common ground, and theories of language processing that specify how world knowledge and pragmatics interact with lexical, syntactic, and semantic processing.

Most of the previous attempts to develop a fully automated tutor have failed because of a variety of technical and theoretical barriers. These include problems in handling natural language when it is not well-formed semantically and syntactically, the problem of world knowledge being open-ended and incomplete, and the lack of research on human tutorial dialog. However, recent advances have dramatically reduced these barriers, so it is time to revisit the possibility of building an automated tutor. The feasibility of a computer tutor is further fortified by some recent research on human tutoring. This research has revealed that human tutors and learners have a remarkably incomplete understanding of each other's knowledge base and discourse contributions (particularly when common ground is minimal), and that a key feature of effective tutoring lies in generating discourse contributions that assist learners in actively constructing subjective explanations, elaborations, and mental models of the material. The tutor's discourse moves include pumping, prompting, hinting, questioning, summarizing, splicing in correct information, providing immediate feedback, and revoicing student contributions.

The proposed computer tutor embraces both classical symbolic architectures (e.g., those with propositional representations, conceptual structures, curriculum scripts, and production rules) and architectures with multiple soft constraints (i.e., neural network models, fuzzy descriptions and controllers, and latent semantic analysis). Latent semantic analysis (LSA) will provide the backbone for representing the world knowledge; LSA reduces a large corpus of texts to a 100-300 dimensional space through a statistical method called singular value decomposition. The LSA space is used to compute the truth, relevance, and quality of student contributions, as well as the conceptual similarity between text segments (i.e., words, phrases, sentences, texts). The macrostructure that guides the tutorial dialog consists of a curriculum script with didactic descriptions, tutor-posed questions, cases, problems, figures, and diagrams (along with good responses to each topic). The tutor's selection of topics and the dialog moves within each topic are captured by a set of production rules, which vary according to the strategies used by the tutor. The production rules (which are expressed in either crisp or fuzzy form) are sensitive to the truth, relevance, and quality of student contributions. The messages that the learner types into the computer are segmented into speech act units and classified into speech act categories (e.g., question, directive, assertion, answer, short response) by a set of language modules: WordNet, syntactic parsers, a dictionary of frozen expressions, software agents with codeless that sense surface linguistic features, and a recurrent connectionist network that predicts the next speech act category. The computer tutor will manage a mixed-initiated dialog between the tutor and student.

The success of the model will be tested in several ways. Experts in discourse and education will evaluate the appropriateness, relevance, and pedagogical quality of the dialog contributions generated by the computer tutor. The fidelity of particular language modules will be evaluated with respect to recall, precision, and other performance measures used by researchers in computational linguistics. Turing tests will be performed at a fine-grained level in order to assess whether the learner (or a neutral human observer) can discriminate whether particular dialog moves are generated by the computer versus a human tutor. Towards the end of the project, there will be assessments of whether the computer tutor produces significant learning gains (compared to control conditions). The tutoring topics will be in the domains of computer literacy and introductory medicine.

Overview of Motivation of Proposed Research

Educators and advocates of intelligent tutoring systems (ITS) have frequently had the vision of having computers tutor students on skills and domain knowledge. The computer tutor would be fully automated in the best of worlds. Unfortunately, however, language and discourse have constituted a serious barrier in these efforts. As a consequence, language and discourse facilities have been either nonexistent or extremely limited in the most impressive and successful intelligent tutoring systems available, such as Anderson's tutors for geometry, algebra, and computer languages (Anderson, Corbett, Koedinger, & Pelletier, 1995), Van Lehn's tutor for basic mathematics (Van Lehn, 1990), and Lesgold's tutor for diagnosing and repairing electronic equipment (Lesgold, Lajoie, Bunzo, & Eggan, 1992). There have been some attempts to augment ITS's with language and dialog facilities (Holland, Kaplan, & Sams, 1995; Moore, 1994). But such attempts have been limited by (1) the inherent difficulty of getting a computer to "comprehend" the language of users, including utterances that are not wellformed syntactically and semantically, (2) the difficulty of getting computers to effectively use a large body of open-ended, fragmentary, and unstructured world knowledge, and (3) the lack of research on human tutorial dialog. These difficulties have been aggravated by insufficient communication between the fields of discourse processing, education, computational linguistics, and ITS developers.

Advances in research during the last five years make it much more feasible to develop a computer tutor. These developments have provided approximate solutions to the above major barriers. The proposed tutor will attempt to "comprehend" text that the learner types into the keyboard (i.e., it is beyond the scope of our proposal to handle the difficulties of speech recognition). The computer tutor will generate discourse contributions in the form of printed text, synthesized speech, graphic displays, animation, and possibly simulated facial movements and expressions (Cohen & Massaro, 1994; Massaro & Cohen, 1995; Pelachaud, Badler, & Steedman, 1996). That is, the tutor will speak in different media. However, the primary technological contribution of the proposed research lies in formulating helpful discourse contributions by the tutor, as opposed to generating a fancy display of interface features. Simply put, our goal is to determine "what the tutor should say" (i.e., the conceptual content), not "how the tutor should say it" (i.e., in digitized speech, synthesized speech, print, versus a talking head). The value of alternative output modalities is an exciting, but secondary bonus. Different versions of the prototype tutor will be created in an effort to simulate (a) skilled and unskilled human tutors who vary in domain expertise and tutoring experience and (b) ideal tutoring strategies that have been identified in the field of education and by developers of intelligent tutoring systems.

The proposed research stretches beyond the obvious goal of building a practical tool that might facilitate learning. The proposed research will advance our scientific understanding of discourse processing, language understanding, learning, education, and cognition in general. The research will explore further how feasible it is to automate these phenomena computationally. Indeed, we would argue that this project achieves a perfect balance between basic science and engineering. More specifically, we hope to advance three major theoretical developments in cognitive science:

  1. Theories of Complex Learning. The proposed research builds on two major theoretical frameworks. The first consists of the "constructivist" theories of learning that emphasize the process of learners actively building "self- explanations", "self-elaborations", and useful mental models of the domain knowledge (Bransford, Goldman, & Vye, 1991; Brown, 1988; Chi et al., 1989, 1994; Graesser, Person, & Magliano, 1995; King, 1994; Piaget, 1952; Papert, 1980; Trabasso & Magliano, 1996; Webb, Troper, & Fall, 1995; Wittrock, 1990). We view the computer tutor as a cognitive aid (e.g., scaffold, cognitive prosthesis) that assists the learner in actively building subjective mental models, explanations, and elaborations, as opposed to a mere information delivery system that ends up having a minimal impact on the learner (as it frequently turns out). The second theoretical framework consists of the social theories of learning that emphasize the process of collaborating with other agents (e.g., teachers, tutors, peers) as a means of scaffolding to higher levels of mastery (Collins, Brown, & Newman, 1989; Graesser et al., 1995; Palincsar & Brown, 1984; Rogoff, 1990; Roschelle, 1992; Vygotsky, 1978). In our case, the other social agent consists of the computer tutor. Our research will dissect, analyze, and integrate the process of collaboration in tutorial dialog in much greater detail and explicitness than most of the previous research on collaboration. We hope to specify the conditions that trigger (or should trigger) particular pedagogical strategies, dialog moves, didactic declarative knowledge, example problems, questions, figures, pragmatic norms, and so on.
  2. Theories of discourse processing. The development of the proposed tutoring system should help us understand how two agents (learner and tutor) manage to communicate when their "common ground" (Clark & Schaefer, 1989; Clark, 1996; Schober, 1995) is minimal or indeterminate, as opposed to the typical research that focuses on discourse contexts where the speech participants have a large amount of shared knowledge. Our analysis of tutorial dialog should advance our understanding of how a mixed-initiative agenda is coordinated among multiple agents in conversation and in human-computer interaction. For example, there are tradeoffs between pedagogical goals and politeness norms when tutors formulate discourse moves (Fox, 1993; Person, Kreuz, Zwaan, & Graesser, 1995); a good tutor does not abrasively critique the errors of a learner, but also does not politely ignore the errors. There is a challenge in specifying how these two sets of constraints are coordinated. The proposed research will directly address psychological theories of language and discourse that specify how world knowledge and pragmatics interact with lexical, syntactic, and semantic processing (Britton & Graesser, 1996; Gernsbacher, 1990, 1994; Graesser, Singer, & Trabasso, 1994; Kintsch, in press; MacDonald, 1993; Perfetti, in press; Singer, 1990).
  3. Computational models. The proposed research will adopt some computational theories and architectures that have been developed in cognitive science, such as production systems, neural networks, latent semantic analysis, and fuzzy systems, as will be discussed later in this proposal. One of the central missions of cognitive science is to identify the computational models that naturally fit particular cognitive representations and processes.

Recent Advances that Reduce Previous Major Barriers in Developing Computer Tutors

As mentioned above, three major barriers (natural language, world knowledge, and tutorial dialog) have prevented ITS researchers from implementing tutorial dialog facilities in natural language. These barriers exist because of technical limitations, inherent difficulties in modeling the phenomenon, or lack; of research. However, recent advances have provided approximate solutions to minimizing these barriers, so an ITS with a natural language and dialog facility is much more feasible. It would be unrealistic to expect the proposed project to completely eliminate the barriers, but we are convinced that approximate solutions will provide substantial progress. We plan on incorporating these advances in the proposed tutor.

Natural Language. The Message Understanding initiative, funded by DARPA, has evaluated the performance of natural language extraction systems that have been developed in artificial intelligence and computational linguistics (DARPA, 1995; Jacobs, 1992; Lehnert, in press; Lehnert, Cardie, Fisher, McCarthy, Riloff, & Soderland, 1994). There has been noticeable progress in automating many components of language analysis that lie within the span of a sentence or short discourse segment, such as: identifying the correct sense of words with multiple senses, identifying the correct syntactic class of words, parsing sentence syntax (for sentences that are short or moderate in length), extracting important elements of thematic structures, constructing semantic representations (e.g., propositions, conceptual graphs), and linking anaphoric expressions to previous discourse constituents. Traditional symbolic parsers have made substantial progress in handling the huge amount of lexical, syntactic, and semantic knowledge required to understand real-world texts in a restricted semantic domain, such as articles on terrorism or financial news in The Wall Street Journal (DARPA, 1995; Grishman, Macleod, & Sterling, 1992; Lytinen, Bhattacharyya, Burridge, Hastings, et al., 1992). Nontraditional parsers are more streamlined or they adopt alternative computational architectures, such as finite state automata (Hobbs, APPLET, Tyson, Bear, & Israel, 1992), heuristic fuzzy parsing algorithms (Huyck & Lytinen, 1993), probabilistic parsers with parallel, multi-level, constraint satisfaction mechanisms (Charniak, 1993; Jurafsky, 1996) and neural network architectures (Miikkulainen, 1996). The Message Understanding initiative has objectively and precisely measured the performance of the best natural language processing systems when these systems attempt to process a corpus of naturalistic texts at different levels of analysis (DARPA, 1995). The perfomance of these systems has been moderately impressive (57% recall and 64% precision, when averaging over benchmark language components at different levels, DARPA, 1995).

One of the researchers in the proposed project, Dr. Wiemer-Hastings, has worked on projects funded by the Message Understanding initiative (Hastings, 1995; Hastings & Lytinen, 1994), so we will have the opportunity to incorporate some of these natural language modules into our computer tutor. However, this available technology will not be enough and is not at the core of the proposed research. We are convinced that further progress in natural language comprehension requires a more thorough understanding of how language mechanisms are integrated with world knowledge and discourse. A distinctive characteristic of the proposed research is that it addresses the deeper, more global levels of language processing, and how these levels interact with the lexicon, syntax, and shallow semantics. Dr. Graesser, the Pl in the proposed research and editor of the journal Discourse Processes, has extensively investigated the relationships among language, world knowledge, and discourse (Graesser& Clark, 1985; Graesser, Gordon, & Brainerd, 1992; Graesser, Millis, & Zwaan, 1997; Graesser, Singer, & Trabasso, 1994; Graesser, Swamer, & Hu, in press). Messages sometimes contain frozen idiomatic expressions, metaphors, irony, hyperbole, and other forms of figurative language (Gibbs, 1994); Dr. Roger Kreuz, a co-PI on this project, has substantial expertise in figurative language (Kreuz & Glucksberg, 1989; Kreuz & Roberts, 1995) as well as nonfigurative natural language.

World Knowledge. The fact that world knowledge is inextricably bound to natural language comprehension is widely acknowledged in psycholinguistics, cognitive science and discourse processing (Gernsbacher, 1990, 1994; Graesser, Singer, & Trabasso, 1994; Kintsch, in press; MacDonald, 1993; Schank & Riesbeck, 1981; Trabasso & Magliano, 1996), but researchers in computational linguistics have not had a satisfactory approach to handling the deep abyss of world knowledge. The traditional approach to representing world knowledge in artificial intelligence has been structured representations, such as semantic networks, conceptual graphs, and rules (Graesser & Clark, 1985; Lehmann, 1992; Lenat, 1998). World knowledge is frequently open-ended, imprecise, vague, and incomplete, so simple algorithms and computational procedures cannot handle the role of world knowledge in understanding language and in tutoring. In fact, these pervasive characteristics of world knowledge motivated Collins' work on the SCHOLAR tutor (Collins, 1985), an ITS that generated plausible inferences and tutor contributions on the basis of fragments of imprecise and incomplete world knowledge.

"Latent Semantic Analysis" (LSA) will provide the critical backbone for representing world knowledge in the proposed tutor. LSA is a very recent invention and is not as well known as the traditional systems for representing knowledge, so some elaboration is provided in this proposal. LSA has recently been proposed as a statistical representation of a large body of world knowledge (Foltz, 1996; Foltz, Britt, & Perfetti, 1996; Kintsch, in press; Landauer & Dumais, 1997). LSA capitalizes on the fact that particular words appear in particular texts; the occurrence of words in texts reflects the constraints that exist in world knowledge. The input to LSA is a co-occurrence matrix that specifies the number of times that word Wj occurs in text Tj. These frequencies are adjusted with a logarithm transformation that also corrects for the base rates of words appearing across texts; a word is a distinctive index for a text to the extent that its occurrence in the text is above the base rate for that word across texts. Singular value decomposition is a statistical method (i.e., principal components analysis) that reduces the large WxT co-occurrence matrix to K dimensions (typically, 100 to 300 dimensions). Each word, sentence, or text ends up being a weighted vector on the K dimensions. The "match" (i.e., similarity in meaning, conceptual relatedness) between two words, sentences, or texts is computed as a cosine product (or dot product) between the two vectors, with values ranging from 0 to 1. The match between two language strings can be high even though there are few if any words in common between the two strings. The LSA goes well beyond simple string matches because the meaning of a language string is partly determined by the company (other words) that each word keeps.

The empirical success of LSA has been promising and sometimes remarkable. Landauer and Dumais (1997) created an LSA space with 300 dimensions from 4.6 million words that appeared in 30,473 articles in Grolier's Academic American Encyclopedia. They submitted to the LSA representation the synonym portion of the TOEFL test, a test developed by the Educational Testing Service to assess how well non-native English speakers have mastered the words in the English language. The test has a four-alternative, forced choice format, so there is a 25% chance of answering each of the questions correctly. The LSA model selected the alternative that had the highest match with a comparison word. The LSA model answered 64.4% of the questions correctly, which is essentially equivalent to the 64.5% performance for college students from non-English speaking countries. In another study, Foltz et al. (1996) created an LSA representation with 100 dimensions from 31 texts and encyclopedia articles on the Panama Canal. College students read a sample of the texts and then wrote an essay that summarized the readings on the Panama Canal. For each sentence in the essay, a maximum match score was determined by computing the cosine match with all sentences in the 32 texts and taking the highest match. For example, one of the sentences in a student's summary was: "Only 42 marines were on the U.S.S. Nashville." The two highest matching sentences in the corpus were:

  1. Nov. 2, 5:30 PM: U.S.S. Nashville arrives in Colon Harbor with 42 marines . (64 cosine
    match)
  2. To Hubbard, commander of the U.S.S. Nashville. from Secretary of the Navy (Nov. 2, 1903): Maintain free and uninterrupted transit. (.56 cosine match)

These are not easy sentences to parse syntactically, and keyword matches hardly go the distance in determining similarity. The overall quality of a student's summary was the mean maximum match score for all sentences in the summary. There was a significant correlation (r = .35) between the LSA's scores of summary quality and the quality ratings of humans; any given pair of human raters had a correlation in quality ratings of r = .51. Therefore, the quality ratings of the LSA were clearly in the arena of trained human raters. Given these successes, Kintsch (in press) is currently using LSA routinely to capture the world knowledge component of his "construction-integration" model (Kintsch, 1988) -- a neural network model of text comprehension..

LSA will play a central role in the proposed computer tutor. The truth of a student's contribution is evaluated by computing the cosine match between a student's contribution and that sentence in the entire corpus that has the highest match. Alternatively, the contribution is matched to sentences from multiple sources of text in the corpus. The corpus will include the curriculum script with the lesson content and texts on the topic. The relevance of a student's contribution is evaluated by computing its match with expected answers to a question, or expected solutions to a problem, as will be elaborated later. Prior to LSA, there was no empirically defensible computation of the truth and relevance of expressions with respect to a large knowledge base that is open-ended, fragmentary, imprecise, and vague. We believe that LSA will allow us to bootstrap the ITS enterprise to accommodate natural language and dialog for the first time. However, we need to assess whether this is feasible empirically. The proposed research provides such an assessment.

It is important to emphasize that LSA is not an appropriate representation for topics in precise, well-defined fields, such as mathematics and computer programming. LSA will also not provide an explicit deep mental model of the domain, which presumably is needed for an accomplished expert to solve difficult problems and which is implemented in sophisticated ITS applications in narrow domains. However, there still might be an important role for LSA even in such topics that require precision and deep expertise. LSA would provide a "safety net" for an ITS when the conventional mechanisms fail and the system needs to formulate a reasonable contribution. However, the results of our project may surprise everyone and reveal that LSA provides tutor contributions that are deep, relevant, and appropriate. This is an empirical question.

Expertise in mathematics and computational modeling is needed in our analysis of LSA, as well other components of the computer tutor. This expertise is provided by the Pl (Dr. Graesser) and four researchers on this proposal: Franklin, Garzon, Hu, and Wiemer-Hastings. Dr. Franklin has developed computational models in artificial intelligence and complex dynamical systems, and has recently published an MIT Press book entitled Artificial Minds (Franklin, 1995). Dr. Franklin has collaborated with Dr. Graesser on symbolic computational models of human question answering (Graesser & Franklin, 1990), on neural network models of speech act sequencing in conversation (Swamer, Graesser, Franklin, Sell, Cohen, & Baggett, 1993), and intelligent software agents (Franklin & Graesser, 1996). Dr. Garzon is an expert in neural networks, fuzzy logic, complex systems, and a variety of implementations of automatic devices (Botelho & Garzon, 1994; Garzon & Eberbach, 1996). His background will assist us in exploring specific algorithms and in assessing whether particular algorithms are computationally feasible in real thee. Dr. Hu is a mathematical psychologist who has expertise in fuzzy systems (Hu, 1985) and "processing tree" models of cognition (Hu & Batchelder, 1994). Dr. Hu has previously collaborated with Graesser in developing quantitative models of naturalistic discourse processing (Graesser, Swamer, & Hu, in press). Dr. Wiemer-Hastings has expertise in computational linguistics (as discussed earlier) and Al modeling with production system architectures (he developed a tutorial for people learning to use SOAR). Dr. Graesser (the Pl) has developed symbolic computational models of human question answering (Graesser & Franklin, 1990; Graesser et al., 1992), neural network models of speech act sequencing (Swamer et al., 1993; Graesser, Swamer, & Hu, in press), educational software in a hypertext/hypermedia environment (Graesser, Langston, & Baggett, 1993; Graesser, Langston, & Lang, 1992), software that simulates hand movements in vehicles with different driver-panel interfaces (Graesser & Marks, 1993), and software that extracts procedural knowledge from domain experts (Williams, Hultman, & Graesser, in press). Therefore, this research team has a track record of producing computer software in addition to having a theoretical understanding of computation and cognition.

Tutorial Dialog. Researchers in education and ITS development have identified a number of ideal tutoring strategies, such as: the Socratic method (Collins, 1985), modeling-scaffolding-fading (Collins, Brown, & Newman, 1989; Rogoff, 1990), reciprocal training (Palincsar & Brown, 1984), anchored learning (Bransford et al., 1991), error diagnosis and correction (Anderson et al., 1995; van Lehn, 1990; Lesgold et al., 1992), frontier learning (Sleeman & Brown, 1982), building on prerequisites (Gagne, 1977), and sophisticated motivational techniques (topper et al., 1990). Researchers who have examined these tutoring strategies have frequently pointed out that tutors need extensive training on the use of these sophisticated ideal tutoring strategies. Not surprisingly, therefore, these strategies do not spontaneously emerge in the repertoire of strategies of unskilled tutors -- the tutors that prevail in actual school systems (Graesser et al., 1995). Previous ITS developers have abandoned attempts to incorporate most of these ideal tutoring strategies in the tutoring systems because of the barriers of natural language and world knowledge. We plan on implementing some of these ideal tutoring strategies in the proposed computer tutor, to the extent that they are technically feasible.

Aside from these ideal tutoring strategies, recent projects have dissected the strategies used by skilled and unskilled human tutors. In some of these studies, the tutors have been highly skilled and knowledgeable about the topic (Fox, 1993; Hume, Michael, Rovick, & Evens, 1996; McArthur, Stasz, & Zmuidzinas, 1990; Merrill, Reiser, Ranney, & Trafton, 1992; Putnam, 1987). Our previous work on tutorial dialog (Graesser, Bowers, Hacker, & Person, in press; Graesser & Person, 1994; Graesser et al., 1995; Person, Graesser, Magliano, & Kreuz, 1994; Person, Kreuz, Zwaan, & Graesser, 1995), funded by the Office of Naval Research, has examined untrained tutors with moderate domain knowledge because these tutors are most representative of tutors in actual school systems. We videotaped, transcribed and analyzed the tutorial dialog of approximately 100 naturalistic tutoring sessions in 7th grade and in college. Even though most tutors in school systems are untrained, they are surprisingly very effective compared to teachers in normal classroom environments. One-on-one human tutoring has shown effect sizes of.4 to 2.3 standard deviation units compared to classroom teaching and other suitable controls (Bloom, 1984; Cohen, Kulik, & Kulik, 1987). Our detailed conversational analyses of normal tutors unveiled the characteristics of the dialog that apparently are responsible for the robust learning gains (Graesser et al., 1995). It should be noted that most of the research that has performed a detailed, turn-by-turn, analysis of tutorial dialog has been conducted in the 1990's. Prior to this research, there was not much of a foundation for simulating smooth tutorial dialog on a computer system. Consequently, the time is ripe for integrating research on tutorial dialog with ITS development.

One conceivable advantage of tutoring is an enhanced "meeting of the minds" between student and tutor. That is, the tutor infers the idiosyncratic knowledge, bugs, and misconceptions of the student -- and the student's knowledge drifts to the tutor's knowledge base. Designers of some ITS's have implemented "student modeling," which is an attempt to infer the knowledge states of a student on the basis of the student's questions, answers to questions, and solutions to problems (Anderson et al., 1995; Ohisson, 1986; Weber, 1996). Discourse theories have frequently emphasized the importance of establishing shared meanings for successful communication (Clark & Schaefer, 1989; Roschelle, 1992; Schober, 1995). There is a radically different perspective on the matter of common ground and student modeling, however. Researchers have cast doubt on the possibility, the need, and the pedagogical utility of detailed student modeling (Newman, 1989). It might not be computationally feasible to induce student knowledge. The educational payoffs of methodically unraveling the misconceptions and bugs of a student might be very small compared to the tutor's modeling good skills and knowledge. Learning can perhaps proceed quite effectively when the tutor presents an environment, a set of tasks, and feedback to the learner, without carefully tracking the knowledge of the student. Our detailed analysis of actual tutoring sessions revealed that there is a very slow convergence towards shared meanings during tutoring (Graesser et al., 1995). The gap in knowledge between the tutor and student is so wide that the two parties in the conversation frequently misunderstand each other and give each other incorrect feedback. For example, tutors normally give positive responses ( Yeah, Uh-huh) to student contributions that are vague, incoherent or error ridden; students who are lost usually say YES or nod their heads when asked Do you understand? (Person et al., 1994). The fact that the tutor manages conversation when there is a breakdown in common ground and feedback mechanisms makes tutoring a fascinating phenomenon to study from the standpoint of theories of communication and discourse processing.

The large gulf that frequently exists between the knowledge of tutors and students gives us reason to believe that it is feasible to develop a computer tutor that mimics human tutors. Human tutors do not normally achieve a deep and complete understanding of the student, particularly when the student contributions are fragmentary, ungrammatical, incoherent, underspecified, and vague. Misunderstandings frequently occur. The tutor scrambles to piece together a minimal understanding of the student's knowledge and to manage the discourse. Given that human tutors face such challenges in seeking shared meanings, a computer tutor might manage surprisingly well with a shallow understanding of the student and a strategic selection of dialog moves. For example, the ANIMATE tutor developed by Nathan, Kintsch, and Youna (1992) produced impressive learning gains on algebra word problems, but did not construct a detailed map of what the student knows. Moreover, a key feature of effective tutoring lies in assisting students in actively constructing subjective explanations and elaborations of the material (Chi et al., 1994; Fuchs et al., 1996; Graesser et al., 1995; McDaniel & Donnelly, 1996; Pressley et al., 1992; Schank, Kass, &: Riesbeck, 1994; Webb et al., 1995). The tutor's dialog moves in a collaborative exchange might provide effective scaffolding for a student to build such self-explanations -- without the computer fully knowing what the student knows.

The proposed research will simulate a tutor's dialog moves in tutorial dialog. There will be different classes of tutors. One class will be unskilled tutors, the sort of tutors that exist in real school systems. Our previous research has uncovered the dialog moves and pedagogical strategies that are frequently enacted by these tutors, as will be discussed in the next section. Another class will be untrained tutors who acquire more experience in tutoring; the computer tutor will augment its knowledge base by storing answers that students give to questions and solutions to problems (segregating good and bad contributions). More sophisticated classes of tutors will implement various ideal tutoring strategies (such as a Socratic tutor, modeling-scaffolding-fading, and strategic hinting). We will evaluate the quality of the tutor's dialog moves for each of these classes of tutors. Quality is determined by the truth of the contribution, the relevance of the contribution, and the aptness of the contribution in the content of the dialog. These dimensions of quality will be evaluated by experts on this grant. Dr. Person has published extensively in tutorial dialog (Person et al., 1995, 1996, in press); Dr. Hacker is an expert in education and metacognition (Hacker & Graesser, in press); Dr. Kreuz is an expert in language and discourse; and Dr. Gholson is an expert in cognitive development, reasoning, and problem solving (Gholson et al., 1996).

Multiple Media. Our plan is to incorporate different types of media when topics are presented and when the tutor produces dialog moves. The alternative forms of media include printed text, synthesized speech, simulated facial movements that are synchronized with the speech, graphic displays, and animation. It is important to emphasize that the focus of this proposal is not to develop multimedia educational software. But we do plan to use media that are easy to incorporate. More important, we plan on implementing media that are directly relevant to the theories, computational models, and cognitive mechanisms that are at the heart of the proposal. Three types of media will be described here to illustrate a theory-based justification.

  1. Talking heads. Researchers have recently developed computer-generated animated talking heads that have facial features synchronized with speech (Cohen & Massaro, 1994; Pelachaud et al., 1996). Ideally, the computer would control the eyes, eyebrows, mouth, lips, teeth, tongue, cheekbones, and other parts of the face in a fashion that is meshed appropriately with the language and emotions of the speaker. Face39 (developed by Massaro and his colleagues) is a computer-generated animated face that provides most of these capabilities, at 60 frames per second when synchronized with the output of an auditory text-to-speech synthesizer. A talking head would be an important enhancement because it concretely grounds the conversation between the tutor and learner. A talking head also provides a separate channel of cues for providing mixed feedback to the learner. When a learner's contribution is incorrect or vague, for example, the speech could be positive and polite whereas the face could have a puzzled expression; this conflicting message (that satisfies both pedagogical and politeness constraints) is presumably preferable to a threatening speech message that says "That's not right" or "I don't understand." The nonverbal facial cues are known to be an important form of back channel feedback during tutoring (Fox, 1993; Graesser et al., 1995; Person et al., 1994), as well as other contexts of conversation (Clark, 1996). Dr. Marks and Dr. Kreuz would play a central role in integrating the simulated talking heads with the speech. Dr. Marks has conducted psychological research on perception, memory, the integration of verbal and visual codes, and multimedia (Marks, McFalls, & Hopkinson, 1992; Marks & Dulaney, in press). Dr. Kreuz has expertise in language and discourse (as discussed earlier).
  2. Synthesized speech. Pitch, pause, duration, amplitude, and intonation contours are among the variety of intonation cues that signal back channel feedback, affect, and emphasis (Brennan, 1995; Brennan & Williams, 1995; Hirschberg & Ward, 1992, 1995; Kreuz & Roberts, 1995; Selting, 1994). Some of these intonation parameters have been implemented successfully in synthesized and digitized speech (Cowley & Jones, 1992). These intonation parameters are important to tutoring because they qualify the back channel feedback ("Uh huh", "Okay") and substantive contributions of the tutor (Fox, 1993; Graesser et al., 1995). For example, the tutor frequently pauses after a vague student contribution or pounces h1 quickly after an obvious error-ridden student contribution. The goal of generating context-sensitive intonation patterns is computationally feasible for the tutor's immediate short feedback to the leaner because there are a limited number of responses and parameters. The generation of intonation patterns for lengthier content is technically more difficult, but we suspect that context sensitive intonation is considerably less important to the learner for the lengthy tutor contributions. Dr. Kreuz and Dr. Graesser will play a major role in developing this module because of their experience with language, discourse, and tutorial dialog.
  3. Fuzzy descriptions of figures and diagrams. The corpus of texts in the previous LSA applications included print, but not figures, diagrams, and pictures. Researchers have not previously incorporated pictorial information with the LSA, but it is possible to do this with fuzzy verbal descriptions. Massaro and Cohen (1993) and others have argued that it is possible to generate fuzzy verbal descriptions of pictures. There can be adjectives and adverbs that correspond to quantitative features of objects and spatial relationships, e.g., "the large wheel is left of the very tall pole", "the correlation between X and Y is "small." We will generate a fuzzy description for each graph, figure, and diagram that the tutor has in its curriculum (but of course not the original corpus -- that would be too labor intensive). The fuzzy description will specify the set of components in the picture, properties of components, spatial relationship between components, motion depicted by arrows, and so on (Baggett & Graesser, 1995; Graesser& Clark, 1985; Massaro & Cohen, 1993; Tversky & Hemenway, 1983). The content of this fuzzy description is verbal and can be treated as text. Therefore, the content of a figure can be compared to the LSA space and evaluated on truth and sophistication. Our hope is that the computer tutor will select figures, graphs, and diagrams that are tailored to the student's knowledge, in the learner's zone of proximal development (Kintsch, in press; Rogoff, 1990; Vygotsky, 1978). Dr. Xiangen Hu has investigated the mathematical and psychological foundations of fuzzy systems (Crowther, Batchelder, & Hu, 1995; Hu, 1985) so he will play a central role in developing this application. Dr. Garzon and Dr. Marks will also participate in developing this component.

The computer displays should not become overly congested to the point of overloading the learner's working memory, distracting the learner, or making it difficult for the learner to focus on the critical information (Marcus, Cooper, & Sweller, 1996; Mayer, Bove, Bryman, Bars, & Tapan6co, 1996; Sweiler, 1988). There are many potential advantages of synchronizing language with pictures: multiple codes are established in memory (Mayer & Sims, 1994), working memory capacity is expanded (Mosavi, Low, & Sweller, 1995), and there is redundancy in delivering critical information. However, these advantages are not realized if a busy display diverts the attention of the user from critical information. For example, if a talking head is speaking during the presentation of a diagram, information will be missed in the diagram when the learner is gazing at the talking head; voice overlay would be preferable so the learner can focus directly on the diagram. The displays need to be designed in a fashion that coordinates the features of the human-computer interface with constraints of cognition (e.g., working memory load, attention). Dr. Gholson, Dr. Marks, and Dr. Graesser have the background for evaluating displays that cater to cognitive constraints.

Overview of Proposed Model of Tutorial Dialog

The proposed computer tutor will incorporate strategies of skilled and unskilled human tutors who vary in domain expertise and tutoring experience (see above references). It will have production rules that select discourse topics and discourse moves of normal unskilled tutors. As discussed earlier, it is well documented that unskilled tutors are quite effective even though they do not use ideal tutoring strategies. More sophisticated versions of the tutor will go to the next level and incorporate those ideal tutoring strategies that are technically feasible to implement (such as the zone of proximal development, modeling-scaffolding-fading, and ordering on prerequisites, as discussed earlier). The Figure at the end of this "Project Description" section presents an overview of the major components of the model. These components are succinctly described below.

Curriculum scripts. At the macro-level, it is well documented that human tutors are guided by a curriculum script that consists of a set of "topics" (Graesser et al., 1995; McArthur et al., 1990; Putnam, 1987). These topics may be structured hierarchically (with subtopics embedded in topics) or ordered along logical or pedagogical principles (e.g., ordering on prerequisites, ordering h' increasing difficulty). An accomplished tutor presents appropriate topics that are approximately tailored to the student's knowledge and difficulties (i.e., at the appropriate zone of proximal development) even though the tutor may not achieve detailed, fine-tuned student modeling. In line with this research on human tutors, our computer tutor will contain a set of topics in a curriculum script. Each topic consists of either: (a) a didactic declarative description, (b) a tutor question (with associated answers and hints), (c) an example case or problem (with associated solutions, hints, and comments), or (d) a figure or a diagram (with associated comments). Educational researchers have cautioned against teachers and tutors relying on didactic declarative descriptions (type "a", which prevails in most lectures) because such knowledge imparts inert knowledge rather than active knowledge that is put into practice (Scardamalia & Bereiter, 1991). The construction of active knowledge is promoted by the other types of topics. Consequently, learning gains are enhanced by integrating didactic declarative knowledge with example problems and cases (Anderson et al., 1995; Forbus, Gentner, & Law; 1995; Gholson et al., 1996; Kolodner, 1993; Schank et al., 1993; Sweller, 1988), figures and diagrams (Levin et al., 1987; Mayer & Sims, 1994; Mayer et al., 1996), and questions that promote deep reasoning, such as why, how, what-if(Edelson, 1996; King, 1994; McDaniel & Donnelly, 1996; Pressley, 1990; Webb et al., 1995; Zee & Minstrell, 1997). We know that it is the tutor, not the student, who introduces these subtopics, questions, and examples, even when the tutor highly encourages the student to take a more active role in learning (Graesser et al., 1995). Unskilled tutors tend to bring up the same questions and examples, regardless of the student's performance. The selection and ordering of topics is increasingly improved and tuned to the student as the tutor gains more experience and expertise

Each topic in the curriculum script can be represented either as a structured set of propositions or as a format-free text (including the figures and diagrams, as discussed earlier). Symbolic computational modeling requires the former whereas LSA can accept the latter. Each of the topics in the curriculum script can be scaled on difficulty. This difficulty value can dynamically change as a function of the computer tutor's experience by having the difficulty metric increase or decrease as a function of the history of the students' responses to the topic; the metric decreases if there is a high volume of good student responses and a low volume of bad student responses to the topic, and vice versa. Associated with each topic in the curriculum script is a set of good student responses (i.e., answers to questions, solutions to problems, comments on figures) and bad student responses that have errors, bugs and misconceptions. Once again, the answer content can be represented in either a structured form or a text form, depending on what computational procedures are applied. Human tutors normally anticipate what student responses to expect and try to steer the student in that direction. They also anticipate traps and common wrong student responses. Our tutor will start out with a small number of good and bad student responses to each topic, which are provided by the designers of the original curriculum script. However, the corpus of good and bad responses will grow with tutoring experience. That is, whenever a good response or a bad student response is identified by the system in response to a particular topic T (by virtue of a high LSA match between a student contribution and a good/bad response stored in the curriculum script), it adds the new response to the good or bad response list for that topic (in free text format). Therefore, a corpus of good and bad responses will grow with tutoring experience, as in the case of human tutors. Also associated with each topic T is a set of tutor hints, tutor contributions, and question-answer items that are scaled on difficulty.

Latent semantic analysis (LSA) will be used to represent world knowledge, as discussed earlier. The corpus of texts will be reduced to a 100-300 dimensional space through singular value decomposition (Landauer & Dumais, 1997). It takes less than one hour for the computer to compute this space from a corpus of texts of intermediate size (such as a book). This LSA space is subsequently used as a static database to measure similarity between text segments during the process of tutoring; such computations of similarity are virtually instantaneous (as opposed to processing intensive). This space will be used to evaluate the truth, relevance, and quality of student contributions. The space will be used when computing the similarity between any given pair of linguistic descriptions (e.g., word, phrase, sentence, text, set of sentences, text).

LSA has already been applied in the context of medicine (Foltz et al., 1996; Kintsch, in press), so we plan on using introductory medical knowledge as one knowledge domain. The other knowledge domain will be computer literacy because curriculum scripts and a large corpus of texts are already available on the World Wide Web in this knowledge domain. It should be noted that a course in computer literacy teaches students general knowledge about computers and computer tools, but does not go into the depth and precision of computer programming. We will be able to test our tutor in a university class on computer literacy, which will be taught by Dr. Garzon and Dr. Wiemer-Hastings in addition to others not involved with this project.

Although the LSA will be at the core of the proposed computer tutor, it would be informative to explore other available intelligent tutoring systems in medicine and computer literacy, and to compare these systems to the LSA tutor. The most sophisticated tutor might end up being a hybrid between (a) a system that analyzes student contributions at a deep level for very specific knowledge domains and tasks and (b) an LSA system when "a" fails.

Mixed initiative dialog and the production rules that select tutoring topics. Graesser's previous research on human tutoring has dissected the multi-turn, collaborative dialog between the tutor and student when topics are elaborated, questions are answered, and problems are solved. It may take dozens of turns in a conversation for an answer to evolve to a single question. We also know that unskilled human tutors introduce 93% of the subtopics, 96% of the example problems, and 80% of the questions (Graesser et al., 1995). It is the tutor who sets the agenda at the macro-level and who generates the topics. At the micro-level, the tutor tries to get more content and contributions from the student by a variety of discourse moves: pumping, prompting, slicing, hinting, summarizing, revoicing, and giving short feedback (Graesser et al., 1 99S), as will be discussed shortly. Most of the tutor's dialog moves are short, serving as informative prods that hopefully lead the student to productive cognitive constructions. Students do sometimes take the initiative by asking a question, proposing a problem, or making a request. Our computer tutor will attempt to respond appropriately to these student-initiated dialog moves, but it should be recognized that our computer tutor is more equipped to handle the activities of tutoring that are instigated by the tutor.

Production rules (Anderson et al., 1989, 1995; Just & Carpenter, 1992; Kieras, 1989; Laird & Rosenbloom, 1996) or fuzzy controllers (Kosko, 1992) will determine the selection of topics from the curriculum script. These production rules (in crisp or fuzzy form) will tailor the topic to the global knowledge of the student, in addition to the logical/pedagogical ordering constraints in the curriculum script and to the history of previous topics covered in the tutoring session. The global knowledge (GK) of the student can be computed by matching the LSA space with the vector computed from the accumulating set of previous student contributions in the entire session. The cosine match (or dot product) produces a value from 0 (very low knowledge) to I (very high knowledge). These precise values could be converted to fuzzy linguistic variables (very low, low, medium, high, very high) for the sake of formulating appropriate fuzzy rules to infer the best match topic to the zone of proximal development; for example, difficult topics in the curriculum script would be assigned to students with high or very high global knowledge, whereas easy topics would be assigned to students with very low knowledge.

Segmentation and classification of student turns. The tutoring system needs to segment and classify student input at each turn in order to effectively manage a mixed initiative dialog at the micro-level. The computer tutor will need to respond differently to the student's Questions and Directives than to the student's Contributions (assertions, answers) and short Responses. For example, it is normally appropriate to evaluate the truth of a speech act that is a Contribution, but not a Question. The students will type their input into the computer (rather than speaking) so it will be trivial to identify turns in a conversation and quite feasible to segment the content of each turn into speech acts. The input of the "speech act classifier" is the message that a student enters during a turn. The output is a segmentation of this message into a sequence of speech acts, each of which is assigned to one (or more) of the following categories: Question, Contribution (i.e., assertion, answer), Response (positive, negative, or neutral, such as yeah, no, huh), and Directive (i.e., command or indirect request). The following components in the Language Module will be developed to achieve this segmentation and classification.

  1. WordNet (Miller, Beckwith, Fellbaum, Gross, & Miller, 1990) will be used to assign words in the student's message to word classes, such as noun, adjective, main verb, determiner, and so on. A spellcheck facility may also be used for words that cannot be classified. WordNet will produce a sequence of word classes for each turn of the student. Multiple word classes may be assigned to a particular word.
  2. Syntactic parsers will input the sequence of word classes in the student's turn and attempt to assign syntactic and semantic constituents. We plan on using parsers that have been developed previously in other laboratories, including those in the Message Understanding initiative (see references presented earlier). The Penn treebank (Marcus, Santorini, & Marcinkiewicz, 1993) is available for identifying frequent syntactic patterns in language. However, it is widely acknowledged in the fields of computational linguistics and speech recognition that a large proportion of the speech acts in spoken or written conversation are ungrammatical and underspecified semantically (e.g., a prevalence of pronouns, vague referring expressions, ellipsis). The proposed model has a mechanism for handling student messages that are not well formed in addition to those that are well formed syntactically and semantically. We will use available parsers to the extent that they compute products that are useful to the computer tutor, such as segmenting student inputs into speech acts and identifying noun-phrase referring expressions. However, the proposed tutoring system will not depend on a successful parser because other modules will be able to proceed without a good parse.
  3. A dictionary of frozen literal and figurative expressions will include frequent expressions that have a special meaning that cannot be derived from the words (Gibbs, 1994) and that signal particular speech act categories. For example,''That's right" is a Response and "l was wondering about X" is a Question or Directive. Once again, this will be useful for segmenting student contributions into speech acts.
  4. Production rules and software agents will be constantly scanning the student input for surface cues that denote ends of clauses (such as commas, dashes, and connectives), beginnings and ends of sentences (such as periods, question marks, dashes, capital letters), nonlexical strings that have important conversational meaning (such a yup, uh-huh, OK), and sequences of words that have a frozen meaning (e.g., What about..., Let me ask a question). Each production rule or codelet will send on the appropriate classification of input that gets sensed. Dr. Franklin (a co-PI on this proposal) has been developing a virtual secretary (Franklin, Graesser, Olde, Song, & Negatu, 1996) that contains intelligent software agents with hundreds of"codelets" (Hofstadter, 1985). The codeless associated with these agents sense surface linguistic patterns in the email messages that are sent to the secretary and take appropriate actions when activated. The proposed tutor will use this intelligent agent architecture for segmenting student speech acts and identifying informative words and patterns. Alternatively, it is possible to use a conventional production system architecture (Just & Carpenter, 1992; Kieras, 1989) in the module that segments and classifies student input into speech act categories.
  5. A simple recurrent connectionist network (Cleeremans & McClelland, 1991; Elman, 1990) will be used to assign each speech act within a student's turn to one of the four speech act categories: Question, Contribution, Response, and Directive. A recurrent connectionist network has already been developed for a large corpus of tutoring data (Graesser, Swamer, & Hu, in press; Swamer, Graesser, Franklin, Cohen, & Sell, 1993). The network predicts the speech act category of speech act N+ I, given the sequence of speech act categories I through N. This connectionist network will be expanded by including the word classes of the first three words of speech act N+l (provided by WordNet). The first three words are quite discriminating in revealing the speech act category. For example, a whword in the first position suggests the speech act is a Question or Directive but not a Contribution or Response; the previous sequence of speech acts might then help decide whether N+ I is a Question or Directive. We have estimated, based on our previous research, that the predicted speech act category for N+l will be correctly predicted 90-99% of the time by the recurrent connectionist network that has the following input: (1) the sequence of speech act categories for speech acts I through N and (2) the word classes of the first three words of speech act N+l. However, this claim needs to be systematically tested in the proposed research. Other recurrent connectionist models, such as ART (Fausset, 1994) and fuzzy neural networks (Kosko, 1992), have proven successful for tasks in other domains (Silva & Kon, 1997), so these could be tested as alternative solutions for better classifications of speech acts. We plan on investigating models that extract symbolic rules from connectionist networks (Giles & Omlin, 1993); this is often useful when integrating symbolic models with neural network models. It should be noted that the field of computational linguistics has not produced a satisfactory model that identifies speech act categories. The proposed research will hopefully fill this gap in the field.

Student questions and directives occasionally occur during tutoring, although not frequently according to previous research (Graesser et al., 1995). However, when these do occur, the tutor is expected to supply context sensitive relevant information. Graesser and his associates have collected and analyzed the questions that students ask when solving problems, comprehending text, and interpreting displays in hypertext (Graesser, Baggett, & Williams, 1996; Graesser, Langston, & Baggett, 1993; Graesser & McMahen, 1993; Graesser & Person, 1994). This research has revealed that the vast majority of student questions can be identified theoretically or can be extracted empirically from topics out of context. We plan on developing a question-answer corpus by using Graesser's models of question asking and answering and also empirically by collecting the questions asked by learners for particular topics. The answers will be formulated on the basis of Graesser's cognitive computational model of human question answering, called QUEST (Graesser & Franklin, 1990; Graesser, Gordon & Brainerd, 1992). This model specifies the question answering procedures for sampling information when answering 19 categories of questions, such as concept completion questions (who, what, when, where), definitional questions (what does X mean), and deep reasoning questions (why, how, what-if). Each question-answer item will be represented as a production rule: IF <Question>, THEN <Answer>. There can be different wordings of a particular question, which accumulates with the development of the question-answer corpus. A particular Question-Answer item is triggered by a high LSA match between (a) the learner's question and (b) one of the wordings associated with the Question slot of the Question-Answer item.

Evaluating the quality of student contributions . Human tutors are sensitive to the quality of student contributions during the collaborative process of answering a question or solving a problem. This component evaluates the quality of the students' speech acts that are classified as Contributions, but not Questions and Directives. The latent semantic analysis (LSA) will be used in these assessments of quality. Once again, the advantage of LSA is that it can handle messages that are not well formed semantically and syntactically. The truth of the contribution C is the cosine match score between C and the highest matching sentence in the corpus of texts, i.e., t(C). The relevance of contribution C, designated as r(C), can also be determined with respect to the specific topic (e.g., example problem, tutor question, diagram); this is the cosine match score between C and the highest matching sentence in the set of good student responses and bad student responses to the topic. The quality of a student contribution increases to the extent that the student response matches the good student response list as opposed to the bad list. It should be noted that the evaluative dimensions of truth, relevance, and quality are not equivalent. A contribution can be true and relevant, but a bad ansver if it has a high match with a sentence in the bad answer list and a high match with the total set of knowledge in the original corpus. Therefore, the LSA has a foundation for classifying contributions into such categories as: (a) true but irrelevant, (b) true, relevant, and good, (c) relevant but bad, and so on. Similarly, the LSA can be used to measure the cumulative quality of all of the student's contributions to the topic ( a mean match score per contribution, a maximum score of all student contributions, or a measure that assesses matches to the set of good responses). Finally, the cumulative quality of all contributions made by the student in the entire tutoring session, up through contribution N, can serve as a measure of the global knowledge of the student.

The set of good versus bad student responses to a topic will grow with tutoring experience. Whenever a student's contribution is classified as bad (via a high LSA match with a bad answer in the curriculum script or with one of the previous bad student responses), that contribution is added to the set of bad student responses for topic T. Whenever it is classified as good, it is added to the good answer list. Separate thresholds would be needed for good and bad, and many student contributions would fall in-between the two thresholds (and not be stored). We believe this approach does mimic how tutoring expertise develops in human tutors. More sophisticated methods of analyzing student responses may also be pursued.

Tutor dialog moves are generated by the tutor at the micro-level when a topic is collaboratively fleshed out. Our analysis of untrained tutors uncovered a set of dialog moves that are triggered under specific conditions during the collaborative evolution of an answer to a question or a solution to a problem (Graesser et al., 1995; Graesser, Bowers et al., in press; Person & Graesser, in press). Some of these moves are specified below:

  1. Pumping. The tutor pumps the student for more information during the early stages of answering a particular question (or solving a problem). The pump consists of positive feedback (e.g., right, yeah, dramatic head nod), neutral back channel feedback (uh-huh, okay, subtle head nod), Or explicit requests for more information ( What else?, Tell me more). The tutor pumps for one or two cycles of turns before the tutor contributes information. Pumping serves the functions of exposing knowledge of the student and of encouraging students to construct content by themselves.
  2. Prompting. Tutors supply the student with a discourse context and prompt them to fill in a missing word, phrase, or sentence. A pause or intonation cue is the appropriate prompt signal in speech whereas an underlined slot would serve an analogous function on a computer screen (e.g., As the heart beats faster, the blood pressure). Prompting is a scaffolding device for students who are reluctant to supply information. Students are expected to supply more content and more difficult content as they progress in learning the domain knowledge.
  3. Immediate feedback. Tutors periodically give positive, negative, or neutral feedback after the student's contributions. Tutors are normally polite conversation partners, so they are reluctant to give negative feedback after student contributions that have poor quality (Person et al., 1994, 1995). Tutors are reluctant to say "No, that's wrong." Instead, they give positive, neutral, or indirect feedback.
  4. Splicing. The tutor jumps in and splices correct information as soon as the student produces a contribution that is obviously error-ridden. The tutor needs to be able to recognize errors, bugs, and slips in order to do this. Deep misconceptions in the student are more difficult to detect and are not handled by splicing.
  5. Hinting. When the student is having problems answering a question or solving a problem, the tutor gives hints by presenting a fact, asking a leading question, or reframing the problem. Some hints are quite indirect and sophisticated in the case of ideal tutors and ITSs (Hume et al., 1993; McArthur et al., 1990; Merrill et al., 1992), whereas the hints of unskilled tutors are unsophisticated (Graesser et al., 1995).
  6. Summarizing. Unskilled tutors normally give a summary that recaps an answer to a question or solution to a problem. This summary serves the function of succinctly codifying a lengthy, multi-turn, collaborative exchange when a question is answered or problem is solved. A skilled tutor might encourage the student to construct the summary instead of the tutor supplying one. This would promote a more active construction of knowledge on the part of the student, an activity which is known to facilitate learning.

There are additional conversational moves, such as requestioning (asking the same question in slightly different words), revoicing (reiterating or questioning a contribution of the student), and assessment (asking whether the student understands). There also is a structure for managing the collaboration. For example, Graesser and Person (1994) discovered a 5-step dialog frame when tutors ask deep-reasoning questions: (1) tutor asks question, (2) students gives an answer, (3) tutor gives immediate feedback and/or pumps the student, (4) tutor and student collaboratively elaborate an answer, and (5) tutor assesses the student's understanding.

Our model will create production rules that precisely specify the conditions in which the various dialog moves are initiated. These are conditionalized on the content of the curriculum script, the dialog history, and the quality of the student's contribution during the last turn (or alternatively, the cumulative quality of the student's knowledge, or the cumulative quality of the student-tutor exchange). For example, the following simple production rule would be associated with immediate positive feedback.

IF [Script component = Question Qj, & max {sim(C, good-answer(Q`)} > v]

THEN [tutor prints out "That's right."]

The value of sim(X,Y) is the cosine match (dot product) between X and Y, a measure of overlap. The max refers to the highest value of the set of matches. C refers to the student contribution. The parameter v is the threshold in similarity that must be met before the student's contribution is regarded as high in quality. It should be noted that a fuzzy controller architecture could be adopted by incorporating fuzzy descriptions of the evaluation metrics or thresholds. The production rule below would specify how an abrasive tutor would handle an error-ridden contribution.

IF [Script component = Question Qj, & max{sim (C, bad-answer(Q`)} > w]

THEN [tutor prints out "Wait. That's not correct." + <best answer from good-answer(Q')>]


The following production rule specifies how a hint would be generated when the student has expressed a contribution that is true but not relevant to the question. The hint is an answer from the good answer set, a contribution that extends the boundaries slightly beyond what a student already knows; this is a form of frontier learning or the zone of proximal development.

IF [Script component = Question Qj, & max{sim (C, good-answer(Q')} < v,

& max{sim (C, bad-answer(Q`)} < w,

& max{sim (C, entire corpus)} > x]

THEN [tutor prints out a sentence S from good-answer(Q'), such that

sim(S, entire corpus) almost= {sire (cumulative student contributions, entire corpus}]

The style and expertise of the tutor is defined in part by the production rules that capture these dialog rules and the values of the threshold parameters (v, w, and x). The production rules would be different for the unskilled tutor with minimal experience and a tutor with much experience (i.e., a large accumulated list of good answers and bad answers) who uses frontier learning. An important goal of this project is to simulate the contributions of different classes of tutors which vary in experience and in the use of particular pedagogical strategies.

Computer System. Model development will be on two computers. The more processing intensive applications will be on a Sun Ultra 2 workstation with dual 167 Mhz processors, a high-speed internal bus, advanced graphics capabilities, 64 MB RAM, and a 4 Gigabyte hard drive. This computer will be necessary to handle the demands of training the more complex neural networks, extracting rules from them, training the LSA knowledge bases, and running the fuzzy controllers and production rule systems. The other computer will be a 200 Mhz Pentium Pro PC with 64 MB RAM and a 9 GB hard drive. This computer is capable of accommodating WordNet, LSA knowledge bases, Neuroworks (for running simpler neural networks), natural language analysis software, and decision control software (including production rules that are crisp or fuzzy). The objective is to have our computer tutor implemented on a Pentium that can be used for training in schools systems, in corporate environments, and in governmental agencies.

Prototype computer tutors will be developed for the two knowledge domains: Computer literacy and medicine. In addition, we plan on developing a computer shell that other researchers can use to develop tutoring systems on other topics. At minimum, the shell would guide the developers in building three of the six database modules shown in the Figure (see last page of this section): The curriculum scripts, the LSA spaces, and the question answer corpus. The other components in the Figure could remain intact. However, the developer would need to modify the topic selection production rules and the tutor dialog rules in order to implement a new set of tutoring strategies. The language modules could also be enhanced with improvements in automated natural language processing systems.


Evaluating the Computer Tutor

Initial evaluations. The primary objective of the evaluation is to assess the pedagogical quality and conversational aptness of the simulated tutor contributions. That is, a tutor contribution should have some pedagogical value, be relevant to the conversational context, and be informative. The initial versions of our computerized tutors are likely to be a long way from meeting these objectives, so some rapid prototyping is needed before we reach a reasonable level of performance. Informal assessments of the tutoring system will prevail during the first year of the project when the components of the system are being developed. When the computer tutor approaches the arena of supplying reasonable contributions, then more systematic evaluations will be made of the tutor's contributions.

In order to provide such evaluations, college students will interact with the computerized tutor and supply a sample of tutorial dialogs. The students will enter information by keyboard and the tutor will display information on computer screen or though simulated speech during these interactions. The content of the tutor's contributions in these transcripts will be analyzed by experts in discourse analysis (Dr. Graesser, Dr. Kreuz, Dr. Wiemer-Hastings), experts in education (Dr. Gholson, Dr. Hacker, Dr. Hacker), and graduate research assistants. Each speech act and conversational turn of the tutor will be rated on dimensions such as the following: relevance to the conversational context, relevance to the preceding turn, informativity, aptness, coherence of speech acts within a turn, pedagogical value, and type of dialog move. There will be multiple raters so that interjudge reliability can be assessed. Speech acts with low ratings will be analyzed in an effort to diagnose and correct the problems. For example, there may be problems with the conditions and parameters of the production rules (of the generator of the topics and the dialog moves), problems in finding the optimal match scores in the LSA, and problems with the curriculum script.

Analyses will be performed on particular components of the tutoring model. For example, we can compute the proportion of student turns in which the model correctly segments the students' contributions into speech acts and correctly assigns the speech acts into categories. We can compute the conditional likelihood that the tutor generates an appropriate dialog move, given that the student's contribution is at a particular level of quality. Recall and precision have been the traditional measures of performance in the Message Understanding evaluations (DARPA, 199S; Lehnert, in press), so we will use these measures of performance for language and discourse modules whenever appropriate. Trained judges will rate student contributions on truth and relevance. These ratings will be correlated with the match scores of the LSA and other characteristics of the tutoring system. The correlations should be high and positive to the extent that the LSA is providing a valid assessment of the quality of student contributions.

Turing tests. We plan on conducting "Turing tests" at a fine-grained level during the third year, when the tutoring system has achieved some maturity. The first step is to collect a sample of transcripts of the tutoring protocols between the computer tutor and an actual learner. Contributions by the tutor will be sampled randomly. Half of these contributions in the sample transcripts will consist of the computer contribution whereas the other half will be contributions generated by the experts in language, discourse, and education on this project (analogous to a "Wizard of Oz" technique). College students will read the transcripts and make a decision as to whether a human or computer generated each contribution on a 6-point scale: (1) definitely human, (2) probably human, (3) undecided, but guess human, (4) undecided but guess computer, (5) probably computer, and (6) definitely computer. There will be two versions of the transcript, such that half of the sampled tutor contributions are generated by the computer and the other half by the human in version A, with the opposite assignment in version B. Consequently, we will be able to perform a standard signal detection analysis that segregates a true discrimination between human and computer from response biases in making these judgments; we will collect hit rates, false alarm rates, d' scores, and Ag discrimination scores. Dr. Hu, Dr. Marks, and Dr. Graesser have a rich background in performing such signal detection analyses. If our tutor is perfectly successful, college students will not be able to distinguish contributions of the computer versus human experts. It is of course possible to segregate performance scores for different categories of tutor contributions.

An alternative method of performing a Turing test would be to have a human expert provide contributions on-line during the tutorial sessions at random points in the dialog (via the Wizard of Oz technique in which the human sends computer messages from another room); the computer would supply contributions at the remaining points. After the session, the learner would go back retrospectively and make the decisions about "computer versus human tutor" contributions. The advantage of this alternative method is that the judgments of the original learner would be deeper, more personal, and more situated.

Comparison of tutors. Our initial goal will be to simulate an unskilled tutor with the dialogue moves identified by Graesser et al. (1995) and with LSA. Once we have created a model with an acceptable level of performance (e.g., conversational aptness and pedagogical value), we will go through a phase of tuning the parameters and rules of the model and to examine how this changes with tutor experience (i.e., the accumulating corpus of good and bad student responses). It is possible to manipulate the amount of domain knowledge of the computer tutor by feeding in a different volume of texts to the LSA program. We will subsequently compare different classes of tutors that embrace different ideal tutoring strategies, as discussed earlier. These models will be the same except for the production rules that select topics and dialog moves. Trained judges will assess the tutor's contributions with respect to conversational smoothness and pedagogical value, as discussed above. We will simulate tutorial dialogue with these different classes of tutors. However, it is too early to commit ourselves on exactly what classes of tutoring systems we plan on developing.

Learning outcomes. Although the primary objective of the proposed research does not address how well students learn with these computerized tutors, we do plan on conducting preliminary evaluations during the third year of the proposed research. College students in a Computer Literacy course will be randomly assigned to one of three different tutoring conditions (1) an unskilled tutor that has the dialog moves identified by Graesser et al. (1995), (2) a skilled tutor that has a set of ideal tutoring strategies in addition to the dialog moves in # 1, and (3) a reading control in which students read topics in the tutoring script that are yoked to topics that emerged in conditions I or 2. After receiving one of these three tutoring conditions, the students will be given a thorough test of the domain knowledge. The test will have a combination of essay questions, deep-reasoning questions, problems to solve, simple objective questions (short-answer questions and multiple-choice questions), and appraisals of particular dimensions of memory. We will include both accuracy-oriented and quantity-oriented measures of memory assessment (Koriat & Goldsmith, 1994). The objective questions and deep reasoning questions will tap specific topics, whereas the extended-response items will tap the organization of large chunks of material and a flexible integration of multiple concepts (Airasian, 1997; Kintsch, in press; Oosterhof, 1996). The tests that require reasoning and problem solving will tap the active knowledge that is being emphasized in recent trends in educational assessment (Goldman, Pelligrino, & Bransford, 1993; Office of Technology Assessment, 1992). The tests will be scored by LSA (see previous discussion of LSA) in addition to trained judges. Dr. Hacker (an expert in education and educational psychology), Dr. Marks (an expert in cognition, perception, and memory), and Dr. Gholson (an expert in cognitive development, learning, reasoning, and problem solving) will play a major role in developing tests that tap deep levels of understanding and tests that tap memory for the relatively shallow information.

Senior Personnel, Training, and the Institute for Intelligent Systems

The Pl, co-Pi's, and other senior personnel on this proposal have had extensive interactions over the years in our interdisciplinary Institute for Intelligent Systems (IIS). For the last 12 years, there have been weekly seminars on cognitive science, neural networks, and complex dynamical systems. These seminars have furnished a rich "grass roots" arena for discussing theories, models, and empirical research in cognitive science. The proposed research will be a substantial boost to the interdisciplinary IIS because the team of researchers will be working seriously on a single coordinated project instead of nurturing individual projects. The education of our student RA's will also benefit from a coordinated effort.