Class 15 (English 197 – Fall 2022)

This is the main course website. There is also a course Canvas site for uploading assignments.

Class Business

Practicum 6: Large Language Models & Text-to-Image Models Exercise


Student outputs

  • Jake Houser
  • Emily Franklin
  • Lorna Kreusel
  • Stella Jia
  • [TBD]

Thinking With / Thinking about Large Language Models

Minh Hua and Rita Raley“Playing With Unicorns: AI Dungeon and Citizen NLP” (2020)

If a complete mode of understanding is as-yet unachievable, then evaluation is the next best thing, insofar as we take evaluation, i.e. scoring the model’s performance, to be a suitable proxy for gauging and knowing its capabilities. (link)

In this endeavor, the General Language Understanding Evaluation benchmark (GLUE), a widely-adopted collection of nine datasets designed to assess a language model’s skills on elementary language operations, remains the standard for the evaluation of GPT-2 and similar transfer learning models…. Especially striking, and central to our analysis, are two points: a model’s performance on GLUE is binary (it either succeeds in the task or it does not)…. But if the training corpus is not univocal — if there is no single voice or style, which is to say no single benchmark — because of its massive size, it is as yet unclear how best to score the model.(link)

Our research questions, then, are these: by what means, with what critical toolbox or with which metrics, can AID [AI Dungeon], as a paradigmatic computational artifact, be qualitatively assessed, and which communities of evaluators ought to be involved in the process? (link)


AID, as an experiment with GPT-2, provides a model for how humanists might more meaningfully and synergistically contribute to the project of qualitative assessment going forward…. (link)

Our presupposition … is that it is not by itself sufficient to bring to bear on the textual output of a machine learning system the apparatus of critical judgment as it has been honed over centuries in relation to language art as a putatively human practice. What is striking even now is the extent to which humanistic evaluation in the domain of language generation is situated as a Turing decision: bot or not. We do not however need tales of unicorns to remind us that passable text is itself no longer a unicorn. And, as we will suggest, the current evaluative paradigm of benchmarking generated text samples — comparing output to the target data to assess its likeness — falls short when the source for generated samples is neither stable nor fully knowable. (link)

Fleuron icon (small)

Emily M. Bender et al.“On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” (2021)

  • Douglass Bakkum, Philip Gamblen, Guy Ben-Ary, Zenas Chao, and Steve Potter, “MEART: The Semi-Living Artist” (2007).
  • Harold Cohen’s “Aaron”

    Harold Cohen and Aaron (samples of art created by Aaron)
    Harold Cohen and Aaron (samples of art created by Aaron)
  • Margaret A. Boden, The Creative Mind: Myths and Mechanisms,  2nd ed. (1990/2004) (PDF)

Fleuron icon (small)

This is the main course website. There is also a course Canvas site for uploading assignments.
css.php