Class 8 (English 146DS – Winter 2021)

Class Business

Plan for class:  Discussion of Conceptual Spreadsheets  Arrow right  Discussion of Data Prep

Epigraph for Class

Apollo & Dionysus

Photorealistic painting of a young woman turned away from the viewer
Gerhardt Richter, “Betty,” 1988

Abstract Painting made by layering oil paints of different colors and then cutting them away
Gerhardt Richter, “Blue,” 1988

Conceptual Spreadsheets

  • Caitlin: Hominids (Discursive)
  • Brian & Nicholas: Desk (Controlled Vocabulary)
  • Wendy & Serena: Furniture (Condition)

Teddy’s Homework Assignment

Our journal’s policy is that all data and code relevant to articles published in CA will be made publicly available through the journal’s open-access data archiving service with Dataverse. […] This policy includes sharing underlying text, audio, or image files; derived data used in the analysis; and code used to acquire, clean, and analyze collections. (About)

Researcher D has been supplied a data set through a third party repository, such as Hathi Trust or their own research library. This researcher would supply the table of file IDs so that other researchers could request identical data sets from Hathi or their own university library. As in the other scenarios, Researcher D supplies derived data from the underlying text files, which might include word counts or higher-level features. The code used to analyze the data would be submitted along with the tables as a separate file. (About)

Is Data Socially Constructed?

Gitelman & Jackson, “Introduction,” “Raw Data” is an Oxymoron

At first glance data are apparently before the fact: they are the starting point for what we know, who we are, and how we communicate.This shared sense of starting with data often leads to an unnoticed assumption that data are trans- parent, that information is self-evident, the fundamental stuff of truth itself. (2)

…“the semantic function of data is specifically rhetorical.” Data by definition are “that which is given prior to argument,” given in order to provide a rhetorical basis. (Facts are facts — that is, they are true by dint of being factual — but data can be good or bad, better or worse, incomplete and insufficient.)Yet precisely because data stand as a given, they can be taken to construct a model sufficient unto itself (7)

Fleuron icon (small)

Objectivity is situated and historically specific; it comes from somewhere and is the result of ongoing changes to the conditions of inquiry, conditions that are at once material, social, and ethical. (4)

The Camera is the Paradigm of Mechanical Objectivity, and Yet …

At the very least the photographic image is always framed, selected out of the profilmic experience in which the photographer stands, points, shoots. (5)

If Data is Socially Constructed, Do We Have to Give Up Science?

Data Workflow

Kotu & Deshpande, Data Science Process, Fig 2.1 (edited)

Visualization of the CRISP Data Mining framework. A flowchart with steps: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, Application

Data Preparation


Data exploration approaches involve computing descriptive statistics and visualization of data. They can expose the structure of the data, the distribution of the values, the presence of extreme values, and highlight the inter-relationships within the dataset. (25)

Descriptive statistics like mean, median, mode, standard deviation, and range for each attribute provide an easily readable summary of the key characteristics of the distribution of data. On the other hand, a visual plot of data points provides an instant grasp of all the data points condensed into one chart. (25)


Tracking the data lineage (provenance) of the data source can lead to the identification of systemic issues during data capture or errors in data transformation. (26)


The data cleansing practices include elimination of duplicate records, quarantining outlier records that exceed the bounds, standardization of attribute values, substitution of missing values, etc. (25)


In some data science algorithms like k-NN, the input attributes are expected to be numeric and normalized, because the algorithm compares the values of different attributes and calculates distance between the data points. Normalization prevents one attribute dominating the distance results because of large values. (27)

Reducing the number of attributes, without significant loss in the performance of the model, is called feature selection. It leads to a more simplified model and helps to synthesize a more effective explanation of the model. (28)

Data Standards

Quantitative Tabular Data

  • comma-separated values (CSV) file (.csv)
  • tab-delimited file (.tab)

Qualitative data

  • eXtensible Mark-up Language (XML) text according to an appropriate Document Type Definition (DTD) or schema (.xml)
  • plain text data, UTF-8 (Unicode; .txt)

Digital image data

  • TIFF version 6 uncompressed (.tif)

Digital audio data

  • Free Lossless Audio Codec (FLAC) (.flac)
  • MPEG-1 Audio Layer 3 (.mp3)