Class Business
- A look ahead to the in-class activity for next Tuesday (Class 9)
- Reference: U.S. Census
- Literary: Intro to Cultural Analytics
- Miscellany: Data is Plural
Plan for class: Discussion of Conceptual Spreadsheets Discussion of Data Prep
Epigraph for Class
Apollo & Dionysus
Gerhardt Richter, “Betty,” 1988
Gerhardt Richter, “Blue,” 1988
Conceptual Spreadsheets
- Caitlin: Hominids (Discursive)
- Brian & Nicholas: Desk (Controlled Vocabulary)
- Wendy & Serena: Furniture (Condition)
Our journal’s policy is that all data and code relevant to articles published in CA will be made publicly available through the journal’s open-access data archiving service with Dataverse. […] This policy includes sharing underlying text, audio, or image files; derived data used in the analysis; and code used to acquire, clean, and analyze collections. (About)
Researcher D has been supplied a data set through a third party repository, such as Hathi Trust or their own research library. This researcher would supply the table of file IDs so that other researchers could request identical data sets from Hathi or their own university library. As in the other scenarios, Researcher D supplies derived data from the underlying text files, which might include word counts or higher-level features. The code used to analyze the data would be submitted along with the tables as a separate file. (About)
Is Data Socially Constructed?
Gitelman & Jackson, “Introduction,” “Raw Data” is an Oxymoron
At first glance data are apparently before the fact: they are the starting point for what we know, who we are, and how we communicate.This shared sense of starting with data often leads to an unnoticed assumption that data are trans- parent, that information is self-evident, the fundamental stuff of truth itself. (2)
…“the semantic function of data is specifically rhetorical.” Data by definition are “that which is given prior to argument,” given in order to provide a rhetorical basis. (Facts are facts — that is, they are true by dint of being factual — but data can be good or bad, better or worse, incomplete and insufficient.)Yet precisely because data stand as a given, they can be taken to construct a model sufficient unto itself (7)

Objectivity is situated and historically specific; it comes from somewhere and is the result of ongoing changes to the conditions of inquiry, conditions that are at once material, social, and ethical. (4)
The Camera is the Paradigm of Mechanical Objectivity, and Yet …
At the very least the photographic image is always framed, selected out of the profilmic experience in which the photographer stands, points, shoots. (5)
- Our World in Data: Coronavirus Pandemic Data Explorer
Data Workflow
Kotu & Deshpande, Data Science Process, Fig 2.1 (edited)
Data Preparation
Data exploration approaches involve computing descriptive statistics and visualization of data. They can expose the structure of the data, the distribution of the values, the presence of extreme values, and highlight the inter-relationships within the dataset. (25)
Descriptive statistics like mean, median, mode, standard deviation, and range for each attribute provide an easily readable summary of the key characteristics of the distribution of data. On the other hand, a visual plot of data points provides an instant grasp of all the data points condensed into one chart. (25)
Provenance
Tracking the data lineage (provenance) of the data source can lead to the identification of systemic issues during data capture or errors in data transformation. (26)
Quality
The data cleansing practices include elimination of duplicate records, quarantining outlier records that exceed the bounds, standardization of attribute values, substitution of missing values, etc. (25)
Task-Specific
In some data science algorithms like k-NN, the input attributes are expected to be numeric and normalized, because the algorithm compares the values of different attributes and calculates distance between the data points. Normalization prevents one attribute dominating the distance results because of large values. (27)
Reducing the number of attributes, without significant loss in the performance of the model, is called feature selection. It leads to a more simplified model and helps to synthesize a more effective explanation of the model. (28)
- Rajabi, Cleaning a Messy Dataset using Python
- Cleaning Code in Gituhub Repo
Data Standards
- File Format: Best Practices
Quantitative Tabular Data
- comma-separated values (CSV) file (.csv)
- tab-delimited file (.tab)
Qualitative data
- eXtensible Mark-up Language (XML) text according to an appropriate Document Type Definition (DTD) or schema (.xml)
- plain text data, UTF-8 (Unicode; .txt)
Digital image data
- TIFF version 6 uncompressed (.tif)
Digital audio data
- Free Lossless Audio Codec (FLAC) (.flac)
- MPEG-1 Audio Layer 3 (.mp3)