Class 7 (English 146DS – Winter 2021)

Class Business

Plan for class:  Discussion of big data (continued)  Arrow right  Discussion of data structures & models

Epigraph for Class

John Keats, “Ode on a Grecian Urn” (1819)

“Beauty is truth, truth beauty,—that is all
Ye know on earth, and all ye need to know.”

V V V V

Jonathan Stuart Ward and Adam Barker, “Undefined By Data: A Survey of Big Data Definitions” (2013)

[The Gartner Report, 2001] proposed a three fold definition encompassing the “three Vs”: Volume, Velocity, Variety…. This definition has since been … expanded upon by … others to include a fourth V: Veracity.

Volume | Velocity | Variety | Veracity

Intel is one of the few organisations to provide concrete figures in their literature. Intel links big data to organisations “generating a median of 300 terabytes (TB) of data weekly.”

–Ward and Baker

As of June 2015, the dump of all pages with complete edit history in XML format [of the English Wikipedia] … is about 100 GB compressed using 7-Zip, and 10 TB uncompressed. arrowhead right

Estimates are that the big four [Google, Amazon, Microsoft and Facebook] store at least 1,200 petabytes between them. That is 1.2 million terabytes (one terabyte is 1,000 gigabytes). And that figure excludes other big providers like Dropbox, Barracuda and SugarSync, to say nothing of massive servers in industry and academia.

Locating computers owned by HFT [High-Frequency Trading] firms and proprietary traders in the same premises where an exchange’s computer servers are housed … enables HFT firms to access stock prices a split second before the rest of the investing public. Co-location has become a lucrative business for exchanges, which charge HFT firms millions of dollars for the privilege of “low latency access.”

Pope Conclave process

The Barnum and Bailey greatest show on earth--A glance at the great ethnological congress and curious led animals ...
The Barnum and Bailey greatest show on earth
–A glance at the great ethnological congress and curious led animals … (circa 1895)
The Ed Sullivan Show
Image illustrating problem of high dimensional data, from Leong Kuan Yew
Image illustrating problem of high dimensional data, from Leong Kuan Yew, “Singular Value Decomposition for 3D Visualization of High Dimensional Data” (2017)
For a good introductory explanation to the problem of high-dimensional data, see Tony Yiu, “The Curse of Dimensionality” (2019)

Daniel Rosenberg, “Data before the Fact”

There are important distinctions here: facts are ontological, evidence is epistemological, data is rhetorical. A datum may also be a fact, just as a fact may be evidence. But, from its first vernacular formulation, the existence of a datum has been independent of any consideration of corresponding ontological truth. When a fact is proven false, it ceases to be a fact. False data is data nonetheless. (18)

It is tempting to want to give data an essence, to define what exact kind of fact data is. But this misses the most important aspect of the term, and it obscures why the term became so useful in the mid-twentieth century. Data has no truth. Even today, when we speak of data, we make no assumptions at all about veracity. Electronic data, like the data of the early modern period, is given. It may be that the data we collect and transmit has no relation to truth or reality whatsoever beyond the reality that data helps us to construct. (37)

Fleuron icon (small)

Matthew L. Jones, “How We Became Instrumentalists (Again): Data Positivism since World War II” (2018) (pp. 674, 675)

quotation from Matthew L. Jones, “How We Became Instrumentalists (Again): Data Positivism since World War II” (2018), p. 674

quotation from Matthew L. Jones, “How We Became Instrumentalists (Again): Data Positivism since World War II” (2018), p. 675

Fleuron icon (small)

Article in Lawtomated on “Explainable AI” (2020)

Lawtomated, "Explainable AI" (2020)

Mess

Matthew L. Jones, “Querying the Archive: Data Mining from Apriori to PageRank,” p. 314

Data mining concerns databases of very large size—millions or billions of records, usually with elements of high dimensionality (meaning that every record typically comprises a large number of elements). For each record in a retail database, a data mining operation might seek unexpected relationships among the item purchased, the store’s zip code, the purchaser’s zip code, variety of credit card, time of day, date of birth, other items purchased at the same time, even every item viewed, or the history of every previous item purchased or returned. Performing reasonably fast analyses of high dimensional, messy real-world data is central to the identity and purpose of data mining, in contrast to its predecessor fields such as statistics and machine learning….

Fleuron icon (small)

Something else is needed, something less pure—because it deals with vast impurities of dynamic data, nearly always from a particular business, governmental, or scientific research goal.

Cover of Mary Douglas, <em>Purity and Danger</em>
Mary Douglas, Purity and Danger (2002)

Cover of Geoffrey Bowker and Susan Leigh Star, <em>Sorting Things Out: Classification and its Consequences</em>
Geoffrey Bowker and Susan Leigh Star, Sorting Things Out: Classification and its Consequences

Text Structures & Models

Digital Data Structures & Models (selected)

  • Markup
  • Relational database (“table,” “record”) Example of relational database tables (from Alan Liu, Local Transcendence, chap. 9)
  • Objects
    • Key-value pairs (and dictionaries or arrays)
    • Associated methods
  • Linked Data
    • RDF (Resource Description Framework)
    • Provenance (PROV) Figure from Luc Moreau and Paul Groth, Provenance: An Introduction to PROV (2013), p. 18

Examples from Wikipedia article on "Data Structures"
Examples from Wikipedia article on “Data Structures”