Class 5 (English 238 – Fall 2019)

Class Business

Gallery (A Short Visual Exhibit Serving as Counterpoint for this Class)

Minimalism — Yves Klein

Yves Klein, IKB 191 (1962)
IKB 191 (1962)
Yves Klein, Le Vide (The Void) displayed at the Iris Clert Gallery (1958)
Le Vide (The Void)
displayed at the Iris Clert Gallery (1958)

V V V V

[The Gartner Report, 2001] proposed a three fold definition encompassing the “three Vs”: Volume, Velocity, Variety…. This definition has since been … expanded upon by … others to include a fourth V: Veracity.

–Jonathan Stuart Ward and Adam Barker, “Undefined By Data: A Survey of Big Data Definitions” (2013)

Fleuron icon (small)

Volume

Intel is one of the few organisations to provide concrete figures in their literature. Intel links big data to organisations “generating a median of 300 terabytes (TB) of data weekly.”

–Ward and Baker

As of June 2015, the dump of all pages with complete edit history in XML format [of the English Wikipedia] … is about 100 GB compressed using 7-Zip, and 10 TB uncompressed. arrowhead right

Estimates are that the big four [Google, Amazon, Microsoft and Facebook] store at least 1,200 petabytes between them. That is 1.2 million terabytes (one terabyte is 1,000 gigabytes). And that figure excludes other big providers like Dropbox, Barracuda and SugarSync, to say nothing of massive servers in industry and academia.

Fleuron icon (small)

Velocity

Locating computers owned by HFT [High-Frequency Trading] firms and proprietary traders in the same premises where an exchange’s computer servers are housed … enables HFT firms to access stock prices a split second before the rest of the investing public. Co-location has become a lucrative business for exchanges, which charge HFT firms millions of dollars for the privilege of “low latency access.”

Fleuron icon (small)

Variety
The Barnum and Bailey greatest show on earth--A glance at the great ethnological congress and curious led animals ...
The Barnum and Bailey greatest show on earth
–A glance at the great ethnological congress and curious led animals … (circa 1895)
The Ed Sullivan Show

Fleuron icon (small)

Veracity
Sister Miriam Joseph, What are the Liberal Arts, Figure 4
Sister Miriam Joseph, What are the Liberal Arts, Figure 4

Mess

While the previously mentioned definitions rely upon a combination size, complexity and technology, a less common definition relies purely upon complexity. The Method for an Integrated Knowledge Environment (MIKE2.0) project, frequently cited in the open source community, introduces a potentially contradictory idea: “Big Data can be very small and not all large datasets are big”

–Ward and Baker

In addition to analyzing massive volumes of data, Big Data Analytics poses other unique challenges for machine learning and data analysis, including format variation of the raw data, fast-moving streaming data, trustworthiness of the data analysis, highly distributed input sources, noisy and poor quality data, high dimensionality, scalability of algorithms, imbalanced input data, unsupervised and un-categorized data, limited supervised/labeled data, etc.

–Najafabadi et al.

Data mining concerns databases of very large size—millions or billions of records, usually with elements of high dimensionality (meaning that every record typically comprises a large number of elements). For each record in a retail database, a data mining operation might seek unexpected relationships among the item purchased, the store’s zip code, the purchaser’s zip code, variety of credit card, time of day, date of birth, other items purchased at the same time, even every item viewed, or the history of every previous item purchased or returned. Performing reasonably fast analyses of high dimensional, messy real-world data is central to the identity and purpose of data mining, in contrast to its predecessor fields such as statistics and machine learning.

Something else is needed, something less pure—because it deals with vast impurities of dynamic data, nearly always from a particular business, governmental, or scientific research goal.

Something else is needed, something less pure–because it deals with vast impurities of dynamic data, nearly always from a particular business, governmental, or scientific research goal.

–Matthew L. Jones, “Querying the Archive: Data Mining from Apriori to PageRank,” 312, 312, 314
  • Data science ways of dealing with mess: cleaning, munging, wrangling (e.g., OpenRefine)
  • Humanities ways of thinking about mess?

| | | |

Dispositif: a thoroughly heterogeneous ensemble consisting of discourses, institutions, architectural forms, regulatory decisions, laws, administrative measures, scientific statements, philosophical, moral and philanthropic propositions–in short, the said as much as the unsaid. Such are the elements of the apparatus. The apparatus itself is the system of relations that can be established between these elements. (Power/Knowledge, 194)
Deleuze and Guattari, A Thousand and One Plateaus
Gehry House in Santa Monica
“… what characterizes the newer “intensities” of the postmodern, which have also been characterized in terms of the “bad trip” and of schizophrenic submersion, can just as well be formulated in terms of the messiness of a dispersed existence, existential messiness, the perpetual temporal distraction of post-sixties life.” (Postmodernism, 117

Materiality

In classifying and indexing samples of ice, rock, soil, and sediment, we acknowledge the Earth as a vast geo-informatic construct. It is both geology and data, ontology and epistemology. Yet unlike many Big Data operations, which live in the Cloud, this “Linked Earth” is also resolutely material —muddy, icy, soggy.

Yet those data and documents are not structured uniformly. Species collections, core samples, and medieval manuscripts can all help researchers understand the changing climate, but they are subject to widely varying protocols of collection, preservation, and access

–Shannon Mattern, “The Big Data of Ice, Rocks, Soils, and Sediments”

Reduction

Deep learning algorithms are actually Deep architectures of consecutive layers. Each layer applies a nonlinear transformation on its input and provides a representation in its output. The objective is to learn a complicated and abstract representation of the data in a hierarchical manner by passing the data through multiple transformation layers. The sensory data (for example pixels in an image) is fed to the first layer. Consequently the output of each layer is provided as input to its next layer.

For example by providing some face images to the Deep Learning algorithm, at the first layer it can learn the edges in different orientations; in the second layer it composes these edges to learn more complex features like different parts of a face such as lips, noses and eyes. In the third layer it composes these features to learn even more complex feature like face shapes of different persons. These final representations can be used as feature in applications of face recognition.

–Najafabadi et al.