- Responses to your research proposals
- Readings for next class
Gallery (A Short Visual Exhibit Serving as Counterpoint for this Class)
Minimalism — Yves Klein
IKB 191 (1962)
Le Vide (The Void)
displayed at the Iris Clert Gallery (1958)
V V V V
[The Gartner Report, 2001] proposed a three fold definition encompassing the “three Vs”: Volume, Velocity, Variety…. This definition has since been … expanded upon by … others to include a fourth V: Veracity.
Intel is one of the few organisations to provide concrete figures in their literature. Intel links big data to organisations “generating a median of 300 terabytes (TB) of data weekly.”
Estimates are that the big four [Google, Amazon, Microsoft and Facebook] store at least 1,200 petabytes between them. That is 1.2 million terabytes (one terabyte is 1,000 gigabytes). And that figure excludes other big providers like Dropbox, Barracuda and SugarSync, to say nothing of massive servers in industry and academia.
Locating computers owned by HFT [High-Frequency Trading] firms and proprietary traders in the same premises where an exchange’s computer servers are housed … enables HFT firms to access stock prices a split second before the rest of the investing public. Co-location has become a lucrative business for exchanges, which charge HFT firms millions of dollars for the privilege of “low latency access.”
|The Barnum and Bailey greatest show on earth
–A glance at the great ethnological congress and curious led animals … (circa 1895)
While the previously mentioned definitions rely upon a combination size, complexity and technology, a less common definition relies purely upon complexity. The Method for an Integrated Knowledge Environment (MIKE2.0) project, frequently cited in the open source community, introduces a potentially contradictory idea: “Big Data can be very small and not all large datasets are big”
In addition to analyzing massive volumes of data, Big Data Analytics poses other unique challenges for machine learning and data analysis, including format variation of the raw data, fast-moving streaming data, trustworthiness of the data analysis, highly distributed input sources, noisy and poor quality data, high dimensionality, scalability of algorithms, imbalanced input data, unsupervised and un-categorized data, limited supervised/labeled data, etc.
Data mining concerns databases of very large size—millions or billions of records, usually with elements of high dimensionality (meaning that every record typically comprises a large number of elements). For each record in a retail database, a data mining operation might seek unexpected relationships among the item purchased, the store’s zip code, the purchaser’s zip code, variety of credit card, time of day, date of birth, other items purchased at the same time, even every item viewed, or the history of every previous item purchased or returned. Performing reasonably fast analyses of high dimensional, messy real-world data is central to the identity and purpose of data mining, in contrast to its predecessor fields such as statistics and machine learning.
Something else is needed, something less pure—because it deals with vast impurities of dynamic data, nearly always from a particular business, governmental, or scientific research goal.
Something else is needed, something less pure–because it deals with vast impurities of dynamic data, nearly always from a particular business, governmental, or scientific research goal.
- Data science ways of dealing with mess: cleaning, munging, wrangling (e.g., OpenRefine)
- Humanities ways of thinking about mess?
“… what characterizes the newer “intensities” of the postmodern, which have also been characterized in terms of the “bad trip” and of schizophrenic submersion, can just as well be formulated in terms of the messiness of a dispersed existence, existential messiness, the perpetual temporal distraction of post-sixties life.” (Postmodernism, 117
In classifying and indexing samples of ice, rock, soil, and sediment, we acknowledge the Earth as a vast geo-informatic construct. It is both geology and data, ontology and epistemology. Yet unlike many Big Data operations, which live in the Cloud, this “Linked Earth” is also resolutely material —muddy, icy, soggy.
Yet those data and documents are not structured uniformly. Species collections, core samples, and medieval manuscripts can all help researchers understand the changing climate, but they are subject to widely varying protocols of collection, preservation, and access
- Hadoop, MapReduce, Reduce
- Marvin Minsky (example of stickleback fish)
- Deep Learning (neural network)
Deep learning algorithms are actually Deep architectures of consecutive layers. Each layer applies a nonlinear transformation on its input and provides a representation in its output. The objective is to learn a complicated and abstract representation of the data in a hierarchical manner by passing the data through multiple transformation layers. The sensory data (for example pixels in an image) is fed to the first layer. Consequently the output of each layer is provided as input to its next layer.
For example by providing some face images to the Deep Learning algorithm, at the first layer it can learn the edges in different orientations; in the second layer it composes these edges to learn more complex features like different parts of a face such as lips, noses and eyes. In the third layer it composes these features to learn even more complex feature like face shapes of different persons. These final representations can be used as feature in applications of face recognition.