- Questions and issues about the course?
- Early thoughts about a research project and topic? (see Assignments)
- Readings for next class
Plan for class:
- Discussion of data science programs
- Discussion of essays by David Donoho and Os Keyes
What is “Data Science”? — An Inductive Exercise
Browse the following sites (and inductively try to answer the question, “what is data science?”):
- UC Berkeley Division of Data Science and Information
- Carnegie Mellon U. data science graduate programs
- NYU Center for Data Science
- Northwestern U. Master’s in Data Science
- UCSB Data Science Initiative
Data Science and the Liberal Arts
- “Data science is the new language of the 21st century and has become a cornerstone of a liberal arts education. Data science skills are also increasingly a requirement for students entering the workforce, government, or research after graduation. As more and more academic disciplines, industries, and media outlets rely on data-driven decision making, research, and evidence, being a sophisticated consumer of data and visualizations, as well as being empowered to analyze and generate discoveries, is naturally becoming a prerequisite for being a global citizen, scientist, and leader.” (NYU ))
- “The Division launched in July 2019 to leverage Berkeley’s preeminence in research and excellence across disciplines to propel data science discovery, education, and impact.
Faculty and students from across campus helped shape the vision and foundations for this innovative Division, which mirrors the cross-cutting nature of data science and redefines the research university for the digital age. It’s designed to meet the opportunities and demands of a world increasingly informed and shaped by data, machine learning, and artificial intelligence in virtually every arena, from health to business to politics; from our cities to our climate to the cosmos.” (UC Berkeley )
- Educational tranche?
- Level of data science in higher-ed?
Some Explicit Definitions of Data Science
- “Data science combines computational and inferential reasoning to draw conclusions based on data about some aspect of the real world.” (UC Berkeley )
- “The [NYU Center for Data Science] PhD program model rigorously trains data scientists of the future who (1) develop methodology and harness statistical tools to find answers to questions that transcend the boundaries of traditional academic disciplines; (2) clearly communicate to extract crisp questions from big, heterogeneous, uncertain data; (3) effectively translate fundamental research insights into data science practice in the sciences, medicine, industry, and government; and (4) are aware of the ethical implications of their work. ” (NYU ))
- “Data Science describes a broad range of theories, algorithms, and tools that lead to a better understanding and predictive modeling of the world around us. It is an interdisciplinary field of scientific methods, processes, and systems to extract knowledge or insights from data, structured or unstructured. It employs techniques and theories drawn from many fields within the broad areas of mathematics, statistics, and computer science; equally important are the application areas that provide the context and domain-specific principles with which data science can have an impact and reach its potential.” (UCSB ))
- “Fortunately, there is a solid case for some entity called “data science” to be created, which would be a true science: facing essential questions of a lasting nature and using scientifically rigorous techniques to attack those questions.” (Donoho)
- “We live in the ‘Age of the Petabyte,’ soon to become ‘The Age of the Exabyte.’ Our networked world is generating a deluge of data that no human, or group of humans, can process fast enough.” (NYU )
- “Theories and algorithms from statistics and machine learning for making predictions from data that are heterogeneous, multimodal, and multiscalar.” (UCSB )
What is “Data Science”? — continued
David Donoho, “50 Years of Data Science” (2017)
“The would-be notion takes data science as the science of learning from data, with all that this entails.”
Breiman says that users of data split into two cultures, based on their primary allegiance to one or the other of these goals.
[Prediction:To be able to predict what the responses are going to be to future input variables;
Inference: To infer how nature is associating the response variables to the input variables]
The “generative modeling” culture seeks to develop stochastic models which fit the data, and then make inferences about the data-generating mechanism based on the structure of those models. Implicit in their viewpoint is the notion that there is a true model generating the data, and often a truly “best” way to analyze the data. Breiman thought that this culture encompassed 98% of all academic statisticians.
The “predictive modeling” culture prioritizes prediction and is estimated by Breiman to encompass 2% of academic statisticians—including Breiman—but also many computer scientists and, as the discussion of his article shows, important industrial statisticians.
To my mind, the crucial but unappreciated methodology driving predictive modeling’s success is what computational linguist Mark Liberman has called the Common Task Framework (CTF). An instance of the CTF has these ingredients:
(a) A publicly available training dataset involving, for each observation, a list of (possibly many) feature measurements, and a class label for that observation.
(b) A set of enrolled competitors whose common task is to infer a class prediction rule from the training data.
(c) A scoring referee, to which competitors can submit their prediction rule. The referee runs the prediction rule against a testing dataset, which is sequestered behind a Chinese wall. The referee objectively and automatically reports the score (prediction accuracy) achieved by the submitted rule.f
Quantitative programming environments run ‘scripts,’ which codify precisely the steps of a computation, describing them at a much higher and more abstract level than in traditional computer languages like C++. Such scripts are often today called workflows.
So the skills required to work within a CTF became very specific and very teachable—can we download and productively tweak a set of scripts?
Data scientists are doing science about data science when they identify commonly occurring analysis/processing workflows, for example, using data about their frequency of occurrence in some scholarly or business domain; when they measure the effectiveness of standard workflows in terms of the human time, the computing resource, the analysis validity, or other performance metric, and when they uncover emergent phenomena in data analysis, for example, new patterns arising in data analysis workflows, or disturbing artifacts in published analysis results..
Os Keyes, “Counting the Countless” (2019)
So: trans existences are built around fluidity, contextuality, and autonomy, and administrative systems are fundamentally opposed to that. Attempts to negotiate and compromise with those systems (and the state that oversees them) tend to just legitimize the state, while leaving the most vulnerable among us out in the cold. This is important to keep in mind as we veer toward data science, because in many respects data science can be seen as an extension of those administrative logics: It’s gussied-up statistics, after all — the “science of the state.”
The ideal data-science system is one that’s optimized to capture and consume as much of the world — and as much of your life — as possible. Why? Because of the final part of that data science definition: for the purpose of decision-making.
So perhaps a more accurate definition of data science would be: The inhumane reduction of humanity down to what can be counted.
So can we reform these systems? Tinker with the variables, the accountability mechanisms, make them humane? I’d argue no. With administrative violence, Spade notes how “reform” often benefits only the least marginalized while legitimizing the system and giving cover for it to continue its violence.
For me, my ethics of care says that we should be working for a radical data science: a data science that is not controlling, eliminationist, assimilatory. A data science premised on enabling autonomous control of data, on enabling plural ways of being. A data science that preserves context and does not punish those who do not participate in the system.
How we get there is a thing I’m still working out. But what you can do right now is build counterpower: alternate ways of being, living, and knowing….