Computer Vision as a Public Act: On Digital Humanities and Algocracy

by Jentery Sayers

Computer vision is generally associated with the programmatic description and reconstruction of the physical world in digital form (Szeliski 2010: 3-10). It helps people construct and express visual patterns in data, such as patterns in image, video, and text repositories. The processes involved in this recognition are incredibly tedious, hence tendencies to automate them with algorithms. They are also increasingly common in everyday life, expanding the role of algorithms in the reproduction of culture. From the perspective of economic sociology, A. Aneesh links such expansion to “a new kind of power” and governance, which he refers to as “algocracy—rule of the algorithm, or rule of the code” (Aneesh 2006: 5). Here, the programmatic treatment of the physical world in digital form is so significantly embedded in infrastructures that algorithms tacitly shape behaviors and prosaically assert authority in tandem with existing bureaucracies. Routine decisions are delegated (knowingly or not) to computational procedures that—echoing the work of Alexander Galloway (2001), Wendy Chun (2011), and many others in media studies—run in the background as protocols or default settings. For the purposes of this MLA panel, I am specifically interested in how humanities researchers may not only interpret computer vision as a public act but also intervene in it through a sort of “critical technical practice” (Agre 1997: 155) advocated by digital humanities scholars such as Tara McPherson (2012) and Alan Liu (2012). In the case of computer vision, such critical technical practice may begin by relinquishing any assumptions that computer vision can be fully understood as, or simply reduced to, “source code.” Then, without eagerly adopting new technologies, it might proceed to ask whether (if ever) and how computer vision should be used for humanities research.

According to various accounts, computer vision research began as early as 1966, during the “Summer Vision Project,” when Marvin Minsky, Seymour Papert, Gerald Jay Sussman, and others in the Artificial Intelligence Group (AIG) at the Massachusetts Institute of Technology (MIT) investigated how to use figure-ground analysis to automatically divide pictures into regions based on surface and shape properties. This region description would act as the basis for object identification, where items in pictures were recognized and named by machines with controlled vocabularies of known objects (Papert 1966; Crevier 1993; Boden 2006). Cameras were attached to computers in order to achieve this automated description and identification, with an express desire to eventually construct a pattern analysis system that would combine “maximal flexibility” with “heuristic rules” (Papert 1966: 6).

Although computer vision has developed significantly since the 1960s and ’70s, the AIG’s Summer Vision Project marks a notable transition in media history, a moment when researchers began integrating image processing into the development of artificial intelligence, including the training of computers to read, detect, and describe aspects of pictures and visual environments (Szeliski 2010: 7-10). During the project, AIG researchers also started asking computer vision questions that, if only tacitly, have persisted: How does computer vision differ from human vision? To what degree should computer vision be modeled on human phenomenology, and to what effects? Can computer or human vision even be modeled? That is, can either even be generalized? Where and when do issues of processing and memory matter most for recognition and description? And how should computer vision handle ambiguity? (Minsky 1974). These questions are at once technical and ideological, as are many responses to them, meaning computer vision (then or now) should not be extracted from the contexts of its conceptual and material development.

Today, computer vision has moved, at least in part, from laboratories into consumer technologies. One popular application is answering the question, “Is this you?” or “Is this them?” iPhoto, Facebook, and Kinect users are intimately familiar with this application, where face detection algorithms analyze patterns to calculate a core or integral image within an image, assess differences across a spectrum of light, and view images across scales. In the open source community, many practitioners combine the Open Source Computer Vision (OpenCV) library with the Python, C++, and Java programming languages to perform this “detection” work. These scripts rely on frameworks to train classifiers to detect “objects”—which, in the language of vision science, include faces, bodies, and body parts—in images based on cascades of features. To see faces while algorithmically scanning images, OpenCV uses the widely popular Viola-Jones object detection framework that relies on “Haar-like” image features for cascading (Viola and Jones 2004). Similar to iPhoto and other image management applications, OpenCV can be used to “identify”—often with errors and omissions—the same face across a distribution, even when multiple faces appear in the same image. Here, I surround terms such as detection and identify with quotation marks because, while this is the langauge used in vision science, computer vision actively helps produce the patterns it ostensibly discovers in bodies and environments. That is, in order to recognize patterns, it must first perform a translation or remediation into data.

Even more important, to write computer vision scripts, programmers do not even need to know—or, to be clear, even understand—the particulars of Haar cascades or Viola-Jones. Their scripts simply call and deploy “trusted” cascades (e.g., “Frontal Face,” “Human Body,” “Wall Clock,” and “Mouth”) stored in XML files readily available across the web. Once a computer vision script detects objects in a given cascade, it can extract them from their source files and archive them.

Computer vision techniques may also merge or compare extracted objects with existing objects. Comparisons allow people to confirm or complicate known relationships between objects. For instance, when comparing faces, multiple photos of the same person can train algorithms to recognize an “eigenface” (or “unique face”) generated from the principle components of those photos. Although eigenfaces do not actually exist in any lived, social reality, they are fundamental to the process of face recognition, and datasets with “training face” images for 100+ people per repository are now common online. One of the most popular sets is the Public Figures Face Database (Pubfig), a “real-world face dataset” consisting of 58,797 internet images of 200 “non-cooperative subjects” that were “taken in completely uncontrolled situations” (Columbia University 2010). While this and other face datasets suggest that training faces are central to big data initiatives anchored in computer vision, humanities practitioners have not thoroughly considered the social and cultural implications of treating bodies as big data for vision science. Indeed, much more humanities research is needed in this area, especially as it relates to policing and racial profiling.

It is also important to note that computer vision responses to “Is this you?” or “Is this them?” do not stop at recognition or pattern analysis after the fact. They enable predictive modeling and forecasting. For example, in surveillance and forensics industries, snapshots are extracted from video and stitched together to articulate “people trajectories,” which both archive and anticipate people’s movements over time (Calderara et al. 2009: 13-18). Here, the image processing tradition of photogrammetry is clearly linked with artificial intelligence research. As computer vision stitches together a series of objects, it also learns from them and makes suggestions based on them, possibly in real-time. The programmatic description and reconstruction of the physical world are thus directed at the past as well as the future, only heightening their influence in an algocratic paradigm.

Bundled together, the emergence of these techniques raises not a few questions about how computer vision techniques normalize bodies and environments, including questions about the relevance of computer vision to privacy and social justice issues. At the same time, many developers are researching computer vision applications in a liminal space between standardized and experimental practice, where the consequences remain uncertain or undefined by policy.

To better understand that liminal space, consider the traction computer vision is gaining in the arts, particularly the combination of machine phenomenology with experimental network aesthetics. Matt Jones (2011) suggests that computer vision corresponds with a “sensor-vernacular aesthetic,” or with “optimised, algorithmic sensor sweeps and compression artefacts” (Jones 2011). Somewhere between bits and atoms, a sensor-vernacular aesthetic is “an aesthetic born of the grain of seeing and computation,” with David Berry (2012) pointing to a renaissance of 8-bit visuals, the emergence of “pixel fashion,” and—generally speaking—a widespread obsession with seeing like a machine. Think Minecraft and decimated meshes on Thingiverse, or Timo Arnal’s robot-readable world (2012), Martin Backes’s pixelhead (2010), Adam Harvey’s stealth wear (2013), and the machine wanderings of James Bridle’s “New Aesthetic” (2011).

Whatever the label or example, such aesthetics have largely revised the notion of technologies as “extensions of man” to suggest that computer vision now supplants human vision. In this sense, they are at once humanist, non-humanist, and object-oriented aesthetics. They throw the very notion of human perspective into relief, understanding computer vision as withdrawn, as beyond human access, as some sort of algorithmic unconscious. At the same time, they demand consensus about what human perception entails in the first place. They need “human” to operate as a stable (or ahistorical, or normative, or universal) category in order to displace it with a computer’s phenomenology. In the last instance, they are largely reactive in character. Their machine wanderings and robot-readable worlds tend to wonder at machine vision—to suspend “Is this you?” from its social dynamics—without systematically intervening in its cultural functions.

Rather than merely hacking computer vision, repurposing scripts, or fetishizing machine perspectives, maybe the most pressing challenge for humanities studies of computer vision and algocracy is shifting from a tactical reaction to a strategic articulation of vision infrastructures. To be sure, this is no small task, especially for humanities practitioners. From my perspective, it would involve interrogating cascading classifiers for their biases, much in the way Simone Browne (2010) has approached video surveillance, race, and biometrics. After all, numerous examples from software development (e.g., by Flickr, Microsoft, and Hewlett Packard) already point to the racism at work in computer vision algorithms and the interpretation of their results. Critiques of vision infrastructures might also involve reframing computer vision to such a degree that it refuses to establish essentializing, binary ways of seeing (Berger 1972). In other words, amidst the possibilities of using computer vision for oppressive purposes (e.g., its applications for surveillance and racial profiling), we need vision infrastructures that value ironic or ambiguous vision, much like Donna Haraway’s early work (1985/2003) on cyborgs, feminism, and informatics, including—lest that often overlooked section of “A Cyborg Manifesto” be forgotten—her concerns about an informatics of domination. Her concerns there deeply resemble recent concerns about algocracy.

In digital humanities research, we see some steps toward new vision infrastructures (e.g., Bagnall and Sherratt 2011); however, the field has privileged the practical use of optical character recognition (OCR) to digitize, encode, search, and discover texts, with very little research on computer vision as a critical technical practice that entangles aesthetics with politics and big data with bodies. As computer vision proliferates beyond text discovery and analysis into other domains of algorithmic culture, digital humanities practitioners need to ask more questions of computer vision, especially how race is interpreted as an “eigenface” for biometrics and, more generally, how cultural formations are treated instrumentally as data. More specifically, we need significantly more humanities research on how exactly race becomes a principle component of the algorithm training process, and how computer vision may serve the interests of algocratic modeling and measures. While I know digital humanities is often quick to build alternative technologies, amidst these questions is whether computer vision should be used at all for face recognition in and beyond academic work, especially since its use is often (perhaps always?) a public act involving, for example, “non-cooperative subjects” and “completely uncontrolled situations.” Even in vision science publications, computer vision issues of race and algocracy remain widely neglected, or perhaps purposefully ignored, yet they clearly apply to archival research, public policy, and the everyday spaces (e.g., airports, social networks, parks, and games) people routinely inhabit. Without this research, we risk not only parsing human from computer vision (as has been the case since the 1960s) but also delegating responsibility for questions of justice to the rule of code.

I would like to thank Dorothy Kim and Jesse Stommel for inviting me to participate in this panel.

Works Cited
Agre, Philip E. “Toward a Critical Technical Practice. Lessons Learned in Trying to Reform AI.” Social Science, Technical Systems, and Cooperative Work: Beyond the Great Divide. Eds. Geoffrey Bowker, et al. New York: Erlbaum, 1997.

Aneesh, A. Virtual Migration: The Programming of Globalization. Durham, NC: Duke University Press, 2006.

Arnall, Timo. Robot Readable World. 2012.

Backes, Martin. “New Artwork: Pixelhead.” Martin Backes – Official Website. 2010.

Bagnall, Kate and Tim Sherratt. “Invisible Australians: Living under the White Australia Policy.” 2011.

Berger, John. Ways of Seeing. New York: Penguin Books Limited, 1972.

Berry, David. “What Is the ‘New Aesthetic’?” Stunlaw. 2012.

Boden, Margaret. Mind As Machine: A History of Cognitive Science. New York, Oxford University Press, 2006.

Bridle, James. “The New Aesthetic: Waving at the Machines.” Booktwo. 2011.

Browne, Simone. “Digital Epidermalization: Race, Identity and Biometrics.” Critical Sociology 36.1 (2010): 131–50.

Calderara, Simone, Andrea Prati, and Rita Cucchiara. “Video Surveillance and Multimedia Forensics: An Application to Trajectory Analysis.” Proceedings of the First ACM Workshop on Multimedia in Forensics, MiFor ’09. New York (2009): 13–18.

Chun, Wendy Hui Kyong. Programmed Visions: Software and Memory. Cambridge, MA: MIT Press, 2011.

Columbia University. “Pubfig: Public Figures Face Database.” 2010.

Crevier, Daniel. AI: The Tumultuous History of the Search for Artificial Intelligence. New York, Basic Books, 1993.

Galloway, Alexander. Protocol: How Control Exists after Decentralization. Cambridge, MA: MIT Press, 2001.

Haraway, Donna. The Haraway Reader. 1st ed. New York: Routledge, 2003

Harvey, Adam. “Stealth Wear.” AH Projects. 2013.

Jones, Matt. “Sensor-Vernacular.” BERG. 2011.

Liu, Alan. “Where Is Cultural Criticism in the Digital Humanities?” Debates in the Digital Humanities. Ed. Matthew K. Gold. Minneapolis: University of Minnesota Press, 2012.

McPherson, Tara. “Why Are the Digital Humanities So White? or Thinking the Histories of Race and Computation.” Debates in the Digital Humanities. Ed. Matthew K. Gold. Minneapolis: University of Minnesota Press, 2012.

Minsky, Marvin. “A Framework for Representing Knowledge.” 1974.

Papert, Seymour. “The Summer Vision Project.” July 1966.

Szeliski, Richard. Computer Vision: Algorithms and Applications. Springer Science & Business Media, 2010.

Viola, Paul, and Michael Jones. “Robust Real-Time Object Detection.” International Journal of Computer Vision 57.2 (2004): 137-154.

Jentery Sayers is Assistant Professor of English and Cultural, Social, and Political Thought, as well as Director of the Maker Lab in the Humanities, at the University of Victoria.

[Image “separationVoronoi4_2697” by flickr user Kyle Macquarrie]