THEOREM 9 :: Hector Rodriguez

Introduction

Theorem 9 (latent figures) can be shown as a 3-channel video installation that combines computer science and cinematic art. Every frame in a digitized version of the film Alphaville (Jean-Luc Godard, 1965) is reconstructed using four distinct sets of prototypical frames extracted from that same film.

The following video is an excerpt of the work:

The work can be shown as a single-channel video as well:

Every set of prototypical images can be understood as a visual dictionary. By appropriately weighing (darkening or brightening) and summing the component images in each dictionary, it is possible to generate an approximate reconstruction of any other frame in the movie.

Every dictionary can also be understood as a system of coordinates that is used to describe other images in the movie. Every image in the dictionary is one axis of coordinates. The weights used to reconstruct any image, given a particular dictionary, comprise the internal representation (coordinates) of that image in the system of coordinates determined by the dictionary.

Latent figures

The concept of latent figures is fundamental to the artistic aims of this project.

The combination of visual frames in a dictionary produces visual analogies and metaphors. Latent visual connections are extracted from the frames in the source movie. This idea can be best appreciated in this demo clip, which uses only one visual dictionary for simplicity.

The meaning of the word "figure" in this project is closely related to the notion of visual metaphor, theorized by philosopher Noël Caroll.⁽¹⁾ Carroll argues that certain artists use visual media (photography, cinema, painting, sculpture, installation) to create composite constructions that fuse or assemble distinct elements together in space, in a manner that draws on visual or thematic correspondences between those elements. Examples of this technique include the surrealist method of free association and the cubist practice of montage. Alphaville explicitly references these techniques, particularly those of surrealism.

Visual figures are essential to Theorem 9. The elements in the dictionary often carry thematic associations, which are obvious to anyone acquainted with Godard’s film. For instance, a flickering artificial light is associated with the supercomputer Alpha 60 and so with the cybernetic society denounced by the film. Agent Lemmy Caution is associated with certain hardboiled or noir heroes, and also with concepts of freedom, love, and rebelliousness. These associations imply thematic contrasts (cybernetics and control vs. freedom and love). In Theorem 9, the reconstruction algorithm juxtaposes various thematically charged dictionary images into a single reconstruction. Each individual dictionary image remains recognizable even in combination with other images.

The reconstruction makes connections among the various images in the dictionary. It also connects the dictionary elements with the frame being reconstructed. These various connections (“visual figures”) sometimes foreground and potentially confound the thematic contrasts made in the original film. For instance, images of Lemmy Caution (associated with existential freedom) and Natasha (associated with love) blend together with the computer light (associated with authoritarian control and technocratic rationality), thus undermining thematic oppositions established in the film and suggesting the need to reframe our understanding of societies of control. The juxtapositions also foreground and undercut gender structures. The male protagonist is reconstructed using female faces.

The analogies are visually compelling because the elements juxtaposed in any reconstruction tend to mimic the graphical structure of the original image. This connection is accomplished automatically, by the algorithm employed to select and combine dictionary elements. The mathematical techniques employed are directly responsible for the conceptual and perceptual properties of this work (see the technical page of this website).

_____________

⁽¹⁾ Noël Carroll, Beyond Aesthetics: Philosophical Essays (Cambridge: Cambridge University Press, 2001), pp. 347-368.

Setup

The work is flexible to setup, since it can be a single-channel or 3-channel projection.

Video:
file format - MP4
codec - H.264
framerate - 24 fps
resolution -
Three-channel version: Full HD (1920x1080) x 3 projectors
Single-channel version: 4K (3840x2160) x 1 projector

Audio: silent, no audio track

Hardware requirements:
- must be shown in a completely dark room
- one or three* ceiling mounted projector(s) with minimum 3000 ANSI Lumens
*extra hardware (like Matrox TripleHead2Go DP Edition ) may be required to synchronize the three projectors

Software requirments:
The file can be played, for instance, from any computer that has a VLC player.

An exhibition setup in Hidden Variables, from Writing Machine Collective edition 6. Hong Kong. Sep-Oct 2018.

Technical description

Factorization

A frequently used method to understand a given dataset is to select a set of dominant (representative, prototypical) latent components. Every entry in the dataset is to be approximately reconstructed as a weighted mixture ("linear combination") of those latent components. We will refer to any collection of latent components associated with a specific dataset as a “dictionary”. We will refer to any procedure for reconstructing a dataset in the manner described as a “factorization method”.

In this work, the dataset consists of all the frames in a digitized version of the film Alphaville. The dictionary consists of a proper subset of those frames.

We now introduce some basic notions. A grayscale image consisting of n pixels is a vector in ℝⁿ. Images correspond to points in the unit hypercube [0, 1]ⁿ within ℝⁿ. Any set of k linearly independent images (i.e., any dictionary) spans a k-dimensional subspace of ℝⁿ. A dictionary of k elements is to be represented as an n x k matrix W. The reconstruction F of any frame f by means of dictionary W can be represented as a vector v of weights or “activation coefficients”. The i^th element in v represents the weight that the i^th frame in W has in F. In other words, F = W v.

A factorization method involves, first of all, a procedure for constructing W and, secondly, a procedure for reconstructing any f as weighted combinations of entries in W ( i.e., a procedure for computing v ).

There are many possible factorization methods. Some methods, such as for instance Principal Component Analysis, allow both the latent components (dictionary elements) and the activation weights to contain negative values. But then the latent components and their weights have no physical meaning. Suppose that the value (brightness) of any grayscale pixel is represented as a real number in the interval [0, 1], where 0 represents white and 1 represents black. In this context, the notion of an image with negative values, i.e., with pixels lighter than white, has no physical content. Any such “frame” is a purely formal entity and cannot be easily understood or immediately visualized. Moreover, the application of a negative coefficient to a frame does not correspond to any physical operation. Another problem is that the linear combination of elements in the dictionary is not in general inside the unit hypercube in ℝⁿ.

The algorithm used in this project is designed to avoid these shortcomings. It decomposes every frame using dictionaries of non-negative components and non-negative activation weights.

Procedure

The procedure involves two steps, a dictionary selection step and a frame reconstruction step.

1. Dictionary selection

The first step selects a set of images suitable to function as dictionary entries. We wish those entries to be extreme data points, i.e., points that reside on or near the convex hull of the point data set. To accomplish this aim, we use the Simplex Volume Maximization (SiVM) algorithm proposed by Thurau, Kersting and Bauckhage⁽¹⁾⁽²⁾.

1.1 Terminology

It is sometimes useful to speak about objects of arbitrary dimensions, in order to make statements of a general nature about them. This generalization is often accomplished by identifying kinds of objects of different dimensions that are analogous to one another. For instance, a tetrahedron is the simplest instance of an ordinary polyhedron in three-dimensional space; we can consider a triangle, a segment, and a point as two-, one-, and zero-dimensional analogues of a tetrahedron. Analogues of a tetrahedron also exist in higher dimensions. These objects are all designated as simplexes.

The concept of a simplex (or hypertetrahedron) generalizes the concept of a tetrahedron to arbitrary dimensions. A (k - 1) simplex in k-dimensional space is the convex hull of k -1 linearly independent points.

The standard n-simplex is the simplex whose vertices are the n + 1 vectors in the standard basis for ℝⁿ⁺¹. For instance, the standard 1-simplex is the segment whose vertices are the points (1,0) and (0,1). The standard 2-simplex is the triangle whose vertices are the points (1,0,0), (0,1,0), and (0,0,1).

An important property of a simplex is its volume. While “volume” in ordinary language describes properties of three-dimensional objects, the same word is used with a more general meaning when speaking of simplexes. The volume of a 1-simplex is its length. The volume of a 2-simplex is its area. The volume of a 3-simplex corresponds to the ordinary notion of volume. Again, higher dimensional analogues of the concept of volume also exist.

It is important to bear in mind that this concept of volume is extremely general, and encompasses a family of different measures, one for each dimension (length, area, etc.). It is nonetheless useful, when formulating a general algorithm, to speak of volume in this abstract sense. This linguistic convention allows us to speak of the volume of a simplex, and by that mean the length of a segment, the area of a triangle, the volume of a tetrahedron, etc., without having to be more specific.

1.2 Motivation

This section motivates the design of the SiVM algorithm.

Any finite set C containing m non-negative vectors in ℝⁿ always lies in a convex polyhedral cone Ω, namely, the convex hull of the set of rays whose directions are determined by the vectors in C.

There exists a subset D ⊆ C such that Ω is the smallest convex cone containing D.

The convex hull Ω_C is the set of all non-negative linear combinations of points in C. Points not in Ω_C are approximated by points in Ω_C.

We want to find a subset D ⊆ C of k elements such that we approximate every point in C by a point in Ω_D.

Given C and k, what is the best choice of D? The answer is the subset of C that maximizes the finite volume of Ω_D, i.e., the volume of Ω_D ∩ S^n-1, where S^n-1 is the unit sphere in ℝⁿ. (The volume of a cone is infinite, so we cannot literally speak of maximizing the volume. Instead, when we speak of maximizing the volume of a cone, we consider the volume of the intersection of the cone with the unit hypersphere.)

The basic idea is best illustrated taking C to consist of a set of points in ℝ².

We wish to approximate every point in C as a non-negative combination of a subset D ⊆ C consisting of two points (vectors). Suppose we select D as containing points P₁ and P₂. In that case, we can perfectly approximate point P₃ as a non-negative linear combination of P₁ and P₂.

The problem is that the best approximation of P₄ using P₁ and P₂ will be very far from P₄.

Points P₁ and P₄ make for a superior choice. All other points in C lie within the cone determined by those two points (the shaded area in the diagram below). All can be accurately reconstructed as a nonnegative linear combination of P₁ and P₄. This choice of D maximizes the volume of Ω_D. The selected points can be considered to be the most representative of the set.

How do we select representative points from a given set? One plausible suggestion (still referring to our example) would be to identify the two points that determine the (nonstandard) 1-simplex with the greatest volume. This suggestion, however, does not work. In the example at hand, the points would be points P₁ and P₃. The simplex here is a segment, and the volume of this simplex is its length. Although the two points determine the simplex with the greatest possible volume (the longest segment joining any two pairs of points in C), they are not the points that we want. This choice does not correspond to the Ω_D with maximum volume.

We will obtain the result we want if we first pre-process the data, as follows. Define the pullback transformation as the operation of 1-normalizing every vector (dividing every entry in a vector by the 1-norm of that vector). This procedure maps every point associated with a vector in C to a unique point within the (n-1) standard simplex. It is illustrated in the following diagrams:

The cone Ω_D projects on the bounded convex subset F of the standard simplex. It can be shown that, if our data has been so projected, maximizing vol (Ω_D ∩ S^n-1) approximately amounts to maximizing vol (F).

In this example, the two points that determine the simplex with the largest volume are P₁' and P₄', the projections of P₁ and P₄. They determine the longest segment of any pair of projected points. These are exactly the points that we wish to select.

Note that for illustrative purposes, we have considered the situation where we aim to choose two representative points in ℝ². In the actual production of this project, however, we want choose m representative points in ℝⁿ, where n is the number of pixels of an image, m is the number of elements in the dictionary, and m << n.

1.3 SiVM Algorithim

All data is pre-processed using the pullback transformation described in the previous section.

The algorithm involves greedy stochastic hill climbing. It receives as input a database S of vectors (the frames in the chosen movie) and a desired number k of prototypical frames. It outputs a set D containing k prototypical frames. The elements in S reside in (the unit hypercube in) ℝⁿ, a vector space naturally endowed with a distance measure, the Euclidean distance.

The algorithm first initializes D as follows: pick a random frame P₁ in S, then find the frame P₂ in S that is furthest from P₁, and then the frame P₃ that is furthest from P₂. Select P₃ as the first entry in D.

After initialization, the procedure iterates over every vector s_i in S not already in D, and computes the volume of the simplex 𝓦 spanned by the union of s_i and the vectors already in D. The vector s_i that yields the maximum volume is then added to D. This procedure is repeated until D contains k representative frames.

To speed up execution of the algorithm, we first preselect a proper subset of the frames in Alphaville. This subset consists of all frames that are local peaks in the Entropic Envelope of the movie (a description of the entropic envelope concept can be found here). The preselection is then fed into the SiVM algorithm, which outputs the final selection D.

From this selection, we manually form one or more dictionaries to be used in the next step. Every dictionary matrix W contains a proper subset of the frames in D.

2. Frame reconstruction

The second step reconstructs every frame in the source movie using the dictionary W identified in the previous step.

The movie is divided into sequences (ordered sets of consecutive frames) of uniform length 𝓁. Each sequence is associated with an n x 𝓁 matrix V whose columns represent the frames in the sequence. V is then reconstructed using a Non-Negative Matrix Factorization (NNMF) algorithm.

NNMF is a family of algorithms that receive a matrix V_{n x m} of non-negative data and approximately decomposes it into two matrices W_{n x k} and H_{k x m}, both of which are also non-negative. The matrix W can be interpreted as containing k basis vectors (i.e., a lexicon) and H as containing the mixing coefficients or activation weights on those vectors.

In the above figure, the image represented by the second column in the input matrix V is approximately reconstructed as a linear combination of all columns in W. The reconstruction is accomplished by multiplying every column vector in W by the corresponding entry in the second column of H. The resulting weighted vectors are added together to produce the final reconstruction of the input image.

Most versions of the NNMF algorithm provide ways to compute both the basis matrix W and the activation weight matrix H.

In the present implementation, we do not need to compute W. Every column of W is one of the frames selected in step one. We use the algorithm only to obtain the activation coefficients.

For each matrix V and each dictionary W, we compute a k x 𝓁 coefficient matrix H such that the Frobenius norm || V – WH || is minimized, subject to the constraint that H is non-negative. We use the standard minimization algorithm proposed by Lee and Seung⁽³⁾⁽⁴⁾, but we fix W and only update the entries in H iteratively. We then obtain the matrix of reconstructed images V’ = WH.

In the final artwork, we split the dictionary obtained in step two into four distinct bases W₁, W₂, W₃, and W₄. We then decompose every frame matrix V using each of the four bases, to obtain four coefficient matrices H₁, H₂, H₃, and H₄. We display the four reconstructions. We also show the individually weighted basis elements in order to highlight their individual contributions.

_____________

⁽¹⁾ Thurau, C., Kersting, K., Wahabzada, M. et al. (2012). "Descriptive matrix factorization for sustainability Adopting the principle of opposites". 24: 325.

⁽²⁾ Kersting K. et al. (2016). "Feeding the World with Big Data: Uncovering Spectral Characteristics and Dynamics of Stressed Plants". In: Lässig J., Kersting K., Morik K. (eds) Computational Sustainability. Studies in Computational Intelligence, vol 645. Springer, Cham.

⁽³⁾ Daniel D. Lee & H. Sebastian Seung (1999). "Learning the parts of objects by non-negative matrix factorization". Nature 401 (6755): 788-791.

⁽⁴⁾ Daniel D. Lee & H. Sebastian Seung (2001). "Algorithms for Non-negative Matrix Factorization". Advances in Neural Information Processing Systems 13: Proceedings of the 2000 Conference. MIT Press. 556–562.

BACKGROUND

The THEOREM series

Theorem 9 (visual figures) is the last entry in a series of works that explores the idea of a visual dictionary. Earlier works in the series include Theorem 8 and Theorem 8.1. This page explains the original contribution of this new work in the context of artistic problems that arise out of previous entries in the series. Placing the work in this context illuminates some of the artistic and technical decisions involved.

The core concept that drives the entire Theorem series is the following: Given a set of digital images (the "dictionary") of fixed dimensions, decompose any arbitrary source image of the same dimensions as a weighted sum of dictionary images.

How are the weights determined? A standard approach involves orthogonal projection.

Let A_{n x k} be a matrix consisting of the k images in the dictionary. The ( i, j )^th entry of A contains the i-th pixel of the j-th image. To project a given vector v in ℝⁿ onto the subspace spanned by A, we need to obtain the vector c of coefficients

c = A⁺ v

where A⁺ denotes the Moore-Penrose pseudo-inverse of A, or

A⁺ = (A^T A)^-1 A^T

The coefficients in c express the contributions of the individual dictionary images to the reconstruction. The first coefficient represents the weight of the first image in the dictionary, and so on. The sign (positive or negative) of each coefficient specifies whether that image is to be added to or subtracted from the other images in the dictionary. The coefficients thus can be understood as activation weights for the dictionary elements.

Given the coefficients, the orthogonal projection p of v onto the subspace spanned by the vectors is

p = A c

This is the method utilized in the work Theorem 8.1, which is the immediate predecessor of the present project. Every image in Alphaville was projected onto a fixed dictionary comprised of images previously selected from the same film.

In this image, for instance, the original frame is shown next to its reconstruction, and to the images in the dictionary scaled by the appropriate coefficients.

The aim of Theorem 8.1 was not only to visualize the reconstruction of each frame, but also to visualize the individual contributions of the various dictionary elements. The procedure outlined above guarantees that the final reconstruction is the best possible approximation to the source image using a weighted sum of images in the dictionary.

Challenges

The use of orthogonal projection in Theorem 8.1 presents two problems.

1. Many coefficients turn out to be negative. But the notion of an image with negative values has no physical interpretation. The Theorem 8.1 installation was designed to visualize the individual contributions of the various dictionary elements to the reconstruction of every frame. But how can images with negative values be visualized? The following solution was adopted: If c_j < 0, every brightness value A_i,j is replaced with (1 - A_i,j ) and then multiplied by the absolute value of c_j. This transformation replaces the dictionary element with its "negative" image (where the word "negative" has its usual photographic meaning). This solution was conceptually reasonable, but aesthetically confusing. It is often difficult to see the connection between the dictionary elements and the final reconstruction.

2. The reconstruction of every source image is typically ghostly and blurred. The various images in the dictionary become to a large extent unrecognizable in the final blend. In consequence, the opportunity to generate graphically and semantically fruitful juxtapositions (“visual figures”) between images in the dictionary and other frames in the movie is lost.

The approach to be used in this work was designed to overcome these problems by accomplishing two aims:

1. To produce a reasonable reconstruction of any source image using only non-negative activation weights.
2. To ensure that the images in the dictionary remain visible in the final reconstruction, in order to generate visually meaningful juxtapositions.

To accomplish 2, it is important to obtain dictionary images that are good representatives of the rest of the frames in the movie. This result affords the possibility of producing intuitive reconstructions of each frame using fairly small dictionaries. It is crucial to achieve this aim, because if a large dictionary of frames is used, then the graphical character of the individual images gets lost in their combination. Theorem 9 was designed to establish visual figures, i.e., allegorical or metaphorical connections between different images in the movie.

Theorem 9 uses the Simplex Volume Maximization algorithm to accomplish these two aims. See the technical description page for details.

CREDITS

ARTIST AND TECHNICAL DIRECTOR

HECTOR RODRIGUEZ

PROGRAMMING

SAM CHAN

HECTOR RODRIGUEZ

MATHEMATICAL ADVISOR

FELIPE CUCKER

TEXT

HECTOR RODRIGUEZ

WEB DESIGN

SAM CHAN

PHILIP KRETSCHMANN