Errant: The Kinetic Propensity of Images is a project about the automatic analysis and visualization of motion in the cinema. It consists of a two-channel video projection. The following video shows the two channels, and is best viewed in full screen.

The presentation of this work consists of a two-channel video projection. The left channel shows the decomposition of the optical flow of a shot into basic motion patterns. These motions are extracted using unsupervised machine learning methods.

It is possible to reconstruct approximately the original shot's optical flow by combining those latent motions. The right channel shows the reconstruction of the optical flow.

The material consists of three films directed by King Hu 胡金銓:
A Touch of Zen 俠女 (1971)
Legend of the Mountain 山中傳奇 (1979)
Raining in the Mountain 空山靈雨 (1979)

The following documentation gives an overview of the project:

This work was commissioned by Linda Lai for the exhibition Algorithmic Art: Shuffling Space and Time, held at the Hong Kong City Hall, December 27 2018 - January 10 2019.

The following image shows the setup in the show.

Production of this work was partly funded by a City University of Hong Kong Strategic Research Grant, project no. 7004992.

It was also made possible by a research fellowship granted by the School of Creative Media's Center for Applied Computing and Interactive Media (ACIM) for the 2018-19 academic year. Many thanks to ACIM co-directors Richard Allen and Jeffrey Shaw for their support of this project.

MOTIVATION

Most film analysis and criticism describe movement in cinema by reference to the object that moves. Descriptions of scene motion typically focus on the nature of the moving object (whether it is a person, a car, etc.), its velocity, and perhaps certain aspects of its rhythm. Descriptions of global motion mainly focus on the camera as the source of that motion. Writers characterize camera movement as, for instance, a “pan”, “tilt”, “track”, “dolly”, “zoom”, etc. These terms presuppose a privileged object, the camera, as the source of the visible movement. The conventional vocabulary of critical analysis guides the expectations of the critic or theoretician, who sees only what they expect to find, and they expect to find only that for which they have acquired words. Writers on cinema almost invariably presuppose a mobile camera viewing mobile objects in a three-dimensional world. In other words, the focus is on the causes or sources of the movement rather than on its visible quality. Under these conditions, we lack the resources to describe or represent the phenomenological quality of motion in the cinema.

Our awareness of movement in mainstream films, advertisements and so on is typically bound to specific objects and locations in support of story content. Viewers do not typically attend to the visual qualities of movement itself. Our attention is directed to what is moving, not how it moves. In opposition to this dominant approach, this project aims to focus deep perception on motion. Its aim is, moreover, not purely formal. It embodies a reaction against the denigration of close attention that accompanies the “attention economy,” the commodification of attention in which we are currently immersed, and provides a medium for cultivating and enriching the content and manner of sense perception. In other words, the algorithm developed in this project makes movement perceptually salient as an end in itself.

A shot or sequence in a film often contains several on-screen motions, for instance the movements of different people, as well as the effects of the camera motion. The organization of movement is a crucial aspect of cinematic art. Whenever we focus our attention on the narrative information, we fail to notice the organization of movement in those films. In opposition to this dominant emphasis on the communication of narrative information, this project aims to focus deep perception on motion.

The philosophical and conceptual aspects of this work are mediated by an awareness of computational technology. In particular, the methodology employed here relies on unsupervised machine learning to produce a visual dictionary of motion patterns.

The latent components need not correspond to the conventional categories of cinematic criticism and analysis. The algorithm is not “trained” by exposure to already known examples or “model answers” that embed familiar ways of understanding motion. Rather, the algorithm extracts those latent motions for each shot in the movie in an unsupervised way by applying optimization techniques.

We can think of the use of machine learning in this work as a way to help viewers "unlearn" stereotypical ways of seeing and understanding movies, and sensitizing them to certain qualitative aspects of motion in the cinema.

ALGORITHM

The input to the procedure consists of one or more movie sequences available in digital form, as a sequence of jpg images (frames). Each image is 960 x 408 pixels. The material has segmented into shots. Every image is considered to be grayscale, so color is ignored.

The following steps are performed for every shot in the database.

1. Optical flow estimation

So-called optical flow techniques receive a video clip as input and estimate the movement depicted in every pair of consecutive frames in that clip. This estimation involves assigning a motion vector to every pixel (or region of pixels) in the first image. A motion vector can be visualized as an arrow. Its orientation represents the direction of the movement depicted in that pixel. Its magnitude represents the apparent speed of the motion. The following images show a short movie excerpt together with the matrix of vectors that represent the movement visible in every pair of consecutive frames in that excerpt.

The classical Lucas-Kanade algorithm is used to compute the optical flow between every pair of subsequent frames.^{(1)} If the shot has m frames, we obtain m–1 flow fields. The flow over a grid of overlapping flowpoints. Each flowpoint is about 60 x 60 pixels. The width separation is 30 pixels. The height separation is 22 pixels. This separation ensures a fixed grid of 32 x 18 flowpoints.

The resulting vector fields are sometimes noisy, because there is motion blur and some frames have insufficient texture. We perform the following clean-up operation on each flow field:

Identify any vector that is too different (in mag or orientation) from its eight neighbors. Consider it missing data. The following image shows a vector field, where all “missing” vectors are drawn as red arrows.

Produce two matrices, X and Y, each having the x and y flow components of the entire flow field.

Use Laplacian interpolation to fill in missing data in each matrix to produce new matrices X’ and Y’.

Map the values of X’ and Y’ linearly so that the maximum and minimum values are equal to those of X and Y.

Reconstruct the full optical flow by combining the data from the two resulting matrices. The following image shows the reconstruction of the optical flow of the above frame.

2. Factorization

Decompose the optical flow for an entire segment using non-negative factorization. The algorithm requires that data should be nonnegative, but motion vectors typically contain negative data. The solution adopted here represents every flow vector (x, y) as a quadruple of non-negative numbers (x+, x-, y+, y-). For instance, the vector (3, -2) is represented as (3, 0, 0, 2).

Form a matrix V of dimensions n x m, where n is the number of flow vectors per frame times four, and m is the number of frames in the shot. This matrix is the input to the factorization algorithm.

Perform projective non-negative factorization on V by searching for a matrix W that minimizes
|| V – W W^{T}V ||_{F}(1) W is a non-negative matrix of dimensions n x k, where k < min(n,m). We use an algorithm proposed by Yuan and Oja."^{(2)}

The minimization algorithm expects a number k and an initial estimate of W. To compute both inputs automatically, we perform a PCA on V, i.e., we obtain the eigenvectors and eigenvalues of the covariance matrix of the column vectors in V. The number k is then obtained by thresholding the rate of decay of the eigenvalues. The first k eigenvectors are then selected as the initial estimate of W.^{(3)} To ensure non-negativity, all negative components of the eigenvectors are set to zero. Once the number of factors and the initial estimate are known, the projective non-negative factorization algorithm minimizes (1) iteratively until convergence.

The resulting W can be interpreted as a dictionary of (approximately orthogonal) components. WW^{T} can be interpreted as akin to a projection matrix. H = W^{T}V can be interpreted as a matrix of mixing coefficients or weights. WH = WW^{T}V is the reconstruction of V using the dictionary W.

A sequence often contains complex movements involving various characters as well as the camera. Even a simple case, that of a camera viewing an empty corridor with a wide angle while tracking forward, may be kinetically quite complex if the layout of the scene contains many objects distributed from the foreground to the background. To capture this potential complexity, it is helpful to cluster the latent motion components. We can think of each cluster as a smaller dictionary or sub-dictionary for this sequence. To do this, we perform another projective non-negative matrix factorization on H^{T} in order to obtain a new basis matrix W'. We now look at each column j in the mixing coefficient matrix W’^{ T}H^{T}. Let i be the row with the max entry in the j-th column. Assign the j-th dictionary entry to the i-th cluster. Suppose we obtain k’ latent components. We will form k’ matrices W_{1} , W_{2} , ……, W_{k'} . Matrix W_{i} contains the elements of W assigned to the i-th cluster.

3. Projection

We approximate the optical flow f_{j} of the j-th frame in the sequence using the i-th basis matrix W_{i} as
f_{j,i} = W_{i} W_{i}^{T}f_{j} .

It is now possible to render the decomposition of each frame in a sequence using each separate matrix, in order to visualize the dynamic components of the sequence. If there are, for instance, 30 sub-dictionaries, we can visualize 30 different reconstructions of the movement. Each reconstruction responds to a different aspect of the movement of the sequence. These are the images shown in the left video channel of the installation.

The right channel of the installation shows the reconstruction WH of the frame using the entire dictionary.

4. Visualization

The flow is visualized using streaklines, a visualization and analysis technique often used in fluid dynamics to represent unsteady (time-varying) flows.^{(4)}

We first identify a set S of keypoints (corners) in the first frame of the shot being analyzed. We associate a virtual particle with each key point. Using the optical flow data that we wish to visualize, each these virtual particles is advected through the video sequence.

In every frame, a new particle is placed in each of the original key points in S and advected through the rest of the sequence. If a sequence consists of m frames and p keypoints, (m – 1) x p particles will be generated.

A streakline at time t is the locus of the positions at t of all virtual particles that started from the same initial keypoint.

_____________

^{(1)}
B. D. Lucas and T. Kanade (1981), An iterative image registration technique with an application to stereo vision. Proceedings of Imaging Understanding Workshop, pages 121—130.

^{(2)}
For a description of the projective non-negative matrix factorization method, see: Yuan Z., Oja E. (2005) "Projective Nonnegative Matrix Factorization for Image Compression and Feature Extraction." In: Kalviainen H., Parkkinen J., Kaarna A. (eds) Image Analysis. SCIA 2005. Lecture Notes in Computer Science, vol 3540. Springer, Berlin, Heidelberg. https://link.springer.com/chapter/10.1007/11499145_35

^{(3)}
Zhirong Yang, Zhanxing Zhu, and Erkki Oja, "Automatic Rank Determination in Projective Nonnegative Matrix Factorization", in V. Vigneron et al. (Eds.): LVA/ICA 2010, LNCS 6365, pp. 514–521, 2010.

^{(4)}
For a description of streaklines and an example of their application to computer vision, see: Mehran R., Moore B.E., Shah M. (2010) A Streakline Representation of Flow in Crowded Scenes. In: Daniilidis K., Maragos P., Paragios N. (eds) Computer Vision – ECCV 2010. ECCV 2010. Lecture Notes in Computer Science, vol 6313. Springer, Berlin, Heidelberg. https://link.springer.com/chapter/10.1007/978-3-642-15558-1_32.

SETUP

There is one single file to be spanned over two projectors. It should be played as a loop.

Video format: mp4 file compressed with H.264.
Video length: 30 min 43 sec.
Projector specs: Minimum 300 ANSI Lumens.
Minimum dimensions of the projection surface: 7000mm width x 2000mm height.
Setting: a completely dark room.

The work was exhibited at Hong Kong City Hall using a QuickTime Player on a macbook pro connected to two projectors via HDMI cable and thunderbolt adapters.

It is optional to show the documentation video on a TV monitor.

The following video footage was taken by Alex Ngan during the premiere of the work as part of the Algorithmic Art: Shuffling Space and Time exhibition organized by Linda Lai at the Hong Kong City Hall.