Errant version 2: The Kinetic Propensity of Images is a project about the automatic analysis and visualization of motion in the cinema. A newly designed machine learning algorithm decomposes the movement in every sequence of a movie into a set of elementary motions. These elementary motions are then recombined to produce a reconstruction of the visible movement in the sequence. The analysis and reconstruction are displayed as a two-channel video installation. The visualization of the movement uses a variant of the streakline method often employed in fluid dynamics.

The following image is an instance of the left channel. The original movie frame is shown together with eight motion factors discovered by the algorithm.

The following image is an instance of the right channel for the same frame. The movement at that instant is reconstructed by combining the eight motion factors.


The algorithm for this work has been completely redesigned from the original version of this project. That version produced a dictionary that contains many different factors. These factors were often redundant and sometimes very difficult to interpret. Version 2.0 uses a different approach for both data representation and data factorization that produces a smaller and more interpretable dictionary. The original algorithm produced a different dictionary for every shot. Algorithm 2.0 learns one single dictionary for the entire film, so that viewers can more easily compare and contrast the activation patterns of different sequences. Details are in the algorithm section of this website.

The movie being analyzed is Ugetsu Monogatari (Mizoguchi Kenji, 1953). This cinematic text was selected mainly for its carefully choreographed long takes. But there is a sociopolitical subtext to the choice of both film and machine learning methodology.

The machine learning algorithms used to analyze and synthesize motion are all specifically designed for this work, but they are variants of methods primarily used for the purpose of crowd control and abnormal movement detection. This project is a “detournement” of surveillance techniques to analyze and foreground movements in movies that attack authoritarianism. More specifically, the film chosen was made by Mizoguchi as a response to the Second World War, and more specifically, to a situation where people were dominated by their own fascistic and militaristic governments. The motions in the film are closely related to this situation.

The original version of this work was commissioned by Linda Lai for the exhibition Algorithmic Art: Shuffling Space and Time, held at the Hong Kong City Hall, December 27 2018 - January 10 2019. Production was partly funded with a grant from the Innovation Technology Fund of the Hong Kong government. Production of the new version was funded by a fellowship from the Centre for Advanced Computing and Interactive Media (ACIM) of the School of Creative Media of the City University of Hong Kong. The project was made at the Perceptron Lab founded by Héctor Rodríguez.


This work addresses the relation between observation and description in the sphere of cinematic motion. Errant offers a system for the description of movement in cinema, which directs our attention towards the organization of visual rhythm.

Mainstream cinema conventions often encourage the viewer to focus on narrative at the expense of movement and duration. When asked to summarize a film, we often describe the film’s storyline rather than its visual rhythm. A shot or sequence in a film often contains several on-screen motions, for instance the movements of different people, as well as the effects of camera motion. The choreography of these diverse movements is a crucial aspect of cinematic art, which goes unnoticed whenever we attend exclusively to the narrative information. Even professional criticism and scholarship often lacks a sufficiently detailed method for the description of kinetic rhythm in the cinematic art.

Most film analysis and criticism describe movement by reference to the object that moves. Descriptions of scene motion typically focus on the nature of the moving object (whether it is a person, a car, etc.) and its speed. Writers characterize camera movement as, for instance, a “pan”, “tilt”, “track”, “dolly”, “zoom”, etc. These terms presuppose a privileged object, the camera, as the source of the visible movement. The conventional vocabulary of critical analysis guides the expectations of the critic or theoretician, who sees only what they expect to find, and they expect to find only that for which they have acquired words. Writers on cinema almost invariably presuppose a mobile camera viewing mobile objects in a three-dimensional world. In other words, the focus is on the causes or sources of the movement rather than on its visible quality. Under these conditions, we lack the resources to describe or represent the phenomenological quality of motion in the cinema.

The situation somewhat resembles the problem confronted in the 1950s musique concrete composers like Pierre Schaeffer who struggled to expand the boundaries of music by incorporating sounds recorded in the wild. Designed for a narrow domain of pitched sounds, the language of classical Western music, fails to capture the qualitative textures of organized recorded sounds. There was no method for describing the new musical objects and methods, in a way that would assist artists and critics to systematize, describe and analyze this kind of practice.

Errant proposes a general methodology for the observation and description of visual rhythm in cinema.




The aim of this algorithm is to learn an interpretable low-dimensional representation of the movement in a given film.

We select a digitized movie X = x1, x2, x3, ….., xm , where xin is the ith frame and n is the number of pixels per image, and m is the total number of frames in the movie.

We manually partition the movie into shots. A shot is a sequence of consecutive images bound by straight cuts or fades. (For convenience, we assume that these are the only shot transitions.) This particular movie has 186 shots.

We then obtain a dense field O of optical flow vectors for every frame. Optical flow is a function that associates every pixel in a given frame xi with a vector in 2, which represents the hypothesized on-screen motion at that pixel from xi to xi+1.

To obtain the optical flow, we used the Python implementation in the scikit-image library of the following method: Le Besnerais, G., & Champagnat, F. (2005, September). Dense optical flow by iterative local window registration. In IEEE International Conference on Image Processing 2005 (Vol. 1, pp. I-137). We selected this method because its results were perceptually very convincing.

To reduce the total number of motion vectors per frame to a more computationally tractable size, we divided the image into 2025 regions of 32 x 24 pixels each. Among all the motion vectors, we chose the one with the greatest magnitude as the representative of the region. We then arranged the optical flow information for the i th frame into a matrix Vi consisting of 2 rows and 2025 columns. Every column represents one motion vector, i.e., a vector in 2.

We then applied a sparse coding algorithm to every Vi matrix. Among several existing sparse coding methods, we chose the scikit-learn library implementation of the minibatch dictionary learning algorithm: J. Mairal, F. Bach, J. Ponce, G. Sapiro, (2009, June). Online dictionary learning for sparse coding. In ICML '09: Proceedings of the 26th Annual International Conference on Machine Learning 2009 (pp. 689–696). We chose this procedure because of its relatively low time complexity.

A sparse coding algorithm models a dataset as a sparse combination of a possibly overcomplete linear basis (the “dictionary”) Di. The number of entries in the dictionary is a parameter. In this case, we asked for a dictionary consisting of five vectors in 2. Every Di is then assembled into one 2 x m matrix Vglobal. We applied the same sparse coding algorithm to Vglobal and asked for a new dictionary Dglobal consisting of 8 elements. We chose the number for aesthetic reasons (see the visualization section below).

The left channel of the video installation visualizes the decomposition of every optical flow using each of the eight latent factors separately. To produce the data for the visualization of one optical flow, we first encoded the flow using a sparse combination of entries in Dglobal and then used only one of the eight components of the code to reconstruct the flow. This procedure outputs eight different vector fields per frame, corresponding to the eight latent factors discovered by the sparse coding algorithm. This data was then processed by the visualization procedure described in the next section.



This project aims to visualize the decomposition of movement in a film into various components. To do this, we fix a virtual grid consisting of P cells over each image xi. One particle is placed in each cell. A particle always has a lifetime of T frames. We simulate the pathway of every particle in xi over its lifetime using each of the 8 partial reconstructions of Vi, to produce 8 different image sequences. Each sequence shows the trajectories of all P particles with one single dictionary factor. At any point in a given sequence, there will be many alive particles at different stages of their lifetimes. Each image in the sequence is produced by drawing a spline curve through all the alive particles that started out from the same cell. This procedure generates P curves for every image.

The resulting 8 sets of trajectories are displayed, together with the original movie frame, as a grid on the left video channel.

The right video channel contains a single image. It visualizes P particle trajectories computed using the reconstruction of the optical flow for each frame with all the 8 entries in Dglobal. This visualization displays a synthesis of the factors analyzed by the sparse coding algorithm.

We can think of the left channel as an analysis of the motion in the movie, and of the right channel as a synthesis of the motion in the movie.




There is one single file to be spanned over two projectors. It should be played as a loop.

Video format: mp4 file compressed with H.264.
Video length: 30 min 43 sec.
Projector specs: Minimum 3000 ANSI Lumens.
Minimum dimensions of the projection surface: 7000mm width x 2000mm height.
Setting: a completely dark room.

The work was exhibited at Hong Kong City Hall using a QuickTime Player on a macbook pro connected to two projectors via HDMI cable and thunderbolt adapters.