COBBS – Computer Vision and Tracking

The raw data gathered from the experimental work consists of three sequences of images, one from each camera, for each flocking/swarming event. Each sequence may consist of up to 2700 images, meaning a single event could have a total of 7100 images!! It would be impossible to process each event by hand in a timely manner. The solution is to apply various computer vision methods and algorithms to extract the required information, namely the position of each individual, from the images automatically. Broadly speaking, the post-processing of the images can be split into two categories: image segmentation; and 3D reconstruction and tracking.

Image Segmentation

The goal of this processing is to separate the individual organisms within an aggregation from the rest of the scene, automatically without frame-by-frame assistance from a person. In other words, determine what is background and not of interest and remove those features, while highlighting the objects of interest. For starlings, the background consists of clouds and sky with occasional intruders such as airplanes and other birds. For midges, the background is more complicated as the typical setting is in a park with trees, shrubs and grass, and intruders consisting of unidentified flying objects (pollen, seeds, other large insects etc.) and people. Our image segmentation procedure uses a combination of well known image processing methods such as background subtraction, thresholding, morphological operations, and flood-fill/watershed. An example of the typical results are shown in the two figures below. Note, the IDT M5 cameras we are using capture grayscale images.

Fig. 1: Raw image of a typical starling flock.

Fig. 2: Result from image segmentation of starling flock in Fig. 1.

Tracking and 3D Trajectory Reconstruction

The results from the image segmentation procedure is frame-by-frame listing of each identified object’s barycenter location within each camera. From this listing, we can then begin tracking and 3D reconstruction of the aggregation. The first task is linking detected objects, both stereoscopic and temporal. Stereoscopic linking consists of matching the same object across the images from different cameras at the same instance of time. It is possible that a given object has more than two stereoscopic links, due to blob formation (to be described below). Temporal links match individual objects in time frame-by-frame in a given camera. The temporal links form a track in a single camera of the object through time (see Figs. 3 and 4).

The biggest obstacle for tracking is visual obstructions as shown in Fig. 3. When two birds overlap in the image, they form a “blob”, which can last for many frames, which causes the bifurcation shown graphically on the left side of Fig. 3. Now there is a possibility of 4 different tracks as shown in Fig. 4 – two that are correct (“good tracks”) and two that are wrong (“bad tracks”). The typical thing to do is to assign only a single temporal link entering and leaving each object, but this would cause the loss of at least one of the birds. The solution is multi-path linking, allowing for many links to enter and leave a given object. In this way, it is possible to actually generate the two “good tracks”, but also possible to generate the two bad tracks as well.

Fig. 3: Example of visual occlusion – ‘blob’ formation.

Fig. 4: Multi-linking dilemma: 2 correct tracks, 2 incorrect tracks.

Once the temporal links have been created, we then have to associate a single track for each detected object. Fig. 5 visualizes the temporal links between detected objects for a single camera in a sample time window using a directed graph – nodes are the detected objects and edges are temporal links. Note the number of bifurcations and multi-object blobs. To find the correct tracks for each object, we calculate a “score” based on the quality of the stereoscopic links for each temporally connected object. Following the temporal links from “head-to-tail” and taking the objects with the highest score, in theory, will build the correct tracks for each detected object. If we used a greedy-type algorithm at this point, the use of multi-path linking would be useless as it would not allow for objects to be taken in more than one track. Thus, we use a global optimization to build the tracks.

Fig. 5: Exponential growth of tracks.

However, how many tracks should there be since we do not know a priori the number of individuals in the aggregations. Also, typically a single object blobs with several other objects throughout the acquisition sequence. Each blobbing doubles the number of possible tracks that the object can belong to, hence over time this can cause exponential growth in the total number of tracks. An example of a difficult blobbing scenario and track proliferation can be seen if Fig. 6. Notice how there are four birds at the beginning of the sequence, and by the end there is only a single blob. Our solution to overcome this exponential growth is to split the acquisition sequence into smaller blocks and use a recursive method to build the tracks within each camera. In each step of the recursion we continue to use the stereoscopic-link-based scores to determine which blocks should be connected together. In this way we are able to generate long tracks, and overcome exponential explosion of the tracks.

Fig. 6: Blobs are part of the reason for exponential growth of tracks.

Once we’ve created the 2D tracks within each camera, we have to associate tracks between cameras in order to construct 3D trajectories for each object. A global optimization is used to minimize a cost function based on the three-dimensional tensor of the stereometric distances between objects of interest. We use the C++ APIs of IBM ILOG CPLEX Optimizaiton Studio v12.2 to solve this linear programming problem.

Fig. 7: Visualization of approximately 179 3D trajectories.

The results of our novel tracking algorithm are promising. We are able to reconstruct the majority of trajectories such that they have a length of > 90% of the acquisition length. An example of a reconstructed flock is shown in Fig. 7. The flock consists of 179 individuals, with an acquisition length of 440 frames. 98.88 % of the trajectories are > 90% of the acquisition length.

Fig. 8: Comparison of our tracking algorithm vs. other fields.

Comparing our results to other 3D tracking algorithms, is not a simple matter as there is not a standard set of figures-of-merit to report findings. However, it is possible to extract the average trajectory length and number of tracked objects and plot them as shown in Fig. 8. There are two main areas where tracking algorithms are used: particle tracking velocimetry (PTV) (labelled “turbulent fluid”) and real-time biological tracking algorithms (labelled “realtime”). In PTV, passive tracers are deployed in a medium and the changes in volumes-of-interest are studied. A high density of tracers are used to ensure proper volume sampling. However it is not of interest in PTV that tracer X started here at time t and ended up over there at time t + acquisition length. Real-time biological tracking algorithms seek to track biological entities frame-by-frame as they are captured. The objects of interest include fruit flies and bats. The difficulty these algorithms face is that they typically do no use multi-linking and require greedy algorithms to generate their tracks. Hence, visual occlusions become much more difficult to overcome. This is reflected in the shorter trajectory lengths and lower number of overall tracked objects shown in Fig. 8. Our results demonstrate the power of multi-linking as we are able to track a large number of objects over a large number of frames.

More details about our method can be found in: L. Parisi, “Recursive dynamical tracking in tree dimensions: the case of starling flocks”, MASc. Thesis, 2012.