Results from our model across many species and behaviors. Some species are quite rare in our dataset: the polar bear is only present in 0.31% of our ~300 hours of training data, the caribou in 0.025%, the alpaca in 0.5%, and the fossa in 0.14%.
We condition with the same initial frame and short motion history, but sample from our stochastic diffusion model with different random seeds, resulting in slightly different motion forecasts.
We can optionally condition our forecasts on a global 2D (x,y) motion vector. All results before this section were generated without this prompting.
Our dataset primarily consists of mammals, but we find that our model generalizes beyond this to other animals and even robots. A couple of these examples were found in our validation set, but most are from other sources.
We prompt Stable Video Diffusion with images from our dataset. We find that the model especially struggles on species that are rarer (at least in our dataset, since the distribution of Stable Video Diffusion's training data is unknown), failing to model consistent morphologies and realistic animal behavior.
We compare our method to baselines: ground-truth constant velocity, and Track2Act trained on our full dataset. In addition to architectural differences, baselines do not model occlusion. Raw point tracking outputs can have inaccurate coordinates for occluded points, so training without proper occlusion handling leads to some of the artifacts seen in the Track2Act results, such as in the bison and the horse. We give our learned baselines the advantage of training on our camera-stabilized data, even though the original methods do not. For Track2Act and our method, which are both stochastic, we show results using the same random seed.
Built upon a diffusion transformer (DiT) architecture, our model treats each point track as a token augmented with local visual context from frozen DINOv3 features to capture semantic priors about animal parts. We explicitly incorporate occlusion handling within our track tokens, allowing the model to reason about visibility.
Our model takes as input the diffusion timestep and—optionally, via dropout training—a 2D global displacement vector through adaptive layer norm (AdaLN). The training objective utilizes a diffusion process that minimizes an L1 loss to denoise reparameterized point velocities and occlusions. This parameterization focuses the learning signal on motion dynamics rather than absolute coordinates, enabling our model to generate diverse, physically plausible futures from short motion histories.
A critical component of our method lies in curating MammalMotion, a large-scale dataset of animal motion.
Starting with the MammalNet dataset, we detect shots (see our paper to learn about our novel point-tracking based shot detection algorithm) and then detect and segment animals within each shot.
We track points, querying within animal segmentation masks, and then use a homography-based camera stabilization to isolate animal motions from pixels.
Visual intelligence requires anticipating the future behavior of agents, yet vision systems lack a general representation for motion and behavior. We propose dense point trajectories as visual tokens for behavior, a structured mid-level representation that disentangles motion from appearance and generalizes across diverse non-rigid agents, such as animals in-the-wild. Building on this abstraction, we design a diffusion transformer that models unordered sets of trajectories and explicitly reasons about occlusion, enabling coherent forecasts of complex motion patterns. To evaluate at scale, we curate over 300 hours of unconstrained animal video with robust shot detection and camera-motion compensation. Experiments show that forecasting trajectory tokens achieves category-agnostic, data-efficient prediction, outperforms state-of-the-art baselines, and generalizes to rare species and morphologies, providing a foundation for predictive visual intelligence in the wild.
We propose a Lagrangian approach to behavior forecasting by representing motion as dense point trajectories. Unlike standard Eulerian video generation models that predict color changes on a fixed pixel grid, our method explicitly tracks the movement of physical surface points over time. This distinction allows our model to disentangle complex object dynamics from appearance and lighting, providing a structured mid-level representation that is significantly more data-efficient and generalizes across diverse non-rigid agents.
We thank Noah Snavely for challenging us with general motion forecasting over a lovely Parisian lunch. We thank Andrew Zisserman, Drew Purves, Aleksander Holynski, Linyi Jin, Sander Dieleman, Mark Hamilton, and Jathushan Rajasegeran for helpful discussions and feedback. This work originated (in part) while the authors were visiting the Simons Institute for the Theory of Computing. This work was supported by ONR MURI N00014-21-1-280, and a NSF Graduate Fellowship to NT.
@article{thakkar2026forecasting,
author = {Thakkar, Neerja and Ginosar, Shiry and Walker, Jacob C and Malik, Jitendra and Carreira, Jo{\~a}o and Doersch, Carl},
title = {Forecasting Motion in the Wild},
journal = {arXiv preprint arXiv:2604.01015},
year = {2026},
}