Abstract: Capturing relationships among local and global features in an image is crucial for visual understanding. However, the convolution operation is inherently limited at utilizing long-range information due to its small receptive field. Existing approaches thus heavily rely on the non-local network strategies to make up for the locality of convolutional features. Despite their successful applications in various tasks, we propose there is still considerable room for improvement, by exploring the effectiveness of global image context and position-aware representations. Notably, the concept of relative position is surprisingly under-explored in the vision domain, whereas it has proven to be useful for modeling dependencies in machine translation tasks. In this paper, we propose a new relational reasoning module, which incorporates a contextualized diagonal matrix and a 2D relative position representations. While being simple and flexible, our module allows the relational representation of a feature point to encode the whole image context and its relative position information. We also explore multi-head and dropout strategies to improve the relation learning further. Extensive experiments show our module consistently improve over the state-of-the-art baselines on different vision tasks, including detection, instance segmentation, semantic segmentation, and panoptic segmentation. The code and models will be released.