Sukrit Shankar, Vikas K. Garg and Roberto Cipolla, Deep Carving: Discovering Visual Attributes by Carving Deep Neural Nets, CVPR 2015.
Deep Carving
Most of the approaches for discovering visual attributes in images demand significant supervision, which is cumbersome to obtain. In this paper, we aim to discover visual attributes in a weakly supervised setting that is commonly encountered with contemporary image search engines. For instance, given a noun (say forest) and its associated attributes (say dense, sunlit, autumn), search engines can now generate many valid images for any attribute-noun pair (dense forests, autumn forests, etc). However, images for an attribute-noun pair do not contain any information about other attributes (like which forests in the autumn are dense too). Thus, a weakly supervised scenario occurs: each of the M attributes corresponds to a class such that a training image in class m ∈ {1, . . . , M } contains a single label that indicates the presence of the mth attribute only. The task is to discover all the attributes present in a test image.
                   Deep Convolutional Neural Networks (CNNs) have enjoyed remarkable success in vision applications recently. However, in a weakly supervised scenario, widely used CNN training procedures do not learn a robust model for predicting multiple attribute labels simultaneously. The primary reason is that the attributes highly co-occur within the training data, and unlike objects, do not generally exist as well-defined spatial boundaries within the image. To ameliorate this limitation, we propose Deep-Carving, a novel training procedure with CNNs, that helps the net efficiently carve itself for the task of multiple attribute prediction. During training, the responses of the feature maps are exploited in an ingenious way to provide the net with multiple pseudo-labels (for training images) for subsequent iterations. The process is repeated periodically after a fixed number of iterations, and enables the net carve itself iteratively for efficiently disentangling features.
                  Additionally, we contribute a noun-adjective pairing inspired Natural Scenes Attributes Dataset to the research community, CAMIT - NSAD, containing a number of co-occurring attributes within a noun category. We describe, in detail, salient aspects of this dataset. Our experiments on CAMIT-NSAD and the SUN Attributes Dataset, with weak supervision, clearly demonstrate that the Deep-Carved CNNs consistently achieve considerable improvement in the precision of attribute prediction over popular baseline methods.
Sukrit Shankar, Vijay Badrinarayanan and Roberto Cipolla, Part Bricolage: Flow-Assisted Part-Based Graphs for Detecting Activities in Videos, In Computer Vision–ECCV 2014, pp. 586-601. Springer International Publishing, 2014.
Part Bricolage
Space-time detection of human activities in videos can significantly enhance visual search. To handle such tasks, while solely using low-level features has been found somewhat insufficient for complex datasets; mid-level features (like body parts) that are normally considered, are not robustly accounted for their inaccuracy. Moreover, the activity detection mechanisms do not constructively utilize the importance and trustworthiness of the features.
                  This paper addresses these problems and introduces a unified formulation for robustly detecting activities in videos. Our first contribution is the formulation of the detection task as an undirected node- and edge-weighted graphical structure called Part Bricolage (PB), where the node weights represent the type of features along with their importance, and edge weights incorporate the probability of the features belonging to a known activity class, while also accounting for the trustworthiness of the features connecting the edge. Prize-Collecting-Steiner-Tree (PCST) problem [19] is solved for such a graph that gives the best connected subgraph comprising the activity of interest. Our second contribution is a novel technique for robust body part estimation, which uses two types of state-of-the-art pose detectors, and resolves the plausible detection ambiguities with pre-trained classifiers that predict the trustworthiness of the pose detectors. Our third contribution is the proposal of fusing the low-level descriptors with the mid-level ones, while maintaining the spatial structure between the features. For a quantitative evaluation of the detection power of PB, we run PB on Hollywood and MSR-Actions datasets and outperform the state-of-the-art by a significant margin for various detection paradigms.
Sukrit Shankar, Joan Lasenby and Roberto Cipolla, Semantic Transform: Weakly Supervised Semantic Inference for Relating Visual Attributes, In Computer Vision (ICCV), 2013 IEEE International Conference on, pp. 361-368. IEEE, 2013.
Semantic Transform
Relative (comparative) attributes are promising for thematic ranking of visual entities, which also aids in recognition tasks. However, attribute rank learning often requires a substantial amount of relational supervision, which is highly tedious, and apparently impractical for real-world applications. In this paper, we introduce the Semantic Transform, which under minimal supervision, adaptively finds a semantic feature space along with a class ordering that is related in the best possible way. Such a semantic space is found for every attribute category. To relate the classes under weak supervision, the class ordering needs to be refined according to a cost function in an iterative procedure. This problem is ideally NP-hard, and we thus propose a constrained search tree formulation for the same. Driven by the adaptive semantic feature space representation, our model achieves the best results to date for all of the tasks of relative, absolute and zero-shot classification on two popular datasets.