RECENT WORK
Sukrit Shankar, Vikas Kumar Garg and Roberto Cipolla, Deep Carving: Discovering Visual Attributes by Carving Deep Neural Nets, Under Review, CVPR 2015.
Deep Carving
Most of the approaches for discovering visual attributes in images demand significant supervision, which is cumbersome to obtain. In this paper, we aim to discover visual attributes in a weakly supervised setting that is commonly encountered with contemporary image search engines. For instance, given a noun (say face), and its associated attributes (say bald, chubby, happy), search engines can now generate many valid images for any attribute-noun pair (chubby faces, happy faces, etc). However, images for an attribute-noun pair do not contain any information about other attributes (like which chubby faces are bald too). Thus, a weakly supervised scenario occurs: each of the M attributes corresponds to a class such that a training image in class m ∈ {1, . . . , M } contains a single label that indicates the presence of the mth attribute only. The task is to discover all the attributes present in a test image.
                  Deep Convolutional Neural Networks (CNNs) have enjoyed remarkable success in several vision applications recently. However, in a weakly supervised scenario, widely used CNN training procedures do not learn an efficient model for predicting multiple attribute labels simultaneously. To ameliorate this limitation, we propose Deep Carving, a novel training procedure with CNNs, that learns a carved (or thinned) net for each attribute class. Each carved net contains an optimally selected subset of feature maps, which contributes maximally to its own attribute class and minimally to others. A test image is then projected on each carved net to determine whether the corresponding attribute class is present in the image.
                  Additionally, we contribute two new attribute datasets to the research community - Face Attributes and Natural Scenes Attributes - each containing a significant number of co-occurring attributes. We describe, in detail, salient aspects of these datasets. Our experiments on these and the SUN Attributes Dataset, with weak supervision, clearly demonstrate that the Deep-Carved CNNs consistently achieve considerable improvement in the precision of attribute prediction over several baseline methods.
Sukrit Shankar, Vijay Badrinarayanan and Roberto Cipolla, Part Bricolage: Flow-Assisted Part-Based Graphs for Detecting Activities in Videos, In Computer Vision–ECCV 2014, pp. 586-601. Springer International Publishing, 2014.
Part Bricolage
Space-time detection of human activities in videos can significantly enhance visual search. To handle such tasks, while solely using low-level features has been found somewhat insufficient for complex datasets; mid-level features (like body parts) that are normally considered, are not robustly accounted for their inaccuracy. Moreover, the activity detection mechanisms do not constructively utilize the importance and trustworthiness of the features.
                  This paper addresses these problems and introduces a unified formulation for robustly detecting activities in videos. Our first contribution is the formulation of the detection task as an undirected node- and edge-weighted graphical structure called Part Bricolage (PB), where the node weights represent the type of features along with their importance, and edge weights incorporate the probability of the features belonging to a known activity class, while also accounting for the trustworthiness of the features connecting the edge. Prize-Collecting-Steiner-Tree (PCST) problem [19] is solved for such a graph that gives the best connected subgraph comprising the activity of interest. Our second contribution is a novel technique for robust body part estimation, which uses two types of state-of-the-art pose detectors, and resolves the plausible detection ambiguities with pre-trained classifiers that predict the trustworthiness of the pose detectors. Our third contribution is the proposal of fusing the low-level descriptors with the mid-level ones, while maintaining the spatial structure between the features. For a quantitative evaluation of the detection power of PB, we run PB on Hollywood and MSR-Actions datasets and outperform the state-of-the-art by a significant margin for various detection paradigms.
Sukrit Shankar, Joan Lasenby and Roberto Cipolla, Semantic Transform: Weakly Supervised Semantic Inference for Relating Visual Attributes, In Computer Vision (ICCV), 2013 IEEE International Conference on, pp. 361-368. IEEE, 2013.
Semantic Transform
Relative (comparative) attributes are promising for thematic ranking of visual entities, which also aids in recognition tasks. However, attribute rank learning often requires a substantial amount of relational supervision, which is highly tedious, and apparently impractical for real-world applications. In this paper, we introduce the Semantic Transform, which under minimal supervision, adaptively finds a semantic feature space along with a class ordering that is related in the best possible way. Such a semantic space is found for every attribute category. To relate the classes under weak supervision, the class ordering needs to be refined according to a cost function in an iterative procedure. This problem is ideally NP-hard, and we thus propose a constrained search tree formulation for the same. Driven by the adaptive semantic feature space representation, our model achieves the best results to date for all of the tasks of relative, absolute and zero-shot classification on two popular datasets.