The data which knowledge workers need to conduct their work is being stored across an increasing number of
repositories and is growing significantly in size. It is therefore unreasonable to expect that knowledge workers
can efficiently search and identify what they need across a myriad of locations where upwards of hundreds of
thousands of items can be created daily. This work describes a system which can observe user activity and
train models to predict which items a user will access, in order to help knowledge workers discover content.
We specifically investigate network file systems and determine how well we can predict future access to newly
created or modified content. Utilizing file metadata to construct access prediction models, we show how the
performance of these models can be improved for shares demonstrating high collaboration among its users.
Experiments on ten enterprise shares reveal that models based on file metadata can achieve F scores upwards
of 99%. Furthermore, on an average, collaboration aware models can correctly predict nearly half of new file
accesses by users while ensuring a precision of 75%, thus validating that the proposed system can be utilized
to help knowledge workers discover new or modified content
The work is done at and in collaboration with Symantec Research Labs, Mountain View.
Sandeep Bhatkar, Michael Hart, Aleatha Parker-Wood, Sujit Dey
The work is accepted as a full paper in ICEIS 2015 conference (acceptance rate ~ 15%). An extended version is under preparation. Please see Publications section for details.
As humans, we understand the relations between Soccer and Ronaldo, between Germany and Berlin. How to make machines understand and use such relations? Taxonomies and Ontologies have been studied to represent relations between different entities. WordNet is a popular lexical ontology, constructed manually by experts. While such a graph captures semantic relations between tags, it fails to encode the information present in a given corpus of images pertaining to interaction between the tags. In this project, we study a data-driven approach for the construction of an ontological graph for a set of image tags obtained from a large corpus of images, where each image in the corpus is annotated with zero or more tags. With certain simplifying assumptions to help in the construction, we formulate the graph construction as an optimization problem and provide an approximate solution.
Evaluation of Ontologies or Taxonomies is often a difficult task. While most research focusses on manual evaluation or comparison with a manually built gold standard ontology, we propose evaluation of the ontological graphs based on novel data driven tasks that asses how well the tree structures capture tag statistics in images.This work was done in collaboration with Yahoo Labs, Bangalore.
Personalization applications such as content recommendations, product recommendations and advertisements, and social network related recommendations, can be quite beneficial for both, service providers and users. Such applications need to understand user preferences in order to provide customized services. As user engagement with web videos has grown significantly, understanding user preferences based on watched videos looks promising. However, the above requires being able to classify web videos into a set of categories appropriate for the personalization application. Such categories may be substantially different from the common categories (such as Comedy, Entertainment, Pets etc.) that are used by video sharing websites. Hence, training videos for classifying web videos into required set of categories, that are appropriate to the personalization application, might be unavailable.
In this project, we study the feasibility and effectiveness of a fully automated framework to obtain training videos to enable classification of web videos to any arbitrary set of categories, as desired by the personalization application. We investigate the desired properties in training data that can lead to high performance of the trained classification models. We then develop an approach to identify and score keywords based on their suitability to retrieve training videos with the desired properties, for the specified set of categories. Experimental results using YouTube videos indicate feasibility of the proposed approach to obtain high classification performance. Comparisons with retrieving training videos using category names reveal that our approach performs significantly better.More information on this work is available at the Project Webpage.