[2017 ACM TODS] Efficient SimRank-based Similarity Join

Weiguo Zheng's paper (Efficient SimRank-based Similarity Join) is accepted by ACM Transactions on Database Systems (ACM TODS).

Graphs have been widely used to model complex data in many real-world applications. Answering vertex join queries over large graphs is meaningful and interesting, which can benefit friend recommendation in social networks and link prediction, etc. In this paper, we adopt “SimRank” to evaluate the similarity between two vertices in a large graph because of its generality. Specifically, we define a SimRank-based join (SRJ) query to find all the vertex pairs satisfying the threshold in a data graph G. In order to reduce the search space, we propose a shortest-path distance based upper bound for SimRank scores to prune unpromising vertex pairs. In the verification, we propose a novel index, called h-go cover+, to efficiently compute the SimRank score of any single vertex pair. Given a graph G, we only materialize the SimRank scores of a small proportion of vertex pairs (i.e., the h-go cover+ vertex pairs), based on which, the SimRank score of any vertex pair can be computed easily. To find the h-go cover+ vertex pairs, we propose an efficient method without building the vertex-pair graph. Hence, large graphs can be dealt with easily. Extensive experiments over both real and synthetic datasets confirm the efficiency of our solution.