SimRank: a measure of structural-context similarity SimRank: a measure of structural-context similarity
Paper summary #### Introduction * Algorithm to derive similarity between 2 nodes of a graph (or graphical model derived from any other kind of dataset). * [Link to the paper](http://dl.acm.org/citation.cfm?id=775126) #### SimRank * Input: A directed graph $G = (V, E)$ where $V$ represents vertices and $E$ represents edges. * SimRank defines similarity between 2 vertices (or nodes) $i$ and $j$ as the average of the similarity between their in-neighbours decayed by a constant factor $C$. * Any node is maximally similar to itself (with similarity = 1). * PageRank analyses the individual vertices of the graph with respect to the global structure, while SimRank analyses the relationship between a pair of vertices (edges). * SimRank scores are symmetric and can be defined between all pair of vertices. * $G^{2}$ is defined as the node pair graph such that each node in *G^{2}* corresponds to an ordered pair of nodes of $G$ and there exists an edge between node pair (a, b) and (c, d) if there exists an edge between (a, c) and (b, d). * In $G^{2}$, similarity flows from node to node with singleton nodes (nodes of the form (a, a)) as the source of similarity. #### Variants ##### Minimax Variant * Defines similarity of nodes $i$ and $j$ as the minimum of maximum similarity between $i$ and any in-neighbour of $j$ and between $j$ and any in-neighbour of $i$. #### Computing SimRank * A naive solution can be obtained by iteration to a fixed point. * Space complexity is $O(n^{2})$ and time complexity is $O(kn^{2}d)$ where $k$ is the number of iterations, $n$ is the number of vertices and $d$ is the average of product of indegrees of pair of vertices. * Optimisations can be made by setting the similarity between far off nodes as 0 and considering only nearby nodes for an update. #### Different Interpretations ##### Co-citation Score * The first iteration of SimRank produces results same as co-citation score between a pair of vertices. * Successive iterations improve these initial scores. ##### Random Surfer-Pairs Model * SimRank $s(a, b)$ can be interpreted as the measure of how soon two random surfers are expected to meet at the same node if they start at nodes a and b and walk the graph backwards. * Expected Meeting Distance (EMD) between 2 nodes a and b is the expected number of steps required before 2 surfers (starting at a and b) would meet if they walked randomly in locked step. * Surfers are allowed to teleport with a small probability - to circumvent the infinite EMD problem. * Expected-f Meeting Distance (EMD) - Given length l of a tour, compute f(l) (where f is a non-negative monotonic function) to bound the expected distance to a finite interval. * Common choice for f is $f(z) = C^{z}$ where $C \in (0, 1)$ * The SimRank score for two nodes, with parameter $C$, is the expected-f meeting distance travelling back-edges with $f(z) = C^{z}$ #### Evaluation * Experiments on 2 datasets: * Corpus of scientific research papers from ResearchIndex. * Transcripts of undergrad students at Stanford. * Domain specific properties used to measure similarity and compared with SimRank scores. * Results show improvement over co-citation scores.

Loading...
Your comment:


Short Science allows researchers to publish paper summaries that are voted on and ranked!
About