SimRank: a measure of structural-context similaritySimRank: a measure of structural-context similarityJeh, Glen and Widom, Jennifer2002
Paper summaryshagunsodhani#### Introduction
* Algorithm to derive similarity between 2 nodes of a graph (or graphical model derived from any other kind of dataset).
* [Link to the paper](http://dl.acm.org/citation.cfm?id=775126)
#### SimRank
* Input: A directed graph $G = (V, E)$ where $V$ represents vertices and $E$ represents edges.
* SimRank defines similarity between 2 vertices (or nodes) $i$ and $j$ as the average of the similarity between their in-neighbours decayed by a constant factor $C$.
* Any node is maximally similar to itself (with similarity = 1).
* PageRank analyses the individual vertices of the graph with respect to the global structure, while SimRank analyses the relationship between a pair of vertices (edges).
* SimRank scores are symmetric and can be defined between all pair of vertices.
* $G^{2}$ is defined as the node pair graph such that each node in *G^{2}* corresponds to an ordered pair of nodes of $G$ and there exists an edge between node pair (a, b) and (c, d) if there exists an edge between (a, c) and (b, d).
* In $G^{2}$, similarity flows from node to node with singleton nodes (nodes of the form (a, a)) as the source of similarity.
#### Variants
##### Minimax Variant
* Defines similarity of nodes $i$ and $j$ as the minimum of maximum similarity between $i$ and any in-neighbour of $j$ and between $j$ and any in-neighbour of $i$.
#### Computing SimRank
* A naive solution can be obtained by iteration to a fixed point.
* Space complexity is $O(n^{2})$ and time complexity is $O(kn^{2}d)$ where $k$ is the number of iterations, $n$ is the number of vertices and $d$ is the average of product of indegrees of pair of vertices.
* Optimisations can be made by setting the similarity between far off nodes as 0 and considering only nearby nodes for an update.
#### Different Interpretations
##### Co-citation Score
* The first iteration of SimRank produces results same as co-citation score between a pair of vertices.
* Successive iterations improve these initial scores.
##### Random Surfer-Pairs Model
* SimRank $s(a, b)$ can be interpreted as the measure of how soon two random surfers are expected to meet at the same node if they start at nodes a and b and walk the graph backwards.
* Expected Meeting Distance (EMD) between 2 nodes a and b is the expected number of steps required before 2 surfers (starting at a and b) would meet if they walked randomly in locked step.
* Surfers are allowed to teleport with a small probability - to circumvent the infinite EMD problem.
* Expected-f Meeting Distance (EMD) - Given length l of a tour, compute f(l) (where f is a non-negative monotonic function) to bound the expected distance to a finite interval.
* Common choice for f is $f(z) = C^{z}$ where $C \in (0, 1)$
* The SimRank score for two nodes, with parameter $C$, is the expected-f meeting distance travelling back-edges with $f(z) = C^{z}$
#### Evaluation
* Experiments on 2 datasets:
* Corpus of scientific research papers from ResearchIndex.
* Transcripts of undergrad students at Stanford.
* Domain specific properties used to measure similarity and compared with SimRank scores.
* Results show improvement over co-citation scores.
#### Introduction
* Algorithm to derive similarity between 2 nodes of a graph (or graphical model derived from any other kind of dataset).
* [Link to the paper](http://dl.acm.org/citation.cfm?id=775126)
#### SimRank
* Input: A directed graph $G = (V, E)$ where $V$ represents vertices and $E$ represents edges.
* SimRank defines similarity between 2 vertices (or nodes) $i$ and $j$ as the average of the similarity between their in-neighbours decayed by a constant factor $C$.
* Any node is maximally similar to itself (with similarity = 1).
* PageRank analyses the individual vertices of the graph with respect to the global structure, while SimRank analyses the relationship between a pair of vertices (edges).
* SimRank scores are symmetric and can be defined between all pair of vertices.
* $G^{2}$ is defined as the node pair graph such that each node in *G^{2}* corresponds to an ordered pair of nodes of $G$ and there exists an edge between node pair (a, b) and (c, d) if there exists an edge between (a, c) and (b, d).
* In $G^{2}$, similarity flows from node to node with singleton nodes (nodes of the form (a, a)) as the source of similarity.
#### Variants
##### Minimax Variant
* Defines similarity of nodes $i$ and $j$ as the minimum of maximum similarity between $i$ and any in-neighbour of $j$ and between $j$ and any in-neighbour of $i$.
#### Computing SimRank
* A naive solution can be obtained by iteration to a fixed point.
* Space complexity is $O(n^{2})$ and time complexity is $O(kn^{2}d)$ where $k$ is the number of iterations, $n$ is the number of vertices and $d$ is the average of product of indegrees of pair of vertices.
* Optimisations can be made by setting the similarity between far off nodes as 0 and considering only nearby nodes for an update.
#### Different Interpretations
##### Co-citation Score
* The first iteration of SimRank produces results same as co-citation score between a pair of vertices.
* Successive iterations improve these initial scores.
##### Random Surfer-Pairs Model
* SimRank $s(a, b)$ can be interpreted as the measure of how soon two random surfers are expected to meet at the same node if they start at nodes a and b and walk the graph backwards.
* Expected Meeting Distance (EMD) between 2 nodes a and b is the expected number of steps required before 2 surfers (starting at a and b) would meet if they walked randomly in locked step.
* Surfers are allowed to teleport with a small probability - to circumvent the infinite EMD problem.
* Expected-f Meeting Distance (EMD) - Given length l of a tour, compute f(l) (where f is a non-negative monotonic function) to bound the expected distance to a finite interval.
* Common choice for f is $f(z) = C^{z}$ where $C \in (0, 1)$
* The SimRank score for two nodes, with parameter $C$, is the expected-f meeting distance travelling back-edges with $f(z) = C^{z}$
#### Evaluation
* Experiments on 2 datasets:
* Corpus of scientific research papers from ResearchIndex.
* Transcripts of undergrad students at Stanford.
* Domain specific properties used to measure similarity and compared with SimRank scores.
* Results show improvement over co-citation scores.