SimRank: a measure of structural-context similarity on ShortScience.org

doi.acm.org
sci-hub
scholar.google.com

SimRank: a measure of structural-context similarity
Jeh, Glen and Widom, Jennifer
ACM Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining - 2002 via Local Bibsonomy
Keywords: vertex, similarity, prediction, link, matrix, graph

Summaries/Notes 1

[link] Summary by Shagun Sodhani 7 years ago

#### Introduction

* Algorithm to derive similarity between 2 nodes of a graph (or graphical model derived from any other kind of dataset).
* [Link to the paper](http://dl.acm.org/citation.cfm?id=775126)

#### SimRank

* Input: A directed graph $G = (V, E)$ where $V$ represents vertices and $E$ represents edges.
* SimRank defines similarity between 2 vertices (or nodes) $i$ and $j$ as the average of the similarity between their in-neighbours decayed by a constant factor $C$.
* Any node is maximally similar to itself (with similarity = 1).
* PageRank analyses the individual vertices of the graph with respect to the global structure, while SimRank analyses the relationship between a pair of vertices (edges).
* SimRank scores are symmetric and can be defined between all pair of vertices.
* $G^{2}$ is defined as the node pair graph such that each node in *G^{2}* corresponds to an ordered pair of nodes of $G$ and there exists an edge between node pair (a, b) and (c, d) if there exists an edge between (a, c) and (b, d).
* In $G^{2}$, similarity flows from node to node with singleton nodes (nodes of the form (a, a)) as the source of similarity.

#### Variants

##### Minimax Variant

* Defines similarity of nodes $i$ and $j$ as the minimum of maximum similarity between $i$ and any in-neighbour of $j$ and between $j$ and any in-neighbour of $i$.

#### Computing SimRank

* A naive solution can be obtained by iteration to a fixed point.
* Space complexity is $O(n^{2})$ and time complexity is $O(kn^{2}d)$ where $k$ is the number of iterations, $n$ is the number of vertices and $d$ is the average of product of indegrees of pair of vertices.
* Optimisations can be made by setting the similarity between far off nodes as 0 and considering only nearby nodes for an update.

#### Different Interpretations

##### Co-citation Score

* The first iteration of SimRank produces results same as co-citation score between a pair of vertices.
* Successive iterations improve these initial scores.

##### Random Surfer-Pairs Model

* SimRank $s(a, b)$ can be interpreted as the measure of how soon two random surfers are expected to meet at the same node if they start at nodes a and b and walk the graph backwards.
* Expected Meeting Distance (EMD) between 2 nodes a and b is the expected number of steps required before 2 surfers (starting at a and b) would meet if they walked randomly in locked step.
* Surfers are allowed to teleport with a small probability - to circumvent the infinite EMD problem.
* Expected-f Meeting Distance (EMD) - Given length l of a tour, compute f(l) (where f is a non-negative monotonic function) to bound the expected distance to a finite interval.
* Common choice for f is $f(z) = C^{z}$ where $C  \in (0, 1)$
* The SimRank score for two nodes, with parameter $C$, is the expected-f meeting distance travelling back-edges with $f(z) = C^{z}$

#### Evaluation

* Experiments on 2 datasets:
    * Corpus of scientific research papers from ResearchIndex.
    * Transcripts of undergrad students at Stanford.
* Domain specific properties used to measure similarity and compared with SimRank scores.
* Results show improvement over co-citation scores.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private