CoSimRank - a fast and accurate graph based similarity measure

Introduction

The codes provided here will compute CoSimRank. You can also use it to compute the cosine of personalized PageRank vectors. Our experiments have shown that CoSimRank is more accurate.

To learn more about CoSimRank, read the following paper:
http://anthology.aclweb.org/P/P14/P14-1131.pdf
Or watch the talk (20min):
http://techtalks.tv/talks/cosimrank-a-flexible-efficient-graph-theoretic-similarity-measur/60483/

Get started

  1. clone the Git repository:
    git clone https://github.com/casaro/CoSimRank.git
  2. go to the folder and compile the code
    make
  3. run the sandbox to play around
    sandbox/sandbox.sh

Graph files

The graph files are in matrix market file format. Each graph can contain several files, each one presenting one edge type. The code comes with an English and a German graph, with approximately 30k words each. The graphs have no edges type (or just one).

Dictionary files

Each graph requires a dictionary file to connect nodes and words. A dictionary files contains N lines (where as N is the number of words in the graph, i.e. the matrix market files has dimension NxN). Each lines starts with the the index, followed by the word and the pos. All separated by space. The codes also contains the matching dictionary files for the graphs.

Seed files

For bilingual experiments a seed file is required. This file contains known translations. It is also in matrix market file file format and usually contains edges of weigth 1. But other weights might be interesting as well. For monolingual experiments (synonym extraction) this file is not used.

Testset file

You can provide a testset file if you want (see sandbox/example.sh for an example). The code will execute all tests and compute acc@1, acc@10 and MRR. A testset file starts with a line containing the number of tests T, the source graph and the target graph (either A or B). The following T lines will contain the pos (a, n or v), the source word and the expected answer. The anser can be several word separated by ",". All other entries are separated by tabulator. The code has got a TS1000 testset for lexicon extraction and a TS68 testset for synonym extraction.

Cite

If you use CoSimRank, please cite the following paper:

@InProceedings{rothe-schutze:2014:P14-1,
  author    = {Rothe, Sascha  and  Sch\"{u}tze, Hinrich},
  title     = {CoSimRank: A Flexible \& Efficient Graph-Theoretic Similarity Measure},
  booktitle = {Proceedings of the ACL},
  month     = {June},
  year      = {2014},
  address   = {Baltimore, Maryland},
  publisher = {Association for Computational Linguistics},
  pages     = {1392--1402},
  url       = {http://www.aclweb.org/anthology/P/P14/P14-1131}
}

Contact: Sascha Rothe (cis page)