Rosetta
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Pages
Public Member Functions | Protected Member Functions | Private Attributes | List of all members
protocols::cluster::APCluster Class Reference

Public interface for doing affinity propagation clustering. More...

#include <APCluster.hh>

Inheritance diagram for protocols::cluster::APCluster:
Inheritance graph
[legend]

Public Member Functions

 APCluster (core::Size total_pts, core::Size max_sims_per_pt=0)
 Create new clustering class for total_pts input data points. Optionally, set a limit on the number of similarities stored per point. More...
 
virtual ~APCluster ()
 
virtual void set_sim (core::Size i, core::Size k, core::Real sim)
 How appropriate is k as an exemplar for i? More...
 
virtual bool cluster (core::Size maxits, core::Size convits, core::Real lambda)
 Run the actual clustering algorithm. More...
 
virtual core::Size num_pts () const
 
virtual core::Size get_exemplar_for (core::Size i) const
 Return the index of the point that is the exemplar for point i. More...
 
virtual core::Size get_num_exemplars () const
 The number of exemplars selected (number of clusters). Monotonically related to the self-preferences s(k,k). More...
 
virtual void get_all_exemplars (utility::vector1< core::Size > &exemplars) const
 Return the indices of data points chosen as exemplars (cluster centers). More...
 
virtual void get_cluster_for (core::Size k, utility::vector1< core::Size > &cluster) const
 Returns the indices of points with the specified exemplar k. Note that k is the index of an (input) data point that was chosen as an exemplar, not some "cluster index" between 1 and get_num_exemplars(). More...
 
virtual core::Real get_net_sim () const
 The sum of similarities s(i,k) between every data point i and its exemplar k, plus the self preferences of the exemplars. The algorithm should minimize this value – if it dips and climbs again, increase lambda. More...
 
virtual bool save_binary (std::string const &filename) const
 Saves the (sparse) similarity matrix and current cluster assignments (if any), but not the accumulated evidence from the last clustering [ r(i,k) and a(i,k) ]. File format is custom binary and is not portable (host endian-ness). More...
 
virtual bool load_binary (std::string const &filename)
 Wipes all currently held data and reads in similarity values and cluster assignments. Afterwards, points may be re-clustered with different parameters if desired. File format is custom binary and is not portable (host endian-ness). More...
 

Protected Member Functions

virtual void freeze ()
 
virtual void reinitialize ()
 
virtual void update_r_ik (core::Real lambda)
 
virtual void update_a_ik (core::Real lambda)
 
virtual core::Size assign_exemplars ()
 
virtual void save_best_exemplars ()
 
virtual void restore_best_exemplars ()
 

Private Attributes

utility::vector1< DataPointpts_
 
core::Size max_sims_per_pt_
 
bool is_frozen_
 

Detailed Description

Public interface for doing affinity propagation clustering.

Based on Frey and Dueck, "Clustering by Passing Messages Between Data Points", Science 315 (2007). Useful for choosing a set of representative data points (exemplars) out of a large set (e.g. all decoys from a large Rosetta run) given a measure of similarity (e.g. RMSD, maxsub, GDT, ...).

As I understand it, this procedures tries to maximize the sum of similarities between each data point and its exemplar, while balancing that with total number of clusters. Reasonable measures of similarity would be negative RMSD, log-likelihoods, or squared distance (i.e. squared error), depending on what the points represent. Note there is no requirement for symmetry: s(i,j) need not equal s(j,i). The self-similarity s(k,k) ("preference") for each point controls the likelihood it will be selected as an exemplar, and thus indirectly controls the total number of clusters. There is no way to directly specify a specific number of clusters. The authors suggest that using the median of all other similarities will give a moderate number of clusters, and using the minimum of the other similaries will give a small number of clusters.

This implementation is designed for clustering very large numbers of data points with sparse similarity [ s(i,k) = -Inf for most i,k ]. Similarities for each input point are kept in a heap so that you can limit to only the L highest for each. (This scheme is quite likely to break symmetry, as some points will have more close neighbors than others.) Alternately, you may choose to do your own pre-filtering and only enter the G globally highest similarities between any points in the data set. Run time (per cycle) is linear in the number of similarities, or O(N^2) in the limit of a dense similarity matrix.

I follow the conventions of the original paper, where "i" is the index of some generic data point, and "k" is the index of a data point being considered as an exemplar (cluster center).

Definition at line 132 of file APCluster.hh.

Constructor & Destructor Documentation

protocols::cluster::APCluster::APCluster ( core::Size  total_pts,
core::Size  max_sims_per_pt = 0 
)

Create new clustering class for total_pts input data points. Optionally, set a limit on the number of similarities stored per point.

Definition at line 60 of file APCluster.cc.

References max_sims_per_pt_, and pts_.

protocols::cluster::APCluster::~APCluster ( )
virtual

Definition at line 74 of file APCluster.cc.

Member Function Documentation

core::Size protocols::cluster::APCluster::assign_exemplars ( )
protectedvirtual
bool protocols::cluster::APCluster::cluster ( core::Size  maxits,
core::Size  convits,
core::Real  lambda 
)
virtual

Run the actual clustering algorithm.

Parameters
maxitsmaximum number of iterations, period; 100 - 4000 reasonable.
convitsterminate after clusters don't change for this many iterations, 10 - 400 reasonable.
lambdadamping factor, 0.50 - 0.95 is a reasonable range
Returns
true if the clustering converges and false otherwise. Failure to converge is not necessarily a problem – you may be close to a good solution. Try increasing maxits and/or making lambda closer to 1 if you want convergence.

Definition at line 98 of file APCluster.cc.

References assign_exemplars(), freeze(), get_net_sim(), get_num_exemplars(), is_frozen_, reinitialize(), restore_best_exemplars(), save_best_exemplars(), protocols::TR(), update_a_ik(), and update_r_ik().

void protocols::cluster::APCluster::freeze ( )
protectedvirtual
void protocols::cluster::APCluster::get_all_exemplars ( utility::vector1< core::Size > &  exemplars) const
virtual

Return the indices of data points chosen as exemplars (cluster centers).

Definition at line 153 of file APCluster.cc.

References protocols::cluster::DataPoint::curr_exemplar, protocols::cluster::DataPoint::i, core::chemical::element::N, and pts_.

void protocols::cluster::APCluster::get_cluster_for ( core::Size  k,
utility::vector1< core::Size > &  cluster 
) const
virtual

Returns the indices of points with the specified exemplar k. Note that k is the index of an (input) data point that was chosen as an exemplar, not some "cluster index" between 1 and get_num_exemplars().

Definition at line 163 of file APCluster.cc.

References protocols::cluster::DataPoint::curr_exemplar, core::chemical::element::N, and pts_.

virtual core::Size protocols::cluster::APCluster::get_exemplar_for ( core::Size  i) const
inlinevirtual

Return the index of the point that is the exemplar for point i.

Definition at line 148 of file APCluster.hh.

References pts_.

core::Real protocols::cluster::APCluster::get_net_sim ( ) const
virtual

The sum of similarities s(i,k) between every data point i and its exemplar k, plus the self preferences of the exemplars. The algorithm should minimize this value – if it dips and climbs again, increase lambda.

Definition at line 173 of file APCluster.cc.

References protocols::cluster::DataPoint::candidates, protocols::cluster::DataPoint::curr_exemplar, protocols::cluster::Exemplar::k, core::chemical::element::N, pts_, protocols::cluster::Exemplar::s_ik, and protocols::cluster::DataPoint::s_kk.

Referenced by cluster().

core::Size protocols::cluster::APCluster::get_num_exemplars ( ) const
virtual

The number of exemplars selected (number of clusters). Monotonically related to the self-preferences s(k,k).

Definition at line 142 of file APCluster.cc.

References protocols::cluster::DataPoint::curr_exemplar, protocols::cluster::DataPoint::i, core::chemical::element::N, and pts_.

Referenced by cluster().

bool protocols::cluster::APCluster::load_binary ( std::string const &  filename)
virtual

Wipes all currently held data and reads in similarity values and cluster assignments. Afterwards, points may be re-clustered with different parameters if desired. File format is custom binary and is not portable (host endian-ness).

Definition at line 465 of file APCluster.cc.

References protocols::cluster::DataPoint::candidates, protocols::cluster::DataPoint::curr_exemplar, protocols::cluster::DataPoint::i, is_frozen_, protocols::cluster::Exemplar::k, max_sims_per_pt_, core::chemical::element::N, pts_, protocols::cluster::read1(), protocols::cluster::Exemplar::s_ik, and protocols::cluster::DataPoint::s_kk.

virtual core::Size protocols::cluster::APCluster::num_pts ( ) const
inlinevirtual

Definition at line 146 of file APCluster.hh.

References pts_.

void protocols::cluster::APCluster::reinitialize ( )
protectedvirtual
void protocols::cluster::APCluster::restore_best_exemplars ( )
protectedvirtual
void protocols::cluster::APCluster::save_best_exemplars ( )
protectedvirtual
bool protocols::cluster::APCluster::save_binary ( std::string const &  filename) const
virtual

Saves the (sparse) similarity matrix and current cluster assignments (if any), but not the accumulated evidence from the last clustering [ r(i,k) and a(i,k) ]. File format is custom binary and is not portable (host endian-ness).

Definition at line 441 of file APCluster.cc.

References protocols::cluster::DataPoint::candidates, protocols::cluster::DataPoint::curr_exemplar, protocols::cluster::DataPoint::i, protocols::cluster::Exemplar::k, max_sims_per_pt_, core::chemical::element::N, pts_, protocols::cluster::Exemplar::s_ik, protocols::cluster::DataPoint::s_kk, and protocols::cluster::write1().

void protocols::cluster::APCluster::set_sim ( core::Size  i,
core::Size  k,
core::Real  sim 
)
virtual

How appropriate is k as an exemplar for i?

Adding s(i,j) is not the same as adding s(j,i) – you must do both if you want symmetry.

There is currently no protection against adding s(i,k) twice, which will not be caught and will screw up the computation royally.

Definition at line 81 of file APCluster.cc.

References is_frozen_, max_sims_per_pt_, and pts_.

void protocols::cluster::APCluster::update_a_ik ( core::Real  lambda)
protectedvirtual
void protocols::cluster::APCluster::update_r_ik ( core::Real  lambda)
protectedvirtual

Member Data Documentation

bool protocols::cluster::APCluster::is_frozen_
private

Definition at line 184 of file APCluster.hh.

Referenced by cluster(), freeze(), load_binary(), and set_sim().

core::Size protocols::cluster::APCluster::max_sims_per_pt_
private

Definition at line 183 of file APCluster.hh.

Referenced by APCluster(), load_binary(), save_binary(), and set_sim().

utility::vector1< DataPoint > protocols::cluster::APCluster::pts_
private

The documentation for this class was generated from the following files: