Graph Algorithms and Appilcations

Graph Basics

A graph is defined as a collection of objects where some pairs of objects are connected by links.

Undirected graphs have symmetrical/reciprocal links, directed graphs have directed links.

Complete Graph: an undirected graph with the maximum number of edges (such that all pairs of nodes are connected).

Graph attribute Key factor
Connected versus disconnected Whether there is a path between any two nodes in the graph, irrespective of distance
Weighted versus unweighted Whether there are (domain-specific) values on relationships or nodes
Directed versus undirected Whether or not relationships explicitly define a start and end node
Cyclic versus acyclic Whether paths start and end at the same node
Sparse versus dense Relationship to node ratio
Monopartite, bipartite, and k-partite Whether nodes connect to only one other node type (e.g., users like movies) or many other node types (e.g., users like users who like movies)

Erdös-Rényi Random Graph Model
The Erdös-Rényi Random Graph Model is the simplest model of graphs. This random graph model comes in two variants:

  1. \(G_{np}\): undirected graph on \(n\) nodes where each \(edge(u, v)\) appears IID with probability \(p\)
  2. \(G_{nm}\): undirected graph with \(n\) nodes, and \(m\) edges picked uniformly at random

The degree distribution of \(G_{np}\) is binomial, let \(P(k)\) denotes the fraction of nodes with degree \(k\):

$$ P(k) = \binom{n-1}{k} p^k (1-p)^{n-1-k} $$

Small World Random Graph Model
It’s a model for constructing a family of networks with both high clustering and short average path length.

To create such a model, we employ the following steps:

  1. Start with low-dimensional regular attic (ring) by connecting each node to \(k\) neighbors on its right and \(k\) neighbors on its left, with \(k \geq 2\)
  2. Rewire each edge with probability \(p\) by moving its endpoint to a randomly chosen node

Kronecker Random Graph Model
The Kronecker Graph Model is a recursive graph generation model that combines both mathematical tractability and realistic static and temporal network properties. The intuition underlying the Kronecker Graph Model is self-similarity, where the whole has the same shape as one or more of its parts.

Pathfinding and Graph Search Algorithms

Pathfinding algorithms build on top of graph search algorithms and explore routes between nodes, starting at one node and traversing through relationships until the destination has been reached. Here’s a summary of general graph search algorithms:

Algorithm type What it does Example use
Breadth First Search Traverses a tree structure by fanning out to explore the nearest neighbors and then their sublevel neighbors Locating neighbor nodes in GPS systems to identify nearby places of interest
Depth First Search Traverses a tree structure by exploring as far as possible down each branch before backtracking Discovering an optimal solution path in gaming simulations with hierarchical choices
Shortest Path (Variations: A*, Yen’s) Calculates the shortest path between a pair of nodes Finding driving directions between two locations
All Pairs Shortest Path Calculates the shortest path between all pairs of nodes in the graph Evaluating alternate routes around a traffic jam
Single Source Shortest Path Calculates the shorest path between a single root node and all other nodes Least cost routing of phone calls
Minimum Spanning Tree Calculates the path in a connected tree with the smallest cost for visiting all nodes Optimizing connected structure routing, such as laying cable or garbage collection
Random Walk Returns a list of nodes along a path of specified size by randomly choosing relationships to traverse Augmenting training for machine learning or data for graph algorithms

Centrality Algorithms

Centrality algorithms are used to understand the roles of particular nodes in a graph and their impact on that network. They can help us identify the most important nodes in the graph.

Overview

Algorithm type What it does Example use
Degree Centrality Measures the number of relationships a node has Estimating a person’s popularity by looking at their in-degree and using their out-degree to estimate gregariousness
Closeness Centrality Calculates which nodes have the shortest paths to all other nodes Finding the optimal location of new public services for maximum accessibility
Betweenness Centrality Measures the number of shortest paths that pass through a node Improving drug targeting by finding the control genes for specific diseases
PageRank Estimates a current node’s importance from its linked neighbors and their neighbors Finding the most influential features for extraction in machine learning and ranking text for entity relevance in natural language processing

Degree Centrality

This is the simplest graph algorithm, it counts the number of incoming and outgoing relationships from a node, and is used to find popular nodes in a graph.

Use Degree Centrality to analyze influence by looking at the number of incoming and outgoing relationships, or find the popularity of individual nodes.

Closeness Centrality

Closeness Centrality is a way of detecting nodes that are able to spread information efficiently through a subgraph. The closeness centrality of a node is calculated by:

$$ C(u) = \frac{1}{\sum_{v=1}^{n-1}d(u, v)} $$

where:

  • \(u\) is a node
  • \(n\) is the number of nodes in the graph
  • \(d(u,v)\) is the shortest-path distance between another node \(v\) and \(u\)

Apply Closeness Centrality to know which nodes disseminate things the fastest.

Betweenness Centrality

Betweenness Centrality is a way of detecting the amount of influence a node has over the flow of information or resources in a graph.

$$ B(u) = \sum_{s \neq u \neq t} \frac{p(u)}{p} $$

where:

  • \(u\) is a node
  • \(p\) is the total number of shortest paths between nodes \(s\) and \(t\)
  • \(p(u)\) is the number of shortest paths between nodes \(s\) and \(t\) that pass through node \(u\)

Betweenness Centrality applies to a wide range of problems in real world networks. We use it to find bottlenecks, control points, and vulnerabilities.

PageRank

PageRank is the best known of the centrality algorithms. It measures the transitive (or directional) influence of nodes. All the other centrality algorithms we discuss measure the direct influence of a node, whereas PageRank considers the influence of a node’s neighbors, and their neighbors.

$$ PR(u) = (1-d)+ d(\frac{PR(T_1)}{C(T_1)} + … + \frac{PR(T_n)}{C(T_n)}) $$

where:

  • we assume that a page u has citations from pages \(T_1\) to \(T_n\)
  • \(d\) is a damping factor which is set between 0 and 1. It is usually set to 0.85
  • \(1-d\) is the probability that a node is reached directly without following any relationships
  • \(C(T_n)\) is defined as the out-degree of a node \(T\)

PageRank is now used in many domains outside web indexing. Use this algorithm whenever you’re looking for broad influence over a network. For instance, if you’re looking to target a gene that has the highest overall impact to a biological function, it may not be the most connected one. It may, in fact, be the gene with the most relationships with other, more significant functions.

Community Detection Algorithms

Community formation is common in all types of networks, and identifying them is essential for evaluating group behavior and emergent phenomena.

Overview

Algorithm type What it does Example use
Triangle Count and Clustering Coefficient Measures how many nodes form triangles and the degree to which nodes tend to cluster together Estimating group stability and whether the network might exhibit “small-world” behaviors seen in graphs with tightly knit clusters
Strongly Connected Components Finds groups where each node is reachable from every other node in that same group following the direction of relationships Making product recommendations based on group affiliation or similar items
Connected Components Finds groups where each node is reachable from every other node in that same group, regardless of the direction of relationships Performing fast grouping for other algorithms and identify islands
Label Propagation Infers clusters by spreading labels based on neighborhood majorities Understanding consensus in social communities or finding dangerous combinations of possible co-prescribed drugs
Louvain Modularity Maximizes the presumed accuracy of groupings by comparing relationship weights and densities to a defined estimate or average In fraud analysis, evaluating whether a group has just a few discrete bad behaviors or is acting as a fraud ring

Triangle Count and Clustering Coefficient

Triangle Count determines the number of triangles passing through each node in the graph.
A triangle is a set of three nodes, where each node has a relationship to all other nodes.

The goal of the Clustering Coefficient algorithm is to measure how tightly a group is clustered compared to how tightly it could be clustered. The algorithm uses Triangle Count in its calculations, which provides a ratio of existing triangles to possible relationships. A maximum value of 1 indicates a clique where every node is connected to every other node.

Local Clustering Coefficient

The local clustering coefficient of a node is the likelihood that its neighbors are also connected. The computation of this score involves triangle counting.

$$ CC(u) = \frac{2R_{u}}{k_{u}(k_{u}-1)} $$

where:

  • \(u\) is a node
  • \(R(u)\) is the number of relationships through the neighbors of \(u\)
  • \(k(u)\) is the degree of \(u\)

Global Clustering Coefficient

The global clustering coefficient is the normalized sum of the local clustering coefficients.

Clustering Coefficient can provide the probability that randomly chosen nodes will be connected. You can also use it to quickly evaluate the cohesiveness of a specific group or your overall network.

Strongly Connected Components (SCC)

SCC finds sets of connected nodes in a directed graph where each node is reachable in both directions from any other node in the same set.

A component that is strongly connected can be used to profile similar behavior or inclinations in a group for applications such as recommendation engines. Many community detection algorithms like SCC are used to find and collapse clusters into single nodes for further intercluster analysis.

Connected Components

The Connected Components algorithm (sometimes called Union Find or Weakly Connected Components) finds sets of connected nodes in an undirected graph where each node is reachable from any other node in the same set.

Make it a habit to run Connected Components to test whether a graph is connected as a preparatory step for general graph analysis.

Label Propagation

The steps often used for the Label Propagation(LPA) pull method are:

  1. Every node is initialized with a unique label (an identifier), and, optionally preliminary seed labels can be used.

  2. These labels propagate through the network.

  3. At every propagation iteration, each node updates its label to match the one with the maximum weight, which is calculated based on the weights of neighbor nodes and their relationships. Ties are broken uniformly and randomly.

  4. LPA reaches convergence when each node has the majority label of its neighbors.

Use Label Propagation in large-scale networks for initial community detection, especially when weights are available. This algorithm can be parallelized and is therefore extremely fast at graph partitioning.

Louvain Modularity

Louvain quantifies how well a node is assigned to a group by looking at the density of connections within a cluster in comparison to an average or random sample. This measure of community assignment is called modularity.

Use Louvain Modularity to find communities in vast networks. This algorithm applies a heuristic, as opposed to exact, modularity, which is computationally expensive. Louvain can therefore be used on large graphs where standard modularity algorithms may struggle.

Reference

CS224W
Neo4j Graph Algorithms

Chuanrong Li

Read more posts by this author.