## Graph Basics

A graph is defined as a collection of objects where some pairs of objects are connected by links.

Undirected graphs have symmetrical/reciprocal links, directed graphs have directed links.

Complete Graph: an undirected graph with the maximum number of edges (such that all pairs of nodes are connected).

Graph attribute | Key factor |
---|---|

Connected versus disconnected | Whether there is a path between any two nodes in the graph, irrespective of distance |

Weighted versus unweighted | Whether there are (domain-specific) values on relationships or nodes |

Directed versus undirected | Whether or not relationships explicitly define a start and end node |

Cyclic versus acyclic | Whether paths start and end at the same node |

Sparse versus dense | Relationship to node ratio |

Monopartite, bipartite, and k-partite | Whether nodes connect to only one other node type (e.g., users like movies) or many other node types (e.g., users like users who like movies) |

**Erdös-Rényi Random Graph Model**

The Erdös-Rényi Random Graph Model is the simplest model of graphs.
This random graph model comes in two variants:

- \(G_{np}\): undirected graph on \(n\) nodes where each \(edge(u, v)\) appears IID with probability \(p\)
- \(G_{nm}\): undirected graph with \(n\) nodes, and \(m\) edges picked uniformly at random

The degree distribution of \(G_{np}\) is binomial, let \(P(k)\) denotes the fraction of nodes with degree \(k\):

$$ P(k) = \binom{n-1}{k} p^k (1-p)^{n-1-k} $$

**Small World Random Graph Model**

It’s a model for constructing a family of networks with both high clustering and short average path length.

To create such a model, we employ the following steps:

- Start with low-dimensional regular attic (ring) by connecting each node to \(k\) neighbors on its right and \(k\) neighbors on its left, with \(k \geq 2\)
- Rewire each edge with probability \(p\) by moving its endpoint to a randomly chosen node

**Kronecker Random Graph Model**

The Kronecker Graph Model is a recursive graph generation model that combines both mathematical tractability and realistic static and temporal network properties. The intuition underlying the Kronecker Graph Model is self-similarity, where the whole has the same shape as one or more of its parts.

## Pathfinding and Graph Search Algorithms

Pathfinding algorithms build on top of graph search algorithms and explore routes between nodes, starting at one node and traversing through relationships until the destination has been reached. Here’s a summary of general graph search algorithms:

Algorithm type | What it does | Example use |
---|---|---|

Breadth First Search | Traverses a tree structure by fanning out to explore the nearest neighbors and then their sublevel neighbors | Locating neighbor nodes in GPS systems to identify nearby places of interest |

Depth First Search | Traverses a tree structure by exploring as far as possible down each branch before backtracking | Discovering an optimal solution path in gaming simulations with hierarchical choices |

Shortest Path (Variations: A*, Yen’s) | Calculates the shortest path between a pair of nodes | Finding driving directions between two locations |

All Pairs Shortest Path | Calculates the shortest path between all pairs of nodes in the graph | Evaluating alternate routes around a traffic jam |

Single Source Shortest Path | Calculates the shorest path between a single root node and all other nodes | Least cost routing of phone calls |

Minimum Spanning Tree | Calculates the path in a connected tree with the smallest cost for visiting all nodes | Optimizing connected structure routing, such as laying cable or garbage collection |

Random Walk | Returns a list of nodes along a path of specified size by randomly choosing relationships to traverse | Augmenting training for machine learning or data for graph algorithms |

## Centrality Algorithms

Centrality algorithms are used to understand the roles of particular nodes in a graph and their impact on that network. They can help us identify the most important nodes in the graph.

### Overview

Algorithm type | What it does | Example use |
---|---|---|

Degree Centrality | Measures the number of relationships a node has | Estimating a person’s popularity by looking at their in-degree and using their out-degree to estimate gregariousness |

Closeness Centrality | Calculates which nodes have the shortest paths to all other nodes | Finding the optimal location of new public services for maximum accessibility |

Betweenness Centrality | Measures the number of shortest paths that pass through a node | Improving drug targeting by finding the control genes for specific diseases |

PageRank | Estimates a current node’s importance from its linked neighbors and their neighbors | Finding the most influential features for extraction in machine learning and ranking text for entity relevance in natural language processing |

### Degree Centrality

This is the simplest graph algorithm, it counts the number of incoming and outgoing relationships from a node, and is used to find popular nodes in a graph.

Use Degree Centrality to analyze influence by looking at the number of incoming and outgoing relationships, or find the popularity of individual nodes.

### Closeness Centrality

Closeness Centrality is a way of detecting nodes that are able to spread information efficiently through a subgraph. The closeness centrality of a node is calculated by:

$$ C(u) = \frac{1}{\sum_{v=1}^{n-1}d(u, v)} $$

where:

- \(u\) is a node
- \(n\) is the number of nodes in the graph
- \(d(u,v)\) is the shortest-path distance between another node \(v\) and \(u\)

Apply Closeness Centrality to know which nodes disseminate things the fastest.

### Betweenness Centrality

Betweenness Centrality is a way of detecting the amount of influence a node has over the flow of information or resources in a graph.

$$ B(u) = \sum_{s \neq u \neq t} \frac{p(u)}{p} $$

where:

- \(u\) is a node
- \(p\) is the total number of shortest paths between nodes \(s\) and \(t\)
- \(p(u)\) is the number of shortest paths between nodes \(s\) and \(t\) that pass through node \(u\)

Betweenness Centrality applies to a wide range of problems in real world networks. We use it to find bottlenecks, control points, and vulnerabilities.

### PageRank

PageRank is the best known of the centrality algorithms. It measures the transitive (or directional) influence of nodes. All the other centrality algorithms we discuss measure the direct influence of a node, whereas PageRank considers the influence of a node’s neighbors, and their neighbors.

$$ PR(u) = (1-d)+ d(\frac{PR(T_1)}{C(T_1)} + … + \frac{PR(T_n)}{C(T_n)}) $$

where:

- we assume that a page u has citations from pages \(T_1\) to \(T_n\)
- \(d\) is a damping factor which is set between 0 and 1. It is usually set to 0.85
- \(1-d\) is the probability that a node is reached directly without following any relationships
- \(C(T_n)\) is defined as the out-degree of a node \(T\)

PageRank is now used in many domains outside web indexing. Use this algorithm whenever you’re looking for broad influence over a network. For instance, if you’re looking to target a gene that has the highest overall impact to a biological function, it may not be the most connected one. It may, in fact, be the gene with the most relationships with other, more significant functions.

## Community Detection Algorithms

Community formation is common in all types of networks, and identifying them is essential for evaluating group behavior and emergent phenomena.

### Overview

Algorithm type | What it does | Example use |
---|---|---|

Triangle Count and Clustering Coefficient | Measures how many nodes form triangles and the degree to which nodes tend to cluster together | Estimating group stability and whether the network might exhibit “small-world” behaviors seen in graphs with tightly knit clusters |

Strongly Connected Components | Finds groups where each node is reachable from every other node in that same group following the direction of relationships | Making product recommendations based on group affiliation or similar items |

Connected Components | Finds groups where each node is reachable from every other node in that same group, regardless of the direction of relationships | Performing fast grouping for other algorithms and identify islands |

Label Propagation | Infers clusters by spreading labels based on neighborhood majorities | Understanding consensus in social communities or finding dangerous combinations of possible co-prescribed drugs |

Louvain Modularity | Maximizes the presumed accuracy of groupings by comparing relationship weights and densities to a defined estimate or average | In fraud analysis, evaluating whether a group has just a few discrete bad behaviors or is acting as a fraud ring |

### Triangle Count and Clustering Coefficient

Triangle Count determines the number of triangles passing through each node in the graph.

A triangle is a set of three nodes, where each node has a relationship to all other nodes.

The goal of the Clustering Coefficient algorithm is to measure how tightly a group is clustered compared to how tightly it could be clustered. The algorithm uses Triangle Count in its calculations, which provides a ratio of existing triangles to possible relationships. A maximum value of 1 indicates a clique where every node is connected to every other node.

**Local Clustering Coefficient**

The local clustering coefficient of a node is the likelihood that its neighbors are also connected. The computation of this score involves triangle counting.

$$ CC(u) = \frac{2R_{u}}{k_{u}(k_{u}-1)} $$

where:

- \(u\) is a node
- \(R(u)\) is the number of relationships through the neighbors of \(u\)
- \(k(u)\) is the degree of \(u\)

**Global Clustering Coefficient**

The global clustering coefficient is the normalized sum of the local clustering coefficients.

Clustering Coefficient can provide the probability that randomly chosen nodes will be connected. You can also use it to quickly evaluate the cohesiveness of a specific group or your overall network.

### Strongly Connected Components (SCC)

SCC finds sets of connected nodes in a directed graph where each node is reachable in both directions from any other node in the same set.

A component that is strongly connected can be used to profile similar behavior or inclinations in a group for applications such as recommendation engines. Many community detection algorithms like SCC are used to find and collapse clusters into single nodes for further intercluster analysis.

### Connected Components

The Connected Components algorithm (sometimes called Union Find or Weakly Connected Components) finds sets of connected nodes in an undirected graph where each node is reachable from any other node in the same set.

Make it a habit to run Connected Components to test whether a graph is connected as a preparatory step for general graph analysis.

### Label Propagation

The steps often used for the Label Propagation(LPA) pull method are:

Every node is initialized with a unique label (an identifier), and, optionally preliminary seed labels can be used.

These labels propagate through the network.

At every propagation iteration, each node updates its label to match the one with the maximum weight, which is calculated based on the weights of neighbor nodes and their relationships. Ties are broken uniformly and randomly.

LPA reaches convergence when each node has the majority label of its neighbors.

Use Label Propagation in large-scale networks for initial community detection, especially when weights are available. This algorithm can be parallelized and is therefore extremely fast at graph partitioning.

### Louvain Modularity

Louvain quantifies how well a node is assigned to a group by looking at the density of connections within a cluster in comparison to an average or random sample. This measure of community assignment is called modularity.

Use Louvain Modularity to find communities in vast networks. This algorithm applies a heuristic, as opposed to exact, modularity, which is computationally expensive. Louvain can therefore be used on large graphs where standard modularity algorithms may struggle.