Graph algorithms

Vermouth describes molecules and molecular fragments as graphs where atoms are nodes and connections between them (e.g. bonds) are edges. This allows us to use the connectivity to identify which atom is which, meaning we are no longer dependent on atom names.

Definitions

Graph

A graph \(G = (V, E)\) is a collection of nodes (\(V\)) connected by edges (\(E\)): \(e_{ij} = (v_i, v_j) \in E\). In undirected graphs \(e_{ij} = e_{ji}\). Unless we specify otherwise all graphs used in vermouth are undirected. The size of a graph is equal to the number of nodes: \(|G| = |V|\).

Subgraph

Graph \(H = (W, F)\) is a subgraph of graph \(G = (V, E)\) if:

\[\begin{split}|H| &< |G|\\ W &\subset V\\ e_{ij} &\in F \quad \forall e_{ij} \in E\\ e_{ij} &\notin F \quad \forall e_{ij} \notin E\\\end{split}\]

This means that all nodes in \(H\) are in \(G\), and that nodes are connected in \(H\) if and only if they are connected in \(G\).

Graph isomorphism

A graph isomorphism \(m\) between graphs \(H = (W, F)\) and \(G = (V, E)\) is a bijective mapping \(m: V \mapsto W\) such that the following conditions hold:

\[\begin{split}|H| &= |G|\\ m(v) &\simeq v \quad &\forall v \in V\\ (m(v_i), m(v_j)) &\simeq (v_i, v_j) \quad &: (m(v_i), m(v_j)) \in F \enspace \forall (v_i, v_j) \in E\end{split}\]

This means that every node in \(G\) maps to exactly one node in \(H\) such that all connected nodes in \(G\) are connected in \(H\). Note that labels/attributes on nodes and edges (such as element or atom name) can affect the equivalence criteria.

Subgraph isomorphism

A subgraph isomorphism is a Graph isomorphism, but without the constraint that \(|H| = |G|\). Instead, \(|H| <= |G|\) if \(H\) is subgraph isomorphic to \(G\).

Induced subgraph isomorphism

As Subgraph isomorphism with the added constraint that equivalent nodes not connected in \(G\) are not connected in \(H\):

\[(m(v_i), m(v_j)) \notin F \quad \forall (v_i, v_j) \notin E\]

We denote \(H\) being induced subgraph isomorphic to \(G\) as \(H \precsim G\).

It is important to note that a path graph is not subgraph isomorphic to the corresponding cycle graph of the same size. For example, n-propane is not subgraph isomorphic to cyclopropane!

Maximum common induced subgraph

The maximum common induced subgraph between \(G\) and \(H\) is the largest graph \(J\) such that \(J \precsim G\) and \(J \precsim H\). Commonly the answer is given as a general mapping between \(G\) and \(H\).

Isomorphism

Vermouth and martinize2 identify atoms by connectivity, generally combined with a constraint on element or atom name. We do this using either a Maximum common induced subgraph (during the Repair graph step) or a Induced subgraph isomorphism (the other steps). In all these cases we effectively find how nodes in the molecule we’re working on match with nodes in our reference graphs, such as blocks.

During the Repair graph step there are two, related, complications: 1) we need a “best” overlay, where as many atom names match as possible; and 2) There can be very many (equivalent) possible overlays/isomorphisms. Let’s address the second concern first. As example we’ll look at the automorphisms (= self-isomorphism, i.e. how does a graph fit on itself) of propane (CH3-CH2-CH3).

There are 2 isomorphisms for the carbons: \(C_\alpha-C_\beta-C_\gamma \mapsto C_\alpha-C_\beta-C_\gamma\) and \(C_\alpha-C_\beta-C_\gamma \mapsto C_\gamma-C_\beta-C_\alpha\). Similarly, there are 2 isomorphisms for the central methylene group: \(H_1-C_\beta-H_2 \mapsto H_1-C_\beta-H_2\) and \(H_1-C_\beta-H_2 \mapsto H_2-C_\beta-H_1\). Each terminal methyl group however, has 6 unique isomorphisms!

\[H_1H_2H_3 \mapsto (H_1H_2H_3, H_1H_3H_2, H_2H_1H_3, H_3H_1H_2, H_2H_3H_1, H_3H_2H_1)\]

This means that in total, propane, a molecule consisting of 11 atoms, has \(2 (carbons) \times 2 (methylene) \times 6 (methyl) \times 6 (methyl) = 144\) automorphisms! Now imagine how this scales for a lipid. Clearly this spirals out of control very quickly, and it is generally unfeasible to generate all possible isomorphisms [1].

Luckily for us however, we’re not interested in finding all these isomorphisms, since we can consider most of these to be equivalent. For our use case it doesn’t matter whether \(H_1\) maps to \(H_1\) or \(H_2\) so long as \(H_1\) and \(H_2\) are equivalent. There is one catch however: we need to find the isomorphism where most atom names match. We can achieve this by preferentially using nodes with a lower index [2] when given a choice between symmetry equivalent nodes. The [ISMAGS] algorithm does exactly this: it calculates symmetry unique isomorphisms preferentially using nodes with a smaller index.

Note that this problem only comes up when your graphs are (very) symmetric. In all other steps we constrain the isomorphism such that nodes are only considered equal if their atom names match. Since atom names are generally unique, this means that this problem is sidestepped completely. The only place where we cannot do this is during the Repair graph step, since at that point we cannot assume that the atoms names in our molecule are correct.

[1]This problem gets even worse when trying to find the Maximum common induced subgraph.
[2]In other words, we impose an ordering on the nodes in the graph. We do this by ordering the nodes based on whether there is a node with a corresponding atom name in the reference and subsequently sorting by atom name.
[ISMAGS]
  1. Houbraken, S. Demeyer, T. Michoel, P. Audenaert, D. Colle, M. Pickavet, The Index-Based Subgraph Matching Algorithm with General Symmetries (ISMAGS): Exploiting Symmetry for Faster Subgraph Enumeration, PLoS One. 9 (2014) e97896. doi:10.1371/journal.pone.0097896.