Martinize 2 workflow

Pipeline

Martinize 2 is the main command line interface entry point for vermouth. It effectively consists of 6 stages:

  1. reading input files

  2. repairing the input molecule

  3. mapping the input molecule to the desired output force field and resolution

  4. applying Links to generate inter-residue interactions

  5. post-processing, such as building an elastic network

  6. writing output files

We’ll describe each stage in more detail here. It is good to bear in mind however that in all stages the recognition/identification of atoms/particles is based on their connectivity in addition to any atom properties.

Throughout this document, when we refer to an ‘edge’ we mean a connection between two nodes in a graph. With ‘bonds’ we mean a chemical connection including the corresponding simulation parameters. Similarly, with ‘molecule’ we mean a connected graph consisting of atoms and edges. Note that this is not necessarily the same as a protein chain, since these could be connected through e.g. a disulphide bridge.

If martinize2 at some point encounters a situation that might result in an incorrect topology it will issue a warning, and refuse to write output files so that you are forced to examine the situation, but also see the -maxwarn CLI option. The options -v and -vv can be used to print more debug output, while the options -write-graph, -write-repair and -write-canon can be used to write out the system after Make bonds, Repair graph and Identify modifications, respectively. All of these can help you track down what’s going wrong where.

1) Read input files

Martinize2 can currently read input structures from .gro and .pdb files. .pdb files are preferred however, since they contain more information, such as chain identifiers, and TER and CONECT records.

Reading PDB files

Reading PDB files is done by PDBInput. We take into account the following PDB records: MODEL and ENDMDL to determine which model to parse; ATOM and HETATM; TER, which can be used to separate molecules; CONECT, which is used to add edges; and END.

We issue a pdb-alternate warning if any atoms in the PDB file have an alternate conformation that is not ‘A’, since those will always be ignored.

Relevant CLI options: -f; -model; -ignore; -ignh.

Make bonds

Since atom identification is governed by their connectivity we need to generate bonds in the input structure. Where possible we get them from the input file such as PDB CONECT records. Beyond that, edges are added by MakeBonds. By default edges will be added based on atom names and distances, but this behaviour can be changed via the CLI option -bonds-from.

To add edges based on atom names the Block from the input force field is used as reference for every residue in the input structure where possible. This is not possible when a residue contains multiple atoms with the same name, nor when there is no Block corresponding to the residue [1]. Note that this will only ever create edges within residues.

Edges will be added based on distance when they are close enough together, except for a few exceptions (see below). Atoms will be considered close enough based on their element (taken from either the PDB file directly, or deduced from atom name [2]). The distance threshold is multiplied by -bonds-fudge to allow for conformations that are slightly out-of-equilibrium. Edges will not be added from distances in two cases: 1) if edges could be added based on atom names no edges will be added between atoms that are not bonded in the reference Block. 2) If the edge would connect 2 residues, and at least one of the atoms involved is a hydrogen atom. Edges added based on distance are logged as debug output.

If your input structure is far from equilibrium and adding edges based on distance is likely to produce erroneous results, make sure to provide CONECT records describing at least the edges between residues, and between atoms involved in modifications, such as termini and PTMs.

We issue a general warning when it is requested to add edges based on atom names, but this cannot be done for some reason. This commonly happens when your input structure is a homo multimer without TER record and identical residue numbers and chain identifiers across the monomers. In this case martinize2 cannot distinguish the atom “N”, residue ALA1, chain “A” from the atom “N”, residue ALA1, chain “A” in the next monomer. The easiest solution in this case is to place strategic TER records in your PDB file.

Relevant CLI options: -bond-from; -bonds-fudge

Annotate mutations and modifications

As a last step martinize2 allows you to make some changes to your input structure from the CLI, for example to perform point mutations, or to apply PTMs and termini. This is done in part by AnnotateMutMod, and completed by Repair graph.

The -mutate option can be used to change the residue name of one or more residues. For example, you can specify -mutate PHE42:ALA to mutate all residues with residue name “PHE” and residue number 42 to “ALA”. Or change all “HSE” residues to “HIS”: -mutate HSE:HIS. Modifications can be specified in a similar way.

The specifications nter and cter can be used to quickly refer to all N- and C-terminal residues respectively [3]. In addition, the CLI options -nter and -cter can be used to change the N- and C-termini. By default martinize2 will try to apply charged protein termini (‘N-ter’ and ‘C-ter’). If this is not what you want, for example because your molecule is not a protein, be sure to provide the appropriate -nter and -cter options. You can specify the modification none to specify that a residue should not have any modifications. Note that if you use this for the termini you may end up with chemically invalid, uncapped, termini.

Relevant CLI options: -mutate, -modify, -nter, -cter, -nt

2) Repair the input graph

Depending on the origin of your input structure, there may be atoms missing, or atoms may have non-standard names. In addition, some residues may include modifications such as PTMs.

Repair graph

The first step is to complete the graph so that it contains all atoms described by the reference Block, and so that all atoms have the correct names. These blocks are taken from the input force field based on residue names (taking any mutations and modifications into account). RepairGraph takes care of all this.

To identify atoms in a residue we consider the Maximum common induced subgraph between the residue and its reference since the residue can be both too small (atoms missing in the input) and too large (atoms from PTMs) at the same time. Unfortunately, this is a very expensive operation which scales exponentially with the size of the residue. So if you know beforehand that your structure contains (very) large PTMs, such as lipidations, consider specifying those as separate residues.

The maximum common induced subgraph is found using ISMAGS, where nodes are considered equal if their elements are equal. Beforehand, the atoms in the residue will be sorted such that the isomorphism where most atom names correspond with the reference is found. This sorting also speeds up the calculation significantly, so if you’re working with a system containing large residues consider correcting some of the atom names.

We issue an unknown-residue warning if no Block can be retrieved for a given residue name. In this case the entire molecule will be removed from the system.

Identify modifications

Secondly, all modifications are identified. Repair graph also tags all atoms it did not recognise, and those are processed by CanonicalizeModifications.

Modifications are identified by finding the solution where all tagged atoms are covered by the atoms of exactly one Modification, where the modification must be an induced subgraph of the molecule. Every modification must contain at least one “anchoring” atom, which is an atom that is also described by a Block. Unknown atoms are considered to be equal if their element is equal; anchor atoms are considered equal if their atom name is equal. Because modifications must be induced subgraphs of the input structure there can be no missing atoms!

After this step all atoms will have correct atom names, and any residues that include modifications will be labelled. This information is later used during the resolution transformation

An unknown-input warning will be issued if a modification cannot be identified. In this case the atoms involved will be removed from the system.

Rebuild coordinates for missing atoms

Currently martinize2 is not capable of rebuilding coordinates for missing atoms.

3) Resolution transformation

The resolution transformation is done by DoMapping. This processor will produce your molecules at the target resolution, based on the available mappings. These mappings are read from the .map and .mapping files available in the library [4]. See also File formats. In essence these mappings describe how molecular fragments (nodes and edges) correspond to a block in the target force field. We find all the ways these mappings can fit onto the input molecule, and add the corresponding blocks and modifications to the resulting molecule.

For a molecular fragment to match the input molecule the atom and residue names need to match [5]. This is why we first repair the input molecule so that you only need to consider the canonical atom names when adding mappings. Mappings defined by .mapping files can also cross residue boundaries (where specified).

Edges and interactions within the blocks will come from the target force field. Edges between the blocks will be generated based on the connectivity of the input molecule, i.e. if atoms A and B are connected in the input molecule, the particles they map to in the output force field will also be connected. Interactions across separate blocks will be added in the next step.

The processor will do some sanity checking on the resulting molecule, and issue an unmapped-atom warning if there are atoms in the input molecule for which no mapping can be found. In addition, this warning will also be issued if there are any non-hydrogen atoms that are not mapped to the output molecule. A more serious inconsistent-data warning will be issued for the following cases:

  • there are multiple modification mappings, which overlap

  • there are multiple block mappings, which overlap

  • there is an output particle that is constructed from multiple input atoms, and some “residue level” attributes (such as residue name and number) are not consistent between the constructing atoms.

  • there is an atom which maps to multiple particles in the output, but these particles are disconnected

  • there is an interaction that is being set by multiple mappings

Relevant CLI options: -ff, -map-dir

5) Post processing

There can be any number of post processing steps. For example to add an elastic network, or to generate Go virtual sites. We will not describe their function here in detail. Instead, see for example ApplyRubberBand and VirtualSiteCreator.

Relevant CLI options: -elastic, -ef, -el, -eu, -ermd, -ea, -ep, -em, -eb, -eunit, -go, -go-eps, -go-moltype, -go-low, -go-up, -go-res-dist

6) Write output

Finally, the topology and conformation are written to files (if no warnings were encountered along the way). Currently martinize2 and VerMoUTH can only write Gromacs itp files. Martinize2 will write a separate itp file for every unique molecule in the system.

Relevant CLI options: -x, -o, -sep, -merge