Substructure Match Commands

Substructure matching is a complex functionality. While a limited number of object commands are supplied ( ens match , mol match , etc.), comprehensive match functionality is accessible via special commands. The match command implements various structure matching commands.

match ss

The match ss command matches substructures. Its syntax scheme is

match ss ?-option value?... ss_spec st_spec ?atommapvar? ?bondmapvar? ?molmapvar?

Structure and substructure may both be independently specified in several different formats:

match ss $ss_handle $st_handle
match ss $ss_handle [list $st_handle 1]
match ss c1ccccc1 $st_handle

The return value of the command is the number of successful matches. For simple match modes, which return only a match/nomatch result, this is 0 or 1, but modes which can produce multiple matches may return higher counts.

The final three optional parameters are names of variables which receive atom, bond and molecule mapping information. If these parameters are not supplied, or a variable name is spelled as an empty string, no variable is created or modified. If a variable is specified, but no match is found, the map variables are set to empty strings.

For match modes which can only return a single match, these map variables are simple nested lists. Each list element contains a substructure and a structure object label, in this order. The number of elements in the result list corresponds to the number of substructure objects. Example:

match ss CN CCN amap bmap mmap

The variable amap is set to “{1 2} {2 3}” , which is the first match of the C-N substructure fragment on the ethylamine structure. The numbers are atom labels - in case of SMILES strings, atom labels are assigned in the order the atoms appear in the string. The bmap variable is set to “ {1 2}” , since only a single bond is involved. The first bond of the substructure matches the second bond of the structure. Finally, the mmap variable is set to “{1 1}” , because both substructure and structure contain only a single molecule, which was assigned the default label 1. The bond and mol map results are still nested lists, even if they appear in this simple example as plain lists

There is no guarantee that the lowest possible labels are use for a simple match - the match algorithm uses internal optimizations for choosing good start atoms for matches. Matches should not be expected to start with an atom with the lowest label. Match result variables are filled in the order of the objects in the internal object lists, which also is not necessarily an ascending label sequence.

These nested result lists can easily be transformed to an Tcl array with a statement like

array set array_amap [join $amap]

The array variable array_amap now contains elements which are named with the substructure labels, and have values which correspond to the structure labels. The unzip command is also useful to isolate substructure or atom label sets.

In case a match mode is invoked which can return more than one match, the map variables are constructed with an additional nesting level. They are a list, where each element describes one match. Each of these elements for a specific match is formatted as in above description of simple match results. Note that the actual number of reported matches does not influence the scheme - if there is a theoretical possibility that more than one match can be found, the maximum nesting level is 3, not 2, even if only a single match is finally found.

Example:

set nmatches [match -mode distinct CC CCC amap]

Here, the match count is 2 (the distinct mode reports matches which differ by at least one structure atom from any previous match - the all mode reports 4 matches, which include reversals of the CC fragment), and the amap variable is set to “{{1 1} {2 2}} {{1 2} {2 3}}” . The first match is substructure atom 1 on structure atom 1, and 2 on 2, the second match maps substructure atom 1 on structure atom 2, and substructure atom 2 on structure atom 3.

The match ss command has a large number of options, which can be used to fine-tune the matching process. Any number of options, in any order, may be inserted before the substructure specification. This is the list of options:

-align

-align none/rotate/redraw/xaxis/yaxis/diagonal/combined

If a match was successful, change the layout of the structure by modifying the A_XY atomic 2D coordinates property.

Mode none , which is the default, does not perform any 2D coordinate changes.

In modes xaxis , yaxis and diagonal , the coordinates of the matched structure atoms are extracted and the largest principal component/eigenvector of these computed. The structure is then rotated in such a way that this eigenvector is aligned to the x-axis, y-axis, or diagonal (lower left to upper right). No coordinates of the substructure atoms are used.

In rotate mode, the structure is rotated in steps of 15 degrees, with and without a flip. The orientation which is in best alignment with the coordinates of the matched substructure atoms is retained.

In redraw mode, the structure is completely redrawn, using the coordinates of the matching substructure atoms as starting point. The rest of the structure is drawn around it. The matched structure atoms possess the same coordinates as the matching substructure atoms.

There are some limitations in this mode, which are automatically enforced by setting the corresponding match control flags. First, it is not possible to match partial ringsystems. A substructure ring atom must match the same class of ring system, i.e. a substructure 6-membered ring fragment only matches a structure benzene or cyclohexane ring, but not naphthalene, adamantane, etc. This limitation is deeply rooted in the 2D layout generator, which treats ring systems different from the acyclic connections. Acyclic substructure atoms can only match acyclic structure atoms, with the exception that a terminal acyclic substructure atom may still match a structure ring atom.

The final mode besteffort combines the redraw and rotate modes - if a match in mode redraw fails, the match attempt is automatically repeated in mode rotate , which has relaxed match conditions with respect to ring system checks.

.allowmissingstereo

-allowmissingstereo none/atoms/bonds/both

If not set to none , the default, stereogenic atom or bond centers on the structure side may be matched by corresponding centers on the substructure with defined stereochemistry if they do not possess a non-zero stereo descriptor in A_LABEL_STEREO or B_LABEL_STEREO . If there is a structure-side stereo descriptor on the matched center, the normal stereo match process applies (i.e. absolute or relative stereo matching). This is a global option which applies to the complete substructure pattern. There are also atom- and bond-specific bits in A_QUERY and B_QUERY to control this feature on a local level in the pattern.

-anchor

-anchor nested_anchor_atom_list

This option defines restrictions on which substructure and substructure atoms must match. The argument is a nested list. where each outer list element is a list of two elements. Each of the inner list elements is either an atom label, an empty string, or the word any . The latter two options are equivalent. The first item identifies a substructure atom, the second a structure atom. If any or an empty identifier is used, it is used as a wildcard. If two atom labels are used, the two atoms must map onto each other in all reported matches. If these atoms are incompatible, no matches are found. If a wildcard is used, it means that the other atom must be part of the match, but without the need to match any specific counter atom. If fuzzy matching is used, this can make sense even on the substructure side. The use of a pair of wildcards is not illegal, but has no effect.

Example:

match ss -anchor {{1 2} {any 3}} $sshandle $ehandle

This sample line forces the match of substructure atom 1 onto structure atom 2, and the inclusion of structure atom 3 in the match, if one exists given the query features in the substructure specification plus the anchor constraints.

-atomhighlight

-atomhighlight none/structure/substructure/both

If this flag is set, all matched atoms in the structure (modes structure or both , or numeric encodings 1 or 3) or substructure (modes substructure or both , or the equivalent numeric encodings 2 or 3) have the highlight flag set in property A_FLAGS . In case multiple matches are generated, the result depends on the - multihighlight option setting. By default, only the first match is highlighted, but highlighting the union of all found matches is also possible. This option does not reset existing atom highlight flags - see the - clearatomhighlight option for this functionality. By default this function is disabled (equivalent to mode none or 0).

-atommapproperty

-atommapproperty none/structure/substructure/both

If this flag is set, for each match a new instance of property A_SSMATCH is attached to the structure (in modes structure or both , or numeric encodings 1 or 3) ensemble, or a new instance of property A_STMATCH to the substructure (in modes substructure or both , or the equivalent numeric encodings 2 or 3) ensemble - the first match is recorded in A_SSMATCH or A_STMATCH , the second in A_SSMATCH/2 or A_STMATCH/2 , and so on. If instances of this property are already set on the structure or substructure ensembles, the new instances start with the highest existing instance number plus one. Structure or substructure atoms which are not used in a match have their respective A_SSMATCH or A_STMATCH data set to 0. Matched structure atoms are marked with the atom label of the matching substructure atom, and matched substructure atoms with the atom label of the matching structure atom. By default, this flag is not active (equivalent to mode none or 0).

-bondhighlight

-bondhighlight none/structure/substructure/both

If this flag is set, all matched bonds in the structure (modes structure or both , or numeric encodings 1 or 3) or substructure (modes substructure or both , or the equivalent numeric encodings 2 or 3) have the highlight flag set in property B_FLAGS . In case multiple matches are generated, the result depends on the - multihighlight option setting. By default, only the first match is highlighted, but highlighting the union of all found matches is also possible. This option does not reset existing bond highlight flags - see the - clearbondhighlight option for this functionality. By default this function is disabled (equivalent to mode none or 0).

-bondmapproperty

-bondmapproperty none/structure/substructure/both

If this flag is set, for each match a new instance of property B_SSMATCH is attached to the structure (in modes structure or both , or numeric encodings 1 or 3) ensemble, or a new instance of property B_STMATCH to the substructure (in modes substructure or both , or the equivalent numeric encodings 2 or 3) ensemble - the first match is recorded in B_SSMATCH or B_STMATCH , the second in B_SSMATCH/2 or B_STMATCH/2 , and so on. If instances of this property are already set on the structure or substructure ensembles, the new instances start with the highest existing instance number plus one. Structure or substructure bonds which are not used in a match have their respective B_SSMATCH or B_STMATCH data set to 0. Matched structure bonds are marked with the bond label of the matching substructure bond, and matched substructure bonds with the bond label of the matching structure bond. By default, this flag is not active (equivalent to mode none or 0).

-bondorder

-bondorder 0/1/2

This flag determines whether the bond orders of substructure and structure bonds outside aromatic systems is used for determining a match. By default this flag is set , but may be disabled with this option. This option affects only the basic bond match. Bond match query expressions which explicitly or implicitly refer to property B_ORDER always use their comparison results to determine matches. In rarely used mode 2, bond order matching is only used for terminal structure bonds (i.e. those which contain an atom which participates only in a single bond).

-burn

-burn 0/1

If this flag is set, all matched structure atoms are excluded from any further match during the execution of the current command. Effectively, the matched structure atoms are added to the structure exclusion list (see - exclude_st option). This is an rather exotic option for special-purpose applications, which has an effect only in match modes which generate more than a single match. By default this flag is not set.

-chain

-chain 0/1

If set, this flag allows additional matches after the first match only if these matches are chained to a previous match, i.e. they do not overlap with any previous match, but a normal or complex bond exists between at least one structure atom of the new match and a structure atom of a previous match. In more complex cases, the results of this command variant can depend on the atom order. For example in case of a structure which contains a left part A and a right part AA linked by some construct, matching with substructure fragment A returns a single hit if the left part is matched first, but two fragment matches if the right part is matched first. However, within a single chain of building blocks in the structure it does not matter where the first match occurs - the chain fragment is recursively appended in all directions and ultimately cover all linked blocks. The chain does not need to be linear - rings or star topologies can be matched, too. Obviously, this option has no effect in match mode first , because specific results are only generated when more than a single match is sought.

-charge

-charge 0/1

This flag determines whether atomic formal charges on the substructure and substructure atoms are used for determining the possibility of an atom match. By default, formal charges are ignored. This option only affects the standard match attributes. Atom query expressions which explicitly refer to property A_FORMAL_CHARGE always use their comparison result to determine matches.

-clearatomhighlight

-clearatomhighlight 0/1

If this flag is set, all highlight bits in property A_FLAGS are reset on the structure (and possibly the substructure) ensemble before the first match is processed. By default, this flag is not set and any existing A_FLAGS highlight bit pattern remains unchanged. Because the reset is performed in the routine where the highlight bits are set, this option is effective only in combination with the - atomhighlight option. The decision whether to reset the flags on the structure or substructure side, or both sides, follows the setting of the - atomhighlight mode.

-clearbondhighlight

-clearbondhighlight 0/1

If this flag is set, all highlight bits in property B_FLAGS are reset on the structure (and possibly the substructure) ensemble before the first match is processed. By default, this flag is not set and any existing B_FLAGS highlight bit pattern remains unchanged. Because the reset is performed in the routine where the highlight bits are set, this option is effective only in combination with the - bondhighlight option. The decision whether to reset the flags on the structure or substructure side, or both sides, follows the setting of the - bondhighlight mode.

-cmpflags

-cmpflags flags

This option provides a direct access to the full set of flags which modify the substructure match process. The more common flags can be set or unset with specific options of this command for convenience. The default flag set is bondorder|useatomtree|usebondtree .

The flag set can either override the default flags (if specified as simple attribute list), added to them (if prefixed with a ’+’), removed, (if prefixed with a ’-’), or toggled (if prefixed with a ’^’).

These are generally useful flags recognized:

-command

-command tcl_command

Define a Tcl callback function which is called when a new match is found and all property-based constraints have been checked. This function is called with four parameters. The first two parameters are the handles of the substructure and structure ensembles. The third parameter is a nested list of label pairs (substructure atom label/structure atom label) for all substructure atoms which are currently matched to a structure atom. The fourth parameter is a nested list of label pairs (substructure bond label/structure bond label) for all substructure bonds which are matched to a structure bond. The format of these arguments is the same as that of the map variables of the match command for single-match modes. Within the callback functions, the match can be further evaluated in ways not possible by the standard match options.

If the function returns 0, any non-numeric value, or throws an error, the post-processing of completed matches, such as atom or bond highlighting, is not executed and the match discarded.

While the callback routine is free to perform any additional match analysis, it must neither delete the structure or substructure, nor change its connectivity (remove or add atoms and bonds), nor discard or invalidate any property data used in the matching process. The computation or setting of additional property data on the substructure or structure ensembles is allowed.

By default, or in case an empty string is passed as callback procedure name, no callback is executed.

Example:

proc my_match_check {ens_ss ens_st amap bmap} {
			puts $amap
			return 1
}
match ss -command my_match_check CC CC

This example outputs “{1 1} {2 2}”, which is the atom mapping of the match found.

-creategroup

-creategroup 0/1

If this flag is set, every match creates a new group minor object on the structure ensemble. The atoms in the group are all those structure atoms which were matched by the substructure. The group name (property G_NAME ) is set to the name of the substructure (property E_NAME ). By default, no groups are generated as side effects of a match.

-daylightaro

-daylightaro 0/1

If the flag is set, the use of Daylight aromaticity in the matching is enforced both on the structure and substructure side regardless of the global aromaticity system setting. For the substructure, this applies to implicitly defined aromaticity, for example the presence of a complete aromatic ring with all defined bond orders and elements, not explicit query attributes.

-exclude_ss

-exclude_ss label_list
-excludelabels_ss label_list

This option allows the exclusion of a set of substructure atoms from the match process. All atoms which are listed here are completely ignored by the match algorithm. By default, or when an empty list is passed, all substructure atoms of the ensemble or molecule (if the handle/molecule label specification was used) are used for matching.

Example:
match ss -exclude_ss [ens atoms $sshandle hydrogen] $sshandle $sthandle

This example does not use any hydrogens on the substructure for matching. This is more efficient and stripping and possibly re-attaching the hydrogen atoms from the substructure.

All substructure atom exclusion options can be combined, but not repeated, and are cumulative.

-exclude_st

-exclude_st label_list
-excludelabels_st label_list

This option allows the exclusion of a set of structure atoms from the match process. All atoms which are listed here are completely ignored by the match algorithm. By default, or when an empty list is passed, all structure atoms of the ensemble or molecule (if the handle/molecule label specification was used) are available for matching.

All structure atom exclusion options can be combined, but not repeated, and are cumulative.

-exclude_st_root

-exclude_st_root label_list
-excludelabels_st_root label_list

This set of structure atoms to be excluded is similar to the one specified with - exclude_st . The difference is that this exclusion only applies to the first level of matching. In deeper match levels, for example recursive SMARTS expressions, these atoms are no longer blocked.

All structure atom exclusion options can be combined, but not repeated, and are cumulative.

-excludeenvironment

-excludeenvironment 0/1

If this flag is set and a recursive SMARTS expression is processed, all parts of the structure which are already matched are excluded from the recursive match check. By default, a new recursion level does not have any knowledge about previous matches and may match all atoms in the structure.

Example:

match ss -excludeenvironment 0 {C[$(OC)]} CO
match ss -excludeenvironment 1 {C[$(OC)]} CO

The first example does match, because the carbon of the recursive fragment may match on the same structure carbon as the first carbon atom in the substructure. In the second case, the structure carbon is marked as already matched, and there is no place to map the recursive fragment carbon, so no match is found.

-excludeflags_ss

-excludeflags_ss flag_value

This option allows the exclusion of substructure atoms from the match procedure which have at least one of potentially several bits set in the A_FLAGS property. The decoded flag values are used as a bit mask, and all structure atoms which have one or more bits of the mask set are hidden from further processing.

Example:

match ss -excludeflags_ss [list starred boxed] $ss_handle $st_handle

This example ignores all substructure atoms which have been marked with the starred or boxed attribute.

All substructure atom exclusion options can be combined, but not repeated, and are cumulative.

-excludeflags_st

-excludeflags_st flag_value

This option allows the exclusion of structure atoms from the match procedure which have at least one of potentially several bits set in the A_FLAGS property. The decoded flag values are used as a bit mask, and all structure atoms which have one or more bits of the mask set are hidden from further processing.

Example:

match ss -excludeflags_st [list starred boxed] $ss_handle $st_handle

This example ignores all structure atoms which have been marked with the starred or boxed attributes.

All structure atom exclusion options can be combined, but not repeated, and are cumulative.

-excludestructures

-excludestructures ens_mol_list

Specify of set of exclusion fragments. These structure fragments are exhaustively matched as substructures on the structure, and all structure atoms and bonds they match are excluded from the actual match procedure invoked by this command. The exclusion fragment substructure match is always performed with the default mode settings - options like - bondorder or - charge are only applied to the final match. The exclusion fragments may be specified in the same styles as the main substructure and structure, i.e. as an ensemble handle, a list of an ensemble handle and a molecule label, or as a SMILES/SMARTS string.

Example:

match ss {[OH]} CC(=O)O
match ss -excludestructures {C(=O)[OH]} {[OH]} CC(=O)O

The first example matches the hydroxyl group of the structure, which is acetic acid. In order to prevent of match of hydroxyl groups which are part of carboxylic acid groups, carboxylic acid groups can be ignored on the structure with a statement like in the second example. Of course, this example could be easily made more generic, such as hiding all groups which have the hydroxyl group attached to any non-carbon, or carbon with any other hetero atom neighbor, as in

match ss -excludestructures {[!C,C&x{2-}][OH]} {[OH]} $sthandle

All structure atom or fragment exclusion options can be combined, but not repeated, and are cumulative.

-exclude_ss_h

-exclude_ss_h 0/1

If this flag is set, all substructure hydrogen atoms are ignored in the match process. By default, all atoms in the substructure are used.

All substructure atom exclusion options can be combined, but not repeated, and are cumulative.

-exclude_st_h

-exclude_st_h 0/1

If this flag is set, all structure hydrogen atoms are ignored in the match process. By default, all atoms in the structure are used.

All structure atom exclusion options can be combined, but not repeated, and are cumulative.

-fixedframework

-fixedframework 0/1

If this flag is set, all carbons in the structure are prevented from possessing any unmatched hetero atom or carbon neighbors. Matched structure hetero atoms may be bonded to unmatched hetero atoms or carbon atoms. By default, the flag is not set. The acceptability of extra unmatched hydrogen, carbon, or hetero atom neighbors may be additionally controlled on the atomic level by setting the appropriate flags in property A _query( flags ) on the substructure.

Example:

match ss -fixedframework 1 CC CCO
match ss -fixedframework 1 CCO CCOC

The first example does not match, because in all possible match orientations there is one matched carbon with bonded to an unmatched hetero atom (the oxygen atom). The second example does match - the matched hetero atom may possess bonds to unmatched non-hydrogen atoms - the methyl group in this case.

This match option is useful for locating starting materials for synthesis in vendor catalogs.

-forceringmatch

-forceringmatch no/strict/relaxed

This option controls the matching of the substructure into structure ring systems. If the option is not specified, or set to no (or 0), the matching is only controlled by explicitly set atom and query attributes, such as the number of ring bonds, or membership of rings of specific size.

The option value strict allows the matching of substructure atoms or bonds which are members of rings only onto structure parts in ring systems of the same class, i.e. the same set of rings of a given size and arrangement, but without consideration of atoms, bond orders, aromaticity, etc. With this option, a phenyl substructure fragment no longer matches a naphthalene structure, and acyclic substructure atoms or bonds can only match acyclic structure parts. All other query attributes, such as bond order, element type, aromaticity, etc. are applied in addition to this constraint.

The relaxed mode has basically the same constraints, but with one small exception: A terminal substructure atom (an atom which has only a single bond, and thus cannot be a ring member) may match onto structure atoms in ring systems, if the normal query attributes allow this.The relaxed mode is automatically enforced if the - align option with value redraw is specified.

-fuzz

-fuzz n

If this option is used with a value n larger than zero, fuzzy substructure matching is activated. In this mode, it is no longer required that all substructure atoms are mapped to structure atoms. Up to n atoms may fail. Within the A_query property, fields are provided which allow a more detailed specification whether a substructure atom may be in the fail set, and how much fuzz is allowed in its immediate neighborhood. The - anchor option is also useful to force the use of some critical substructure atoms in the found matches.

This match variant is computationally significantly more expensive than the standard match procedure, and can generate a large set of matches if a match mode which can generate more than one match is used.

Example:

match ss -fuzz 1 ClCCCl CCCl
match ss -fuzz 1 ClCCCl CCl
match ss -fuzz 1 ClCCCl ClCCl

The first example matches, since there is only a single unmatched substructure atom in the best mapping - one of the chlorine atoms- , but the second and third do not. The third example demonstrates that fail atoms are straightforwardly ignored, but their unmatched neighbors are not allowed to start new implicit fragments. The second chlorine atom in the substructure cannot match because it remains tethered to the main fragment, even if the excess carbon atom in the substructure is designated as the one allowed failure atom. Both example two and three will however match with a fuzz of 2.

-include_ss

-include_ss labellist
-includelabels_ss labellist

Select substructure atoms for use in matching. By default, all substructure atoms are used. If both an inclusion list and an exclusion list (option - exclude_ss ) are specified, the inclusion list is processed first. From the remaining atoms, those which are also listed in the exclusion list are removed.

-include_st

-include_st label_list
-includelabels_st label_list

Select structure atoms for use in matching. By default, all structure atoms are used. If both an inclusion list and an exclusion list (option - exclude_st ) are specified, the inclusion list is processed first. From the remaining atoms, those which are also listed in the exclusion list are removed.

-includeflags_ss

-includeflags_ss flag_value

This option allows the selection substructure atoms for the match procedure which have one of potentially several bits set in the A_FLAGS property. The decoded flag values are used as a bit mask, and only those structure atoms which have one or more bits of the mask set are selected for matching. By default, all substructure atoms are used for matching. If both an inclusion flag set and exclusion flag set (option - excludeflags_ss ) is specified, the inclusion list is processed first. From the remaining atoms, those which match the exclusion filter are removed.

-includeflags_st

-includeflags_st flag_value

This option allows the selection structure atoms for the match procedure which have one of potentially several bits set in the A_FLAGS property. The decoded flag values are used as a bit mask, and only those structure atoms which have one or more bits of the mask set are selected for matching. By default, all structure atoms are used for matching. If both an inclusion flag set and exclusion flag set (option - excludeflags_st ) is specified, the inclusion list is processed first. From the remaining atoms, those which match the exclusion filter are removed.

-isotope

-isotope 0/1

This flag determines whether isotopic labeling is used for matching. By default, isotope label matching is not performed. If this flag is set, substructures with an isotope label must map onto a structure atom with the same isotope label. Even if this option is not set, explicit references to property A_ISOTOPE in atom query expressions are always evaluated and used to determine the match.

-kekule

-kekule none/odd/even/all

By default (value none or 0), the Kekulé bond order of aromatic bonds is not used for matching. A substructure aromatic bond matches a structure aromatic bond, regardless of whether their Kekulé bond orders are the same or not. If this flag is set to all (or 3), aromatic bonds are compared with the drawn bond order. This can be useful for example in order to find a sequence of atoms for perform a reaction transformation which allows a simple change of bond orders in the path without a complete rearrangement of the full π system. The modes odd and even are useful for controlled matching of certain heteroaromatic systems. In mode odd (or 1), the Kekulé bond order is used for all bonds which are only a member of aromatic rings with an odd number of atoms, while the order of bonds in even aromatic systems (including those which are simultaneously a member in an odd aromatic system) is disregarded. Mode even (or 2) is the complementary counterpart.

-limit

-limit n

Set the maximum number of reported substructure matches to n . Any additional matches which might be present are ignored.

-mode

-mode first/all/distinctatoms/distinctheayatoms/distinctinneratoms/distinctbonds/nocommon/unique/distinctssatoms/dualdistinct/distinctmols/distinctfirstatom

This important option determines the match mode. The default mode is first . In mode first , only the first, if any, match is returned, and any list variables used to capture the atom, bond or molecule maps use only a single level of nesting.

Mode all reports all (subject to a potentially set maximum number of results, see - limit option) all possible matches, which differ in at least one atom mapping relationship to any other reported match.

Example:

set nmatch [match ss -mode all CC CCC]

returns 4, because the C2 fragment can be embedded in forward and backward direction, and matched on either the first two or last two carbon atoms of the propane structure.

Mode distinctatoms only reports matches which map onto a different set of structure atoms. Example:

set nmatch [match ss -mode distinctatoms CC CCC]

returns 2 for the mapping of the substructure onto the first two, and the last two carbon atoms. The backward matches of the C2 fragment are not reported, because they do not cover a new set of atoms.

Mode distinctheavyatoms is similar to the distinctatoms mode, but only uses non-hydrogen substructure atoms for determining whether a match should be considered new and included.

Example:

set nmatch [match ss -mode distinctatoms {CC[#1]} CCC]
set nmatch [match ss -mode distinctheayatoms {CC[#1]} CCC]

The first example reports an astonishing 10 matches, because the hydrogen atom can be mapped to either of the three terminal hydrogens, or two central hydrogens, and there are two distinct embeddings of the substructure C2 fragment. Mode distintheavyatoms reduces the number of reported hits to 2, because only the atom mappings of the two carbons in the substructure are considered. In many cases, hydrogens can be considered equivalent, and in these cases this mode comes in handy.

Mode distinctinneratoms is similar to distinctheavyatoms , but instead of ignoring all hydrogen atoms on the substructure when determining the novelty of a match, all terminal atoms (those with less than two bonds) are ignored in filtering new matches.

Mode distinctfirstatom is another mode with a modified view of what are distinct matches. This mode only looks at the structure atom matched by the first substructure atom.

Mode distinctmols requires that the substructure matches a different molecule in the structure ensemble in each accepted match.

Mode distinctbonds uses the set of matched structure bonds to determine whether a match is novel. For cage structures, there may be multiple matches of the same structure atoms, but matching different bond paths.

Mode unique is a stricter version of mode distinctatoms . Here, the matched atoms must additionally be topologically different, as determined by property A_HASH (when matching without stereochemistry) or A_STEREO_HASH (in stereo match mode).

Mode nocommon only reports matches which do not share any common atoms. Example:

set nmatch [match ss -mode nocommon CC CCC]

returns only a single match, because the middle carbon atom in the structure is already matched by the first match. Unfortunately, the results of this match mode may depend on the numbering of atoms. If, by change, a C2 substructure fragment is first matched in the middle of a C4 chain, only a single match is found, but if it matches first at one of the ends, two matches are found, because the middle match, if found next, is discarded and then the other terminal match is accepted. The described effect is not a problem in all cases, depending on the nature of the substructure, but using this mode requires careful analysis.

Modes distinctssatoms and dualdistinct are only useful in contexts where only a part the substructure may be matched, for example when using the - fuzz option. Mode distinctssatoms is essentially the same as mode distinctatoms , only that the matched atoms on the substructure side are checked, not those on the structure side. Mode dualdistinct uses substructure atom/structure atom pairs instead of simple atom identities as criterion of distinctiveness.

-mapanchor

-mapanchor 0/1

If this flag is set, an anchor set (see option - anchor ) is automatically constructed from the values of the A_MAPPING properties on the substructure and structure. A_MAPPING is the default property to encode reaction mapping information. Both substructure and structure must possess valid A_MAPPING data, otherwise this option is ignored. If this condition is fulfilled, any substructure atom which has a non-negative mapping number1 is anchored to its counterpart on the structure side with the same mapping number. If no such number is present, the command immediately returns zero matches and empty atom/bond/mol mapping variables, if these were specified. This option can be combined with a normal - anchor option. The anchor tables are cumulative in this case.

-maxopenlinks

-maxopenlinks n

Limit the number of open links of the substructure embedded in a match. Any continuation of the structure from the matched substructure into the unmatched parts except by hydrogen atoms is considered an open link. Example:

match ss -mode distinct CC CCCC
match ss -mode distinct -maxopenlinks 1 CC CCCC

The first example reports three matches, the second only two.

In the latter case, the substructure matches only at either end, because in case of a match in the the middle of the C4 carbon chain there would be two continuation links. The - terminal option is equivalent to using this mode with an open link count of one.

-multihighlight

-multihighlight 0/1

If this option is set, and the options - atomhighlight and/or - bondhighlight are used, and more than one match is generated, the highlight atom and/or bond attributes are also set for the second and further matches, resulting in a highlight set which is the union of all matches. By default, only the first match is highlighted, even if more than one match is generated and reported.

-noaliphaticonaro

-noaliphaticonaro 0/1

If this flag is set, aliphatic bonds do not map on aromatic bonds. By default, and in the absence of other criteria determining the match of a bond, both single and double (but not triple or higher) aliphatic substructure bonds match aromatic structure bonds, and vice versa. If the flag is set, substructure bonds which are not marked aromatic, either by explicit attribute setting or indirectly by aromaticity analysis of the substructure fragment, do not match aromatic structure bonds. By default, this flag is not set. This option does not influence the processing of bond query expressions which explicitly reference properties such as B_ORDER or B_ISAROMATIC . These are evaluated in any case.

-noarobondfg

-noarobondbg 0/1

If this flag is set, aromatic bonds are not considered functional groups. This flag influences the interpretation of the insulator and separator pseudo-atoms, which are constructs used to separate functional groups in the match process. By default, aromatic bonds are considered part of a functional group.

-nodoubleonaro

-nodoubleonaro 0/1

If this flag is set, double bonds do not map on aromatic bonds. By default, and in the absence of other criteria determining the match of a bond, both single and double (but not triple or higher) aliphatic substructure bonds match aromatic structure bonds, and vice versa. If the flag is set, substructure double bonds which are not marked aromatic, either by explicit attribute setting or indirectly by aromaticity analysis of the substructure fragment, do not match aromatic structure bonds. By default, this flag is not set. This option does not influence the processing of bond query expressions which explicitly reference properties such as B_ORDER or B_ISAROMATIC . These are evaluated in any case.

-noheterofg

-noheterofg 0/1

If this flag is set, bonds to hetero atoms are not considered part of functional groups. This flag influences the interpretation of the insulator and separator pseudo-atoms, which are constructs used to separate functional groups in the match process. By default, bonds involving a hetero atom are considered part of a functional group.

-nomultibondfg

-nomultibondfg 0/1

If this flag is set, non-aromatic multiple bonds are not considered part of functional groups. This flag influences the interpretation of the insulator and separator pseudo-atoms, which are constructs used to separate functional groups in the match process. By default, non-aromatic multiple bonds are considered part of a functional group.

-nosingleonaro

-nosingleonaro 0/1

If this flag is set, single bonds do not map on aromatic bonds. By default, and in the absence of other criteria determining the match of a bond, both single and double (but not triple or higher) aliphatic substructure bonds match aromatic structure bonds, and vice versa. If the flag is set, substructure single bonds which are not marked aromatic, either by explicit attribute setting or indirectly by aromaticity analysis of the substructure fragment, do not match aromatic structure bonds. By default, this flag is not set. This option does not influence the processing of bond query expressions which explicitly reference properties such as B_ORDER or B_ISAROMATIC . These are evaluated in any case.

-nochainonaro

-nochainonaro 0/1

If this flag is set, substructure chain bonds (acyclic bonds) do not match on aromatic structure bonds. By default, and if no options prohibiting this like - nosingleonaro or - nodoubleonaro are set, single and double chain bonds can match aromatic structure bonds.

-omitrecursion

-omitrecursion 0/1

This options influences the way matches of recursive SMARTS fragments are reported. Internally, the first atom of a recursive fragment is represented by an any atom on the basic substructure. This placeholder atom and its mapped structure counterpart are reported in atom maps, and the bonds leading to the placeholder in bond maps. If this flag is set, the placeholder atom and its bonds are omitted from the maps.

Example:

match ss -omitrecursion 0 {C[$(OC)]} COC amap
match ss -omitrecursion 1 {C[$(OC)]} COC amap

In the first example, the atom map contains the pairs “{1 1} {2 2}”, while in the second example only “{1 1}” is returned as atom map.

In any case, detailed mapping information about all the atoms and bonds of the recursive fragment is currently not directly available on the script level.

-openhcount

-openhcount 0/1

If this flag is set, all hydrogen counts are considered minimum values. If a matched structure atom possesses more hydrogens, the match still succeeds, even if the original comparison operator uses equality as criterion, provided that the compared property value is A_HCOUNT , the standard hydrogen count property, which is the default used by the various query syntax decoders of the toolkit. This option is unusual because it is also applied to comparisons in atom or bond query expressions. By default, this flag is not set.

Example:

match ss -openhcount 0 {[C;H2]} CC
match ss -openhcount 1 {[C;H2]} CC

The first example does not match, because both carbon atoms in the structure possess three hydrogen atoms, not two, while the second attempt succeeds. Note that the simple specification

match ss -openhcount x {[CH2]} CC

succeeds regardless of the setting of this flag. This is a side effect of the implicit expansion of SMARTS hydrogen atoms when they appear directly behind the atom symbol in the default SMARTS decoder mode, which is described in detail in the section about the handling of SMILES strings.

Alternatively, it is of course possible to either use standard SMARTS or-connected hydrogen count alternative values, or use the toolkit-specific range extensions, as in

match ss {[C;H2,H3]} CC
match ss {[CH{2-}]} CC

but in many cases this makes the query more complicated than necessary.

-overlap

-overlap none/any/nobonds/noembedding/distinctatoms/distinctmols

This option controls how potential overlap of multiple substructure fragments on the target structure is handled. If the substructure contains only a single fragment, this option has no effect.

The default mode is none . In this mode, no overlap of substructure fragments on the target structure may occur. All fragments must be matched side by side, matching different structure parts.

Mode distinctmols is even more restrictive than mode none . In this mode, only one substructure fragment may be matched onto each structure fragment (i.e. molecule).

In mode any , every substructure fragment is treated independently of any other substructure fragment. No information about any match by other fragments is used. Arbitrary overlap of the fragments on the target structure is allowed.

Mode nobonds allows the overlap of atoms, but not of bonds. In effect, multiple fragments may overlap at the edges, but not share any larger structure parts.

In mode noembedding , atoms and bonds may overlap, but no substructure fragment may be completely embedded into the matched structure part covered by another fragment, meaning that at least one of any pair of matching substructure fragments must match an atom which is not matched by the other fragment.

Mode distinctatoms is similar to mode noembedding , but in this mode any pair of matching substructure fragments at least one structure atom must be matched by each substructure fragment which is not matched by the other.

Because internally bitsets are used to track the mapping of substructure fragments, the maximum number of fragments which may be used in any mode but none or distinctmols is 64. The none and distinctmols modes do not have a maximum fragment count.

-pionaro

-pionaro 0/1

If this flag is set, any bond between atoms which are part of a π system can match an aromatic bond. This option is intended to allow the reproduction of the behavior of the Daylight toolkit, which has a much broader idea about which ring systems are aromatic than the Cactvs toolkit in its default aromaticity mode. The Daylight toolkit recognizes rings with exocyclic keto groups, such as purines and pyrimidines, as aromatic, while this toolkit does not. If the option flag is set, aromatic fragments match on such systems. By default, the flag is not set.

-rotateterminals

-rotateterminals 0/1

If set, the 2D bond direction of matched structure-side terminal atoms (i.e. atoms with only a single bond) is adjusted to match that of the direction of the matched substructure-side bond. This option is for example useful to force the same orientation of hydrogens as in a template. Obviously, this option requires for useful results that the general orientation of the matched structure part is the same as that of the substructure pattern. This is usually enforced by combining this option with the - align option in the rotate , redraw or besteffort modes.

-stereo

-stereo none/absolute/relative

This option controls the global use of stereochemistry information of the substructure in the match process. By default, stereochemistry is ignored. If this flag is set, stereochemistry present in the substructure is checked against the stereochemical features in the structure. Stereo checks are performed on a pseudo-3D model of the compound and do not use simple descriptor values such as R and S.

If a stereo center in the substructure is unspecified, any stereochemistry, including unspecified stereochemistry, is allowed on the structure side in the matching atoms or bonds. If stereochemistry on an atom or bond of the substructure is specified, it must match the features found in the structure. Unspecified stereochemistry for the matched bond or center on the structure normally leads to a mismatch, except in case a nostereook flag has been set in A_query( flags ) or B_query( flags ) for the substructure atoms or bonds. Currently, the substructure match system handles stereochemistry of tetrahedral centers (including those which involve free electron pairs), cis/trans double bonds, allenes (both odd and even) and square planar geometries. Other geometries such as pentagonal bipyramids or octaeders are not yet supported.

With stereo match mode absolute , the pseudo-3D configuration of substructure and structure must match at all stereo centers and diastereomeric bonds specified in the substructure. The alternative mode relative allows the opposite configuration at stereo centers (but not bonds), provided that all matched stereo centers possess the opposite configuration. For example, an S,S-substructure would match both an S,S- and R,R-structure, but not the S,R or R,S-isomer. In effect, only stereo isomers are matched, but not diastereomers. The relative mode is obviously useful only when more than one stereo center needs to be matched.

Explicit atom stereo groups, such as the MDL stereo groups, override the global absolute or relative settings for the atoms involved.

Examples:

match ss -stereo none {[Cl,Br,I][C@H](CC)C} {C[C@H](CCC)Cl}
match ss -stereo absolute {[Cl,Br,I][C@H](CC)C} {C[C@H](CCC)Cl}
match ss -stereo absolute {[Cl,Br,I][C@H](CC)C} {C[C@@H](CCC)Cl}
match ss -stereo absolute {[Cl,Br,I][C@H](CC)C} {C[CH](CCC)Cl}

In this example set, the first line matches, because stereochemistry is ignored. The second line does not match, because the target structure represents the opposite stereo isomer. The third line does match, and the last line fails again because the substructure requested matching stereochemistry at a center for which no stereochemical information was available on the structure.

-strictexclusion

-strictexclusion 0/1

This is an expert option which controls how substructure fragments are handled which exclusively consist of atoms which bear the attribute that they should not be matched. By default, an attempt to match these fragments is performed after all other substructure fragments have been matched, and their matched structure parts are blocked. If at this point a match of any such fragment succeeds, the match is a failure. However, at this stage, structure parts which could match the exclusion fragment are potentially covered by other substructure fragments and thus protected, if the overlap mode disallows overlaps. If the flag is set, the check of these fragments is performed before the normal substructure fragments are processed. If a match occurs, the match process is immediately aborted.

-strictsmarts

-strictsmarts 0/1

If set, substructure argument specifications are decoded as strict SMARTS definitions. This means for example that the non-aromaticity of upper-case elements in SMARTS is enforced. Atoms for which aromaticity is not relevant need to be encoded with # notation, or as uppercase and lowercase element symbol pair. This flag only has an effect if the substructure is decoded within the match command. If the handle of an existing ensemble is used as substructure specification, its internal representation and match behavior is not changed and was already defined by whatever decoder options were used when it was created.

-tauto

-tauto none/basic/advanced

By default, bond orders and location of hydrogen atoms in the structure are fixed. A tautomer of a compound is considered a different chemical entity and does not match another tautomer. If the tautomer match mode is explicitly set to none , the match procedure continues to work in this style.

The alternative tautomer match modes basic and advanced introduce flexibility - at the cost of longer processing times, and a risk of obtaining matches which are surprising at first glance.

Examples:

match ss -tauto none {C=CO[H]} CC(=O)C
match ss -tauto none {CC=O} C=C(O)C
match ss -tauto basic {C=CO[H]} CC(=O)C
match ss -tauto basic {CC=O} C=C(O)C

The first two sample lines with the substructures of an enol and a keto group do not find a match with the structures of acetone and its keto form. The second pair of lines does find matches in both cases.

Atom and bond maps can be used with tautomeric matches, but the results can be surprising. The bond of a wandering hydrogen atom in the substructure is matched to the bond with the hydrogen in the original structure. However, since the substructure hydrogen atom may actually have been matched against a different virtual structure than the one passed to the match routine, the partner atoms of the bonds to the hydrogens in the substructure and structure may not have been mapped onto each other!

The difference between the basic and advanced modes is that the basic mode does not disturb aromatic systems, while the advanced mode considers forms which involve the conversion of aromatic systems into quinoids and vice versa, at the cost of extra processing time and less precisely defined matches.

-terminal

-terminal 1/0

This is another expert flag, and equivalent to the - maxopenlinks option with a link count of one. If it is set, a maximum of one bond, with the exclusion of bonds to hydrogen, may lead from the matched part of the structure to any non-hydrogen unmatched atoms. Essentially, the substructure is mapped into peripheral regions of the structure.

Example:

set nmatch [match ss -mode all CO C(O)C(O)C]
set nmatch [match ss -mode all -terminal 1 CO C(O)C(O)C]

In this example, the first line returns two matches, since the CO fragment can be matched onto both CO groups in the structure. The second line finds only a single match. The substructure cannot be matched onto the seconds CO group, because in that match the structure carbon atom has two unmatched non-hydrogen neighbors, one leading to the first CO group, and the other to the methyl group.

-transferstereo

-tramsferstereo none/atoms/bonds/both

If not set to none, the default, stereogenic atoms and/or bonds in the structure that are matched by substructure atoms or bonds with defined stereochemistry, but do not already possess their own stereochemistry descriptors, inherit stereochemistry from the substructure. This is done by setting properties A_LABEL_STEREO or B_LABEL_STEREO in such a fashion that the absolute configuration is the same as in the substructure. Depending on the atom and bond labeling of the structure vs. substructure, this is not necessarily the same descriptor value. In order for such a match to succeed, missing atom or bond stereochemistry on the structure side needs to be allowed (see - allowmissingstereo option).

-timeout

-timeout nsecs

Set a time-out for the match operation. By default, or when a value of zero is given, the routine does not time out. If a time-out occurs, the match procedure is stopped. If any matches have been found so far, these are reported as results, without raising an error.

-useatomtree

-useatomtree 0/1

This flag is set by default, but may be reset with this option. If the flag is set, atom query expression trees present in property A_query( query ) are evaluated and used to determine match possibilities. If this flag is not set, query trees are ignored and only the flat atom match attribute set is used.

-usebondtree

-usebondtree 0/1

This flag is set by default, but may be reset with this option. If the flag is set, bond query expression trees present in property B_QUERY( query ) are evaluated and used to determine match possibilities. If this flag is not set, query trees are ignored and only the flat bond match attribute set is used.

-varbondglobal

-varbondglobal maxdelta

If this option is used, the global use of approximated fractional bond orders for coordinate compound hypergraph matching is enabled for bonds with explicit approximated bond order request values stored in property B_QUERY( varbo ) . The maxdelta parameter is the maximum allowed average deviation of the matched structure bonds (with fractional order in B_ORDER_ESTIMATE ) vs. the substructure bonds that have a specified value in B_QUERY( varbo ) .

-varbondlocal

-varbondlocal maxdelta

If this option is used, the use of approximated fractional bond orders for coordinate compound hypergraph matching is enabled for bonds with explicit approximated bond order request values stored in property B_QUERY(varbo) . The maxdelta parameter is the maximum allowed individual deviation of the fractional query bond orders in B_QUERY( varbo ) from the structure-side fractional bond order values of matched bonds stored in property B_ORDER_ESTIMATE .

-wedge

-wedge 0/1

If this flag is set, matching bonds on the substructure and structure sides must possess identical wedge attributes (both wedge tip location and up or down direction). This option should be used only under very specific circumstances. It is not a replacement for stereo center matching, since wedges can be placed onto different bonds around a stereo center, and still represent the same stereo isomer.

Tips and Tricks

The following code snippet performs a simple maximum common substructure search, using the fuzzy substructure match capabilities of the toolkit:

proc max_common_ss {eh1 eh2} {
			set n1 [ens atoms $eh1 count]
			set n2 [ens atoms $eh2 count]
			if {$n1<$n2} {
				set ss $eh1; set st $eh2
			} else {
				set ss $eh2; set st $eh1
			}
			loop i 0 [ens atoms $ss count] {
				set n [match ss -mode unique -fuzz $i $ss $st]
				if {$n} break
			}
			return $n
}

1. A negative atom mapping value indicates an unmapped atom.