SMILES and SMARTS dialects
The toolkit supports the complete range of the Daylight
SMILES
,
SMARTS
, Reaction
SMILES
and
SMIRKS
standards, including Recursive
SMARTS
.
The global control variable
::cactvs(smiles_version)
can be set to a Daylight release number. The setting of this variable influences various aspects of encoding and decoding of
SMARTS
data. The default value is 4.9 - the version best known for finally introducing the
x
ring bond count atom attribute. This is the most recent major Daylight
SMILES
/
SMARTS
definition update.
In
SMARTS
context a simple ’H’ atom attribute without a count is always interpreted by the toolkit as a hydrogen atom for explicit matching, not the hydrogen neighbour count. This behaviour is standard in Daylight tools since the 4.51 release.
Octahedral and bi-pyramidal stereochemistry in
SMILES
is read and written, but currently not checked by the substructure match routines. Allenes and square planar stereochemistry are fully supported.
Besides supporting the standard syntax and attributes of both atoms and bonds, a significant number of enhancements are also recognized:
Attribute ranges
In addition to a simple numerical count (as in ’[X2]’), bracketed open and closed ranges are supported, as in ’[X{1-}]’, [X{-3}]’ or ’[X{2-3}]’. This feature is available for every attribute which can take a count. It is also possible to use the Eli Lilly operator extensions for the same purpose, as in ‘[X>1]‘ or [X<=3]‘. The exception is the closed range, which cannot be expressed in Lilly syntax.
Match count prefixes
The
SMARTS
expression may be prefixed by a simple count, or an operator and a count. The
SMARTS
must then match the required number of times. The match mode is automatically adjusted if required. Example:
set ss [ens create {>4a-[F,Cl,Br,I]} smarts]
This matches compounds which contain 4 or more halogens substituting aromatic rings.
set ss [ens create {0[R]} smarts]
This matches compounds which do not contain rings.
Strict interpretation suffix
The default
SMARTS
interpretation in
Cactvs
is more lenient than the original Daylight definition. Specifically, the aliphatic attribute of upper-case element symbols is not enforced by default. Most match commands provide options to fine-tune the interpretation, and it is also possible to switch the toolkit globally into a strict
SMARTS
interpretation mode.
As a convenience, it is possible to request strict interpretation of a
SMARTS
string regardless of command options and global configuration by appending an exclamation mark to the string.
Example:
set ss [ens create C1CCCCC1!]
This
SMARTS
does not match benzene, which in default toolkit mode without the suffix is matched.
Additional atom attributes
-
a
Besides supporting its standard meaning without a suffix, the toolkit version allows a count to this attribute. If a count is set, the atom must be part of the count
or more
aromatic bonds. The associated property is
A_AROBOND_COUNT
. For example,
[a3]
matches the two central carbon atoms of naphthalene, but not the other ring atoms.
-
b
Followed by a 0 or 1 value, this attribute requires branching or chain character of the matched atom (i.e. up to 2, or 3 or more heavy atom substituents). The associated property is
A_SUBSTITUENT_COUNT
.
-
d
Heavy atom substituent count. Different from the D degree attribute, this one ignores even explicitly specified hydrogen atoms. A count suffix is required. The associated property is
A_SUBSTITUENT_COUNT
.
-
D
If used without a count, this symbol defines a deuterium atom.
-
e
An atom attribute for the ring pi electron count of all ESSSR rings the atom is part of. If there is more than one such ring, a match in any of these is sufficient. If no number modifier is supplied, the condition requires the presence of one or more pi electrons in the ring. The associated property is
R_PI_ELECTRON_COUNT
.
-
G
The same as the ’i’ attribute. This is an Eli Lilly internal tools compatibility feature. Example:
set ss [ens create {[aD3]-[G0;CH>0,O,N]} smarts]
-
HA
A hydrogen acceptor atom. This interpretation has precedence over the rather pointless
“hydrogen&aliphatic”
standard SMILES interpretation.
-
HD
A hydrogen donor atom.
-
i
An atom attribute checking for in/unsaturation. If a number modifier is specified, it requests a specific number of π bond participations (e.g. i2 on carbon matches either an allene, or an alkyne). The associated property is
A_UNSATURATION
.
-
T
With a count, it is the same as the
z
attribute. This is an Eli Lilly internal tools compatibility feature. If used without a count, it defines a tritium atom. Example:
set ss [ens create {[CT1]#C} smarts]
-
X
If used
without
a number modifier, which is illegal in standard
SMARTS
, this matches a hetero atom.
-
z
An atom attribute indicating a required number of hetero atom neighbors. If no numeric modifier is supplied, one or more hetero neighbors are required. The property associated with this attribute is
A_HETERO_SUBSTITUENT_COUNT
.
-
Z
An atom attribute indicating a required number of aliphatic hetero atom neighbors. If no numeric modifier is supplied, one or more hetero neighbors are required. The property associated with this attribute is
A_ALIHETERO_SUBSTITUENT_COUNT
.
-
^[0123456]
An atom attribute where a following digit is required. This attribute checks the atom hybridization: 0=s, 1=sp, 2=sp2, 3=sp3, 4=sp3d, 5/6=sp3d2. The property associated with this attribute is
A_HYBRIDIZATION
.
-
* and ?
These two symbols both specify an ’any’ atom.
-
#X
This atom symbol matches a hetero atom. It is a MOE compatibility feature.
-
$$(...)
This is a variation of normal recursive
SMARTS
. Standard recursive
SMARTS
does not know about atoms and bonds already matched in upper levels - the complete structure can be matched by the atoms in the recursive expression. This variant blocks all atoms and bonds already matched in any previous recursion level.
-
|
The vertical bar as bond symbol encodes a bond of type
complex
. This is a bond which is similar to a standard valence bond, for example with respect to defining molecular fragments, but is not electron-counted.
-
/IWfss
An EliLilly extension: number of SSSR rings in the ring system the atom is a member of. Example:
set ss [ens create {[/IWfss1o,s]1:c:c:c:c1} smarts]
-
/IWspch
An EliLilly extension: The 0 or 1 suffix requires that the matched atom is part of the core, or the Molecular Spinach part of the structure. The associated property is
A_LILLY_SPINACH,
the literature reference is J. Med. Chem. 2012, 55, 9763-9772. Example:
set ss [ens create {[/IWfss1o,s]1:c:c:c:c1} smarts]
-
/IWhr
An EliLilly extension: Number of hetero atoms in one SSSR ring the atom is a member of. Example:
set ss [ens create {[/IWhr1n]} smarts]
-
/IWrid
An EliLilly extension. Atoms marked with the same ring ID must be a member of a common SSSR ring. Atoms with a different ring ID must be a member of at least one different SSSR ring. The set of chain atoms forms a pseudo-ring class and can also be tagged. Example:
set ss [ens create {Cl-[/IWrid1a].Cl-[/IWrid1a]} smarts]
set ss [ens create {Cl-[/IWrid1].Cl-[/IWrid2]} smarts]
-
/IWfsid
As above, but applies to ring systems, not rings. Example:
set ss [ens create {Cl-[/IWfsid1a].Cl-[/IWfsid1a]} smarts]
-
/IWAr
An EliLilly extension. The attribute checks whether the atom is alpha to an aromatic ring. The associated property is
A_ALPHA_ARO_COUNT
. Example:
echo [match ss {[/IWAr1Cl]} c1ccccc1Cl=CCl] (expect 1)
-
/IWVy
An EliLilly extension. The attribute checks whether the atom is alpha to a vinyl, non-aromatic group. The associated property is
A_ALPHA_UNSAT_COUNT
. Example:
echo [match ss {[/IWVy1Cl]} C=CCl] (expect 1)
Operator-chained matches
The toolkit supports to a limited degree the EliLilly extensions for chained matches. In these, multiple
SMARTS
fragments (which each may consist of multiple dot-disconnected parts) are linked via &&, || or ^^ two-character operators. Each fragment is handled independently, as a separate structure object, without regard to match overlaps as in Recursive SMARTS or explicit setting of the fragment overlap mode in substructure matching.
Example:
set ss [ens create {[nD3]-S(=O)(=O)&&0[aD3]-[G0;CH>0,O,N]} smarts]
The current implementation does not take operator precedence into account, as the original Lilly code does. It is possible to combine, for example, || and && parts in one query string, but the fragments are checked in strict left-to-right order, without precedence for the and part.
Only those parts of the expression are checked which are require to obtain the final match results. In case of an or expression, the match processing stops after the first fragment match has been found.
Eli Lilly extended SMARTS
As described above, the toolkit has near complete support for the published Eli Lilly SMARTS extensions, including match count prefixes, custom attributes, attribute count operators and chained matches.
Extended hydrogen handling
The H symbol may be used as an explicit hydrogen atom outside brackets, even though it is not in the official
organic subset
element set.