The molfile Command

molfile append

molfile append filehandle property value ?property value?..

Standard data manipulation command for appending property data. It is explained in more detail in the section about setting property data. This is not a command to append file records. Use themolfile write command for this purpose.

Example:

molfile append $fh F_GAUSSIAN_JOB_PARAMS(route) “Opt=(AddRed,CalcFC)”

molfile backspace

molfile backspace filehandle ?nrecords?

Position the file pointer backwards. If no record counter is specified, the file is backspaced by a single record. It is an error to attempt to reposition the file before the beginning of the file.

Examples:

molfile backspace $fh

molfile set $fh record [expr [molfile get $fh record]-1]

These two sample lines provide identical functionality.

The molfile backspace command is often used in combination with the molfile copy command in order to copy records with specific properties verbatim:

set eh [molfile read $fh]

if {[strucuture_passes_condition $eh]} {

	molfile backspace $fh

	molfile copy $fh $outfilehandle

molfile close

molfile close ?filehandle? ...

molfile close all

Close one or more file handles. If the file handle corresponds to a scratch file, the file is deleted. If it corresponds to a pipe, all programs in the pipe are shut down.

If all is passed instead of a set of file handles, all currently opened structure files are closed. Standard Tcl files are not affected.

It is a good idea to close files when they are no longer needed. In addition, while most file format I/O modules commit all data to disk after each record has been written, so that a clean close-down is not absolutely required, there are file formats for which the I/O module has a cleanup or finalization routine which is only called if the file is properly closed.

The command returns the number of files which were closed.

Example:

set fhandle [molfile open scratch]

molfile close $fhandle

The example closes a scratch file, which is automatically deleted from disk when it is closed.

On normal interpreter program exit, the close functions of all remaining open file handles are automatically called.

molfile copy

molfile copy filehandle ?channel? ?count? ?record?

Copy a record to a Tcl channel, to a Cactvs structure file handle, or retrieve it as a byte image. No interpretation or formatting of the data in the file record(s) takes place - the data is copied verbatim, byte by byte.

If file format conversion is desired, the data items (ensembles, reactions, datasets) must be explicitly read (molfile read command) as chemistry objects and written to another molfile opened for output in the desired format (molfile write command) . That procedure involves re-formatting and potential loss of formatting or information which was not captured by the input routine, or cannot be written by the output routine.

By default the next record after the current file pointer position is returned as a byte image. The optional parameters allow the selection of a specific record (beginning with 1 for the first record), the copying of multiple records in one command (by default, a single record is copied), and output to alternative Tcl channels or Cactvs molfile structure file handles. If an empty string or the value 0 are used as start record number, the file is copied from the current position. If the record number is negative, it is interpreted as offset from the current position. Therefore, passing -1 as parameter instructs the command to backspace by one record prior to copying. Not all files can be backspaced. If the special count values end or all are used, all remaining records in the input file are copied. Otherwise, if the number of available records is smaller than the requested copy count, an error results.

If the output channel argument is omitted, or set to an empty string, the record(s) are returned as a byte sequence command result. Otherwise, the data is written to the file handle the argument is connected to. For Cactvs molfile handles, the destination is the current write position of the underlying file handle. On Unix/Linux systems, writable active Tcl file or socket handles (in the form filexxx or sockxxx ) are also supported, but not on Windows. Additionally, the special output channel names stdout and stderr can be used. If output is written to a channel, and not returned as blob, the number of actually copied records is returned as the command result.

The I/O modules for ctx and sdf formats provide optimized fast copy routines and are thus notably faster to copy then other file formats without explicitly encoded record positions. These still need to read the file line by line and maintain a parser state, though they can avoid decoding the record contents as structures or reactions.

Example:

set eh [molfile read $fhandle]

set fhout [open “metal_compounds.sdf” w]

if {[ens atoms $eh metal exists]} {

	molfile copy $fhandle $fhout 1 [expr [molfile get $fhandle record]-1]

This example reads a structure from an input file, checks whether is contains a metal atom, and if yes, copies the record unchanged to an output file, which is opened as a simple Tcl text file channel in this example. The expression which forms the last parameter backspaces the input file by one record, so that the same record which was just read can be copied. A simpler solution for the same functionality is to simply pass -1 as argument. This works of course only if the input file can be repositioned backwards. i.e. normal text files are fine, standard input or a socket connection do not work.

molfile count

molfile count filehandle ?maxrecs? ?readscope?

Count the number of records in the file.

If the file format contains an internal or external record index with information about the complete file, the answer is produced from the index, and thus is typically obtained fast. Otherwise, the file is skipped from the current position until the end, and the sum of the number of records encountered while skipping and the record index when the count started is returned. In case of files which are rewindable, the original input file pointer position is then be restored. On non-rewindable files, the file contents are consumed, and no return to the old input position is possible. For files which are opened for writing, the count usually is simply the current output position, except for those few file formats which support in-file record replacement in combination with a complete file index. In the latter case, the count is again extracted from the index.

During the record skipping part the file contents are not physically read if possible. Rather, the skip function of the responsible file format I/O module is used to scan the file effectively. After arriving at the end of the file, a full in-memory record position index has been assembled for the file, and future record selection within files which support re-positioning is fast.

The type of record boundaries counted depends on the input scope of the file. For file formats which support multiple input modes, such as for extraction of ensembles or molecules or datasets, the count is dependent on the type of object which is configured to be read. If the file input object type is changed, the in-memory record index table is discarded.

If the maxrecs parameter is specified, and is not a negative number, it is the maximum count reported. No attempt is made to position the file beyond this mark during the count process. This has no effect on future input operations - these may still proceed beyond the reported count. This option is not intended to be generally useful, but is used for example in the structure browser csbr with the -m option to enable quick inspection of a file without full scanning.

The optional readscope parameter can be used to temporarily modify the read scope under which the file is processed. It can be any of the generally recognized values (mol, ens, reaction, dataset). If the file format does not support the specified mode, its default mode is silently used. If the file is not positioned at the beginning of the data, the count reports the sum of the currently known records as perceived by the previous read scope, and the remaining file records under the new one. If these values are different, the result may only be useful under very specific circumstances. The the parameter is not set, or an empty string is passed, the currently set, or, for one-shot file operations, the default read scope, is used.

Example:

set nrecs [molfile count “thefile.sdf”]

set nrecs [molfile count “test.spl” -1 mol]

molfile dataset

molfile dataset filehandle

Return the handle of the dataset associated with the file handle. If no such dataset is set, the command returns an empty string. The command

molfile get $filehandle dataset

is equivalent.

This command is different from the dataset commands for ensembles, reactions or tables, where it indicates membership in a dataset. File objects cannot be a member of a dataset. This dataset association is explained in more detail in the molfile set command section.

molfile defined

molfile defined filehandle property

This command checks whether a property is defined for the structure file. This is explained in more detail in the section about property validity checking. Note that this is not a check for the presence of property data! The molfile valid command is used for this purpose.

molfile delete

molfile delete filehandle recordlist ?rebuild_index?

Delete records from the file. The file must have been opened for writing or update, and be rewindable. In case the file is not a simple record sequence, the I/O module for its format must provide a deletion function, or the operation will fail.

The deletion record list is a set of record numbers in any order. They are sorted and duplicates removed. It is no error to specify an empty removal record list. The record numbering starts with one, and the record numbers are referring to the record numbering at the moment the command is issued. There is no need to compensate for intermediate record numbering shifts when more than one record is deleted.

The optional index rebuild parameter, a boolean value, can be set to optimize the deletion process for files in formats which maintain field index information. By default, indices are updated as part of the deletion process. In case many records are deleted, it may be more efficient to drop the indices prior to the deletions and rebuild them after the records have been removed. In order to select this alternative procedure, a true parameter value can be set. At this time, the only file format which actually can use that parameter is the bdb database file format.

In case the file is to be truncated, the molfile truncate command is usually more efficient.

This command returns the number of deleted records. It does not close or destroy the file handle, or the underlying file.

molfile dget

molfile dget filehandle propertylist ?filterset? ?parameterlist?

Standard data manipulation command for reading object data. It is explained in more detail in the section about retrieving property data.

For examples, see the molfile get command. The difference between molfile get andmolfile dget is that the latter does not attempt computation of property data, but rather initializes the property values to the default and return that default if the data is not yet available. For data already present, molfile get andmolfile dget are equivalent.

molfile dup

molfile dup filehandle

This command duplicates a file handle. The duplicate handle points to the same underlying file or other data channel, is opened in the same access mode, and positioned at the same record. Also, all file object attributes and file properties are set to identical values.

Currently, it is not possible to duplicate virtual file sets opened by amolfile lopen command.

The command returns a new file handle.

molfile exists

molfile exists filehandle

Check whether a file handle is currently in use. The return value is the boolean result. No error is raised if the file handle cannot be decoded.

molfile extract

molfile extract filename retrievallist

Extract the contents of data fields from the file, without reading full structure or reaction records i f possible. This operation requires a support function in the I/O module for the file format. Generally, only formats optimized for query operations, such as the Cactvs bdb and cbs formats provide such a function in their I/O module.

This command is essentially a shortcut for a molfile scan command with an empty query condition and a propertylist retrieval mode. Please refer to that command for details about the possible contents of the retrieval list.

The result is a nested list of extracted property values, with one outer list element for every file record to the end of the file, and inner list with one element per retrieval field.

molfile filter

molfile filter filehandle filterlist

Check whether the structure file passes a filter list. The return value is 1 for success and 0 for failure.

Example:

molfile filter $fhandle $filter

molfile fullscan

molfile fullscan filehandle queryexpression ?mode? ?selectlist? ?parameters?

This command is the same as molfile scan , except that an automatic rewind (see molfile rewind ) is performed before the query is executed. The same effect can be achieved by setting the startposition parameter value to 1.

molfile get

molfile get filehandle propertylist ?filterset? ?parameterlist?

molfile get filehandle attribute

Standard data manipulation command for reading object data. It is explained in more detail in the section about retrieving property data.

The molfile object possesses a rather extensive set of built-in attributes, which can be retrieved with the get command (but not its related subcommands like dget, sqlget , etc.). Most of them can also be manipulated with a set command. In addition, molfile objects can possess file-level properties. The standard prefix for these is F_ .

Example:

set c [molfile get $fhandle F_COMMENT]

These built-in attributes are:

atomlabelproperty
The name of a property which holds data for a parallel user-defined atom numbering scheme (see writeflags / writelabels attribute) which can be output by some I/O modules. The default property is A_LABEL. The property must be associated with atoms, but is not required to be an integer, if the I/O modules supports alternative data types (i.e. for CDX/CDXML the label data in the file format is internally a string, and any different property data type is converted as necessary). This attribute has an effect only if the writelabels flag is also set in the writeflags attribute .
authorization
A service authorization URL, which might for example be presented to the user for approval of access to a resource. In the case of dropbox file access, this data is copied from the global value of the I/O module (see filex get command). For normal files, this attribute is empty, and setting it to a string value has no effect.
batchsize
The number of records in a standard processing batch. The default batch size are 10 records.
bondlength
The standard bond length to be used in the file. The unit is points (1/72 inch). If the value is negative (the default), the standard format-specific bond length is used. This attribute is only supported in a few graphics-oriented file formats, such as CDX or SKC files, or EMF images.
cachesize
The size of the record prefetch cache the file should use. Normally, the size is zero and no such cache is employed. The I/O modules for a few file formats, such as PubChem CID and SID files, where the individual retrieval of a record via the Internet is almost as expensive as fetching a sizable batch, use a cache if allowed and prefetch multiple records when a record read operation is performed and the cache is empty or the requested record is not in the cache. A later read can, if the input record is in the cached set, return the data without establishing a new network connection.Using a cache is beneficial only when the expected access pattern is linear and in ascending record order. It decreases performance if the record access pattern is random and not limited to a continuous record set that fits into the cache.
chain
A single-letter code indicating the chain to be read from records with structure disorder data. These can for example be found in PDB files. The default value ‘?’ automatically selects the first chain which is encountered in the file record. After a record has been read, the attribute is set to the actual character of the chain which was selected, so it needs to be reset in case more than one record is input via this file handle. If the chain character is set to an empty string, all atoms are read from files even if they belong to multiple overlapping disordered structure instances. This can of course lead to problems in connectivity representation. The alternative name disordered is an alias for this attribute.
compact
A boolean flag indicating whether the file is present in a abridged form, or should be written as compact as possible. This attribute effects few file formats. An example is the native Cactvs ASCII format ( cascii ).
complexresolver
A boolean flag which enables or disables bond type processing after input. It the flag is on, typical complex bonds between metal atoms and ligands, or between metal atoms, are recognized and re-coded as complex bonds, which provide connectivity, but do not participate in valence electron counting. In many cases, this improves the general representation quality of the structures. However, since most chemical data exchange formats do not support this type of bonds, it can also make export of the data difficult. By default, this flag is on . For maximum portability, it should be switched off. This attribute is a convenience shortcut operating on the readflags attribute.
computationlog
A read-only attribute. It is a list of all properties which were computed during a record write operation. This can be used to determine which effects the output has had on the information content of a written object, or to optimize I/O throughput by performing pre-computation of these properties in a separate thread.
compression
The detected file compression type. It can be one of none , compress , pack , gzip or bzip2 . Compressed files are automatically opened for reading via a pipe to the suitable decompressor program, if it can be located. This attribute can also be set, but it currently has no effect on the actual output in any format. In order to write compressed files, open an output as a pipe to a compressor program.
ctime
A read-only attribute reporting the time of the last status change of the file. Its unit are seconds since January 1st, 1970. This value is meaningful only for normal disk files.
deletable
Flag indicating whether this molfile can be deleted or closed with a standard molfile close command. The attribute is read-only. Molfiles which are, for example, property data values or a part of a molfile loop command cannot be deleted by standard means.
deselection
This somewhat awkwardly named attribute is the inverse of the selection attribute. For further explanation, refer to the paragraph on selection.
device
A read-only attribute reporting the device number of the file. This is meaningful only for normal disk files, and only supported on Unix/Linux.
droplist
A list of properties which are not to be written to the file, even if they are already present on output objects and the file format can encode them. Naming a property in this list does not delete them from the property set of objects which are written to the file, just suppresses their output.
embedformat
The format of embedded objects encoded in another format. This is meaningful only for a few file formats, for example zip (which contains single-record files of a different type) or rtf (which may contain cdx or skc embedded OLE objects). If this attribute is not set, the default depends on the wrapper format (i.e. SDF files for zip , cdx OLE objects for rtf ). Setting it to an empty string or none disables embedding where applicable. The attribute is updated on input and can be read when a file record is input which contains embedded data.
encoding
The detected encoding type of the file. It can be one of ascii , binary or unicode . This is a read-only attribute.
eof
This read-only boolean attribute indicates whether the file read pointer is at the end of the file.
eolchars
A sequence of characters which are used as line terminators for the output of text-based file formats which do not define a specific line end character. The default value is platform-dependent. It is a single newline character on Linux/Unix, CR/LF on Windows and a single CR on Macs. This attribute has no effect on input. All input routines automatically recognize and read all three variants on all platforms.

On setting, the magic strings windows , mac (both checked for the first three characters only) as well as unix and linux are translated to the standard platform line terminators and not copied verbatim. Alternative names for these standard system encodings are crlf , cr and lf . The special value default resets the attribute to the platform-dependent default.

eor
A read-only attribute which indicates at what type of record terminator the current read position is located. Possible values are none , mol , ens , reaction and dataset . The none value indicates that reading did stop in the middle of a record due to some problem.
errorproperty
A read-only attribute which holds the name of the last property where input failed. This is not supported by all file I/O modules. It is especially useful for binary formats where a line number cannot be used for simple visual inspection of an input problem.
failures
A list of properties for which computation failed on this file object. This is a read-only attribute. Depending on configuration settings, this information may be used to block pointless attempts at re-computation of incomputable data.
fd
A read-only attribute which reports the system channel number the file object is associated with.
fields
This is a list of the names and potentially attributes of data fields in the file. For simple formats such as SD files, this is simply a list of property names, and it is updated after each read record to track a potentially changing field set. For more complex formats such as bdb and cbs , every list item is a nested list which contains the field name, field flags, field object class association and partition file. The field output for simple formats such as SD is controlled via the writelist attribute, and the value of the fields attribute has no effect on output. However, the I/O modules for complex database-type formats such as bdb and cbs provide a handler function which translates an updated value of this attribute into a changed database layout. Depending on the I/O module, this may be supported only for an empty file ( cbs ), or may be possible even for files which already store records ( bdb ). This attribute can also be addressed by the alias fieldnames .
filelock
On reading, this is a boolean flag indicating whether a file lock is currently set on the file or not. On setting, the argument can be release , trylock , forcelock or test . The first variant attempts to release an existing file lock, the second variant attempts to set a file lock, but returns immediately if that is not possible, the third variant blocks until the lock can be established, and the fourth version tests for the presence of a lock. The return value is a boolean status result. This command is not supported on Windows. File locking may pose special problems if the file is not residing on a local file system.The underlying system call is lockf64() or lockf() . Please consult your operating system manual for more details.
fileset
A read-only attribute containing a list of the names of the physical files which are behind the file handle. For normal files, this is a single list element for a single file. However, for file handles opened by means of the molfile lopen command to access a virtual file assembled from multiple physical files, this can be a list with more than one element.
filter
A query expression (see molfile scan command) which input records must match to yield a result object when a molfile read command is run. The read command is automatically looped until a matching record is found, or the end of the input source is reached. Since the test is only applied after a prospective input object has already been fully read internally, this style of record filtering is in many cases considerably less effective than using molfile scan for file formats which possess query acceleration features, such as CBS , BDB or the Pubchem virtual file module. For the reading of simple text files, such as SDF , there is no performance difference to using molfile scan in the ens or reaction object retrieval mode, and this type of filter which can be easily adjusted or disabled (by setting it to an empty string) can be convenient.
fontsize
The standard font size for text in graphics-oriented formats, such as CDX or SKC . The value is a floating point number measured in points (1/72 inch). A value of zero or less, which corresponds to the default, lets the software chose a suitable value, which is dependent on scaling and bond length.
fold
The number of characters after which the software should look for a good position to use a continuation character and line break. This is only used in a few formats, such as SLN .
format
The standard name of the file format the molfile object is linked to. This is normally only set in scripts for output files, because the format for input files is auto-detected. Nevertheless, it is possible to set a format explicitly also for input files, and even to switch it when records have already been read. When setting a format, generally a set of alias names are recognized in addition to the short official name.
from
The sender of a file. This is only set when the file has been extracted from a mail message or attachment.
handle
The handle of the file as a read-only attribute. Not generally useful, because in standard access modes you already need the handle to identify the file object.
height
The maximum height of a structure or reaction depiction in points (1/72 of an inch). This is only used for graphics-oriented formats, such as CDX , SKC or EMF . If the attribute is set to a negative value, which is the default, the size is indirectly controlled by the bond length and atom coordinates. In case this attribute is set to a positive value, and the depiction would exceed the maximum height, it is automatically scaled down proportionately.
hidden
Flag indicating whether the molfile is hidden. This is not the same as the invisible state. This attribute is intended to be used for rendering object selections. This attribute can be set.
highmaprecord
The maximum record to include in a memory-mapped section of the file for accelerated read access. If set to a negative value, which is the default, the system automatically determines if mapping is worthwhile, and if it is, map the full file. This attribute is primarily useful for the acceleration of queries which repeatedly operate in a section of a larger file, for example when running distributed queries with multiple processes handling different parts of a large file.
host
This is a shortcut for the host name part of a file or virtual file addressed via an URL. For simple retrieval it is equivalent to the URL field attribute url(hostname) . For some I/O modules, for example the interface to access Mysql tables as virtual structure files, a change of the host name does have an effect and results in (re)-connection to a different database host. For normal files accessed via a URL a change of the attribute is ignored after the file has been opened. Files that are not associated with an URL have an empty host name value.
hydrogenfilter
A hint about the desired output style of hydrogen atoms of the structure. In contrast to the hydrogens attribute, this hint does not actually change the structure by adding or removing hydrogen atoms, neither on the original output object nor a temporary processed structure or reaction duplicate. Not all I/O modules support this flag. Its availability can be queried via the capabilities attribute of the filex command for the format. The possible values are default (or -1), which is the default and selects the default hydrogen write mode of the file format, none (or 0) which suppresses hydrogen output, special (or 1) which writes hydrogens shown normally with a symbol only, and all (or 2), which writes all extant hydrogens. Since this attribute does not change the hydrogen atom set, setting for example the mode to all when there are no hydrogens attached to the structure has no effect.
hydrogens
The hydrogen processing mode of the file. Its default can be controlled via the system variable ::cactvs(default_hydrogen_addition_mode) . Its standard setting is asis , meaning the hydrogen set is to kept as it stored in the objects for output, or defined in the original file records for input. Possible modes for this attribute, or the system control variable, are add (add a complete standard set of hydrogens), asis (keep unchanged), strip (strip hydrogens except those which are normally displayed, such as bonded to hetero atoms or at stereo centers), stripall (strip all hydrogen), stripadded (strip all hydrogens which were added by a hydrogen add command, automatic hydrogen addition on input, or similar mechanisms) and addblind (which is the same as add, but does not register the added hydrogen atoms as implicit in property A_IMPLICIT ). When writing a structure object to a file with enabled hydrogen processing, the original object is not changed. Hydrogen processing takes place on a ephemeral duplicate object. On input, hydrogens which are no explicitly encoded, but defined via implicit valence rules in the format specification are still instantiated in asis mode. For example, a single C atom in an MDL Molfile is read as a single atom, because there are no default valence rules, but a C as a SMILES string is expanded into one carbon plus four hydrogen atoms. For a method to suppress the expansion of valence-implicit hydrogen atoms, see the readflags attribute.
hydrogenstatus
An enumerated value providing information about the hydrogen status of the file. Possible values are unknown , complete (all hydrogens present), partial (some hydrogens present) and missing (no hydrogens present). This attribute is updated when data is read from files which encode this information. It may also be set and has an effect on some post-processing operations on objects read from the file.
ignoreempty
A boolean flag which instructs, when set, the I/O module of the file format associated with the molfile object to ignore empty records without atoms when reading from the file. By default, this flag is not set and empty records are retrieved as empty ensembles or other objects.
ignoreerrors
A boolean flag which tells the I/O module of the file format associated with the molfile object to ignore errors and to attempt to read or write the next record instead. By default the flag is not set and errors in I/O result in Tcl script command errors.
ignorelist
A list of properties which should not be read from the file, even if they are explicitly encoded in the records.
incomplete
This is a boolean read-only boolean flag which indicates that a record was only read partially. This is the same as checking for the presence of the incomplete flag in the flags attribute.
inode
A read-only attribute reporting the inode number of the file. This is meaningful only for normal disk files, and only supported on Unix/Linux.
invisible
Flag indicating whether the molfile object is invisible. This is not the same as the hidden state. An invisible object is no longer accessible via its handle. This is usually the case for objects which are scheduled for deletion, but still have lingering referring pointers. This attribute is read-only.
iscompressed
A boolean read-only attribute which is set when the file is compressed by one of the recognized compression algorithms ( gzip , bzip2 by default). In that case, the file is not accessed directly but via a pipe the the appropriate decompression program, which changes the file handling characteristics.
ismapped
A boolean read-only attribute which is set when the file is read via a memory-mapping method.
ispipe
A boolean read-only attribute which is set when the file is accessed via a pipe, either because it was explicitly opened to a pipe, or because decompression ( gzip , bzip2 ) or character encoding ( iconv ) programs where automatically spliced in.
jstreversal
A boolean flag indicating whether the JST special encoding variant for MDL Molfiles should be used.
lastrecord
The value of the file record read position before the last molfile read command. This is normally the value of the record molfile attribute after the read operation minus one and corresponds to the file record number of the read object in the data file.
line
A read-only attribute returning the current line number. lc is an alias name for this attribute. Generally this attribute is meaningful only for text-based file formats. For most binary formats, the value of this attribute is the same as the record number. This line number always refers to the current physical file. To get the global line number of a virtual file set, use the vline attribute.
loopitem
The current file input item in a molfile loop statement. This is the same as the content of the loop variable. If no loop is active, this is an empty string. This is a read-only attribute.
lowmaprecord
The minimum record to include in a memory-mapped section of the file for accelerated read access. If set to a negative value, which is the default, the toolkit automatically determines if mapping is worthwhile, and if it is, map the full file. This attribute is primarily useful for the acceleration of queries which repeatedly operate in a section of a larger file, for example when running distributed queries with multiple processes handling different parts of a large file.
mailencoding
This is a read-only attribute which is only set if the file has been extracted from an email message or attachment. Possible values are unknown , ascii , iso (for ISO 8859-1), quoted (for quoted printable), base64 and utf8 .
mailproperties
This is a read-only attribute which is only set if the file has been extracted from an email message or attachment. It is a list of properties which were requested for computation in a header field. This attribute is typically used for setting up email-based property computation services.
maxblobsize
The maximum size of Cactvs ensemble or reaction blobs which are part of the file records, measured in bytes. This attribute only applies to those few file formats which store structure and reaction data as Cactvs toolkit blobs. Currently these are CBS and BDB . If the blob size exceeds the limit, the input or output of the record fails. The default value are 256K, which is more than sufficient for standard applications. If the attribute is changed, a minimum value of 64K is silently enforced. Increasing the attribute can have a small negative effect on I/O performance, but is otherwise safe.
mimeboundary
This is a read-only attribute which is only set if the file has been extracted from an email message or attachment. This is the string which was used to separate MIME data blocks in the message.
mimedefaulttype
A read-only attribute giving the default MIME type associated with the current file format.
mimetype
The currently configured MIME type for the file. Initially, it is set to the default type (attribute mimedefaulttype ). However, it can be changed, and it is used for transmitting the file data via various types of Internet connections.
modcount
The molfile object modification count. This is a read-only attribute.
mode
This is a read-only attribute which describes the general file access mode which was established when the file handle was created by a molfile open or molfile lopen command. Possible values are append , pipe , read , string , write and update . Note that in this attribute there is no difference between the standard read and the restricted read-only modes (see molfile open ). The file mode cannot be changed at a later time by directly changing the mode attribute. However, with some limitations, a file may be switched back and forth between input and output modes with the aid of the molfile toggle command.
mtime
A read-only attribute reporting the time of the last modification of the file. The unit is seconds since January 1st, 1970. This value is meaningful only for normal disk files.
name
On input, this attribute simply reports the full path name of the underlying file, or the original magic name in case of special files. This attribute can also be set, and in case of normal disk files, the physical file is renamed, too, if the file access permissions are sufficient for this operation.
nitrostyle
The nitro (and similar) group encoding conventions associated with the file handle. There are actually independent settings of this attribute for input and output. The version reported by the command is dependent on whether the file is in input or output mode. Possible values are asis , ionic neutral, xionic and xneutral . The default input value is ionic , while the default output value is asis . When the value is modified, the new value is stored both for input and output. If the value is not asis and a structure item is read, its nitro group (and related groups) connectivity is automatically adjusted to the preferred style. If processing is requested for output, the connectivity change is performed on a temporary duplicate, so that the original output object is not modified.
nullstring
A string which on input is used to identify NULL values, or used on output to encode NULL values. This attribute is only used by a few I/O modules. The most important application is in reading text-based tables with embedded structure notations by means of the table structure I/O module.
offset
A read-only attribute reporting the current byte offset position of the read or write pointer. It is not meaningful for all types of data channels.
orientation
This value can be none (the default), landscape or portrait . It describes the orientation of a drawing area specified via the paper attribute. Few I/O modules use this information. The most important formats which implement this is are CDX and CDXML .
originalname
The name as originally used to create the molfile object. The standardized name, with path information in case of disk files, can be accessed via the name attribute. Changing this attribute has no effect on the file system. This is different from the handling of the name attribute.
pagecount
The number of (vertically stacked) pages in the document. This attribute is currently only used for the CDX and CDXML formats.
paper
An attribute describing the size of the drawing areas for formats such as CDX or CDXML , which can encode this type of information. Possible values are none (the default), a3 , a4 , a5, a6, a7, b3, b4, b5, b6, letter, legal and executive . The associated orientation of the drawing orientation can be set via the orientation attribute.
parameters
A free-form string which can be used to pass additional, non-standardized parameters to a file format I/O module. Few I/O modules use this, one example is the XFIG output code.
password
A file access password. It is used in various contexts, for example for authentication when using URL-based access to files, to enable the I/O of encrypted records in files which support partial data encryption, such as the Cactvs CBS and BDB formats, or to proceed with the execution of a remote query received via a listener port. In most cases a change of the attribute value after a file has been opened has no effect. An exception are modules which access database tables as virtual structure files. These will react to a changed user name with re-authentication to the database and table, which may result in different access permissions.
polysymbol
A free-form string used to override the standard symbol used by a file format I/O module to indicate polymer components. If set to an empty string, the standard symbol is used, which depends on the file format. The default is an empty string.
port
The number of a port on which the file handler should accept remote query requests. If set to a negative value (the default), no such requests are accepted, and in case a monitor thread was executing before the value was changed, it is shut down. If a positive port number is set, a monitor thread is automatically started as listener on the specified port.
position
This read-only attribute describes the relative position of the read or write pointer in the full file, as an integer in the range between 0 and 100. It is primarily intended to be used in progress meters and similar widgets. In case the relative position is unknown, for example because the total size of the input file is unknown, the value is zero.
preservelist
A list of properties which should not be changed if a file record is updated, even if the value in principle depends on, for example, changed connectivity of the main structure record. Currently, the only I/O module which supports this feature is BDB .
previousrecord
This read-only attribute is a convenience function to obtain the value of the record number of the file handle that before the current record was read. Usually, it is the same as the record attribute minus one, but if reactions from files where reagents and products are separate sub-records, or complete datasets were read, the difference may be larger.
pyobject
If the toolkit was compiled with Python support, this attribute reports the memory address of the Python wrapper class instance, if it exists. This attribute is read-only.
pyrefcount
f the toolkit was compiled with Python support, this attribute contains the reference count of the Python wrapper class instance, if it exists. This attribute is read-only.
reactioncolumn
This attribute is the numerical index of a column in table-style data files which are, for example, read by the reaction table I/O module. The column is expected to contain a string notation for the reaction object which is returned by a molfile read operation. To this decoded object the contents of the other columns is attached as property data. Typically the content of the structure column is a Reaction SMILES string or similar line notation. A negative value of this attribute indicates that the presence of structure data in a specific column is unconfirmed. In that case, an attempt is made to determine the reaction column automatically, and the attribute is updated accordingly. However, setting it explicitly may still be required in case there are multiple columns with reaction data, or there are too many unreadable or NULL row entries to allow automatic determination.
reactionscreen
The name of the property which is used for bitvector screening in filtering records for reaction transform matching. Its default default value is controlled by the global variable ::cactvs(default_reaction_screen_property) and is usually X_SCREEN . If a file is opened that contains information about the screen property set when the file was written (for example, CBS and BDB formats), this attribute is automatically set to the value stored in the file.
readflags
This attribute controls a set of input processing flags. If the attribute is queried, the result is a list of the names of all flags which are currently set. For modification, the preferred method is to use the bit manipulation prefixes for generic bitset operations. In case just additional flags should be activated, the molfile append command can also be conveniently used. There are also a few shortcut alias attribute names which set or reset selected, frequently used flags directly ( complexresolver ). The following flag names are currently recognized:
none
no flags.
aroresolver
resolve aromatic bonds into a Kekulé form. A frequent application is the input of records from MDL SD files which are not used as query structures, but where aromatic bonds in the original data are nevertheless and illegally encoded as the aromatic structure query bond type.
autowrap
When the end of the file has been reached, automatically start reading from the beginning of the file again, until the full file has been scanned once. This operation effects only the molfile scan command and is used there in order to perform full-file queries starting from an arbitrary position in the middle of the file.
basiconly
only read basic property data set, not full record. This is supported in CBIN , CBS and BDB formats in order to accelerate fast filter and query operations.
chargebalancer
Try to neutralize and balance charges.
chargecombiner
Try to merge opposing charge pairs where possible, changing the bond orders of paths between them if necessary.
complexresolver
perform a bond analysis and re-code typical bonds in metal complexes as non-VB bonds, which do not participate in valence electron counting. For a more detailed explanation, see the alias shortcut complexresolver . This flag is set by default.
continueafterhetatm
For PDB files, consider any atom line after the first HETATM to be a heterogen, regardless of the line type. This feature helps to cope with ligands which contain amino acid substructures and which some other PDB write software misclassified as part of the protein.
fixdoublespace
If set, this flag instructs I/O modules with support for this feature to read structure files which contain one spurious empty line after each data line, which unfortunately appears to happen sometimes when DOS -encoded files are transferred to Apple systems. This is not the same as reading CR/LF files on CR -only or NL -only platforms, or vice versa, which is always possible and fully automatic. This flag addresses the problem that, due to mishandling by obscure transfer software, duplicated EOL -markers are introduced in the file (two identical CR/LF , or CR , or NL pairs after each data line).
fixstereo
Remove spurious stereo descriptors on atoms and bonds which are not stereogenic.
fixwedges
Re-code wedges which are attached with the broad base to a stereo center (for example as written by IDBS software) into standard IUPAC format with tips at the stereo centers.
hetatmonly
In PDB files, read only HETATM lines.
ignorecr
Allow an isolated carriage return ( ASCII 13) character without following NL ( ASCII 10) character as data content instead of examining it as potential line break symbol. This flag is necessarily ignored on Mac-style input files which only use CR as EOL markers.
ignoreitherdb
If set, ignore any either attribute data for double bonds in MDL Molfiles . Instead, determine their stereochemistry from coordinates.
ignoreempty
When reading an empty record, with no atoms, from a multi-record file, ignore the record and immediately proceed with the next.
ignoreerrors
When reading a corrupted record from a multi-record file, ignore the error and instead attempt to re-synchronize and read the next record.
ignorenorecall
If set, the norecall field flag supported in some file formats ( CBS , BDB ) is ignored. By default, data from fields which carry this flag is not merged into the property set of ensembles or reactions when they are retrieved as objects from these files, as an optimization to avoid recalling data which is useful for queries, but not so much as object data (for example, screen bits, element counts). With a set attribute flag, all fields of the record are attached as property data to recalled objects.
ignorevisibility
Ignore any display attributes in the input data which would make atoms or bonds invisible in renderings.
latehprocessing
If this flag is set, the standard hydrogen addition/removal operations are performed after other selected processing steps have been performed. By default, hydrogen processing takes place before charge equilibration, radical charging, etc. This flag should be set if the hydrogen set in the file records is known to be complete, but the charge and radical situation is dubious.
lockmemory
Lock the shared memory mapping arena of the file into memory, preventing it from being swapped out. This is only supported on Linux, and has an effect only if the sharedmap flag has been set. Depending on the size of the arena, and the system configuration, this operation may require enhanced privileges.
logqueries
If the file formats supports operation logging, activate the log.
keepcoords
In case multiple molecules or ensembles are read in one operation, the system normally verifies that they do not have overlapping 2D display coordinates, and moves them apart if necessary. If this flag is set, the 2D display coordinates in property A_XY are always passed unchanged.
mergedata
In case there are repeat instances of the same data item in an input record, attempt to append it in a suitable fashion to the first property instance on the input object. By default, multiple data items with the same name are not merged, but result in multiple property data instances. This is a problem which is encountered typically while reading data from formats with limited syntactic expressiveness that cannot properly distinguish between these cases.
multibondcheck
attempt to correct unlikely clusters of multiple bonds.
nocoordinatecheck
do not attempt to discover and fix mixed-in missing 2D or 3D coordinates, for example encoded as all-0 values. All coordinate data is to be preserved verbatim.
noorigin
do not register the origin of the property data values from the current file as metadata information.
noeof
do not attempt to detect EOF . More data may be coming.
noimplicith
do not add a standard valence set of hydrogens to explicitly encoded atoms, even if the file format specification defines such a set. The most common application is for reading SMILES strings without the default hydrogen atoms. nohadd is a (slightly misleading) alias for this flag. This flag is independent of the generic hydrogen addition/removal processing option, which can be configured with the hydrogens attribute.
nometa
if this flag is set, it asserts that the file does not contain metal atoms. This is for example useful for reading PDB files which frequently possess ambiguous encodings such as CA for calcium or alpha carbon.
nometalh
suppress addition of hydrogens to metal atoms.
noradicals
assert that the file does not contain records with atoms that are radicals. This is a hint which is used for hydrogen addition, radical charging, and other operations.
pedantic
apply pedantic checking of file syntax rules. For some frequently abused file formats, such as MDL Molfiles or PDB , this may result in quite a percentage files being rejected for file format specification violations.
radicalcharger
Edit radicals which are typically formed by reading a file without formal atomic charge information by adding standard formal charges, for example replacing NR 4 with N (+) R 4 and OR with O (-) R. This only works reasonably well if the file contains a complete hydrogen set.
readas2d
Force the interpretation of atomic coordinates as 2D, regardless of the file type encoding or presence of a third coordinate column, which may have been abused as an additional atom data store.
readparity
If this flag is set, the parity fields in MDL Molfiles and derivatives are read and the data stored in property A_LABEL_STEREO . In accordance with MDL rules, this field is normally ignored, and stereochemistry decoded from wedge bonds and atom coordinates.
sharedmap
If the file is memory-mapped, use a shared memory segment for the data. This can be useful if there are many processes accessing the same file for reading. This flag is only supported on Unix/Linux.
simpleradicals
If this flag is set, the input file is assumed to contain only simple doublet radicals, if any. Any encoding of other, probably miscoded radical forms is changed to a doublet.
tautoresolver
Perform a tautomer standardization on the read structure. This operation invalidates numerous atom and bond properties, such as coordinates, but in this special case all ensemble properties which were attached to the processed structure are retained, regardless of their sensitivity toward atom and bond changes. Tautomer resolution requires a complete hydrogen set, so either these must be present in the input file, or a suitable hydrogen addition mode must have been set on the file handle. The processing behind this input option is comparatively expensive. For normal input, when speedy input and maximum fidelity of the data to the original file is desired, this flag should not be set.
readkey
This attribute is only used in certain library configurations which have been configured to restrict read access to specific types of files. The key and data computed from the file name must together match the signature. Usually restricted applications have a compiled-in signature, and one or more read keys which enable read access to the same number of specific files.
readkeysignature
This attribute is used for certain library configurations which have been configured to restrict read access to specific files. This signature is required to verify the read access key.
readkeystatus
A read-only attribute which reports the access key status for a file for which a read key has been specified. It can be unchecked , verified or error .
readscope
This attribute controls which types of objects are read from a file, in case the file contains more than one object type. For example,. MDL RXN files can be read as en ensemble record stream, or as a reaction record stream. Cactvs CBIN files can be read as a multi-record stream of individual ensemble or reaction records, or as a single dataset with additional dataset properties. CTX files allow access to individual molecules or ensembles. The hierarchical FDA SPL format supports read modes for molecules, ensembles, and datasets. The default value for this attribute depends on the file format and is automatically updated whenever the format is analyzed or changed. It is generally set to the most commonly used access variant for that format, for example reactions for RXN files and ensemble streams for CBIN , but it may also be set explicitly. Possible values are none , mol , ens , reaction , dataset and auto . In case a file format does not support a specific variant, the next supported type to the right in this sequence is automatically used. The auto mode performs a new content analysis for every record and use the most suitable scope. Examples where this is useful are RDF files with mixed structure and reaction records, or RTF documents which mix reaction and structure OLE objects. The dataset mode is potentially dangerous when reading large multi-record files which do not contain multiple smaller datasets. In that case, the whole file is interpreted as a single dataset, and that can lead to a large amount of memory being consumed.
record
The number of the next record to be read or written, starting with one. This value always refers to the current physical file. In case a virtual file is read, the vrecord attribute can be used to address the global record number. rc is an alias name for the attribute. It is possible to set this attribute in order to reposition the file pointer. In case the file is opened for output, and is not in update or append mode, this operation truncates the file. Repositioning while reading does not modify the file. It is not possible to position the file pointer any further to the rear of a file than immediately behind the end of the last existing record. When setting the value, the magic record numbers last (to set the file pointer so that the last record is accessed) and end (to set the file pointer immediately after the last record) are supported for convenience.
recordtable
This is a read-only attribute. It returns a nested list of the attributes of the currently known record positions in the file. Every list element is itself a list which contains, in this order, the record number, the file offset, the line number (which is the same as the record number for binary formats), the eor type of that record, a boolean flag indicating whether the record is physically present in the disk file (0) or virtual (1)., and the original file name used to create the handle. In case of multi-file handles, this is not a constant over all records. In order to guarantee that all records of a file and their offsets are known, execute for example a molfile count command before querying the record table.
refcount
If the Tcl interpreter is using native Cactvs objects instead of string-based major object handles and integer-based minor object labels to identify toolkit objects, this returns the number of Tcl object references active for this molfile . The attribute is read-only.
replyto
The future recipient of a file. This is only set when the file has been extracted from a mail message or attachment. In order to send mail messages to specific destinations via the mail wrapper I/O module, this attribute may also be set.
resolution
A resolution value in DPI (dots per inch). The default value is 0, meaning that it is undefined. This information can be used by a couple of I/O modules, for example for reading structure data from image files by performing chemical OCR via the interface to the OSRA program.
returnformat
The name of the desired return format if the original file was received by mail. This is only set when the file has been extracted from a mail message or attachment.
scandata
This is a read-only attribute which reports statistics on the last molfile scan command. The returned data is a Tcl dictionary with keys start_time (in seconds since 1970-1-1), stop_time (in seconds since 1970-1-1), scan_time (in seconds), ens_read (count of ensemble objects instantiated), miniens_read (count of Minimol objects decoded), reactions_read (count of reaction objects instantiated), properties_read (count of property records read), ens_screened (count of bit-screen filtering operations performed for substructure/superstructure searches), reactions_screened (count of bit-screen filtering operations performed for reaction matching), records_examined (count of records looked at), records_matched (number of matched records), start_record (record the scan started at), end_record (last visited record), eof_reached (boolean indicator whether the end of the file was reached), max_mmap_used (maximum used size of memory mapping arena), max_mmap_requested (maximum requested size of memory mapping arena), records_skipped (number of records which where skipped with need for re-synchronization), records_repositioned (number of records which were finished without the need for a re-synchronizing skip operation) scores_computed (the number of scoring function calls executed).
selected
Flag indicating whether the molfile object is selected. This attribute can be changed.
selection
This attribute is not a molfile handle attribute, but a flag attached to individual records. If queried, the return value is a list of all record numbers for which this flag is set. Using molfile set with a list of record numbers in any order to modify the attribute resets the current flags, and creates a new set. Modifying the attribute via molfile append adds selection flags without resetting the current selection. The selection flag can only be set for existing records. If an attempt is made to set the selection flag ahead of the currently known position set, the command scans the record structure (as in molfile count ), which can be a problem in case of non-rewindable input. In order to facilitate resetting of selection flags, the virtual attribute deselection can be accessed as the inverse of the selection. Setting it to an empty list selects all records up to the end of the file (again this triggers automatic forward scanning, if necessary), and appending a list of records removes them from the selection. The default value of the selection flag for any record is false .
separator
A string containing one or more column separator characters. This is used for example by the structure and reaction table I/O module. The attribute is also set when a table with an auto-detected separator character was read via the file handle. The default separator is a single tab character.
sessionkey
A free-form string intended to be used to identify sessions.
shmid
In case the memory map arena of the file is in shared memory, this is the shared memory key as read-only value. If the file is not mapped into shared memory, or on platforms where memory mapping is not supported, the value is always minus one.
signature
The signature of a mail message. This is only set when the file has been extracted from a mail message or attachment.
similarityproperty
The name of the property which is used for bitvector similarity computation in file scans. Its default default value is controlled by the global variable ::cactvs(default_similarity_property) and is usually either E_SCREEN or E_QUERY_SCREEN . If a file is opened that contains information about the similarity property set when the file was written (for example, CBS and BDB formats), this attribute is automatically set to the value stored in the file.
size
The file size in bytes as read-only data. In case it is not known, for example because the file is accessed via a special stream or a pipe, zero is reported.
sizehint
The expected maximum record count of the file. This attribute is used by some I/O modules to pre-allocate room in files with complex storage layout, in order to avoid the need for expensive re-organization during later record writes. The CBS format especially benefits from this information. File formats which are simple record sequences have no use for this information. A value of zero, which is the default, specifies an unknown future size. If the final size is not known exactly, it is generally preferable to overestimate it somewhat than to be slightly short.
statusflags
A list of boolean flags which describe the status of the machinery behind the I/O operations of this handle. All set flags are reported. When checking for the presence of a flag, make sure not to use simple string comparison, because other flags may also be set. While it is possible to change the flags, this is not a common operation, and if done carelessly can disrupt the I/O functionality of the handle. The older attribute name flags is still a valid alias. The following flags are commonly seen:

append - all file output is append to the end of the file, ignoring the current write pointer position.

binary - the file is binary, without a line structure.

bzip2-compressed - the file is accessed via a pipe to the bzip2 program.

checkedbinary - the file contents were checked to determine whether they contain non-ASCII characters.

edited - the file contains virtual edited records, or virtual deletes.

fakeposition - the file has no meaningful offset positions for the beginnings of records, the offset data structures contain other forms of access information

gzip-compressed - the file is accessed via a pipe to the gzip program.

incomplete - the last file record was not read completely. This can be intentional in file formats which support basic and extended data groups, or can be an indication of a non-critical decoder problem.

indexed - the file is accessed via an index file with record positions, not directly.

nommap - memory-mapping of the file contents is suppressed.

initialized - an initialization function of in the associated I/O module has been called

locked - there is currently a flock()/lockf() style file lock active on the file.

mapallocated - the memory mapping arena for the file was allocated and filled via some read operation, not mmap()ed .

memlocked - a mapping of the file are locked into memory and are not swapped out.

readable - the file handle can be read from.

readonly - the file has been opened for read-only access, without the possibility to switch the handle to a different mode.

remotefs - the physical file resides on a non-local file system.

rewindable - the file can be rewound if necessary

scratch - the file is a scratch file and is automatically deleted when the file is closed.

shared - the file contents reside in shared memory.

validcount - the current number of known positions is known to correspond to the total of records in the file.

virtual - the file is a virtual file build from multiple physical files.

ucs2-encoded - the file is accessed via a pipe to the iconv program.

url - the file is accessed via a URL, not a file system path.

updating - the file is currently being updated

writeable - the file handle can be written to.

xdr - the file is associated with an XDR encoder or decoder structure.

style
A free-format string identifying a predefined attribute bundle for graphics-oriented file formats. This is currently supported for CDX , CDXML , SKC and TGF , where, for example, the acs value selects settings corresponding to the “ACS Journal” settings in ChemDraw or ISISDraw .
structurecolumn
This attribute is the numerical index of a column in table data files which are, for example, read by the structure table I/O module. The column is expected to contain a string notation for the basic structure object which is returned by a molfile read operation. This string is decoded and the content of the other columns is attached as property data to this object. Typically the content of the structure column is a SMILES, SLN or InChI string. A negative value of this attribute indicates that the presence of such structure data is not confirmed. In that case, an attempt is made to determine the structure column automatically, and the attribute is updated accordingly. However, setting it explicitly may still be required in case there are multiple column with structure data, or there are too many unreadable or NULL row entries to allow automatic determination.
subformat
A enumerated value which encodes the subtype of the main file format. The most common values are mol2d , mol3d and mol0d , to indicate structure records with 2D or 3D or no coordinates. The type reaction can be encountered for RDF and CTX files with reaction data, since these can also be structure files in other cases. The attribute is automatically set when a file is read,. For some formats a explicit specification of the attribute controls the output formatting, for example for all file formats which contain an MDL ctab block, which can store either 2D, 3D, or 0D information, but not simultaneously.
substructurescreen
The name of the property which is used for bitvector screening in filtering records for substructure matching. Its default default value is controlled by the global variable ::cactvs(default_substructure_screen_property) and is usually either E_SCREEN or E_QUERY_SCREEN . If a file is opened that contains information about the screen property set when the file was written (for example, CBS and BDB formats), this attribute is automatically set to the value found in the file.
superstructurescreen
The name of the property which is used for bitvector screening in filtering records for superstructure matching. Its default default value is controlled by the global variable ::cactvs(default_superstructure_screen_property) and is usually either E_NO_HYDROGEN_SCREEN or E_NO_HYDROGEN_QUERY_SCREEN . If a file is opened that contains information about the screen property set when the file was written (for example, CBS and BDB formats), this attribute is automatically set to the value found in the file.
template
The name of a template file to be used for output formatting. At this time, only the RTF I/O module uses this information. It switches between de novo RTF formatting and replacing chemistry tags in the template file. If this value is set to an empty string, no template is used.
timeout
The maximum number of seconds to spend in a molfile scan command. When the time is exhausted, the scan terminates after the respective current record has been cleanly processed by all query threads, even if the end of the file has not been reached. Setting the attribute to zero, which is the default, allows an unlimited time to be spent on a query. Another function where the timeout value is used is in reading a record via an Internet connection, for example an http or ftp URL. If the timeout expires and the record has not been downloaded, an error results.
url
A read-only attribute with the URL in case the file is accessed via an Internet connection. If no such connection exists, the result is an empty string. If a URL has been set, this attribute may be indexed using the same fields as a URL property data item in order to retrieve URL components.

The allowed field names are hash , host , hostname , href , pathname , port , protocol , search , user , password , directory , file , ipaddr , lastmodified and mimetype . Note that in this context the port field name is the port the file is transferred via the Internet connection, which generally is not the same as the listener port for remote requests (see molfile get attribute port ). Likewise, the mimetype here is the MIME type as reported by the server, not the file MIME type defined by the file format handler module. Example:

set ip [molfile get $fh url(ipaddr)]

user
This is a shortcut for the user name part of a file or virtual file addressed via an URL. For simple retrieval it is equivalent to the URL field attribute url(user) . For some I/O modules, for example the interface to access Mysql tables as virtual structure files, a change of the user name has an effect and results in a re-authentication of the database and table access, which can result in different access permissions. For normal files accessed via a URL a change of the attribute is ignored after the file has been opened. Files that are not associated with an URL have an empty user name value.
uuid
An automatically generated UUID globally identifying the molfile object. This attribute is read-only, different for every molfile , and not dependent on the contents or format of the disk file this object is associated with.
valencelevel
For files which support this concept, an indicator what kind of structure (stable, intermediate, MS ion, etc.) is stored in the file.
version
The file format version as a string. This attribute is set automatically when a file is opened for reading. If it is not set, files are generally read or written in the latest supported version. If a data file contains a known version indicator, input routines in some cases adjust to older encoding standards. The I/O modules of some file formats support the writing of old versions. An example are the CDX and CDXML modules, which in the context of file versions explicitly set to less than 8.0 do not write the InterpretChemically tag which is not understood by older ChemDraw releases.
vline
The current virtual line count as a read-only attribute. For simple files, this is identical to the standard line count (attribute line or lc ). However, for virtual files opened by means of the molfile lopen command, this attribute is the global line number in the virtual file, while line/lc refers to the line count within the current physical file. The attribute name vlc is an alias.
vrecord
The virtual record number of the next record to be read, starting with one. For simple files, this is identical to the standard record count (attribute record or rc ). However, for virtual files opened by means of the molfile lopen command, this attribute is the global record number in the virtual file, while record/rc refers to the record count within the current physical file. The attribute name vrc is an alias. This attribute can be set and changing it results in repositioning of the file pointer, and potentially even a change in the active physical file.

Since virtual files which refer to multiple physical files can only be opened for reading, this attribute has no meaning for output files that is any different than that of the standard record attribute. When setting the attribute, the special values end and last can be used to position the file pointer behind the last, or before the last record.

width
The maximum width of a structure or reaction depiction in points (1/72 of an inch). This is only used for graphics-oriented formats, such as CDX , SKC or EMF . If the attribute is set to a negative value, which is the default, the size is indirectly controlled by the bond length and atom coordinates. In case this attribute is set to a positive value, and the depiction would exceed the maximum width or height, it is automatically scaled down proportionately.
writeend
An enumerated value indicating what kind of record end marker should be written on output, if the file format has such a concept. Possible values are none , mol and block . The default value is block , which translates into the standard record terminator for almost all file formats. The mol type is only significant for CTX format output. The none value can be useful if a programmer wants to add custom data to the end of a record and then writes an end marker himself, as it could be done without too much effort for example for an SD file.
writeflags
A collection of boolean flags controlling output details. When queried, this attribute returns a list of the names of all set flags. Modification of this flag supports the standard bit manipulation prefixes. The following flag names are currently recognized:

none - no flags

computeprops - attempt to compute properties in the write list if they are not yet present in the output objects.

miniheader - keep the file header as concise as possible.

multiwriter - prepare the file to handle multiple simultaneous writers. The only file format I/O module which currently supports this is BDB .

noimplicith - do not output hydrogen atoms which were added as implicit atoms.

nopropertymapping - always synthesize property descriptions, do not attempt to map them onto existing standard system definitions. The only module currently supporting this feature is the PubChem ASN.1 module.

nostereo - do not write stereo information into the file, even if present in the output structures.

nostereoperception - do not attempt to perceive stereochemistry from the available object data such as 2D coordinates and wedges, or 3D atomic coordinates, even if the file format normally requires this information.

omitct - if the inclusion of a structure connectivity table is optional, this flag can be used to suppress the output this block.

pedantic - perform pedantic output format checking, for example by refusing to write long lines in text formats which exceed the exact format specification, or refusing to write structures with more atoms than officially supported.

rawcoordinates - do not perform any coordinate checking, scaling, and centring but write the coordinates exactly as they are currently stored.

recalcbaseprops - if the output file content is a single property (for example E_GIF for GIF or PNG files, E_EMF_IMAGE for EMF and WMF files), force recalculation of this property before output.

supergroupexpansion - If a file format can either be written with expanded or contracted superatom groups (specified as type SUP in property G_TYPE and group label in G_NAME ), the default is to write them contracted. If this flag is set, the expanded form is used instead. This option affects few file formats (currently cdx and cdxml ). It does not perform expansion of superatoms which are only present as a single pseudo atom in the ensemble by decoding their tag (see ens expand command to achieve this). Rather, it expects the full set of atoms of the expanded form in the ensemble, plus one or more properly set up group objects indicating the atoms of the expanded form of a functional group or fragment which are not shown in the contracted style. If these groups are present, only the first atom in any group is shown, with the G_NAME data as atom tag, which overrides all other label information. However, the output file still contains the hidden atoms and their data. Tools like ChemDraw use this data to support interactive group expansion utilizing the original layout coordinates of the previously hidden atoms and other information.

synchronous - use synchronous writes for files which normally use buffering to increase performance, for example in the bdb format.

splitmol - Split output into individual ensembles and write each molecular fragment as a separate record.

upgrade - if this flag is set, and the format of a file is not of the most current version, but there is an upgrade function available in the support library, invoke the upgrade function to change the file layout to the most current version. The bdb module is the only one which currently supports this feature.

write0d - write records without coordinates if possible

write2d - write 2D records if possible

write3d - write 3D records if possible

writearo - write aromatic bonds instead of a Kekulé form if the file format supports this. An example where this makes sense are SMILES files. A counterexample are MDL Molfiles - you can enforce the encoding of aromatic bonds of non-query structures as the aromatic query bond type with this option, but that is technically incorrect and violating the format specification. Nevertheless, there are third party programs which require data in that format aberration for further processing.

writecolor - write atom and bond colouring information if this is an optional part of the file format specification.

writeenzymes - if the output data contains enzyme superatoms, include them in the output if that is an option. The SDF3000 I/O module is an example for a module recognizing this flag.

writelabels - write explicit atom labels, as defined in the attribute atomlabelproperty , if the file format supports it. This does not override the natural numbering of the written atom objects. It only applies to formats which support a parallel user-defined labelling scheme, such as CDX/CDXML.

writename - write a structure name section if this is optional information in the output. An example are SMILES files.

writekey
This attribute is only used in certain library configurations which have been configured to restrict write access to specific types of files. The key and data computed from the file name must together match the signature. Usually restricted applications have a compiled-in signature, and one or more write keys which enable write access to the same number of specific files.
writekeysignature
This attribute is used for certain library configurations which have been configured to restrict write access to specific files. This signature is required to verify the write access key.
writekeystatus
A read-only attribute which reports the access key status for a file for which a write key has been specified. It can be unchecked, verified or error.
writelist
A list of properties that should be included in the output if the file format supports this. Standard properties defining basic connectivity etc. usually do not need to be listed because they are written out by default where needed. Normally, this list contains only ensemble- or reaction-level properties, like SD data fields. Properties listed both in the write list and the drop list are not written. By default properties listed here are not computed. If they are not already present in the output objects, they are omitted. The computeprops bit in the writeflags attribute can be used to automatically initiate a computation attempt. Still, if a computation attempt fails, the output of that property data is silently omitted.

The attribute list above is also referenced by the molfile set command. This is the reason why it contains information about the read-only status of the individual attributes. Only attributes that can be set can be addressed by the molfile set command.

For the use of the optional property parameter list argument, refer to the documentation of the ens get command.

Filters in the optional filter set must apply directly to the file object. Filters which operate on other object types are ignored.

Variants of the molfile get command are molfile new, molfile dget, molfile nget, molfile show, molfile sqldget, molfile sqlget, molfile sqlnew, andmolfile sqlshow . These only apply to retrieval of file-level property data, not the attributes.

molfile getline

molfile getline filehandle ?skiprecord?

Read a text line from the file, with repositioning of the file pointer. This operation is only possible on text files which have been opened for reading. The command is not frequently used, because it tends to disrupt the normal file record parsing.

If the skiprecord boolean argument is set, the file is positioned to the beginning of the next record after the line has been retrieved.

The command returns the line read. Line termination characters are removed.

molfile getparam

molfile getparam filehandle property ?key? ?default?

Retrieve a named computation parameter from valid property data. If the key is not present in the parameter list, an empty string is returned. If the default argument is supplied, that value is returned in case the key is not found.

If the key parameter is omitted, a complete set of the parameters used for computation of the property value is returned in key/value format.

This command does not attempt to compute property data. If the specified property is not present, an error results.

Example:

molfile getparam $fhandle F_QUERY_GIF format

returns the actual format of the data in that property, which could be a GIF , PNG or a bitmap format.

molfile hloop

molfile hloop filehandle objvar ?maxrec? body

This command is functionally equivalent to the molfile loop command. The difference is that for the duration of the loop command hydrogen addition is enabled for the file handle. The original hydrogen addition mode of the file object is restored when the loop finishes.

molfile hread

molfile hread filehandle ?datasethandle/enshandle? ?recordcount?

This command is identical to the molfile read command, except that standard hydrogen addition is enabled for the duration of the command. The original hydrogen mode is reset when the command completes.

Example:

set eh [molfile hread “myfile.mol”]

This is a simple single-record structure input with hydrogen addition, using a file name instead of a file handle. The file is automatically opened and then close for the duration of the command.

molfile list

molfile list ?filterlist?

This command returns a list of the molfile handles currently registered in the application. This list may optionally be filtered by a standard filter list.

Example:

molfile list

lists the handles of all open molfiles in the application.

molfile lock

molfile lock filehandle propertylist/objclass/all ?compute?

Lock property data of the file handle, meaning that it is no longer subject to the standard data consistency manager control. The data consistency manager deletes specific property data if anything is done to the file handle which would invalidate the information. Property data remains locked until is it explicitly unlocked.

The property data to lock can be selected by providing a list of the following identifiers:

Property names
Valid property instances on the file object are locked. If the boolean compute flag is set, an attempt is made to compute the property if it is not yet present. Otherwise, a request to lock non-existent data is silently ignored. It is not possible to lock individual property fields.
all
All valid file properties are locked. The compute flag is ignored.
molfile
This is an object class identifier. All property data which is controlled by the file major object and attached to the specified object class is locked. Since files do not incorporate minor objects, this identifier is equivalent to all .

The lock can be released by a molfile unlock command.

This command is a generic property data manipulation command which is implemented for all major objects in the same fashion and is not related to disk file locking. Disk file locks can be set or reset by modifying the molfile object attribute lock. This is explained in more detail in the paragraph on the molfile get command.

The return value is the molfile handle.

molfile loop

molfile loop filehandle objvar ?maxrec? body

Execute a loop over the file. Objects are read from the file from the current file position onwards. The type of object read (usually ensemble or reaction, but in principle also a table or dataset object) depends on the read scope of the file. The handle of every object input from a file record is assigned to the specified Tcl object variable. Next, the Tcl script code in the body argument is executed. The body code typically uses the value of the variable to perform some operations with the currently read object. After the body code has been executed, the object which was just read is deleted, and the cycle is repeated, either until EOF has been reached on the file (the default), or the maximum number of records specified by the optional parameter has been reached, whichever comes first. In either case, no error is generated when the end of file has been reached. Setting the maximum record count parameter to an empty string, or to a negative value, results in the default processing style running until the end of the file.

Within the body, the standard Tcl break and continue commands work as expected. If the loop code generates an error, the loop is terminated and the error reported. Programs should not expect that the same object handle value stored in the variable is reused in each iteration.

Since the input objects are automatically deleted after they have been processed, it is not required to delete them in the loop code. Deletion requests on the loop object executed within the loop are ignored. Any other operation on the structure object is allowed. The loop code may perform repositioning operations on the input file, but not close it.

The return value is the number of processed records.

Example:

set th [table create]

table addcol $th E_NAME

table addcol $th E_WEIGHT

molfile loop $myfile eh {

	table addrow $th #auto end [list [ens get $eh E_NAME] [ens get $eh E_WEIGHT]]

This sample loop successively reads all records from the file and stores the ensemble handles in variable eh . In the loop body, the handle is used to extract name and molecular weight information from the structure and store it in a table object.

molfile lopen

molfile lopen filelist ?mode? ?attribute value?...

Open a list of files as a virtual file. The files identified by the file list items are implicitly concatenated in the list order. In addition to normal files, the standard set of special input types such as URLs, pipes, Tcl file handles or standard channels may be used. This command returns a single file handle, regardless of the number of input files passed as parameter.

A file list can only be opened for read operations on input objects. Writing, appending, updating or string input are not supported.

Most input file operations can be performed on virtual files. One important exception is currently file scanning with query expressions. This only works for lists of standard sequential files, not files which contain optimized query layouts, such as the native Cactvs CBS and BDB file formats. These can only be used as a single file formolfile scan commands. However, simple structure input is possible across file boundaries even with these formats.

The rest of the options are processed in the same way as the standard molfile open command.

Example:

set fhandle [molfile lopen [lsort [glob *.mol]]]

molfile max

molfile max filehandle property ?filterset?

Scan the file for the maximum value of the the specified property from the current read position to the end of the file. If no error occurs, the file is at end-of-file after the end of the command.

If a filter set is provided, it is applied to the objects read from the file during the scan, not the molfile object proper. Objects which do not pass the filter are ignored.

The property may correspond either to a data column in the file, or to a computable property on the structure or reaction objects read during the scan. Read objects are transient and automatically discarded. The property argument may contain a field specification, and in that case, only the field value is compared.

The maximum value determination uses the standard property comparison function associated with its data type. For properties which are implicitly defined during file I/O, an explicit property definition with a correct data type may be beneficial. For example, when testing the values of an SD data field, by default the data is read as an implicitly created string property. If the field content is actually an integer, the comparison as a string value does not yield the same results as when the data is compared as an integer. For file formats which encode a proper data type of its contents this is not necessary.

The return value is the maximum property or property field value found, or an empty string if no input was processed.

molfile metadata

molfile metadata filehandle property field ?value?

Obtain property metadata information, or set it. The handling of property metadata is explained in more detail in its own introductory section. The related commands molfile setparam and molfile getparam can be used for convenient manipulation of specific keys in the computation parameter field. Metadata can only be read from or set on valid property data.

molfile min

molfile min filehandle property ?filterset?

Scan the file for the minimum value of the the specified property from the current read position to the end of the file. If no error occurs, the file is at end-of-file after the end of the command.

If a filter set is provided, it is applied to the objects read from the file during the scan, not the molfile object proper. Objects which do not pass the filter are ignored.

The minimum value determination uses the standard property comparison function associated with its data type. For properties which are implicitly defined during file I/O, an explicit property definition with a correct data type may be beneficial. For example, when testing the values of an SD data field, by default the data is read as an implicitly created string property. If the field content is actually an integer, the comparison as a string value does not yield the same results as when the data is compared as an integer. For file formats which encode a proper data type of its contents this is not necessary.

The return value is the maximum property or property field value found, or an empty string if no input was processed.

molfile mutex

molfile mutex filehandle mode

Manipulate the object mutex. During the execution of a script command, the mutex of the major object(s) associated with the command are automatically locked and unlocked, so that the operation of the command is thread-safe. This applies to builds that support multi-threading, either by allowing multiple parallel script interpreters in separate threads or by supporting helper threads for the acceleration of command execution or background information processing. This command locks major objects for a period of time that exceeds a single command. A lock on the object can only be released from the same interpreter thread that set the lock. Any other threaded interpreters, or auxiliary threads, block until a mutex release command has been executed when accessing a locked command object. This command supports the following modes:

lock
Increase the recursive mutex lock count on the object. The command returns the current lock count after the command, excluding the transient single-command lock.
reset
Release all persistent locks on the object, if any exist.
test
Return the current persistent lock count on the object. This excludes the transient per-command lock.
unlock
Decrease the recursive lock count on the object. The command returns the current lock count after the command, excluding the transient single-command lock. Unlocking an object which has not been persistently locked results in an error.

There is no trylock command variant because the command already needs to be able to acquire a transient object mutex lock for its execution.

molfile need

molfile need filehandle propertylist ?mode?

Standard command for the computation of property data, without immediate retrieval of results. This command is explained in more detail in the section about retrieving property data.

The return value is the file handle.

Example:

molfile need $fhandle F_AVERAGE_ATOM_COUNT

molfile new

molfile new filehandle propertylist ?filterset? ?parameterlist?

Standard data manipulation command for reading object data. It is explained in more detail in the section about retrieving property data.

For examples, see the molfile get command. The difference between molfile get andmolfile new is that the latter forces the re-computation of the property data, regardless whether it is present and valid, or not.

molfile nget

molfile nget filehandle propertylist ?filterset? ?parameterlist?

Standard data manipulation command for reading object data. It is explained in more detail in the section about retrieving property data.

For examples, see the molfile get command. The difference between molfile get andmolfile nget is that the latter always returns numeric data, even if symbolic names for the values are available.

molfile open

molfile open filename ?mode? ?attribute value?...

molfile open filename ?mode? ?attributedict?

This command opens a structure file or other input source for input or output. The filename argument may be any of:

A disk file

This is the most common case. File names may be absolute or relative. On the Windows platform, the path naming follows the Tcl convention, with backslashes replaced by forward slashes, and optional drive letters, in the same way as the standard Tcl open command. Tilde substitution is also supported and built into the command. In case a file name could possibly collide with a reserved name, the file name can be prefixed with ./ in order to force interpretation as a file name. File name expansion can be conveniently performed by means of the standard Tcl glob command. File names must currently be spelled in the 8-bit ISO8859-1 character set. Unicode file names are not yet supported. On Unix platforms, named pipes and sockets may also be opened with this command.

Examples:

molfile open ./stdout r

molfile open ~theuser/data/newleads.sdf

molfile open C:/temp/calicheaamycin.pdb w

A standard channel

The file names stdout , stderr and stdin are reserved and connect the file handle to a standard I/O channel. stdout and stderr can only be opened for output, and stdin can only be read from. The character ’-’ (minus) is an alternative name for standard input.

Example:

molfile open stdout w format mdl

molfile open ./stdout

The first line opens an MDL file for output on standard output. The second sample line opens the file in the current directory which is named “stdout” for input. By prefixing file names with directory information any file with a reserved name can be opened as standard file.

A scratch file

The name scratch is reserved as the name of a generic scratch file. The file is initially opened for writing, but may be switched to input later by a molfile toggle command. The magic filename is translated into the name of a platform-specific temporary file. Every invocation of this command variant generates a new scratch file, with a different name. The true file name can be obtained with an attribute query:

set fh [molfile open scratch]

set name [molfile get $fh name]

Scratch files are automatically deleted when they are closed, or when the program exits.

A pipe

If a file name starts with a vertical bar character “|”, a pipe is opened from (in read mode) or to (write mode) the commands listed after the bar.

Example:

molfile open “|gzip >thefile.sdf.gz” w format mdl

When the file is closed, the pipe and all programs connected to it are automatically shut down. Pipes cannot be rewound, or switched from input to output and vice versa.

An URL

The Cactvs toolkit supports reading from various types of URLs. Currently, the schemes ftp , http, file and gopher are supported. file URLs are just another notation for normal disk files, as described above. From among the other URL schemes, only ftp and http connections may be opened for writing. The support for ftp URLs includes username and password components. If the server side supports it, passive ftp is the preferred mode. Http connections opened for writing use the PUT http command, which often is not activated in standard Web server set-ups and may therefore be of limited practical usefulness. URL connections can be rewound and backspaced, but this is costly because the existing connection has to be disconnected and the initial data from the beginning of the file to the desired position needs to be re-transferred and discarded.

Examples:

set fh [molfile open http://www.yourcompany.com/repository/jcamp/ir1.jcp]

molfile open ftp://yourid:yourpasswd@ftp.yourcompany.com/upload/ideas.sdf

A directory

If the target is a directory, all files in the directory are scanned. Those files which were identified as structure data files by any of the built-in or currently loaded I/O module extensions are concatenated to a virtual file which comprises all individual files. The order in which the files are concatenated is largely unpredictable, because it is defined by the order of the file name entries in the directory, and not any alphabetic sort criterion. The files may be of different formats, and may be any mixture of single-record and multi-record files. Subdirectories of the opened directory are not entered by default, but this may be activated by appending a ‚d‘ character to the open mode. Directories may only be opened for reading.

Example:

set fh [molfile open .]

set fh [molfile open $mydir rd]

The second example opens not only perceived structure files in the source directory, but also in all subdirectories thereof.

A string

The Cactvs toolkit can read most file formats directly from a string. There is no need to write structure data which was obtained as a string image to a temporary file to decode it. Data strings are opened as structure file with mode ’s’. Only input is possible, but navigation within the string with molfile rewind etc. works as expected. The complementary molfile string command can be used to generate a string image of a file record.

Example:

set fh [molfile open $thedatablob s]

set eh1 [molfile read $fh]

set eh2 [molfile read $fh]

molfile close $fh

A Tcl file or socket handle.

Any file name beginning with file or sock , and where the rest of the file name is a sequence of digits, are interpreted as references to Tcl file handles.

Example:

set tcl_fh [open thefile.txt w]

set cactvs_fh [molfile open $tcl_fh w]

A Tcl handle can only be accessed by this command in a mode which is compatible to the mode it was opened with, i.e. it is not possible to write to a file via a Tcl handle if it was opened for reading. If a structure file coupled to a Tcl handle is closed with a molfile close command, the Tcl handle remains valid, and my be used freely once the association to the structure file I/O object is broken. Closing the Tcl handle while the piggybacked structure file handle is being used is illegal. No input, output or positioning should be performed on the Tcl handle with standard Tcl commands while it is being referred to by a molfile object.

This functionality is not available on Windows, because on this platform Tcl internally uses Windows handles for I/O, while the Cactvs toolkit builds on standard Posix C library FILE pointers.

A virtual file

Some I/O modules implement access to a variety of information sources as a virtual file, which has neither a presence on the local disk, nor is one of the standard magic file names or access methods. Such virtual file names are by convention written with pointed brackets.

Example:

set fh [molfile open <pubchem>]

This command loads the PubChem virtual file access module, and returns a handle which may be used in a similar fashion as, for example, a handle to a huge local SD file. Depending on the I/O module, various operations on the handle may be optimized to be performed remotely. For example, the PubChem module offloads as many query operations of molfile scan commands as possible to the NCBI computers and downloads result structures only if they are needed as results, or query sub-expressions were specified which cannot be processed by the NCBI system.

The first optional parameter is the file access mode. It may be one of:

r
Open for reading, but with the option of later changing the mode to writing or appending. This is the default.
rt
As above, but automatically start a thread which immediately starts gathering file status information, such as the record count and record positions. This mode can be useful when operations, such as reading data for display, are to be commenced immediately, but ultimately overall record count information needs to be displayed, which can take a while to collect for larger files. The status thread is only started for rewindable files, and has no effects on files which directly provide record index and total record count information. Operations which would duplicate the efforts of the statistics thread, such as molfile count , are automatically blocked until the thread has completed,, and then directly use its results. Operations which change the nature of the access to the file, or its record contents or positions, silently terminate the status thread.
ro or rot
Open for read-only. If a file is opened in this mode, it is not possible to switch to write access later via a molfile toggle command. If the file permissions do not allow write access, the standard ‘r’ mode automatically falls back to this variant. Mode ‘rot’’ is also possible and additionally starts a file status thread (see mode ‘rt’).
w
Open for writing. If the file exists, it is overwritten. If not overridden by an explicit format specification, the file format is inferred from the suffix of the file name, if possible.
a
Open for appending. If the file exists, new data written to it is appended. If not overridden by an explicit format specification, the file format is inferred from the suffix of the file name, if possible. Not all file formats support appending.
u
Open an existing file for updating, i.e. the replacement of specific records. Not all file formats support this mode. It is generally useful for database-style formats such as Cactvs BDB and, to a limited degree, CBS . It can also be used for simple record sequence files like SD, though in this case it can be inefficient because a lot of data copying may be required to adjust the file layout. For single-record file formats, this command is not useful, and multi-record files which are not simple record sequences and for which the I/O module does not provide a special function, this mode is not supported.
s
Open string image of a file. If the mode is used, the file name is interpreted as in in-memory image of a structure file in any of the formats the toolkit understands, and not as a file name, URL, or any of the other types of input objects. Binary file formats may be used with this command.
p
Open in pipe reader mode. The input is expected to be a pipe or socket, where sporadically new data is posted. If an attempt is made to read from the file, a check is made if any data is present. If no data is waiting, the input command immediately returns without blocking. At a later time, new data may be present and the input succeed. If just a single byte of data is present on a pipe input channel, the read routine hangs on until the record for which input has begun has been read completely.
R, Ro, Rot
Open the file for reading and infer the format of the file from the suffix alone, without actually attempting to read the initial section of the file contents, which is the default method to determine its format. This mode can be useful in case the data contains text with embedded structure data, where the plain text is read by scripted commands and the occasional embedded structure or reaction record is to be extracted by means ofmolfile read commands. For such files, an automatic format detection would fail. The ‘o’ and ‘t’ flags may also be appended, and have the same meaning as in the standard ‘r’ mode.

For some files and file formats, two more mode characters have meaning if appended to the primary mode: They are silently ignored if the file argument or file format do not support them.

d
Recursive opening. This is for example useful when opening a directory as a virtual file for input. If this flag is set, the all files recursively found under the specified directory form the virtual file, not just the files directly located under the specified directory.
f
Fast mode. The file is opened for maximum performance, taking chances with respect to data integrity in case of program or computer crashes, etc. One file format where this flag is supported is the Cactvs Data Archive ( CDA ) format.

The remaining parameters of the molfile command are optional keyword/value pairs, or alternatively a single dictionary with the same function. The processing of these parameters is exactly the same as in the molfile set command.

Example:

set fhandle1 [molfile open thefile.pdb]

molfile set $fhandle1 hydrogens add nitrosyle ionic

set fhandle2 [molfile open thefile.pbp r hydrogens add nitrostyle ionic]

The first two lines and file final line perform exactly the same task: Open an input file, and set up input flags so that a complete set of hydrogens is added, and nitro groups and similar groups are converted to an ionic (as opposed to pentavalent) representation.

When a file is opened for reading, its format is automatically determined. Do not use the format attribute except under very special circumstances.

The command returns the file handle of the opened input file. This is the handle which is required by most other molfile commands which refer to an opened file.

Depending on the encoding of the opened file, the actual access mode to the file may be different than expected. In case a disk file is compressed with gzip or bzip2 , the file is opened via a pipe to the responsible decompressor program. Likewise, an UCS-2 encoded file is opened via a pipe to the iconv program which converts the contents to the UTF-8 encoding. Files which are opened indirectly via such helper pipes have different access characteristics than directly addressed files. For example, backspacing is expensive, because the pipe has to be closed, re-opened, and the data stream skipped to the desired position. This takes much longer than simply repositioning a file pointer.

molfile properties

molfile properties filehandle ?pattern? ?noempty?

Generate a list of the names of all properties attached to the molfile object. Optionally, the list may be filtered by a string match pattern.

In most cases, this list is empty. Only structure file properties, such as F_COMMENT , etc., are listed, but no object attributes, such as readflags , nitrostyle , etc. Few file formats support the concept of storing file-level properties, and therefore an empty property set is usually reported. Since file objects do not contain minor objects, and currently cannot be a member of other major objects such as datasets or reactions, no properties belonging to other classes except file objects are ever listed.

If the noempty flag is set, only properties where at least one data element is not the property default value are output. By default, the filter pattern is an empty string, and the noempty flag is not set.

The property list may become modified by input operations. In some cases, the defined file-level properties may vary with the record position, or may become only available only after the first input operation, not immediately after opening the file.

The command may be abbreviated to props instead of the full name properties .

Example:

set plist [molfile properties $fhandle]

molfile purge

molfile purge filehandle propertylist/molfile/all ?emptyonly?

Delete property data from the molfile object. Only molfile property data may be deleted with this command (these usually have a F_ prefix). Molfile attributes are not deletable.

If the optional flag is set, only file property values which are identical to the default of the property are deleted. By default, or when this flag is 0, properties are deleted regardless of their values. In case a listed property is not present, or not a file property, the request is silently ignored, but using property names which cannot be resolved leads to an error. If the object class name molfile is used instead of a property name, all file-level property data is deleted from the molfile object.

Example:

molfile purge $fhandle F_COMMENT

molfile purge $fhandle all

The first command deletes a specific property, the second command deletes all file property data associated with the handle.

molfile putline

molfile putline filehandle ?lines?

Write user-specified string lines to a file, bypassing the normal record writing mechanism. This operation is only supported on files which are opened for output and contain text data. The lines should not contain end-of-line characters. These are automatically supplied depending on the file object configuration set set in the eolchars attribute.

The command returns the file handle.

molfile read

molfile read fhandle ?datasethandle/enshandle/#auto/new? ?flags?? ?recordcount?

molfile read fhandle ?datasethandle/enshandle/#auto/new ?flags?? ?attributedict?

This important command reads chemistry objects from a structure file. The type of objects returned depends on the read scope of the file. They can be ensembles, reactions, or datasets. Read scope mol returns single-molecule ensembles, but (with I/O modules supporting this feature) reads only individual molecules into the output ensemble, splitting a multi-molecule file data ensemble if necessary. The return value of the command is a list of all objects which were generated, except when the #auto dataset creation method was used, or an unlimited number of objects was read into a dataset. In that case, the recipient dataset handle is returned.

By default, the returned objects are not a member of any dataset. If a dataset handle is passed as fourth parameter, the returned objects are appended to that dataset if possible. The special value #auto or new creates a new dataset as container. This is equivalent to using the nested statement [dataset create] as dataset handle argument. If the fourth parameter is an ensemble handle, and the object read from the file is also an ensemble, the read data is stored in the shell of the old ensemble, after all old ensemble data has been deleted. Its object handle remains unchanged, as is its dataset membership. The reuse of reaction handles is currently not supported. This parameter can be skipped by specifying an empty string.

In addition to passing an empty string, or a simple dataset or ensemble handle, as the fourth command argument, a list consisting of a handle and a modifier flag set can be specified. The only flag value which is currently recognized is checkroom . If that flag is set, and the input objects are to become members of a dataset with enabled maximum size or insertion mode control, a test is made whether the dataset has sufficient room to allow the insertion of the new object(s), or whether a suitable alternative action is configured to handle the read object in a different fashion, such as discarding it. If that is not the case, the command returns immediately, without performing any input, and returns an empty string. If the test succeeds, the input operation is atomic, since the dataset is locked for the full duration of the command, so that no other threads can manipulate its status between the initial check and the file input result object transfer.

The final optional parameter is either a single argument specifying the number of objects which should be read, or a dictionary with key/value attributes. The default is equivalent to passing a simple numerical value of one, in the first, simple format. In order to read until the end of the file, the special value all may be used instead of a numerical count. With an all parameter value, the input operation is finished when no more data is available on the file. Until this condition is met, an unlimited number of records is read. No error is generated when EOF is met. There are also no EOF errors reported if a numerical record count of more than one was specified, and at least one object could be successfully read. Another magical value of the simple argument form is batch , which is substituted by the batch record set size configured on the molfile handle (see molfile get/set ).

In the second form of the final parameter, an attribute dictionary is persistently applied equivalent to a molfile set command before the input commences. Standard file handle attributes and an input limit may be both set in parallel by using the special attribute name limit as part of the dictionary. It is only recognized in this context, but not with molfile set or molfile string . The allowed values of the limit attribute are the same as in the simple command variant.

The command raises an error if input could not be completed, regardless whether the reason is a file syntax error, or simple EOF (but see above for exceptions). If an input error occurs, the EOF attribute of the file handle should therefore be checked in order to distinguish between these two conditions. In case the input file was opened for pipe reading (mode ’p’), or is connected to a Tcl channel, an EOF report may only indicate that no current data is available on the pipe or Tcl channel, but it could still arrive at a future point in time.

Examples:

if {[catch {molfile read $fhandle} ehandle]} {

	if {![molfile get $fhandle eof]} {

		puts “Error: $ehandle”

} else {

	puts “Read [ens get $ehandle E_NAME]”

The prototypical snippet above shows the input of the next ensemble record from a previously opened file, with proper error checking.

molfile read “acd.sdf” [dataset create] all

This sample command reads a complete input file (we are using the single-operation feature of the molfile command to open and close the file acd.sdf automatically for the duration of this command) into a newly created dataset in memory. Reading huge datasets is of course not necessarily a good idea without large amounts of RAM . On typical current workstations, 10.000 or 20.000 compounds are no problem, but beyond that the risk of running out of memory is a real problem.

molfile reorganize

molfile reorganize filehandle

This command only has an effect for file formats for which the I/O module provides a reorganizer function. This function typically optimizes and compacts the file for input and queries, and should usually be called after all records have been written. Writing to a reorganized file is typically at least initially slower than writing to a file which has not been processed.

The function returns a boolean value indicating whether any reorganization has actually been performed. In case the command is applied to a file which is not writable, an error results.

molfile rewind

molfile rewind filehandle

Reposition the file before first record, and clear all error status information. If the file is already at the first record, and no error condition is set, this command does nothing.

Not all file channels can be rewound, and for some which can, it can be an expensive operation. For example, standard input or pipe input channels are not rewindable, and an FTP URL channel has to be closed and re-opened.

Rewinding a virtual file set positions the file pointer before the first record of the first file in the set.

Standard text-stream style output files can be rewound, too. This effectively truncates them. Files which are opened for appending are truncated to their original length.

Rewinding is not necessary in all cases. The molfile scan command automatically rewinds the input file if it is at EOF at the begin of a scan.

The return value of the command is the file handle.

molfile rewrite

molfile rewrite filehandle recordlist propertylist ?values? ?filter? ?callback?

This command updates specific property fields in a file, without rewriting the complete record. This is only supported if the file was opened for writing or updating, and the I/O module for the format of the file supports this operation by a special function. This typically limits the applicability of this command to database-style file formats such as Cactvs CBS and BDB .

The record list parameter is either a list of numerical records, with one as the first file record, or one of the special values all (all file records are updated), current , next , previous (the indicated record is updated), or a table handle, optionally followed by a table column name. In the last case, the table is expected to contain the data for rewriting, and in case a column name is specified, that column should contain the applicable record numbers. If the table version is selected without a record column, the file records from one to the number of table rows is updated. None of the special values can be combined with the simple numerical record sequence style. If the parameter is a numerical record sequence, the order of the records is significant.

The values list can be empty, or it must match the length of the property list. In the latter case, every specified value must be a valid value for the property in the same list index position. Note that while it is possible to manipulate multiple records in one step with this command, it is not possible to assign a different set of values to the data fields for each processed record. For this operation, multiple rewrite statements must be issued. If the value list is absent, or empty, the values are recomputed from the structure or reaction object that is temporarily read from the file record for this purpose. This is a useful feature in case the computation function for a computable property has changed. In case the record list references a table instead of a numerical record list or a magic record name, the value list is ignored. Instead, the table is expected to contain table columns which match the properties in the list, but not necessarily in the same column order, or containing exclusively the properties in the list.

The optional filter argument is a query expression in the same style as used in the molfile scan command. If a filter expression is supplied, only records which match the expression are changed. Non-matching records are skipped. In case no filter is used, all records selected by the record list are processed

After processing, the file pointer is on the last processed record.

If the name of a Tcl callback procedure is specified, it is called after each processed record. The Tcl procedure arguments depend on the processing mode. In case of table-based processing, the arguments are the table handle, the current table row, the file handle and the current file record.

This command is not fully implemented yet. CBS files currently only support re-computation of property data from object data, not updates from explicit value lists. Neither BDB nor CBS I/O modules currently call the Tcl callback procedure except in table-based processing mode.

The command returns the number of updated records.

Example:

molfile update $fh current E_NAME “Black tar, grade A”

molfile update $fh all E_XLOPG2

molfile update $fh [list $mytable records] [list E_IDENT E_REGID]

The first command changes the property field E_NAME in the current record to the specified value. The second variant recomputes all E_XLOGP2 values in the file from the stored structure data - for example after updating the computation function of that property, or having added it as a new field to the file. The final version changes the fields E_IDENT and E_REGID for the records stored in table column records, replacing them with the data found in the table columns of the same name.

A complication in the use of this command is that database-type files like the Cactvs CBS and BDB formats store property definitions themselves. After opening the file, a newly set up property definition, which may for example possess an upgraded computation function, can have been replaced by the old definition from the file. In that case, the new property definition must be explicitly re-read to gain the upper hand again, for example with a prop read command.

molfile scan

molfile scan filehandle|remotehandle expression ?mode? ?parameters?

Execute a query on the file and return results. The structure file is scanned, by default starting from its current read position, and results are gathered until either the end of the file has been reached (or the scan wrapped once around the file, if the wraparound file flag has been set) or a scan condition caused the stopping of the scan procedure. If the scan finished without reaching the end of the file, it can be resumed with another molfile scan command at a later time.

The file scan works in principle on any file, but with very different efficiency. Files managed by file format I/O modules which support direct field access, and can supply structure and reaction data in binary form, can be queried much (often a factor of 1000 or more) faster than, for example, a plain SD file. In the latter format, every record needs to be fully parsed, the structure compared against the query expression, and most of the structure data is discarded immediately after the record has been checked. Files in formats which support various types of indexing for numerical values, bit-screen filtering for super- and substructure searches, hash codes for full-structure matching and other means of acceleration can be effectively queried with typical expressions in a few seconds, even while containing millions of compounds.

The two basic built-in Cactvs formats for effective searching are CBS (static files, good performance on CDROM and other linear media) and BDB (efficiently updateable, and with more advanced indexing than CBS ) . In contrast the systematic reading of a million-record SD file takes a few hours. Nevertheless, the feature of universal query support is very useful for working with typical data sets of a few thousand records. These do not need to be converted from their original formats to a query file for a quick exploratory data scan.

Query expression syntax classes

The toolkit currently supports two syntactically unrelated classes of query expressions: Native Cactvs expressions, which are described below, and Bruns/Watson structure queries as described in J. Med. Chem. 2012, 55, 9763-9772, The exact syntax supported is that of the internal Lilly suite in October 2014, which is significantly extended from the description in the paper, but also discards some outdated syntactic elements briefly mentioned in the paper.

Example:

set demerits [molfile scan $fh [read_file 9_aminoacridine.qry] {record demerit}]

This expression returns a nested list of records which match the query, and their merit/demerit score computed by that rule. Note that records which do not match the expression are omitted, they do not report a zero demerit in the result. Internally,, Bruns/Watson queries are mapped to the standard toolkit query expression data structure. Many of the queries in the standard Lilly rule set can be expressed equivalently as a native query. However, at this time there are a few specific Lilly query features which cannot be expressed in native toolkit syntax.

If a query expression cannot be parsed as Bruns/Watson code, an attempt is made to interpret is as native Cactvs expression, and all error messages relate to that interpretation attempt. The following paragraphs all apply exclusively to the native toolkit expression style.

Branch node expression classes

The expression argument is a tree of individual query statements. It is formatted as a nested Tcl list. The he allowed depth of branching as well as the allowed number of leaf nodes is unlimited. The following branch operations are supported in this tree:

and
One to any number of child branches. The branch query only succeeds if all branches match.
or
One to any number of child branches. The branch query succeeds if any of the branches match. As soon as the first branch is a match, the other child branches are no longer executed. This is usually desired because it accelerates the processing of the query. However, in some circumstances, for example when computing similarity scores or coloring matched atoms or bonds, this is not the desired behavior. The orcontinue operator has the same query branch logic, but all branches are visited.
orcontinue
See above, an or operator variant where all child branches are always executed. This can also be written as orcont .
xor
One to any number of child branches. The branch query succeeds if an odd number of the child branches match. eor is an alias name of the operator.
not
Exactly one child branch. This operator inverts the match/nomatch status of the child branch, and lets all other status conditions reported by the child branch pass unchanged.
bind objclass
One or an odd number of child branches. This is a rather unique operator. Its effect is to force the use of the same minor object in all controlled branches. For example, if the child branches were to contain two molecule property checks connected by an and operator, by default the molecules of database structure ensembles which pass these conditions are independent and can be different. If a bind node is located upstream, those two molecules must be the same. Only when the first of a series of conditions is checked, all molecules are iterated as potential matches. If the query continues with a match of the first condition, the molecule is no longer unbound, and only the molecule already matched with the first condition is tested with the other conditions. Bind nodes can be used with any ensemble minor object class on structure queries (such as atom , mol , ring ) or ensembles ( ens ) on reaction queries. The objclass argument part must be set to the desired class name. Bind nodes only affect controlled nodes which are property queries with properties belonging to the bound object class.

If more than one branch is specified, the query expression branches (first, third, etc. argument) are linked by an identifier which determines how these branches interact under the umbrella of the bind node. The link argument it itself a list. Its first element is the link type identifier (currently one of independent , singlebond or doublebond ). Except in case of the first mode, the next element is the index (starting with 0) of the query branch in the bind node. It must refer to an existing branch index, i.e. forward declarations are not possible. For the determination of the branch index only the query branches count. The interspersed link arguments do not generated query branches.

If the mode is not independent , the allowed atoms or other minor objects which are tested in the additional branches depend on the current minor object in the referred branch. In modes singlebond and doublebond , these can only be atoms linked via the specified bond type to the referrer object, not the full atom set of the tested ensemble. In case of linked query branches, these are recursively checked. If a minor object in the leading branch matches, but fails to match in a dependent linked branch, more allowed minor object combinations are tested until they are exhausted or a combination of suitable minor objects is found which matches all branches. In any case, a minor object is only utilized once per bind node, so that for example a chain of three singlebond connected query branches needs to match three different atoms - the third branch cannot go back on the bond between the atoms selected for the first and second branch matches.

Example:

set q {
bind atom {and {A_ELEMENT in {7 8 16}} {A_NEIGHBORS = 2} {A_RING_COUNT = 0}}
{singlebond 0}
{and {A_ELEMENT = 6} {A_UNSATURATION = 0} {A_RING_COUNT = 0}}
{singlebond 1}
{and {A_ELEMENT in {7 8 16}} {A_NEIGHBORS = 2} {A_RING_COUNT = 0}}
}

molfile scan $fh $q

This query tests for a fragment of three atoms, which are connected by single bonds and where the individual atoms are each subject to a check on different set of atomic attribute conditions. The same query could also be realized as a SMARTS pattern. The advantage of this notation is that arbitrary properties can be used as attributes and an extended operator set and the full set of comparison mode flags is available. The disadvantage is a less readable pattern representation, and that no substructure query accelerator techniques such as bitvector screening are automatically employed.

passswitch
A switch where a single child node depending on the current value of the pass index is selected. All other child nodes are ignored in that query pass. This is internally used for smart similarity queries and of limited usefulness for normal user-written queries, but it may be used in expert queries. In standard queries, only a single pass, with index zero, is ever executed. The maximum number of passes of a query is determined by the largest number of child nodes in any passswitch node in the query.

Here are a few simple expression patterns:

molfile scan $fh $leafexpression1

molfile scan $fh [list “and” $l1 $l2]

molfile scan $fh [list “or” $l1 [list “and” $l2 $l3 $l4]]

molfile scan $fh [list “orcontinue” [list not $l1] [list “xor” $l2 $l3]]

molfile scan $fh [list bind mol [list and $l1 $l2]]

All branch nodes need to end in leaf expression nodes. An empty query expression is valid and matches every input record. Also, it is legal and actually a common case to have an expression which is just a single leaf node expression. The order of the branches does not matter. An automatically invoked optimizer sorts the branches, and simplify them, in order to achieve maximum performance.

Leaf node expression classes

These are the supported classes of leaf node expressions:

all
This is just a placeholder. It will matches every record.
filename
A condition on the name of the current physical file. This is only useful for scans involving virtual files.
formula
A molecular formula expression.
isnull
Check whether property data is absent.
notnull
Check whether property data is present.
property
A condition of a property value. If possible, this is evaluated without reading a full structure or reaction object from the file. However, if necessary, the checked property data is extracted from, or even computed on, the full record data item. The first word of a property leaf node expression is the name of the property, not the class name.
reaction
A reaction query to find records with reactions containing specific bond transformations.
record
A condition on the file record of the current physical file. For simple single-file scans, this is the same as the virtual record.
smartsearch
A special variant on the structure search node. This node is internally expanded into four internal alternative queries controlled by a pass-dependent switch node. The expanded queries are a full structure query, a substructure query, and Tanimoto similarity queries with thresholds of 95% and 90%. The complete query is automatically re-run with the next branch of the series of alternative queries until at least one hit has been found. This query mode only works on data sources where the file or other input source can be repositioned to the original start position if a second or later pass is required.
structure
A structure match operation on the primary database structure, a derived version thereof, or a reaction component. This type of query supports a variety of full-structure, substructure, superstructure and similarity matching methods. Some of these expressions, such as full-structure queries, are internally rewritten to property queries. For full-structure queries, these are hash code checks. Others, such as substructure matching, are handled by special routines. The first word of the leaf node specification can either be structure , for the main record structure, which is expected to be cleaned up and standardized, or any other of the recognized structure file ensemble classes ( reagent, product, solvent, catalyst, parent, scaffold, original, deprotected, salt ). If a tested file record does not contain the requested structure variant, an attempt is made to derive it from the main record structure. This works with, for example, the parent structure, but not, for example, for obvious reasons with the original .
vrecord
A condition on the virtual file record. For simple files, this is the same as the physical record.

The various leaf expression classes have different syntax schemes, which are explained in the next paragraphs.

record and vrecord expressions

The record and vrecord expression classes are always written with three list elements: The expression class name, the operator, and the value or value list. The operators can be from the standard six numerical types, the range operator (<->), and the in or notin set operators. Numerical comparisons require a single comparison value, the range operator a pair of values, and the set operators a list. Examples:

“record <= 100”

“vrecord <-> {1 1000}”

“record in {1 7 19 230}”

filename expressions

The filename expression class is even simpler. It always consists of three elements: The expression class name, the operator (which can only be = or !=), and the file name. The actual file comparison operation uses device and inode identifiers on Linux/Unix platforms if the file is accessible, so the exact spelling of any path components does not matter. Example:

“filename = part1.sdf”

isnull and notnull expressions

The isnull and notnull expression classes are written with two elements. The first is the class name, and the second a property name. The property name may be qualified with an ensemble class modifier. If the modifier is not specified, the query applies to the main database structure. Otherwise, the property of the specified ensemble class is addressed. Examples:

“isnull E_NAME”

“notnull product:E_ASSAY_RESULT”

property expressions

The property query expression class is a little bit more complex. It has a variable number of elements, between three and eight. The general syntax scheme is

property {operator ?modifiers?..} value ?threshold? ?multimode? ?filter? ?c1? ?c2?

The first three elements are always the property name, which can be qualified with an ensemble class, the comparison operator, and one or more values. The number of required values is dependent on the operator. The comparison operator can be a nested list. It needs to contain as a list element the basic comparison operator (numerical, range or in/notin set operators) and may additionally contain modifier words, which are translated into flags potentially influencing the datatype-specific comparison functions. It depends on the data type of the property whether any flag word has an effect.

If the object flag word is supplied as part of the operator list, the value part of the query is parsed as a chemistry object handle, more specifically an ensemble handle, a decodable string representation of an ensemble, a reaction handle, or a decodable string representation of a reaction. The ensemble variants are accepted if the query property is attached to an ensemble or an ensemble minor object, and the reaction variants can be used if the property is reaction-related. The value of the query is then automatically extracted, even computed if needed, from the object. Properties with subfields can be entered with the basic name, or any qualified subfield name. In addition, the property name may be prefixed by a structure class designator (see paragraph on structure queries). By default a property is assumed to be data of the main structure of the file record, or the main reaction. Examples:

“E_NAME = methane”

“solvent:E_NAME {in ignorecase} [list benzene toluene ethylbenzene]”

“E_IRSPECTRUM(source) {= shell nocase} *bruker*”

“E_WEIGHT {<= object} $ehtest”

“E_CAS {= ignoredashes ignorecase} 88337-96-6”

These are the comparison flag words which are recognized:

absolute
Use absolute numerical values for comparison.
alternative
Use alternative variant of comparison algorithm, if supported. For example, the bitset/bitunset comparison methods by default report 0 (equality) only if all bits are identical. The alternative version reports 0 when there is any common bit.
approximate
Use an approximate version of the comparison operator. For strings, this means that case, whitespace, numbers and punctuation are ignored. For floating point data, it means that the comparison employs rounded integer values. This can also be specified by an at @ character directly attached to the operator.
asnumber
Extract number from, for example, a string and use that for numerical comparison instead of literal comparison.
bitset
Interpret the query expression value as bit mask and check whether all bits in that mask are also set in the file value.
bitunset
Interpret the query expression value as bit mask and check whether all bits in that mask are unset in the file value.
contained
Test whether the query expression value is contained in the file value. For strings, this is simple substring matching. For vectors, this is an element match.
cosine
Compute cosine similarity coefficient percentage from query expression and file value and remember this as score. This comparison is only supported for bit vectors.
correlation
Compute correlation coefficient from numerical vector types.
dice
compute Dice similarity coefficient on bit vectors, bit sets or strings (via bigraphs).
euclidean
Compute Euclidean distance from numerical vector types.
extended
Use an extended version of a comparison method. For example, in conjunction with regular expressions, this enables extended regexp syntax.
glob
Interpret query value as shell expression. This can also be specified by an asterisk * character directly attached to the operator.
ignorecase
Ignore case for string-related comparisons. This can also be specified by an i character directly attached to the operator.
ignoredashes
Ignore dash/minus characters in string-related comparisons
ignorewhitespace
Ignore whitespace in string-related comparisons
ignorezero
For numerical vector comparisons, ignore zero elements.
needelementmatch
For vector comparisons with the contained flag, the default method is to check whether all elements of the query vector value compare to one element in the file vector data, but not necessarily in the same position. If this flag is supplied additionally, any single element match will suffice for a positive comparison result.
needelementmismatch
For vector comparisons with the contained flag, the default method is to check whether all elements of the query vector value compare to one element in the file vector data, but not necessarily in the same position. If this flag is supplied additionally, there needs to be at least one element mismatch for a positive comparison result.
object
Decode value as object, and compute comparison value from it The the object is a string representation, the object is only created temporarily and discarded as soon as the value was obtained. Persistent objects that are addressed via their handles remain valid and unchanged, except that their property data set is potentially extended by the computation.
precision
Use the precision as defined in the property description to check for equality. By default, full CPU precision is used.
regexp
Interpret query value as regular expression. This can also be specified by a tilde ~ character directly attached to the operator. Starting with toolkit version 3.352, the regular expression syntax on all platforms is that of the PCRE library, also known as the Perl style.
swap
Swap left and right side of the expression in the comparison. This makes especially sense for asymmetric operations such as regular or shell expressions. With a swap word, the regular or shell expression is the string from the file, not the written query value.
tanimoto
Compute Tanimoto similarity coefficient percentage from query expression and file value and remember this as score. This comparison is only supported for bitsets and bit vectors.
tversky
Compute Tversky similarity coefficient percentage from query expression and file value and remember this as score. This comparison is only supported for bitsets and bit vectors.
trim
Ignore leading and trailing whitespace. Spaces in the middle of a string are still significant.
unique
Hint for the query processor that the value is expected to match only once in the file, if at all. This is useful for query optimization. If a hit has been found, additional records need not to be checked.
vectorrange
For numerical vector comparisons. The query expression value vector is expected to contain twice as many elements as the file values. Every pair of values in the query vector is interpreted as a required upper and lower bound for the file values.
withdigits
In conjunction with the approximate modifier, make digits significant again.

If the operator is the in or notin word, the value part is interpreted as a list. The value, or value list item, must be parseable according to the property data definition definition. Enumerated values and similar encodings may be used if properly defined in the property descriptor record.

If the comparison function computes a score (for example, the Tversky or Tanimoto variants), the next optional argument is a threshold value which needs to be exceeded to register as hit. If the threshold parameter is not specified, or given as a negative value, any score passes. Example:

“E_SCREEN {>= tanimoto object} $eh 95”

The next two optional arguments concern the case when there is more than one file data value to compare against the expression value. This generally happens when the tested property is not a major object property, but a minor object property, such as an atom or molecule property. In that case, the database record often contains multiple values, because there is more than one atom, or more than one molecule in the structure in the record. The first argument is the general match criterion. It can be set to one , all , none , or both . The default is one . Mode one means that it is sufficient if one of the record values matches. Mode all requires all to match, mode none requires that none matches, and mode both requires that there are both matches and mismatches.

The next optional parameter is a filter which can be used to restrict the values tested. If it is not present, or an empty string, no filter is applied. Example:

“A_ELEMENT = 6 {} all ringatom”

Above expression checks whether all ring atoms in the structure are carbon. Any record with a hetero ring atom fails the test.

The final two optional arguments are integer constants which may be used by the comparison operation. If they are not specified, both are implicitly passed as zero. If the first is specified, but not the second, the second is set to 100 minus the first value. Almost all comparison operations on the various data types ignore these.

One comparison mode which does make use of them is the Tversky bit vector similarity score. Here c1 and c2 are the weights of the bits in the first and second compared value. For scoring, both parameters are divided by one hundred and the floating point results are used as weight multipliers. Example:

“E_SCREEN {>= tversky object} $eh 90 {} {} 30 70”

Above expression computes a Tversky score on the standard structure search screen E_SCREEN with 30% weight for the database structure features and 70% of the query structure features (i.e. imbalanced towards a substructure rating), and report the record if the score is 90% or higher.

Starting with version 3.358 of the toolkit, property expressions where the data type of the query property is structure or reaction are no longer parsed as standard property expression, but as structure or reaction query expressions, respectively. Example:

"V_ONTOLOGY_TERM(substructure) {>= swap  stereo isotope charge} $eh"

Since the data type of the subfield of V_ONTOLOGY_TERM is structure, the syntax rules of normal property expressions no longer apply. Instead, the syntax for structure expressions explained below is substituted.

structure expressions

Structure expressions are used to invoke structure comparison operations, such as sub- and superstructure search. The expression is a list, with three to six elements. A structure expression starts with the structure identifier, followed by the operator, which, as in property queries, may be written as a list with auxiliary modifier words, and as third mandatory argument the comparison structure source.

The structure identifier is the name of a structure class. Usually it is present as part of the record in the queried file, but some structure classes can be computed from the main structure if necessary. If a structure class can neither be found in a file record, nor computed, the node will not match. The following structure classes are supported:

structure
The main structure. Usually expected to be a standardized, normalized form.
original
An original structure, un-standardized. deposited is an alternative name.
salt
A salt form
deprotected
A variant without protective groups
parent
A parent compound. There is a standard computation function for this form.
scaffold
A structure core, isolated by some algorithm.
reagent
A reagent ensemble. Usually this is a part of a reaction record, but it can be present also on its own.
product
A product ensemble. Usually this is a part of a reaction record, but it can be present also on its own.
solvent
Solvent for a reaction. Usually this is a part of a reaction record, but it can be present also on its own.
catalyst
Catalyst for a reaction. Usually this is a part of a reaction record, but it can be present also on its own.

At minimum, the operator section contains a standard numerical operator symbol. Additionally, modifier words may be present as additional list elements. The following operators are supported.

=
Structure identity, i.e. full-structure search. This is internally re-written to an equivalent hash code search as a property comparison node. A suitable hash code is automatically selected depending on the operator modifiers such as stereo and isotope .
!=
Structure inequality, i.e. a negated full-structure search. This is internally re-written to an equivalent hash code search as a property comparison node. A suitable hash code is automatically selected depending on the operator modifiers such as stereo and isotope .
>=
Substructure search.
>
Substructure search, excluding identity.
<=
Superstructure search. This operation ignores hydrogens on the database structures (see below).
<
Superstructure search, excluding identity. Superstructure search ignores hydrogens on the database structures when the database entries are used as sub-graphs - otherwise a normal, fully specified database molecule will not match much. For the identity check, hydrogens are significant.
~>= or ~>
Tanimoto similarity search with a reporting limit. This is internally re-written to an equivalent property search.
%>= or %>
Tversky similarity search with a reporting limit. This is internally re-written to an equivalent property search.
<->
Substructure match count range search. This automatically changes the substructure match mode to distinctinneratoms (see match ss command and the count modifier below). It is possible to use a lower bound of zero which lets structure mismatches pass the query condition. This can be useful when match-dependent data is retrieved, for example the matchcounts pseudo property (see below).

The default substructure match mode has the bondorder , useatomtree and usebondtree flags set (see match ss command). The initial flag set can be modified with modifier words linked to the operator. As far as it makes sense, the modifier words also change the operation of derived query modes, such as full-structure matching via hash codes.

These are the modifier words which can be used in structure expressions:

absolutestereo
Perform absolute stereo matching. By default, stereochemistry is not used in the query, except if set up explicitly as atom- or bond-specific query attribute in properties A_query and B_query as part of the query substructure specification. An alternative syntax is to directly attach an uppercase S character to the operator.
allowmissingstereo
If set, absent stereochemistry descriptors in file structures can be matched by explicit stereo centers in the query structure. However, stereo center mismatches still lead to a match failure.
anyfragment
Report a match for full-structure search if any molecule of the file structure is identical to the query structure. For substructure/superstructure queries, this flag has no effect, since their default operation mode already covers the effects of the flag.
anyoverlap
If the substructure contains multiple fragments, they may match overlapping parts of the structure ensembles. By default, matched substructure fragments cannot overlap. This flag cannot be combined with atomoverlap .
arotautomer
A more aggressive form of the tautomer mode. In this mode, tautomers involving the dissolution of aromatic systems are also found, in addition to the more low-energy tautomer forms matched with the normal tautomer mode.
atomoverlap
If the substructure contains multiple fragments, they may match overlapping atoms, but not overlapping bonds. By default, matched substructure fragments cannot overlap at all. This flag cannot be combined with anyoverlap .
charge
Match formal charges of query atoms. By default, charges are not compared, except if set up explicitly as atom-specific query attribute in property A_query in the query substructure specification.
count
For substructure and superstructure matching, check not only for the presence of a match, but count the number of distinct matches equivalent to the match mode distrinctinneratoms in the match ss command. The normal substructure match mode is equivalent to the first mode in the match ss command, yielding only counts zero or one.
emptyssismismatch
By default, a substructure without any atoms matches anything. If this flag is set, it matches nothing instead.
exactaro
Match aromatic bonds exactly. By default, simple single or double query structure bonds match structure file record aromatic bonds.
exactringsystem
Rings in substructure fragments must match complete ring systems only. For example, with this flag a benzene substructure no longer matches naphthalene, anthracene, etc. Non-ring parts of the substructure can still, if other query attributes do not prevent this, match both ring and chain parts of file structures. For full-structure queries, this flag has no effect.
extended
Use extended versions of the match procedures. For similarity queries, this enables the PubChem extended scoring mechanism. If the query structure is identical to a file structure both in stereochemistry and isotope labels, an artificial score of 104 is computed, 103 if isotopes or stereochemistry match, but only one of these, 102 for basic equivalence of connectivity without isotopes or stereochemistry, and 101 for a tautomer. Compounds which are not structurally identical to the query structures using one of these criteria are scored normally.
fragmentsplit
Treat every molecule in the query structure as a separate fragment. The query ensemble is implicitly split, and every component therein is stored in an independent structure expression node. These nodes are then connected with an or or orcontinue branch mode. This is similar to using a file handle pointing to a file with multiple records as query structure data source (see below).
framework
Substructure carbon atoms cannot have any unmatched, directly bonded carbon or hetero atom neighbors in the structure. Unmatched bonded hydrogen is allowed. This flag has an effect only for sub- and superstructure match modes.
implicitsinglearo
If this flag is set, bonds which were created with an implicit bond order when the query structure was decoded are matched as if they were explicit single/aro query bonds. This is a useful mode for emulating Daylight software.
isotope
Perform isotope matching. By default, isotope labels are not used in the queries, except if set up explicitly as atom-specific query attribute in property A_query in the query structure specification. An alternative syntax is to directly attach an i character to the operator.
matchallheavyatoms
Require that all heavy atoms in the file structures are matched. This feature generates matches of file structures similar to full-structure matches while allowing the use of substructures with variable match conditions, such as atom lists.
nobondorder
Do not compare bond orders. This flag has an effect only for sub- and superstructure match modes.
nochainonaro
Do not match chain parts of the query substructure on aromatic bonds in the file structures. This flag has an effect only for sub- and superstructure match modes.
nochainonring
Do not match chain parts of the query substructure on ring bonds in the file structures. This flag has an effect only for sub- and superstructure match modes.
nodoubleonaro
Do not match otherwise unmarked double bonds in the substructure onto aromatic bonds of the structures.
noquerytree
Deactivate extended matches requiring full checks of the query tree fields in the A_QUERY and B_QUERY properties in the query structures. Certain query inputs need these trees for precise matching, because the query cannot be expressed as a flat set of query attributes. Examples for queries requiring tree matching for proper execution are complex SMARTS expressions beyond those using only simple explicit or implicit and in atomic or bond expressions, and Recursive SMARTS. Disabling the flag may lead to a small speed-up for simple substructure queries.
nosingleonaro
Do not match otherwise unmarked single bonds in the substructure onto aromatic bonds of the structures.
nosubstructureh
For substructure match, ignore any hydrogens present in the query structure. This is a convenient shortcut to allow the use of hydrogen-complete structures as simple substructures. A similar scheme is automatically invoked for superstructure search, where hydrogens in the file structures are ignored in matching.
reactionflags
Match reaction transform flags in the substructure. Both query and file structures need to have data for property B_REACTION_CENTER set. The supported set of comparisons is compatible with MDL’s ISIS database. Note that this flag can be used gainfully in structure expressions for half-reaction matching. It is not limited to full reaction queries. This flag is on by default in reaction queries, but off for structure queries.
relativestereo
Perform relative stereo matching. By default, stereochemistry is not used in the query, except if set up explicitly as atom- or bond-specific query attribute in properties A_QUERY and B_QUERY in the query structure specification. An alternative syntax is to directly attach a lowercase s character to the operator.
sethighlight
In case structure ensembles are retrieved from the file (molfile scan modes ens , enslist , reaction or reactionlist ), the bonds and atoms matched by a substructure are marked in the returned structure-side ensembles with the highlight flags in properties B_FLAGS and A_FLAGS . In case multiple matches occur, the highlight set is an union of all processed matching substructure mapping. This flag is also automatically set if the property retrieval set in the molfile scan command includes related pseudo properties, such as matchatoms or matchbonds .
setmatchproperty
In case structure ensembles are retrieved from the file (molfile scan modes ens, enslist, reaction or reactionlist ), the bonds and atoms matched by a substructure are marked in the returned structure-side ensembles by attached properties A_SSMATCH and B_SSMATCH . These are set to the labels of the matching substructure atoms or bonds. Unmatched structure ensemble parts have match property values of zero. In contrast to the sethighlight flag, this option attaches a new match property instance for any successful and processed match. Returned ensembles may therefore possess series of property instances like A_SSMATCH , A_SSMATCH/2 ... and so on.
swap
Swap the left and right structures in the query. This means, for example, that the database is expected to contain substructure definitions, and the query value argument a fully defined structure. This is not exactly the same as a superstructure search because of the different style how hydrogens are handled. For superstructure search, hydrogen atoms in the file records are ignored, generating a simplified structure from the record data for matching, but in case of a swapped substructure search, the file record is submitted as substructure for matching without any processing.
tautomer
Match tautomers of the query structure. If this flag is active, non-aromatic single and double bonds in tautomer systems need not to be matched exactly, as long as the overall bond order count is a match. Mobile hydrogens can either be specified explicitly, or a full implicit set can be used if the useimplicith flag of property B_ISTAUTOMERIC is set. The standard mode does not consider tautomeric forms which destroy aromatic systems. If you need to find matches between aromatic and non.aromatic tautomer systems, use the more aggressive arotautomer mode.
unique
Hint for the query processor that the query ensemble is expected to be matched only once in the file, if at all. This is useful for query optimization. If a hit has been found, additional records need not to be checked.

Many of these global flags can be overridden, or activated on a local level, for individual atoms or bonds, in the A_QUERY and B_QUERY properties. For example, A_QUERY has fields for flags which can request the matching of stereo or charges for specific atoms, or to allow missing stereochemistry at a specific center. These per-atom or per-bond requests override global query flag settings.

The third mandatory expression list element is the structure source. It can be one of

an ensemble handle
The ensemble is directly decoded.
a list of ensemble handle and molecule label
The fragment indicated by the molecule label is extracted from the ensemble and used for the query as isolated entity. If the molecule label cannot be found, an error is reported.
structure line notation string
For example, a SMARTS/SMILES/SLN/InChI/CID string or a packed Cactvs ensemble - anything which can be decoded by the ens create command. The string is decoded into a transient ensemble, which is automatically discarded when it is no longer needed. The exact decoding specifications depend on the operator. For full-structure search, a fully specified structure is created, while for substructure-type queries implicit hydrogens are not attached, and the full range of query specifications of the encoding format is allowed.
a dataset handle
A dataset containing at least one ensemble. All dataset objects are checked, and internally for every ensemble a separate expression node is created. The nodes are then linked via an or or orcontinue (in case a scoring operator is used) branch node. Dataset objects which are not ensembles are silently ignored. The hydrogen status of the dataset ensembles is not changed. In case there is only a single ensemble in the dataset, this command is indistinguishable from using the ensemble handle directly. In case the dataset does not contain any ensembles, an error is raised.
a molfile handle
An opened structure file. All remaining records are read, and internally for every record a separate structure expression node is created. The nodes are then linked via an or or orcontinue (in case a scoring operator is used) branch node. If the match operation is full-structure, the file is read with automatic hydrogen addition (see molfile set ), otherwise without any conversion flags. However, since the hydrogen addition flag is the only file attribute which may be temporarily overridden, other molfile object attributes may be set before the file is used in the query expression. Of course, using a file with a huge number of records in this fashion may cause problems. In case the file does not contain any records behind the read pointer at the time the command is parsed, an error is raised.

Query specifications found in structure sources are understood in a variety of formats. Daylight and MDL formats are decoded and translated into an internal representation in an almost completely compatible fashion. That includes Recursive SMARTS , ISIS 3D queries, MDL stereo groups and MDL reaction queries. A significant range of Sybyl SLN and CambridgeSoft ChemFinder query expressions are also understood, as well as features found in the CSD ConQuest software. Finally, in Cactvs there is no fundamental difference between a query fragment and a normal structure object. Query structures are just structures with additional information stored in properties A_QUERY , B_QUERY and possibly B_REACTION_CENTER . For basic matching, any structure object will do, even if they do not possess these query attribute properties. However, an eye should be kept in the hydrogen status of query fragments. If no specific flags are set, substructure matches attempt to match hydrogen atoms just like any other atom. Example:

set ehss [ens create C]

set ehss [ens create C smarts]

The upper substructure ensemble does not, in the absence of hydrogen ignore flags, match any structure ensemble except those which contain a full methane (one C plus four H) molecule as fragment, because that is what the substructure represents. The second code line decodes the substructure in full SMARTS mode. Not only now the full range of SMARTS expressions can be parsed (though absent in this example), but the structure is also be created without implicit hydrogens. The first substructure could still be used in a molfile scan command as a simple carbon match test if the nosubstructureh modifier flag were supplied.

In order to read query structures from a file, the following generic open statement is the standard approach:

molfile open $file r hydrogens asis readflags noimplicith

Simple query formats, such as MDL ISIS query Molfiles , are read into a flat set of attributes. More complex formats, such as SMARTS, may require the use of a tree of expressions on individual atoms and bonds, similar to the overall query tree with branch and leaf nodes described here for the molfile scan command. These complex formats are nevertheless also translated, to the degree possible, to the flat model. For example, a SMARTS expression with only uses simple atom lists or atom and bond query attributes all connected just by and can be fully represented in this way. This also means that, format translation into other query file formats is also possible for these simple expressions . The use of the full query trees in matching can in some cases be a performance issue. The noquerytree flag is available to restrict the match to those parts of the full query which can be expressed in the flat model.

The fourth and optional expression list element in the query expression is used only for a few match modes. If it is not set, the default value is minus one.

similarity queries: The minimum score required to report a hit
substructure count ranges: A list of the acceptable minimum and maximum occurrence counts of the substructure. If only a single value is supplied, is is used both as minimum and maximum value.

Example:

“structure ~=> $eh 90”

“product <-> C(=O)\[OH\] {2 3}”

The first sample expression is a standard Tanimoto similarity query, with a 90% threshold. The second query matches product structures with two to three carboxyl groups.

Optional expression list elements five and six correspond to the c1 and c2 parameters in property query expressions. These are currently only used in Tversky similarity queries:

“structure %>= $eh 90 30 70”

This is an expression for a skewed Tversky similarity (70% query structure, 30% file structure weight) with a 90% reporting threshold.

If the file format supports it, bitvector screening is automatically be applied to reduce the number of records for which structures need to be pulled and sent to graph-based substructure matching. The default structure match screening property is E_SCREEN . The standard versions of E_SCREEN implement three predefined fragment sets. The higher sets are identical to the lower ones in the leading bits. Sets zero to two , which yield bit vectors of increasing length and selectivity, but also storage requirements can be requested by setting

prop setparam E_SCREEN extended 0/1/2

The bit set read from the query file must correspond to the parameter setting for E_SCREEN in the current Tcl interpreter, if the screen bits are automatically computed on the query structure. The CBS and BDB file formats, which are optimized for structure query operations, contain screen bit version information in the file header and automatically configure the property parameter setting when the file is opened. For other file formats with screen bits this needs to be done explicitly in the application script. It is also possible to change the structure bit-screen property associated with a file by setting the appropriate molfile handle attribute, so it is easily possible to use custom screen bit sets instead of the default property.

Starting with version 3.358 of the toolkit, property query expressions where the data type of the property is structure are automatically parsed as structure expressions.

smartsearch expressions

This query expression takes the same arguments as a structure expression. It is internally expanded into four alternative queries, linked by a pass-dependent switch control node. The four alternative queries are a full-structure query (equivalent to operator = in a structure query), a substructure query (operator >=), and two Tanimoto similarity queries with thresholds of 95% and 90% (operator ~>=).

When such a query expression is a component of query expression tree, the query is first run with the full-structure query. If that query yields less results than the pass match limit (by default one, i.e. the query does not match anything, this can be configured via the molfile passlimit attribute), the input data source is repositioned to the original start record and then the substructure query is run, and if that run also does not yield sufficient hits, the two similarity queries are tried one after another.

Running the second and later alternatives is only possible of the data source can be repositioned to the original start position of the first pass. If that fails, the query is silently terminated early. The pass match limit comparison triggering the possible re-execution of the query is with the global hit count of the query, not the number of hits returned by the smartquery branch. If other parts of a complex query produce sufficient hits, the query is not re-run even if a smartquery branch did not return any hits.

Hits returned in different passes can be distinguished by including the pass pseudo-property in the retrieval data.

By convention, smartsearch expressions are written with an = operator. The actual operator in a smartsearch expression is ignored, but modifiers are not. So specifying options like the use of stereochemistry or isotopes is supported and useful.

It is possible to have multiple smart search expressions in a query. The query pass index for these is incremented in parallel, not independently.

The smart search feature was inspired by a similar functionality in the Accelrys Isentris system.

Examples:

“smartsearch = c1ncccc1”

“smartsearch {= stereo} \“L-lysine\””

formula expressions

Formula expressions are used to match file structures by element composition. Conceptionally, this is a special syntax for a complex property match on file structure properties E_ELEMENT_COUNT and M_ELEMENT_COUNT . A formula search expression is always a list of three elements. The first element is always formula, the second element the comparison operator, and the third word the formula specification. The following operators are supported:

=
Match the formula specification. There cannot be any elements present in the structure which are not mentioned in the formula.
>=
Match the formula specification. Elements which are not mentioned in the formula may be present in the tested structure.
>
Match the formula specification. At least one element which is not mentioned in the formula must be present in the tested structure.

For formula queries, there are no modifier words for the operator.

The syntax of the formula is built on the lowest level by element or pseudo-element symbols, which may be grouped into sum or difference expressions and may possess a prefixed count multiplier. The symbol or symbol group can then be suffixed by a simple count, or an open or closed count range. If no count range is specified, the default count is one. In case an element is entered more than once, all counts for that element are added. Finally, the expression may be grouped by period characters into sub-expressions to be applied to different molecular fragments in the tested structures.

Besides normal elements, the following pseudo-elements, which are compatible to the set of the CSD ConQuest software, are recognized:

?
An atom in the tested structure which is not a simple element.
[Any]]
Any atom which is a simple element (SLN syntax)
[Hev]
Any atom which is a simple element and not hydrogen (SLN syntax)
[Het]
Any atom which is a simple element and neither carbon nor hydrogen (SLN syntax)
[1A]
Elements from the first PSE main group, excluding hydrogen (Li, Na, ..).
[2A]
elements from the second PSE main group (Be, Mg, ..)
[3A]
Elements from the third PSE main group (B, Al, ..)
[4A]
Elements from the fourth PSE main group (C, Si, ..)
[5A]
Elements from the fifth PSE main group (N, P, ..)
[6A]
Elements from the sixth PSE main group (O, S, ..)
[7A] or [Hal]
Elements from the seventh PSE main group (F, Cl, ..)
[8A]
Elements from the eighth PSE main group (He, Ne, ..)
[1B]
Elements from the first PSE minor group (Cu, Ag, ..)
[2B]
Elements from the second PSE minor group (Zn, Cd, ..)
[3B]
Elements from the third PSE minor group (Sc, Y, ..)
[4B]
Elements from the fourth PSE minor group (Ti, Zr, ..)
[5B]
Elements from the firth PSE minor group (V, Nb, ..)
[6B]
Elements from the sixth PSE minor group (Cr, Mo, ..)
[7B]
Elements from the seventh PSE minor group (Mn, Tc, ..)
[8B]
Elements from the full eighth PSE minor group (Fe, Co, Ni, Ru, Rh, ..)
[8X]
Elements from the first column of the eighth PSE minor group (Fe, Ru, ..)
[8Y]
Elements from the second column of the eighth PSE minor group (Co, Rh, ..)
[8Z]
Elements from the third column of the eighth PSE minor group (Ni, Pd, ..)
[1M]
Metals from the first and second main groups (Li, Na, Mg, K, Ca, ..)
[2M]
Metals from the third to sixth main groups (Al, Ga, Ge, Sb,..; but not Si, As, Se, Te)
[3M]
All main group metals (union of [1M] and [2M])
[TR]
ll transitions group metals, no main group elements or lanthanides/actinides
[LN]
Lanthanides
[AN]
Actinides (no, this is not [AC]!)
[4M]
All metals in the PSE
[NM]
All non-metallic elements

Element items can be grouped with round brackets into sums or differences. However, this is no full arithmetic expression parser. Element symbols can only be used as stand-alone syntactic elements, bracketed all-sum expressions, or bracketed all-difference expressions.

An element or an arithmetic group can have an appended count. This count can be:

missing
The default count is one.
a simple integer
The count must be matched exactly.
a full integer range
The count must lie between the minimum and maximum values.
an open range
Left-open ranges have an implicit minimum count of zero, right-open ranges an implicit maximum count of infinity.
an asterisk
This is the same as a right-open range starting with zero, i.e. zero to any number of occurrences.
a plus character
This is the same as a right-open range starting with one, i.e. one to any number of occurrences.
a standard numerical comparison operator, followed by a number
The value is compared according to the specification. This is a CSD compatibility feature.

Examples:

“formula = C6H6”

“formula = C5-6H6-”

“formula >= (Cl+Br)2”

“formula > \[4M\]>=3”

“formula = (2C-H)-6”

“formula = CH3COOH”

“formula = \[Het\]>1

The first expression is a simple search which matches any ensemble with a composition of six carbon and six hydrogen atoms. The second looks for compounds with five to size carbon and six or more hydrogens, but no other elements. The third line finds compounds where the sum of chlorine and bromine atoms is two. Other elements may be present but are not required, so this expression matches Cl2, Br2 and ClBr as well as dichlorobenzene. The fourth expression finds structures with three or more metal atoms. The fifth expression finds compounds where twice the sum of the carbon atoms minus the hydrogen atoms has a value up to six. The next line finds compounds with a formula of C2H4O2. The counts for repeated elements are summed up. The last example matches any compound with one or more hetero atoms.

Periods can be used to define separate formula sections. These are applied to individual molecules in the tested structures, not the full ensemble. If a single dot is specified at the beginning or end of the expressions, it signifies a single expression section to be applied to a molecule. When a test for formula sections is applied, all permutations of possible matches between the molecules in an ensemble and the formula expression sections are tried. It is neither required to have any specific order of the molecules in the ensemble, nor in the formula expression sections, not is there a need for a match between the molecule and expression section count. However, every expression section in a formula needs to match a different molecule in the tested ensemble.

Examples:

“formula = C6H6.C7H8”

“formula = .H2O”

The first expression looks for ensembles which contain one molecule with the formula C6H6, and another with formula C7H8. The second expression matches ensembles with one or more water molecules. In both cases, molecules/fragment with different composition may be present in the record. In order to test for two or more formulae with the additional conditions that there are no other molecules/fragments, use two formula expression nodes connected with an and branch node, as in

and “formula = C6H6.C7H8” “formula = C6H6C7H8”

Element symbols which stand for specific isotopes, such as D for deuterium, are currently not processed. D is read as a simple alias for hydrogen, disregarding the isotope label.

It is possible to use an ensemble handle instead of a formula expression. In that case, the elemental formula of that ensemble is used in the query, as computed by property E_FORMULA .

reaction expressions

Reaction expressions are the construct used for reaction substructure searches, for example when looking for certain bond transformations in a database of reactions. Obviously, the scanned file needs to contain reaction information for this to succeed.

An important aspect for reaction searches are atom mapping numbers, which link atoms in the reagent ensemble to the product ensemble, and likewise in the transformation scheme which needs to be matched. The central property for this is A_MAPPING . If this property is present, it is used to restrict matches to those reactions which embody a certain transformation, and are not a simple pair of ensembles which match substructures of the left and right part of the query transformation somewhere in their connectivity. Nevertheless, it is still possible to query reaction without a mapping scheme. That is identical to a pair of substructure searches. Also, individual parts of a reaction (the reagent and product ensembles, but potentially also the catalyst or solvent entries) can be used as targets for single-ensemble sub/super/full-structure searches via structure query expressions (see above).

A reaction expression is a list of three to six elements. The first element is always reaction , the second element the operator, and the third element the reaction source. The following operators can be used:

=
Reaction identity, i.e. full-structure reaction search. This is internally re-written to an equivalent hash code search as a property node.
!=
Reaction inequality, i.e. a negated full-structure reaction search. This is internally re-written to an equivalent hash code search as a property node.
>=
Reaction substructure search.
>
Reaction substructure search, excluding identity.
<=
Reaction superstructure search.
<
Reaction superstructure search, excluding identity.
~> or ~>=
Reaction Tanimoto similarity search with a reporting threshold.
%> or %>=
Reaction Tversky similarity search with a reporting threshold.

Similar to structure query expressions, the operator can be modified by adding flag words as additional list elements to the operator list element. The following flags are recognized:

absolutestereo
Perform absolute stereo matching. By default, stereochemistry is not used in the query, except if set up explicitly as atom- or bond-specific query attribute in properties A_QUERY and B_QUERY . An alternative syntax is to directly attach an update S character to the operator.
allowmissingstereo
If set, absent stereochemistry descriptors in file structures can be matched by explicit stereo centers in the query structure. However, stereo center mismatches still lead to a match failure.
anyfragment
Report a match for full-structure search if any molecule of the file structure is identical to the query structure. For substructure/superstructure queries, this flag has no effect, since their default operation mode already covers the effects of the flag.
anyoverlap
If the substructure contains multiple fragments, they may match overlapping parts of the structure ensembles. By default, matched substructure fragments cannot overlap. This flag cannot be combined with atomoverlap .
atomoverlap
If the substructure contains multiple fragments, they may match overlapping atoms, but not bonds. By default, matched substructure fragments cannot overlap. This flag cannot be combined with anyoverlap .
bidirectional
If the query reaction does not match, try to match it also in the reverse reaction direction.
charge
Match formal charges of query atoms. By default, charges are not compared, except if set up explicitly as atom-specific query attribute in property A_QUERY .
emptyssismismatch
By default, a substructure without any atoms matches anything. If this flag is set, it matches nothing instead.
exactaro
Match aromatic bonds exactly. By default, simple single or double query structure bonds match structure file record aromatic bonds.
exactringsystem
Rings in substructure fragments must match complete ring systems only. For example, with this flag a benzene substructure no longer matches naphthalene, anthracene, etc. Non-ring parts of the substructure can still, if other query attributes do not prevent this, match both ring and chain parts of file structures. For full-structure queries, this flag has no effect.
extended
Use extended versions of the match procedures. For similarity queries, this enables the PubChem extended scoring mechanism. If the query structure is identical to a file structure both in stereochemistry and isotope labels, an artificial score of 104 is computed, 103 if isotopes or stereochemistry match, but only one of these, 102 for basic equivalence of connectivity without isotopes or stereochemistry, and 101 for a tautomer. Compounds which are not structurally identical to the query structures using one of these criteria are scored normally.
framework
Substructure carbon atoms cannot have any unmatched, directly bonded carbon or hetero atom neighbors in the structure. Unmatched bonded hydrogen is allowed. This flag has an effect only for sub- and superstructure match modes.
implicitsinglearo
If this flag is set, bonds which were created with an implicit bond order when the query structure was decoded are matched as if they were explicit single/aro query bonds. This is a useful mode for emulating Daylight software.
isotope
Perform isotope matching. By default, isotope labels are not used in the queries, except if set up explicitly as atom-specific query attribute in property A_QUERY . An alternative syntax is to directly attach an i character to the operator.
matchallheavyatoms
Require that all heavy atoms in the file structures are matched. This feature generates matches of file structures similar to full-structure matches while allowing the use of substructures with variable match conditions, such as atom lists.
nobondorder
Do not compare bond orders. This flag has an effect only for sub- and superstructure match modes.
nochainonaro
Do not match chain parts of the query substructure on aromatic bonds in the file structures. This flag has an effect only for sub- and superstructure match modes.
nochainonring
Do not match chain parts of the query substructure on ring bonds in the file structures. This flag has an effect only for sub- and superstructure match modes.
nodoubleonaro
Do not match otherwise unmarked double bonds in the substructure onto aromatic bonds of the structures.
noquerytree
Deactivate extended matches requiring full checks of the query tree fields in the A_QUERY and B_QUERY properties in the query structures. Certain query inputs need these trees for precise matching, because the query cannot be expressed as a flat set of query attributes. Examples for queries requiring tree matching for proper execution are complex SMARTS expressions beyond those using only simple explicit or implicit and in atomic or bond expressions, and Recursive SMARTS. Disabling the flag may lead to a small speed-up for simple substructure queries.
noreactionflags
Do not match reaction transform flags in the substructure. If reaction flags are checked, which is the default for reaction queries but not for structure queries, both query and file structures need to have property B_REACTION_CENTER set for this to work. The supported set of comparisons is compatible with MDL’s ISIS database. For standard reaction queries which check for specific bond changes, this flag should not be set.
nosingleonaro
Do not match otherwise unmarked single bonds in the substructure onto aromatic bonds of the structures.
nosubstructureh
For substructure match, ignore any hydrogens present in the query structure. This is a convenient shortcut to allow the use of hydrogen-complete structures as simple substructures. A similar scheme is automatically invoked for superstructure search, where hydrogens in the file structures are ignored in matching.
relativestereo
Perform relative stereo matching. By default, stereochemistry is not used in the query, except if set up explicitly as atom- or bond-specific query attribute in properties A_QUERY and B_QUERY . An alternative syntax is to directly attach a lowercase s character to the operator.
sethighlight
In case structure ensembles are retrieved from the file (molfile scan modes ens, enslist, reaction or reactionlist ), the bonds and atoms matched by a substructure are marked in the returned structure-side ensembles with the highlight flags in properties B_FLAGS and A_FLAGS . In case multiple matches occur, the highlight set is an union of all processed matching substructure mapping. This flag is also automatically set if the data retrieval set in the molfile scan command includes related pseudo properties, such as matchatoms or matchbonds .
setmatchproperty
In case structure ensembles are retrieved from the file (molfile scan modes ens, enslist, reaction or reactionlist ), the bonds and atoms matched by a substructure are marked in the returned structure-side ensembles by attached properties A_SSMATCH and B_SSMATCH . These are set to the labels of the matching substructure atoms or bonds. Unmatched structure ensemble parts have match property values of zero. In contrast to the sethighlight flag, this option attaches a new match property instance for every successful and processed match. Returned ensembles may therefore possess series of properties like A_SSMATCH , A_SSMATCH/2 ... and so on.
unique
Hint for the query processor that the query reaction is expected to be matched only once in the file, if at all. This is useful for query optimization. If a hit has been found, additional records need not to be checked.

The third mandatory parameter is the query reaction source. It can be any of

A reaction handle
The handle is decoded directly.
A dataset handle
A dataset containing at least one reaction. All dataset objects are checked, and internally for every reaction a separate expression node is created. The nodes are then linked via an or or orcontinue (in case a scoring operator is used) branch node. Dataset objects which are not reactions are silently ignored. The hydrogen status of the dataset reactions is not changed. In case there is only a single reaction in the dataset, this command is indistinguishable from using the reaction handle directly. In case the dataset does not contain any reactions, an error is raised.
reaction line notation string
A string representation of a reaction, in any format that can be decoded by thereaction create statement, for example a Reaction SMILES, SMIRKS or a Cactvs serialized reaction object string. This query reaction is only temporarily instantiated and automatically deleted when the command finishes.

Reading one or more query reactions from a file handle directly in the query statement, as it is possible for structure queries, is currently not supported. Also, the tautomer match mode is not available for reaction matching because it interferes with atom map processing.

The optional query list items four to six are identical to those for structure query expressions. They represent a reporting threshold value and the c1 and c2 comparison algorithm parameters. Please refer to the paragraph on structure match expressions for more details.

The general approach to reaction sub- and superstructure matching is as follows:

Perform bit vector screening for acceleration, if supported by the file format. The default reaction screen property is X_SCREEN . The name of the reaction screen bit property can be changed by setting the appropriate molfile handle attribute, so it is easily possible to use a custom reaction screen.
Match the reagent side from the file record onto the reagent side of the query reaction, just like a structure query expression. If possible, structure screening (see paragraph on structure match expressions) is used as an acceleration filter in addition to the reaction screen.
If atom mapping information is available, use it to set up a match constraint table for the product side, i.e. allow the product side substructure atoms with an atom mapping label which has a counterpart in a reagent substructure atom mapping value to match only the atom in the file product structure which has the same mapping label as the reagent side atom which was matched by the reagent substructure. For this to work, there need to be two matching pairs of mapping values on the reaction substructure and file reaction, though they of course can be different in both reactions. In case a 1:1 relationship cannot be established for an atom, the matching of this atom is not restricted.
Match the product side, using mapping constraints where possible, and also using structure screens if available.
If any of the previous steps fail, abort the sequence early, but if bidirectional matching is allowed, try again with the roles of the reaction substructure reagent and product ensembles swapped.

Besides the ensemble-level query attribute properties A_QUERY and B_query , reaction matches also make use of B_REACTION_CENTER (for constraints on the type of transformation a bond undergoes) and E_REACTION_ROLE (for the identification of reagent and product ensembles in the reaction object).

Reaction similarity queries use the reaction screen set (by default, property X_SCREEN ) instead of the structure screen that is used for structure similarity. This operation returns a single score. There is no scoring of the reagent or product ensembles.

Full-structure reaction matches are performed via hash code checks both the reagent and product sides. Atom mapping information is not used for this query operation. The suitable hash code is automatically selected depending on the operator modifiers (stereo, isotopes).

Starting with version 3.358 of the toolkit, property query expressions where the data type of the property is reaction are automatically parsed as reaction expressions.

Scan modes

The return value of the molfile scan command depends on the query mode. The default mode is enslist for the molfile scan command, but may be different when scanning other objects, such as datasets, networks or tables. The following modes are supported for file queries via the molfile scan command. Scan modes for other objects may include specific additional modes, while disallowing others.

array
The mode parameter is a list consisting of the mode selector array and a nested list of properties and pseudo-properties. Each property item can be a list of one to three elements. The first element is a property or pseudo-property, the second element a name, and the third element again a property or pseudo property. The the second property item list element is omitted, the name is the same as the first element. If the third element is missing, it is assumed to be the pseudo-property record . In this scan mode, the molfile scan command returns a list of the names of the created arrays. For each name, a global Tcl array variable is created, and for each match, an Tcl array element with an element name equal to the value of the first item specification index and an element value equal to the value of the third item specification is created. For example, the scan mode specification

{array {E_NAME name2rec} {record rec2name E_NAME}}

results in the creation of two global Tcl arrays in the current interpreter, called name2rec and rec2name . The first has elements where the element name is the name of the matching structure (property E_NAME ), and the value the file record number (because is is the default). The second array has elements where the record number is the array element name, and the corresponding value the structure name. The return value of the Tcl statement is the list “name2rec rec2name” , the names of the two variables created.

If array elements for a specific key already exist, the new value is appended as a list object. The result registration procedure does not overwrite the existing content. So, for example in above case, if there are multiple records with the same structure name, the array element indexed by name would contain a list or records, not just a single record. Since global arrays are persistent, data is also appended over multiple scan statements. If this is not desired. a statement like unset -nocomplain $arrayname should be executed before the scan is started. It is legal to use the same array name for the registration of multiple properties. In this case, each match appends a new list element for every reported property, though these lists will not be nested.

bitvector
Return a string-encoded bit vector (series of 0s and 1s) indicating the match status for every visited record.
count
Just count the number of hits, but do not report details. The result value is an integer.
delete
Delete hits from the file, if this is possible. This operation is performed after the scan has completed, not during the scan, so that file record numbers etc. do not change within a query.
ens
Return the handle of the first matching ensemble. The query is stopped at that point. If no hits are found, an empty string is returned.
enslist
Return the handles of all matching ensembles. If no hits are found, an empty list is the result.
exists
Return a boolean flag indicating whether any hit exists. This is very similar to the count mode, except that query processing is stopped after the first match.
index
The file position index of the first matching object. This is the same as the record mode, except that each hit value is one less, since indices start at zero. The query is stopped after the first hit.
indexlist
A variant of the recordlist mode. The returned values are one less than the records, since indices start at zero.
molfile
The mode parameter list consists of the mode selector molfile and a structure file handle, which must have been opened for writing, appending, or updating. The first matching structure is written to the file.After this, the query stops. The output file attributes determine format, selection of data written, structure encoding conventions such as hydrogen status, etc. If no matching structure is found, nothing is written. In this mode, the return value of the command is the matching record number of the input file, just as in the record mode.
molfilelist
The mode parameter is a list consisting of the mode selector molfilelist and a structure file handle, which must have been opened for writing, appending, or updating. Matching structures are written to that file. The output file attributes determine format, selection of data written, structure encoding conventions such as hydrogen status, etc. If no matching structures are found, nothing is written. This mode is also implicitly selected if a structure file handle is directly provided as mode argument. In this mode, the return value of the command is a list of the matching record numbers of the input file, just as in the recordlist mode
property
The mode parameter is a list consisting of the mode selector property and a sequence of properties and pseudo-properties. The selected properties for the first match are returned as a list. If there are no hits, an empty string is returned. The query stops after the first match.
propertylist
The mode parameter is a list consisting of the mode selector propertylist and a sequence of properties and pseudo-properties. The selected properties for all matches are returned as a nested list. If there are no hits, an empty string is returned. This mode is also selected if the mode argument is simply a list of property and pseudo property names without an identifiable mode keyword as first list element.
reaction
Return the handle of the first matching reaction. The query is stopped at that points. If no hits are found, an empty string is returned.
reactionlist
Return the handles of all matching reactions. If no hits are found, an empty list is the result.
record
The record number of the first file record which matches. In case a single physical file is searched, this is the same as vrecord , but if the scanned file is a virtual file consisting of multiple physical component files, this is the record number in the matching physical file. The scan is stopped when the first match has been found. If there are no matches, an empty string is returned.
recordlist
The same as the record mode, except that more than one match is potentially reported. In case a virtual file is searched, it is possible that duplicate values are returned, because the same record number from different physical files may be a hit. For unique record numbers, use the vrecordlist variant.
table
The mode parameter is a list consisting of the mode selector table and a sequence of properties and pseudo-properties. This scan mode returns a table handle. The table is automatically configured with properly typed columns corresponding to the requested properties. For each hit, a row is added. If there are no hits, a table handle is still returned, but the table does not have any rows. This retrieval mode is only available if the toolkit has been compiled with table support.

The individual properties may also each be specified as a list consisting of the property name, and an arbitrary string. In that case, the string is used as the column name. By default, the column names are the same as the name of the property they store. Example:

{table {E_NAME name} {E_CAS casno} record}

sets up a table with three columns called name , casno and record . The first two columns contain property data from the matching file records, the last one the record in the file which matched.

Instead of the keyword table , an existing table handle may also be used. In that case, any existing matching table columns are automatically re-used to store result data. Additionally specified properties are added as new columns to the right of the previously existing columns. New table rows generated by matches are appended to the bottom of the table.

tablecollection
This mode is mostly identical with the table mode, and takes the same column specification parameters.The important difference is that this scan mode always retrieves the full objects associated with the filled table rows (ensembles or reactions),. They are preserved and their relationship with the table marked. This can be useful if at a later stage in handling the table additional data needs to be computed or retrieved from an object. On the other hand this mode can be memory-intensive if many objects are created. Referral to associated objects may happen indirectly, for example with image columns where the exact image property is unknown until output time when the storage format is selected.

The scan command mode returns the table handle as result. The associated row objects are stored in the general namespace, and are not be a member of any dataset. They are visible like any other object of their type, for example via ens list or reaction list commands. Commands table ens andtable reaction are useful to get the object subset associated with this table. Note that these table-associated objects are not automatically deleted when the table is destroyed - only their association is severed. If they are no longer needed, they should be destroyed explicitly.

vrecord
If the scan is executed on a single file, this is the same as record . In case a virtual file which consists of multiple physical files is searched, this is the virtual file record number, i.e. the overall record number in the concatenated component files.
vrecordlist
If the scan is executed on a single file, this is the same as recordlist . In case a virtual file which consists of multiple physical files is searched, this is a list of the virtual file record numbers, i.e. the overall record numbers in the concatenated component files.

If requested property data is not present on the object representing a hit, an attempt is made to compute it. If this fails, the retrieval modes table and tablecollection generate NULL cells, and property retrieval as list data produces empty list elements, but no errors. For minor object properties, the property list retrieval modes produces lists of all object property values instead of a single value. In table -based mode, only the data for the first minor object associated with the major object is retrieved, which makes this mode less suitable for direct minor object property retrieval.

Pseudo properties for retrieval

The following pseudo properties can be retrieved in property/properylist scan modes or as table values, in addition to standard property data:

avgscore
The average value of all computed scores, such as Tanimoto or Tversky similarity scores, in the matching query for this result.
conformerindex
The index of the matching conformer in case of 3D queries with multiple conformations, -1 if no matching conformer index was determined.
conformer
A list of the atomic coordinates of the matching conformer, if a 3D query was performed. If this is not the case, an empty vector is the result. The data type of this vector is coorvec (x,y,z-triples as vector elements).
filename
The name of the physical file the match occurred in. For normal, single-file scans, this is not interesting. However, for virtual files, only the combination of the pseudo properties filename and record is a complete reference.
image
A structure GIF image (property E_GIF ) with highlighted matching substructure atoms and bonds. A normal E_GIF retrieval property would just show the structure, but without highlighting. The data type of this property is the same as that of E_GIF (depending on the configuration, a diskfile reference or an in-memory blob ).
index
This is the same as record , except that the value is one less, since indices start with zero.
matchatoms
An integer vector holding the labels of all atoms matching the substructures used in evaluating the query expression. If no substructure was used for the match, this vector is empty. highlighatoms is an alias for this pseudo property.
matchbondatoms
The same as matchbonds , except that each element is a pair of the labels of the matching atoms in the bonds, not the bond label as a single number.
matchbonds
An integer vector holding the labels of all bonds matching the substructures used in evaluating the query expression. If no substructure was used for the match, this vector is empty. highlightbonds is an alias for this pseudo property.
matchchount
The first element of the matchcounts array, as described below. If the query does not contain any substructure match nodes, the result is empty.
matchcounts
An integer vector holding the number of distinct substructure matches for substructure query nodes in the query tree. For normal substructure expressions, this value can only be zero or one because the standard substructure match mode only checks for the presence of any match (match mode first ). Additionally, this value can be minus one if the node was never evaluated, for example because it is part of an or expression. Only if the count modifier is used together with the substructure query operator, or the substructure operator is the range operator, the possibility of multiple matches is evaluated and larger values can be obtained. For these operations the match mode is currently always distinctinneratoms (see match ss command).
maxscore
The maximum value of all computed scores, such as Tanimoto or Tversky similarity scores, in the matching query for this result.
merit
For queries which use a merit/demerit rating scheme (for example, Bruns/Watson queries) this retrieves the accumulated merit/demerit sum of the top-level query node. The query needs to match for this retrieval to work, so in case none of the demerit rules match, you get an empty result, not a default zero merit/demerit value. Internally, there is no distinction between merit and demerit scores. The keyword demerit is an alias for this pseudo-property.
minscore
The minimum value of all computed scores, such as Tanimoto or Tversky similarity scores, in the matching query for this result.
pass
The pass number of the query execution. Normal queries, i.e. those without smartquery nodes and without hand-crafted passswitch nodes are executed only once, and the pass is always zero.
parent
The parent structure of the matching structure as a packed, base64-encoded serialized object string. If the structure file does not contain a precomputed parent structure, or the main file structure contains it as property, it is computed from the main file structure as property E_PARENT_STRUCTURE .
productmatchatoms
The same as the matchatoms pseudo property, but for the ensemble on the right side of a matching reaction, not a simple structure. If no reaction was matched, this is an empty list.
productmatchbondatoms
The same as the matchbondatoms pseudo property, but for the ensemble on the right side of a matching reaction, not a simple structure. If no reaction was matched, this is an empty list.
productmatchbonds
The same as the matchbonds pseudo property, but for the ensemble on the right side of a matching reaction, not a simple structure. If no reaction was matched, this is an empty list.
reagentmatchatoms
The same as the matchatoms pseudo property, but for the ensemble on the left side of a matching reaction, not a simple structure. If no reaction was matched, this is an empty list.
reagentmatchbondatoms
The same as the matchbondatoms pseudo property, but for the ensemble on the left side of a matching reaction, not a simple structure. If no reaction was matched, this is an empty list.
reagentmatchbonds
The same as the matchbonds pseudo property, but for the ensemble on the left side of a matching reaction, not a simple structure. If no reaction was matched, this is an empty list.
record
The physical record number of the current physical file. For normal, single-file scans this is the same as the virtual record. For virtual files, this property needs to be combined with the filename pseudo property to obtain a complete reference.
rgatoms(rg)
A list of the atom labels in a matching structure which were mapped to an expanded R-group atom in the query. The property index is the name of the R-group of interest defined in the substructure, usually something like R1. If there was no expanded R-group of that name, the result list is empty.
rgattachments(rg)
A nested list of the atom label pairs of the bonds in a matching structure which connect between the structure framework and the atoms expanded as the named R-group rg . If there was no expanded R-group of that name, the result list is empty.
score
The first element of the scores array, as described below. If the query does not contain any scoring expressions, the result is empty.
scores
An integer vector of the results of all query expression branches, in depth-first left-to-right order, which computed a score, such as structure similarity queries with Tanimoto or Tversky bitvector comparisons. In case a branch was not executed when the match was determined, zero is entered.
structure
The dataset structure as a packed, base64-encoded serialized object string.
vrecord
The virtual record number. For single-file scans this is the same as the physical record number.

Record visitation order

The optional visitation order parameter, one of the optional query parameters listed in the next section, is primarily intended to be used for convenient execution of queries on a subset of records which were selected by a previous query on the same file. It can either be a numerical record list, with the first file record indicated as record one, or one of the keywords sortup or sortdown , followed by a property name. If this parameter is not set, or set to an empty string, or the magic string all , records are visited from the current input position in simple sequential order. If the query parameter dictionary additionally contains a startposition value, this start position refers to the index (plus one) of the first element of the specified record set, not to the original underlying file.

In the record list variant of this argument, the specified (virtual) records in the file are visited in the list order, and all other file records are ignored. For optimum performance, the records should be sorted in ascending order, but this is not necessary, and, since it does affect the order of the returned results, record visitation sets with record sequences in custom order sorted to some criterion can have uses. A suitable format for a record list is a saved result of molfile scan in the recordlist or vrecordlist scan modes. It is possible to use a sorted record list with a non-rewindable input file, but an unsorted list will fail in that case if the file input pointer needs to be positioned backwards.

The sort property option variant implies a visit of all file records, but in the order of the values of a property in that file, not the native record sequence in the file. Using this access method is not too much overhead for indexed file formats such as CBS or BDB with an index on the sort property , but a serious performance hit for standard text files. This method cannot be used with files which cannot be rewound and do not have the sort property data in some direct access field, since it requires a full pass through the file to gather the sort property data values before the actual query is processed.

Examples:

molfile scan $fh “structure >= C1NCCC1” vrecordlist \	[dict create “order” [list 3 6 29 157]]

molfile scan $fh “structure ~>= $ehcmp 90” {table E_SMILES score} \	[dict create “order” {sortup E_WEIGHT}}

Query parameters

The final optional parameter is a keyword/value list of various additional attributes for fine-tuning the execution of the query. The following keywords are recognized:

branchmaxhits number
Set a maximum number of hits for every branch below the root node of the query. This is a per-branch version of the more commonly used maxhits option, which sets a global hit limit. By default the limit recursively applies to all nodes below the root, but this option is often combined with the branchnodetype option. If the latter is set, the limit only applies to those nodes which are of the specified type. Any nodes which have accumulated the specified maximum number of hits during the execution of a scan command no longer match regardless of the contents of additional records they are tested against. The hit count is increased whenever the branch returns a positive result, even if an overall positive match is not found because conditions in other branches are not met. The option can also be applied to logical nodes, such as and or or. In case of or nodes and in circumstances with similar optimization opportunities, the use of this option does not force the execution of lower branches if the match result of the node can already be determined by a partial testing of its branches, so the count may be less than expected.
branchnodetype nodeclass
This option is useful only in combination with a branchmaxhits option. If it is set to the name of a query node class, such as structure, reaction, formula or property, the limit only applies to those nodes below the root node which are of the specified type.
fullblockscan auto/no/yes
Ths parameter can be set to the values auto (or -1), no (or 0) and yes (or 1). The default value is auto. If this flag is true, scanning does not immediately stop after the maxhits or maxscan limits have been reached. Instead, each query thread completes its currently allocated file section, but not pick up more work afterwards. This guarantees that a subsequent query on the file can resume after the last visited record, without omitting to test records in file sections where the threads did not complete their task. However, in the full block scan mode the maxhits and maxscan parameters are then only a guideline, since the threads will scan more records, and possibily generate more hits, until they have finished their block. In auto mode, the full block scan mode is active if more than one thread is actually spawned, and inactive when there is only a single thread processing the query.
matchcallback procname
The name of a Tcl procedure in the current interpreter which is called upon each match after processing all standard query conditions, as well as once each for initialization and finalization. In some scan modes, the function is also be called to report a mismatch. The parameters passed to the function are the callback mode as a string (one of init , match , mismatch or final ), the current number of hits, and a reference for the match results accumulation object. The format of the latter depends on the scan mode - for example, in scan mode bitvector it is the evolving string representation of the result vector, while in scan mode propertylist is is a nested list of property values extracted so far. The structure of the result accumulator is usually the same as the final result of the scan operation. Setting the procedure name to an empty string is the same as omitting this attribute - no Tcl procedure is called.
maxhits number
The maximum number of hits to report. If this number has been reached, the scan stops. If it is set to a negative value, which is the default, an unlimited number of hits could be reported.
maxscan number
The maximum number of records to scan in all query threads combined. If this number has been reached, the scan is stopped (but see fullblockscan parameter). If set to a negative value, which is the default, an infinite number of records could be scanned.
maxthreads number
The maximum number of threads to use for scanning. By default, only a single thread is used. The use of multiple threads can significantly accelerate the processing of large input files, but for input data sets with less than 5K text records, or 50K binary and indexed records the overhead is likely to outweigh the gains. If multiple threads are used, a rule of thumb is that maximum acceleration on a sufficiently large file, with plenty of memory and no competing processor load, is observed with two query threads per processor core. If set to a negative value, the internally used maximum number of threads is adjusted to the number of visible processor cores and number of threads already internally spawned for other purposes. Because of the need to have multiple concurrent read positions for multi-threaded searching, files which cannot be rewound are always the processed by a single thread. If the query file has less records than the maximum number of threads multiplied by the thread block size, the actual number of threads used can be smaller then specified..
order order_list
Specify a scan order. By default, the data records are visited in increasing sequence from the current start position. The format of the value part of this dictionary pair has been described in the previous section. For more information, please refer to the paragraph on the record order list.
passlimit number
The number of accumulated hits which will prevent the execution of another query pass, typically with relaxed match conditions, in smartquery expressions and similar constructs. The default value is one, i.e. no additional passes will be executed if there is at least one match.
progresscallback procname
The name of a Tcl function which is called regulary during the file scan. That function could, for example, update a progress bar. The arguments to that function are, in this order, the operation code (init, scan, final), the handle of the scanned object (a molfile handle for the molfile scan command), the current number of record scans performed so far, the hit count and the full scanned object size (file record count, dataset element count) counted as records. If the object size is not known, minus one is passed. The init and final function calls are made only once each, and before respectively after any scan calls for the execution of this statement. The short form callback is an alias for this keyword. Setting the option to an empty string disables all progress callback function calls.
progresscallbackfrequency number
The frequency of callback function invocations, measured as the number of records scanned between calls. If set to a negative value, the default is used (currently one call per 100K records scanned). If set to zero, the callback function is not called during the scan, but still for initialization and finalization. The short form callbackfrequency is an alias for this keyword.
sscheckcallback procname
The name of a Tcl procedure which is called after all preliminary checks of a substructure or superstructure match operation have succeeded. Records which are skipped by screening mechanisms or where standard sub/superstructure query attributes already exclude a match do not trigger a function call. This function can be used to add additional criteria to the query which cannot be expressed by standard means.

The arguments passed to this function are, in this order, the substructure object handle, the structure object handle, a nested list with label pairs of all matched substructure and structure atoms, and a nested list with label pairs of all matched substructure and structure bonds. In case of superstructure searches, the roles of substructure and structure are reversed, i.e. the substructure handle and the listed atoms and bonds refer to the current structure read from the scanned data source. The check function should either return 1 for a successful final check, or 0, which leads to a rejection of the match. It is also possible to raise an error, which terminates the query with an error, or exit with a break, which terminates the query without an error.

While the callback routine is free to perform any additional match analysis, it must neither delete the structure or substructure, nor change its connectivity (remove or add atoms and bonds), nor discard or invalidate any property data used in the matching process. The computation or setting of any additional property data on the substructure or structure ensembles is allowed.

startposition number
A specific record to begin the scan at. By default the scan begins at the current read position of the file, except when it is at EOF . In that case, the file is automatically rewound. If a record visitation order list is used, the start position parameter indicates the record list index plus one to use as first file record to visit, not the file record proper.
target datahsethandle/remotedataset
The value of this argument is a local or remote dataset handle. If the result of the scan are ensembles (query modes ens or enslist ), reactions (query modes reaction or reactionlist ) or a table object (mode table or tablecollcetion ), the object is moved to the specified dataset. In case the dataset is local, the move happens during the query, so that a different script thread could already begin further processing. Data transfer to remote datasets is performed in a single batch just before the query command finishes. For query modes which do not generate chemical objects, such as the recordlist , property or count modes, this parameter is ignored.
threadblocksize number
If multiple threads are used, each thread processes a section of the file. If it completes the section, it will then request the allocation of a new section after the last section already allocated to any worker thread. If this parameter is set to a negative value, which is the default, a suitable thread block size is automatically determined from the file characteristics. It will then be typically a value between 10K and 100K records.

More typical examples

Examples:

molfile scan $fh {structure = c1ccccc1} recordlist

molfile scan $fh {E_WEIGHT < 100} {propertylist E_SMILES E_NAME E_WEIGHT}

molfile scan $fh {notnull E_CAS} {table E_SMILES E_CAS}

molfile scan $fh {structure ~>= c1nnccc1 90} {score record}

molfile scan $fh “and {structure >= $ehss} {formula >= N3}}” ens

Distributed queries

Molfile object handles can be configured to listen on specific ports for remote scan requests. The syntax of a remote scan request is the same as for a normal file. The only exception is the handle argument. The command is executed asynchronously. Since because of this no direct results are returned, the remote scans are typically of a type which yields network-transferable objects (modes ens , enslist , reaction , reactionlist , table ) and specify a target dataset object on the local system.

On the local system, a typical set-up looks like this:

set dh [dataset create]

dataset set $dh port 10001

molfile scan $remotehost:10002 {structure >= c1ncccc1} \

	{table record E_NAME E_CAS} {} {target $localhost:10001 startposition 1}

while {![dataset tables $dh {} count]} {

	sleep 1

In above code, we first create a recipient dataset object, and configure it to listen on port 10001 for incoming Cactvs objects - we are expecting a table object as result later. We then issue the query for execution on the remote host, and wait until the table object containing the results has arrived.

On the remote server, the set-up could look like this:

molfile open $dbfile r port 10002

vwait

Here the database file is opened, and a port for incoming requests opened. The vwait Tcl statement does nothing, but keeps the interpreter running, while waiting for and processing events such as incoming scan commands. In this sample set-up, the remote server needs to be started first, because otherwise the connection to the remote file fails on the client.

Since execution of remote queries is asynchronous, the client could issue multiple query requests to different remote handles and then wait until results from all these requests have been collected, or a timeout or other error condition has been reached. The results could arrive in any order. The scan commands for a group of servers could, for example, specify different start positions and maximum scan values for distributed searching of a big file, or could gather results from different small files. Additionally, the use of multiple scan threads could be requested on the server by passing appropriate parameters in the control section of the command. Nevertheless, only a singled remote scan command per Tcl script thread is executed on the server at any time. If multiple scans need to be executed in parallel on a single server, a collection of script threads need to be created via the Thread package, and then every thread told to open its own port listener.

The mechanism for the reception of messages for remote scans on molfile handles which listen on ports is subtly different from the processing of commands sent to listening dataset objects. The execution of scans requires active collaboration of a Tcl interpreter. Commands are only read and processed when the interpreter is idle, for example while sitting in a vwait or sleep statement. In contrast, dataset object listeners do not rely on Tcl interpreters, and are implemented as independent threads. Remote dataset commands, such as ens move or dataset pop with a remote dataset handle, are therefore executed at any time when a mutex lock on the database object and other accessed objects can be secured.

molfile set

molfile set filehandle ?property/attribute value?...

molfile set filehandle attribute_dictionary

A standard data manipulation command. It is explained in more detail in the section on setting property data. The alternative short form with the single dictionary argument is functionally equivalent to using the expanded dictionary as separate property and value arguments.

Examples:

molfile set $fhandle F_GAUSSIAN_JOB_PARAMS(link0) [list \	“%chk=144__303_2EVE_PDB_Opt8.chk” “%mem=128MB” “%nprocshared=2”]

The command can also be used to set a broad range of object attributes. The list of attributes is documented in the section on the molfile get command.

In case a set command is applied to a virtual file, the command applies to the current physical file only, if this makes sense.

Example:

molfile set $fhandle record 2

Above command repositions the file read/write pointer to the second record.

This command supports a special attribute value syntax for manipulating bitset-type attributes (only attributes, not property values). If the first character of the argument is a minus character (-), the named bits in the set identified by the remainder of the argument are unset. If it is a plus (+), they are additionally set. With an equal sign (=), or no special lead character, the flag set replaces the old value. A leading caret character (^ ) toggles the selected bits.

Example:

molfile set $fhandle readflags +pedantic

molfile setparam

molfile setparam filehandle property key value ?key value?...

Set or update a property computation parameter in the metadata parameter list of a valid property. This command is described in the section about retrieving property data. The current settings of the computation parameters in the property definition are not changed.

molfile show

molfile show filehandle propertylist ?filterset? ?parameterlist?

Standard data manipulation command for reading object data. It is explained in more detail in the section about retrieving property data.

For examples, see themolfile get command. The difference between molfile get andmolfile show is that the latter does not attempt computation of property data, but raises an error if the data is not present and valid. For data already present, molfile get andmolfile show are equivalent.

molfile skip

molfile skip filehandle ?recordcount?

Skip records in a file opened for input. If the file pointer is at the beginning of a new record, this next record is the first skipped. If the file pointer is stuck in the middle of a record, for example because a molfile read command failed due to a file syntax error, the first record counted is the remainder of the current record. An attempt is made to re-synchronize to the beginning of the next record.

By default a single record is skipped. If the record count parameter is specified, more than one record can be skipped. Because of the partially read l record re-synchronization feature, negative record counts are not allowed in this command. Themolfile backspace and molfile set record commands can be used to go back in a file.

The command returns the number of the next record to be read. In case an attempt was made to position behind the end of a file, or a record re-synchronization failed, an error is reported.

molfile sort

molfile sort fhandle {{propertylist ?direction? ?cmpflags?}..} ?outfile/handle?

Sort the records in the file according to the values of one or more properties or property subfields contained in the file records, or computable on the objects read from the file. The output are byte-for-byte identical images of the input records, not records reconstructed from input data objects.

The property sort list consists of zero or more sort specification elements. Every specification element is parsed as a sublist, but only the first element therein is mandatory. This element is either a property name, a property subfield name, or one of the magic names #record or record (for the file record) or # random or random (for a random number assigned to that record). The optional sort direction element may be up or down . The default sort direction is upwards. The final optional comparison flags parameter can be set to a combination of any of the values allowed with the prop compare command. The default is an empty flag set.

The first property or magic name in the sort list has the highest priority. In addition to the specified properties, the original record number is implicitly added as tie breaker to yield a stable sort. This automatic value is always sorted upwards. If an empty property list is specified, the result is thus a simple file copy without record rearrangement.

The sort properties do not need to be already present in the file. If necessary, an attempt is made to compute these on the objects read from the file in the first pass. It is possible to sort on properties which are not of the object class read from the file, for example atom properties when ensembles are read, or ensemble properties when reactions are read. In that case, the record is output at the position determined by the lowest sort rank of the property of that object, for example the minimum or maximum value of all values of an atom property in an ensemble. Additional data instances of the property associated with a given record are ignored, so no record duplicates are output.

The optional output parameter can either be the handle of an opened Tcl channel, including standard output and standard error or the name of a (preferably new) file, or a pipe construct. Output is appended to this output channel. If the parameter is omitted, the output is first written to a temporary file, the original file deleted and the temporary file renamed to the original file. In that case, the original file handle is automatically re-opened for reading on the new file. The input file handle must be positionable, because file records are accessed twice, once for reading the sort data and once for copying the records out. Sorting from standard input, pipes or other non-rewindable sources is therefore not supported, and neither is the sorting of files which are not simple record sequences. Sorting such files is currently only possible by using explicitly scripted record data buffering mechanisms.

On Windows, output to an open Tcl file handle is not supported, except for the standard output and error channels.

The return value of the command is the number of records written. The position of the sort file handle is set to the same location as before the command.

Examples:

molfile sort $fh {{E_NAME up {dictionary nocase}}} dict.sdf

molfile sort myfile.sdf {{record down}}

set fhtcl [open “randomized.sdf” w]; molfile sort $fh {{random}} $fhtcl

molfile sort $fh {{A_ELEMENT down} {E_WEIGHT up}} “|gzip >heavy.sdf.gz”

The first example creates a new file dict.sdf which contains the remaining records in the file associated with the file handle sorted by the value of property E_NAME in case-insensitive dictionary order. The second example reverses the order of the records in the file, replacing the original file in the process. The third example randomizes the record sequence in the original file, outputting the records in a new file which was opened for writing as a normal Tcl text file. The final example outputs a compressed SD file, with structures sorted by the heaviest element in the ensembles, and using the molecular weight as tie breaker.

molfile sqldget

molfile sqldget filehandle propertylist ?filterset? ?parameterlist?

Standard data manipulation command for reading object data. It is explained in more detail in the section about retrieving property data.

For examples, see the molfile get command. The differences between molfile get andmolfile sqldget are that the latter does not attempt computation of property data, but initializes the property value to the default and returns that default, if the data is not present and valid; and that the SQL command variant formats the data as SQL values rather than for Tcl script processing.

molfile sqlget

molfile sqlget filehandle propertylist ?filterset? ?parameterlist?

Standard data manipulation command for reading object data. It is explained in more detail in the section about retrieving property data.

For examples, see the molfile get command. The difference between molfile get andmolfile sqlget is that the SQL command variant formats the data as SQL values rather than for Tcl script processing.

molfile sqlnew

molfile sqlnew filehandle propertylist ?filterset? ?parameterlist?

Standard data manipulation command for reading object data. It is explained in more detail in the section about retrieving property data.

For examples, see the molfile get command. The differences between molfile get and molfile sqlnew are that the latter forces re-computation of the property data, and that the SQL command variant formats the data as SQL values rather than for Tcl script processing.

molfile sqlshow

molfile sqlshow filehandle propertylist ?filterset? ?parameterlist?

Standard data manipulation command for reading object data. It is explained in more detail in the section about retrieving property data.

For examples, see the molfile get command. The differences between molfile get andmolfile sqlshow are that the latter does not attempt computation of property data, but raises an error if the data is not present and valid, and that the SQL command variant formats the data as SQL values rather than for Tcl script processing.

molfile string

molfile string enshandle/reactionhandle/datasethandle ?attribute value?...

molfile string enshandle/reactionhadle/datasethandle? ?attribute_dict?

This command byte vector representation of a structure file. The third argument to this command is an ensemble, reaction or dataset handle, not a file handle as for other molfile commands.

If the selected output format module supports direct output into a string, the record image is created without intermediary forms. Otherwise, a anonymous temporary file is opened, the ensemble or reaction(s) written to that file, and the file content returned as string with all newlines etc.. The file is then removed.

Writing to binary formats is possible. The return value of the command is a byte vector, not a simple text string, so it may contain NUL bytes. By default, in the absence of an explicit format specification, a MDL Molfile is written.

The remaining parameters are interpreted as in the molfile set command. There are two equivalent command variants, either using attribute and value argument pairs or a dictionary as a single argument. The parameters in the extra arguments or dictionary are typically used to set a hydrogen status, select the output format, etc.

Example:

set jmestring [string trim [molfile string [ens create C1CC1] format jme]]

The example creates an input string for the popular JME Java structure editor by P. Ertl/Novartis. The string trim statement deletes the trailing newline. The necessary JME output module is automatically loaded if it is not already loaded or compiled-in when the format parameter is decoded.

String record representations generated by this command can be opened for input as string data with the s mode of the molfile open command:

set fh [molfile open [molfile string $eh] s]

molfile subcommands

molfile subcommands

Lists all subcommands of the molfile command. Note that this command does not require a molfile handle.

molfile sync

molfile sync filehandle

This command synchronizes the file contents with the file system. The I/O modules for most file formats automatically performs a simple file buffer flushing upon finishing the output of a record, so this command is needed only under special circumstances where complete file system synchronization is required, the file was written without immediate commits, the I/O module for the file format provides a special synchronization function, or the output was done via asynchronous I/O. In any case, every file is fully synchronized when it is closed, so calling this function for normal output operations is not required.

The command returns the file handle.

molfile toggle

molfile toggle filehandle

Switch a file from input to output, or vice versa. If the file was in write, append or update mode when the command is executed, the file is rewound and the read pointer is now pointing to the first record, or the original end point for append files. If the file was configured for input, the file output mode is changed to append if the file is a normal file. If the file is a scratch file, the file is truncated to an empty file and the write position set to the first record.

Not all file types can be toggled. Special file types except FTP streams cannot, and it is not possible to toggle a simple disk file which was originally opened in read only mode (see molfile open command).

The command returns the molfile handle.

molfile truncate

molfile truncate filehandle ?record?

Truncate a file. If no explicit record is given, the file is truncated after the current record. In case the current record count of the file is less than the specified record, the command raises an error.

Only files which are rewindable can be truncated. In addition, the program must have write permission to the file, although it is not required that the file handle is opened for writing. The I/O modules for files formats which are not a simple record sequence must provide a truncation function or the operation will fail.

The command returns the molfile handle.

molfile unlock

molfile unlock filehandle propertylist/molfile/all

Unlock property data for the file object, meaning that they are again under the control of the standard data consistency manager.

The property data to unlock can be selected by providing a list of the following identifiers:

Property names
Valid property instances on the file object are unlocked. Non-existent data is silently ignored. It is not possible to unlock individual property fields.
all
All valid file object properties are unlocked.
molfile
This is an object class identifier. All property data which is controlled by the molfile major object and attached to the specified object class is unlocked. Since files do not incorporate minor objects, this identifier is equivalent to all .

Property data locks are obtained by the molfile lock command.

The return value is the molfile handle.

molfile upgrade

molfile upgrade filehandle

If the I/O module provides a function to upgrade the format of an older file to the latest version of the format, for example after a support library upgrade, that function may be used. The only format which currently supports this feature is BDB .

The command returns the molfile handle.

molfile valid

molfile valid filehandle propertylist

Returns a list of boolean values indicating whether values for the named properties are currently set for the structure file. No attempt at computation is made.

Example:

if [molfile valid $fhandle F_COMMENT] {...}

molfile vappend

molfile vappend filehandle objectlist

Virtually append records to an open file handle. The underlying file is not modified, but all future input operations on this file behave as if the extra records were present.

Because no actual output is generated, this command can only be applied on files opened for reading , not output files. In addition, the file handle needs to refer to a normal disk file and to support going backwards in the file, i.e. this command cannot be used on structure files opened via URLs, standard I/O channels, socket connections or composite virtual files with multiple physical files or the contents of a directory. The file format must support multiple records and the records must be encoded as a simple concatenated byte sequence. Examples for formats which work are SMILES or SD files for structures, or RXN or RD files for reactions.

The object list may contain ensemble, reaction or dataset handles. The data is split into virtual records according to the storage capabilities of the file. The format of the data written to the virtual records can be controlled by setting the writelist , droplist and hydrogens status attributes on the file handle.

When executed for the first time on a file handle for which the record count is yet unknown, the existing file records must be tallied and all current physical record positions be registered. For very large files, this can take some time. However, this is not equivalent to reading the complete file, so it does not consume much memory and the command can in principle work on arbitrarily large files.

Virtual records are held as string images in memory. A couple of thousand such records should not be a problem for typical workstations, but for systematic editing of large files where every record is touched an explicit scripted input/output loop is preferable.

The return value is the new record count of the file.

Changes to the file can be committed to disk by means of the molfile vrewrite command.

Example:

molfile vappend $fhandle [ens create c1ccccc1]

molfile vdelete

molfile vdelete filehandle recordlist

Virtually delete records from an open file handle. The underlying file is not modified, but all future input operations on this file behave as if the specified records had been deleted.

The record list is a list of integer values, with one as the first file record. The list does not need to be sorted, and duplicate record numbers or record numbers out of range are ignored. It is possible to virtually delete file records which are themselves virtual, i.e. were added by the vappend, vreplace or vinsert subcommands and are not physically present in the file.

Virtually deleted records have negligible memory demands, but will slightly slow down input operations on edited files.

The return value is the new record count of the file.

Changes to the file can be committed to disk by means of the molfile vrewrite command.

Example:

molfile vdelete $fhandle [list 3 9 6]

molfile vinsert

molfile vinsert filehandle objectlist

Insert virtual records for the specified objects into the file. The insertion position is before the current read position.

Except for the difference in the location where the virtual records are inserted, the command is equivalent to the molfile vappend command and has the same features and limitations. Please refer to that command for details.

molfile vreplace

molfile vreplace filehandle objectlist

Insert virtual records for the specified objects into the file. The current input record is virtually overwritten.

Except for the difference in the location where the virtual records are inserted, and the fact that an existing record is replaced, the command is equivalent to the molfile vappend command and has the same features and limitations. Please refer to that command for details.

It is possible to replace a record which is itself virtual, i.e. was introduced by a vappend, vinsert or vreplace subcommand. If more than one output object is passed, or the object is written as multiple file records, additional virtual records are created and the record count of the file increased accordingly.

Example:

set eh [molfile read $fh]

ens expand $eh

molfile backspace $fh

molfile vreplace $fh $eh

ens delete $eh

This command sequence virtually replaces a record with a version where superatoms are expanded.

molfile vrewrite

molfile vrewrite filehandle ?filename?

Commit all virtual record additions, deletions or replacements to a physical file. If no file name is given, the current file name is used. After writing, the file handle remains valid. It is open for reading, and positioned before the first record. At this moment, the file no longer contains any virtual modifications, but the file handle may again be subjected to virtual edit operations. In case a file name is specified, and is not the same as the name of the current file, the file handle refers to the new file when the command has finished.

All valid records are copied verbatim to the new file, without going through decoding and re-encoding or records (see molfile copy command). A temporary file in the same directory as the current file is created, and sufficient disk space needs to be present to hold both the original file and the edited version at the same time. In case a problem occurs, the temporary file is deleted and the current file remains active. Only if all write operations succeed the old file is deleted and the temporary file renamed if necessary. In case a file name is specified, and it is not the same as that of the current file, the original file remains untouched, but is no longer linked to the molfile handle. For large files, this operation can take some time because massive amounts of data may need to be moved.

If the file referenced by the file handle has not been edited with virtual record operations ( vappend, vdelete, vinsert, vreplace ), the command does nothing and is equivalent to a molfile rewind .

The command returns the number of records written.

Example:

set fh [molfile open „myfile.sdf“]

molfile vinsert $fh 1 [ens create c1ncccc1]

molfile vrewrite $fh „myfile_with_pyrdine_inserted_in_rec_1.sdf“

molfile write

molfile write filehandle ?objecthandle?...

This commands writes structure and reaction data to a file. Object handles may be ensemble handles, reaction handles, dataset handles, or molfile handles.

If an object is an input molfile handle, objects are read from the file until EOF is encountered if the output file supports multiple records. If the output file type is single-record, only the next record is read. The types of objects which are collected from the input molfile handle are dependent on its read scope. These objects are then treated as if they were used as parameter objects directly. Objects obtained via a molfile handle are automatically deleted after they have been written. If the input file is already at EOF when the command is executed, no objects are read, and no error is generated. However, this does not trigger the NULL record output handling described below, because the file object was specified as an argument.

The type of data which is actually written to the file depends on its format. A file opened for ensemble output can be fed with any type of handle. If reactions or datasets are passed, these are taken apart and written as individual records. If the output file is a reaction file, and an ensemble is passed, the reaction it is a member of is looked up and used as output object. If the ensemble is not a reaction ensemble, an attempt is made to store it as a plain ensemble outside any reaction. If the output routine rejects this, an error is raised. In case of datasets passed as objects for reaction output, the individual dataset objects (ensembles or reactions) are written, in combination with reaction reference substitution in case ensembles instead of reactions are found. For full-dataset output, it is legal to pass non-dataset objects. No dataset-level information is written and the objects stored as an anonymous dataset.

It is legal to supply no object handles at all. Normally, this means that simply no output is performed. However, I/O modules for specific file formats may support the output of special NULL records. In that case, the output function is called once without any objects. An example are Gaussian job files, which allow you to write records in multi-link files, where the computation instructions are taken from the file property F_GAUSSIAN_JOB_PARAMS , without supplying a structure record.

As part of the output process, new information may be computed on the objects. In case the active settings on the output molfile handle demand a structural change of an object, for example the addition or removal of hydrogen atoms, or the re-coding of ionic versus pentavalent nitro groups and similar functionality, the write objects are temporarily duplicated and these duplicates undergo the structure changes. The original output objects are never indirectly edited in their connectivity by this command.

The writelist attribute of molfiles may be set to a list of properties which should be included in the output. This has an effect only for file formats which support the storage of custom data values and which can cope with the data types of the listed properties. By default, no attempt is made to actively compute these properties for output. If they are not present in the input data, their output is silently omitted, or NULL values are written, depending on how the output format encodes these things. However, if the computeprops flag is set on the output molfile , an attempt for computation is made, and after output, the objects retain this additional data if the computation succeeds.

If the hydrogen set mode of the output molfile calls for a change in hydrogen status, the stage when these computations are performed depends on the hydrogen addition mode. If the output mode calls for potential hydrogen additions, the computations are executed after the addition - and this means, on the temporary duplicate, so the original object does not see the new property data. If the hydrogen mode does not change the hydrogen set, or potentially removes hydrogens, computations are performed on the original objects and then the object is potentially duplicated, with all its data, for hydrogen removal and output. In the latter case, the additional property data is visible on the original input objects.

The command returns a list of the object handles which were actually written to file. In cases like a reaction being split into ensembles, or a dataset taken apart, this is not necessarily the same object handle collection as the input object list. For output from an input molfile argument, the total number of objects written is returned instead, because the read objects are not retained.

Examples:

molfile write “myfile.sdf” $eh1 $eh2

set fhandle [molfile open z.cbin w hydrogens add format cbin]

molfile write $fhandle $dset1

molfile write $fhandle $dset2

molfile close $fhandle

The first sample line uses the single-shot file operation feature of the molfile command. Instead of a molfile handle, a file name is passed, and that file is automatically opened, the output performed, and then the file is closed. Two ensembles are written with a single statement to the output file myfile.sdf. The desired file format is guessed from the file name suffix. No change in hydrogen status, etc. is performed, and no extra data is written out.

The next four example lines show how two complete datasets can be written to a native Cactvs toolkit binary file. Hydrogens are added to structures or reactions in the dataset - but the original dataset elements are not changed, since the addition is performed on temporary object duplicates. Also, the Cactvs binary format is requested explicitly by setting the format attribute. In this case, this is not really required, since the file format could also be guessed from the file name suffix. However, in case a non-standard file name suffix is used, formats must be specified explicitly, or the default format ( MDL SD-file) is used. If the Cactvs binary file is later opened for reading with a read scope of dataset , all dataset elements plus the dataset-level property data can be recovered.