The molfile Command

The molfile command is the generic command used to manipulate chemical structure and reaction files. These can be of any supported format, not just MDL molfiles.

Molfiles are major objects. They are uniquely identified by their label alone. Molfiles do not contain minor objects.

Example:

set fhandle [molfile open myfile.sdf]
set ehandle [molfile read $fhandle]
molfile get $fhandle record

As explained in more detail in the section about working with structure files, the molfile handle identifier can be replaced by a file name. This file is automatically opened, the command executed, and the file closed in a single one-shot operation.

In the context of structure files, file-related data is usually provided as attributes. However, molfiles can store property data like any other chemistry object.

Example:

molfile get $fhandle F_COMMENT

This is the list of currently officially supported subcommands:

molfile append

molfile append filehandle property value ?property value?..

Standard data manipulation command for appending property data. It is explained in more detail in the section about setting property data. This is not a command to append file records. Use the molfile write command for this purpose.

Example:

molfile append $fh F_GAUSSIAN_JOB_PARAMS(route) “Opt=(AddRed,CalcFC)”

molfile backspace

molfile backspace filehandle ?nrecords?

Position the file pointer backwards. If no record counter is specified, the file is backspaced by a single record. It is an error to attempt to reposition the file before the beginning of the file.

Examples:

molfile backspace $fh
molfile set $fh record [expr [molfile get $fh record]-1]

These two sample lines provide identical functionality.

The molfile backspace command is often used in combination with the molfile copy command in order to copy records with specific properties verbatim:

set eh [molfile read $fh]
if {[strucuture_passes_condition $eh]} {
	molfile backspace $fh
	molfile copy $fh $outfilehandle
}

molfile close

molfile close ?filehandle? ...
molfile close all

Close one or more file handles. If the file handle corresponds to a scratch file, the file is deleted. If it corresponds to a pipe, all programs in the pipe are shut down.

If all is passed instead of a set of file handles, all currently opened structure files are closed. Standard Tcl files are not affected.

It is a good idea to close files when they are no longer needed. In addition, while most file format I/O modules commit all data to disk after each record has been written, so that a clean close-down is not absolutely required, there are file formats for which the I/O module has a cleanup or finalization routine which is only called if the file is properly closed.

The command returns the number of files which were closed.

Example:

set fhandle [molfile open scratch]
molfile close $fhandle

The example closes a scratch file, which is automatically deleted from disk when it is closed.

On normal interpreter program exit, the close functions of all remaining open file handles are automatically called.

molfile copy

molfile copy filehandle ?channel? ?count? ?record?

Copy a record to a Tcl channel, to a Cactvs structure file handle, or retrieve it as a byte image. No interpretation or formatting of the data in the file record(s) takes place - the data is copied verbatim, byte by byte.

If file format conversion is desired, the data items (ensembles, reactions, datasets) must be explicitly read ( molfile read command) as chemistry objects and written to another molfile opened for output in the desired format ( molfile write command) . That procedure involves re-formatting and potential loss of formatting or information which was not captured by the input routine, or cannot be written by the output routine.

By default the next record after the current file pointer position is returned as a byte image. The optional parameters allow the selection of a specific record (beginning with 1 for the first record), the copying of multiple records in one command (by default, a single record is copied), and output to alternative Tcl channels or Cactvs molfile structure file handles. If an empty string or the value 0 are used as start record number, the file is copied from the current position. If the record number is negative, it is interpreted as offset from the current position. Therefore, passing -1 as parameter instructs the command to backspace by one record prior to copying. Not all files can be backspaced. If the special count values end or all are used, all remaining records in the input file are copied. Otherwise, if the number of available records is smaller than the requested copy count, an error results.

If the output channel argument is omitted, or set to an empty string, the record(s) are returned as a byte sequence command result. Otherwise, the data is written to the file handle the argument is connected to. For Cactvs molfile handles, the destination is the current write position of the underlying file handle. On Unix/Linux systems, writable active Tcl file or socket handles (in the form filexxx or sockxxx ) are also supported, but not on Windows. Additionally, the special output channel names stdout and stderr can be used. If output is written to a channel, and not returned as blob, the number of actually copied records is returned as the command result.

The I/O modules for ctx and sdf formats provide optimized fast copy routines and are thus notably faster to copy then other file formats without explicitly encoded record positions. These still need to read the file line by line and maintain a parser state, though they can avoid decoding the record contents as structures or reactions.

Example:

set eh [molfile read $fhandle]
set fhout [open “metal_compounds.sdf” w]
if {[ens atoms $eh metal exists]} {
	molfile copy $fhandle $fhout 1 [expr [molfile get $fhandle record]-1]
}

This example reads a structure from an input file, checks whether is contains a metal atom, and if yes, copies the record unchanged to an output file, which is opened as a simple Tcl text file channel in this example. The expression which forms the last parameter backspaces the input file by one record, so that the same record which was just read can be copied. A simpler solution for the same functionality is to simply pass -1 as argument. This works of course only if the input file can be repositioned backwards. i.e. normal text files are fine, standard input or a socket connection do not work.

molfile count

molfile count filehandle ?maxrecs? ?readscope?

Count the number of records in the file.

If the file format contains an internal or external record index with information about the complete file, the answer is produced from the index, and thus is typically obtained fast. Otherwise, the file is skipped from the current position until the end, and the sum of the number of records encountered while skipping and the record index when the count started is returned. In case of files which are rewindable, the original input file pointer position is then be restored. On non-rewindable files, the file contents are consumed, and no return to the old input position is possible. For files which are opened for writing, the count usually is simply the current output position, except for those few file formats which support in-file record replacement in combination with a complete file index. In the latter case, the count is again extracted from the index.

During the record skipping part the file contents are not physically read if possible. Rather, the skip function of the responsible file format I/O module is used to scan the file effectively. After arriving at the end of the file, a full in-memory record position index has been assembled for the file, and future record selection within files which support re-positioning is fast.

The type of record boundaries counted depends on the input scope of the file. For file formats which support multiple input modes, such as for extraction of ensembles or molecules or datasets, the count is dependent on the type of object which is configured to be read. If the file input object type is changed, the in-memory record index table is discarded.

If the maxrecs parameter is specified, and is not a negative number, it is the maximum count reported. No attempt is made to position the file beyond this mark during the count process. This has no effect on future input operations - these may still proceed beyond the reported count. This option is not intended to be generally useful, but is used for example in the structure browser csbr with the -m option to enable quick inspection of a file without full scanning.

The optional readscope parameter can be used to temporarily modify the read scope under which the file is processed. It can be any of the generally recognized values (mol, ens, reaction, dataset). If the file format does not support the specified mode, its default mode is silently used. If the file is not positioned at the beginning of the data, the count reports the sum of the currently known records as perceived by the previous read scope, and the remaining file records under the new one. If these values are different, the result may only be useful under very specific circumstances. The the parameter is not set, or an empty string is passed, the currently set, or, for one-shot file operations, the default read scope, is used.

Example:

set nrecs [molfile count “thefile.sdf”]
set nrecs [molfile count “test.spl” -1 mol]

molfile dataset

molfile dataset filehandle

Return the handle of the dataset associated with the file handle. If no such dataset is set, the command returns an empty string. The command

molfile get $filehandle dataset

is equivalent.

This command is different from the dataset commands for ensembles, reactions or tables, where it indicates membership in a dataset. File objects cannot be a member of a dataset. This dataset association is explained in more detail in the molfile set command section.

molfile defined

molfile defined filehandle property

This command checks whether a property is defined for the structure file. This is explained in more detail in the section about property validity checking. Note that this is not a check for the presence of property data! The molfile valid command is used for this purpose.

molfile delete

molfile delete filehandle recordlist ?rebuild_index?

Delete records from the file. The file must have been opened for writing or update, and be rewindable. In case the file is not a simple record sequence, the I/O module for its format must provide a deletion function, or the operation will fail.

The deletion record list is a set of record numbers in any order. They are sorted and duplicates removed. It is no error to specify an empty removal record list. The record numbering starts with one, and the record numbers are referring to the record numbering at the moment the command is issued. There is no need to compensate for intermediate record numbering shifts when more than one record is deleted.

The optional index rebuild parameter, a boolean value, can be set to optimize the deletion process for files in formats which maintain field index information. By default, indices are updated as part of the deletion process. In case many records are deleted, it may be more efficient to drop the indices prior to the deletions and rebuild them after the records have been removed. In order to select this alternative procedure, a true parameter value can be set. At this time, the only file format which actually can use that parameter is the bdb database file format.

In case the file is to be truncated, the molfile truncate command is usually more efficient.

This command returns the number of deleted records. It does not close or destroy the file handle, or the underlying file.

molfile dget

molfile dget filehandle propertylist ?filterset? ?parameterlist?

Standard data manipulation command for reading object data. It is explained in more detail in the section about retrieving property data.

For examples, see the molfile get command. The difference between molfile get and molfile dget is that the latter does not attempt computation of property data, but rather initializes the property values to the default and return that default if the data is not yet available. For data already present, molfile get and molfile dget are equivalent.

molfile dup

molfile dup filehandle

This command duplicates a file handle. The duplicate handle points to the same underlying file or other data channel, is opened in the same access mode, and positioned at the same record. Also, all file object attributes and file properties are set to identical values.

Currently, it is not possible to duplicate virtual file sets opened by a molfile lopen command.

The command returns a new file handle.

molfile exists

molfile exists filehandle

Check whether a file handle is currently in use. The return value is the boolean result. No error is raised if the file handle cannot be decoded.

molfile extract

molfile extract filename retrievallist

Extract the contents of data fields from the file, without reading full structure or reaction records i f possible. This operation requires a support function in the I/O module for the file format. Generally, only formats optimized for query operations, such as the Cactvs bdb and cbs formats provide such a function in their I/O module.

This command is essentially a shortcut for a molfile scan command with an empty query condition and a propertylist retrieval mode. Please refer to that command for details about the possible contents of the retrieval list.

The result is a nested list of extracted property values, with one outer list element for every file record to the end of the file, and inner list with one element per retrieval field.

molfile filter

molfile filter filehandle filterlist

Check whether the structure file passes a filter list. The return value is 1 for success and 0 for failure.

Example:

molfile filter $fhandle $filter

molfile fullscan

molfile fullscan filehandle queryexpression ?mode? ?selectlist? ?parameters?

This command is the same as molfile scan , except that an automatic rewind (see molfile rewind ) is performed before the query is executed. The same effect can be achieved by setting the startposition parameter value to 1.

molfile get

molfile get filehandle propertylist ?filterset? ?parameterlist?
molfile get filehandle attribute

Standard data manipulation command for reading object data. It is explained in more detail in the section about retrieving property data.

The molfile object possesses a rather extensive set of built-in attributes, which can be retrieved with the get command (but not its related subcommands like dget, sqlget , etc.). Most of them can also be manipulated with a set command. In addition, molfile objects can possess file-level properties. The standard prefix for these is F_ .

Example:

set c [molfile get $fhandle F_COMMENT]

These built-in attributes are:

append - all file output is append to the end of the file, ignoring the current write pointer position.

binary - the file is binary, without a line structure.

bzip2-compressed - the file is accessed via a pipe to the bzip2 program.

checkedbinary - the file contents were checked to determine whether they contain non-ASCII characters.

edited - the file contains virtual edited records, or virtual deletes.

fakeposition - the file has no meaningful offset positions for the beginnings of records, the offset data structures contain other forms of access information

gzip-compressed - the file is accessed via a pipe to the gzip program.

incomplete - the last file record was not read completely. This can be intentional in file formats which support basic and extended data groups, or can be an indication of a non-critical decoder problem.

indexed - the file is accessed via an index file with record positions, not directly.

nommap - memory-mapping of the file contents is suppressed.

initialized - an initialization function of in the associated I/O module has been called

locked - there is currently a flock()/lockf() style file lock active on the file.

mapallocated - the memory mapping arena for the file was allocated and filled via some read operation, not mmap()ed .

memlocked - a mapping of the file are locked into memory and are not swapped out.

readable - the file handle can be read from.

readonly - the file has been opened for read-only access, without the possibility to switch the handle to a different mode.

remotefs - the physical file resides on a non-local file system.

rewindable - the file can be rewound if necessary

scratch - the file is a scratch file and is automatically deleted when the file is closed.

shared - the file contents reside in shared memory.

validcount - the current number of known positions is known to correspond to the total of records in the file.

virtual - the file is a virtual file build from multiple physical files.

ucs2-encoded - the file is accessed via a pipe to the iconv program.

url - the file is accessed via a URL, not a file system path.

updating - the file is currently being updated

writeable - the file handle can be written to.

xdr - the file is associated with an XDR encoder or decoder structure.

none - no flags

computeprops - attempt to compute properties in the write list if they are not yet present in the output objects.

miniheader - keep the file header as concise as possible.

multiwriter - prepare the file to handle multiple simultaneous writers. The only file format I/O module which currently supports this is BDB .

noimplicith - do not output hydrogen atoms which were added as implicit atoms.

nopropertymapping - always synthesize property descriptions, do not attempt to map them onto existing standard system definitions. The only module currently supporting this feature is the PubChem ASN.1 module.

nostereo - do not write stereo information into the file, even if present in the output structures.

nostereoperception - do not attempt to perceive stereochemistry from the available object data such as 2D coordinates and wedges, or 3D atomic coordinates, even if the file format normally requires this information.

omitct - if the inclusion of a structure connectivity table is optional, this flag can be used to suppress the output this block.

pedantic - perform pedantic output format checking, for example by refusing to write long lines in text formats which exceed the exact format specification, or refusing to write structures with more atoms than officially supported.

rawcoordinates - do not perform any coordinate checking, scaling, and centring but write the coordinates exactly as they are currently stored.

recalcbaseprops - if the output file content is a single property (for example E_GIF for GIF or PNG files, E_EMF_IMAGE for EMF and WMF files), force recalculation of this property before output.

supergroupexpansion - If a file format can either be written with expanded or contracted superatom groups (specified as type SUP in property G_TYPE and group label in G_NAME ), the default is to write them contracted. If this flag is set, the expanded form is used instead. This option affects few file formats (currently cdx and cdxml ). It does not perform expansion of superatoms which are only present as a single pseudo atom in the ensemble by decoding their tag (see ens expand command to achieve this). Rather, it expects the full set of atoms of the expanded form in the ensemble, plus one or more properly set up group objects indicating the atoms of the expanded form of a functional group or fragment which are not shown in the contracted style. If these groups are present, only the first atom in any group is shown, with the G_NAME data as atom tag, which overrides all other label information. However, the output file still contains the hidden atoms and their data. Tools like ChemDraw use this data to support interactive group expansion utilizing the original layout coordinates of the previously hidden atoms and other information.

synchronous - use synchronous writes for files which normally use buffering to increase performance, for example in the bdb format.

splitmol - Split output into individual ensembles and write each molecular fragment as a separate record.

upgrade - if this flag is set, and the format of a file is not of the most current version, but there is an upgrade function available in the support library, invoke the upgrade function to change the file layout to the most current version. The bdb module is the only one which currently supports this feature.

write0d - write records without coordinates if possible

write2d - write 2D records if possible

write3d - write 3D records if possible

writearo - write aromatic bonds instead of a Kekulé form if the file format supports this. An example where this makes sense are SMILES files. A counterexample are MDL Molfiles - you can enforce the encoding of aromatic bonds of non-query structures as the aromatic query bond type with this option, but that is technically incorrect and violating the format specification. Nevertheless, there are third party programs which require data in that format aberration for further processing.

writecolor - write atom and bond colouring information if this is an optional part of the file format specification.

writeenzymes - if the output data contains enzyme superatoms, include them in the output if that is an option. The SDF3000 I/O module is an example for a module recognizing this flag.

writelabels - write explicit atom labels, as defined in the attribute atomlabelproperty , if the file format supports it. This does not override the natural numbering of the written atom objects. It only applies to formats which support a parallel user-defined labelling scheme, such as CDX/CDXML.

writename - write a structure name section if this is optional information in the output. An example are SMILES files.

The attribute list above is also referenced by the molfile set command. This is the reason why it contains information about the read-only status of the individual attributes. Only attributes that can be set can be addressed by the molfile set command.

For the use of the optional property parameter list argument, refer to the documentation of the ens get command.

Filters in the optional filter set must apply directly to the file object. Filters which operate on other object types are ignored.

Variants of the molfile get command are molfile new, molfile dget, molfile nget, molfile show, molfile sqldget, molfile sqlget, molfile sqlnew, and molfile sqlshow . These only apply to retrieval of file-level property data, not the attributes.

molfile getline

molfile getline filehandle ?skiprecord?

Read a text line from the file, with repositioning of the file pointer. This operation is only possible on text files which have been opened for reading. The command is not frequently used, because it tends to disrupt the normal file record parsing.

If the skiprecord boolean argument is set, the file is positioned to the beginning of the next record after the line has been retrieved.

The command returns the line read. Line termination characters are removed.

molfile getparam

molfile getparam filehandle property ?key? ?default?

Retrieve a named computation parameter from valid property data. If the key is not present in the parameter list, an empty string is returned. If the default argument is supplied, that value is returned in case the key is not found.

If the key parameter is omitted, a complete set of the parameters used for computation of the property value is returned in key/value format.

This command does not attempt to compute property data. If the specified property is not present, an error results.

Example:

molfile getparam $fhandle F_QUERY_GIF format

returns the actual format of the data in that property, which could be a GIF , PNG or a bitmap format.

molfile hloop

molfile hloop filehandle objvar ?maxrec? body

This command is functionally equivalent to the molfile loop command. The difference is that for the duration of the loop command hydrogen addition is enabled for the file handle. The original hydrogen addition mode of the file object is restored when the loop finishes.

molfile hread

molfile hread filehandle ?datasethandle/enshandle? ?recordcount?

This command is identical to the molfile read command, except that standard hydrogen addition is enabled for the duration of the command. The original hydrogen mode is reset when the command completes.

Example:

set eh [molfile hread “myfile.mol”]

This is a simple single-record structure input with hydrogen addition, using a file name instead of a file handle. The file is automatically opened and then close for the duration of the command.

molfile list

molfile list ?filterlist?

This command returns a list of the molfile handles currently registered in the application. This list may optionally be filtered by a standard filter list.

Example:

molfile list

lists the handles of all open molfiles in the application.

molfile lock

molfile lock filehandle propertylist/objclass/all ?compute?

Lock property data of the file handle, meaning that it is no longer subject to the standard data consistency manager control. The data consistency manager deletes specific property data if anything is done to the file handle which would invalidate the information. Property data remains locked until is it explicitly unlocked.

The property data to lock can be selected by providing a list of the following identifiers:

The lock can be released by a molfile unlock command.

This command is a generic property data manipulation command which is implemented for all major objects in the same fashion and is not related to disk file locking. Disk file locks can be set or reset by modifying the molfile object attribute lock. This is explained in more detail in the paragraph on the molfile get command.

The return value is the molfile handle.

molfile loop

molfile loop filehandle objvar ?maxrec? body

Execute a loop over the file. Objects are read from the file from the current file position onwards. The type of object read (usually ensemble or reaction, but in principle also a table or dataset object) depends on the read scope of the file. The handle of every object input from a file record is assigned to the specified Tcl object variable. Next, the Tcl script code in the body argument is executed. The body code typically uses the value of the variable to perform some operations with the currently read object. After the body code has been executed, the object which was just read is deleted, and the cycle is repeated, either until EOF has been reached on the file (the default), or the maximum number of records specified by the optional parameter has been reached, whichever comes first. In either case, no error is generated when the end of file has been reached. Setting the maximum record count parameter to an empty string, or to a negative value, results in the default processing style running until the end of the file.

Within the body, the standard Tcl break and continue commands work as expected. If the loop code generates an error, the loop is terminated and the error reported. Programs should not expect that the same object handle value stored in the variable is reused in each iteration.

Since the input objects are automatically deleted after they have been processed, it is not required to delete them in the loop code. Deletion requests on the loop object executed within the loop are ignored. Any other operation on the structure object is allowed. The loop code may perform repositioning operations on the input file, but not close it.

The return value is the number of processed records.

Example:

set th [table create]
table addcol $th E_NAME
table addcol $th E_WEIGHT
molfile loop $myfile eh {
	table addrow $th #auto end [list [ens get $eh E_NAME] [ens get $eh E_WEIGHT]]
}

This sample loop successively reads all records from the file and stores the ensemble handles in variable eh . In the loop body, the handle is used to extract name and molecular weight information from the structure and store it in a table object.

molfile lopen

molfile lopen filelist ?mode? ?attribute value?...

Open a list of files as a virtual file. The files identified by the file list items are implicitly concatenated in the list order. In addition to normal files, the standard set of special input types such as URLs, pipes, Tcl file handles or standard channels may be used. This command returns a single file handle, regardless of the number of input files passed as parameter.

A file list can only be opened for read operations on input objects. Writing, appending, updating or string input are not supported.

Most input file operations can be performed on virtual files. One important exception is currently file scanning with query expressions. This only works for lists of standard sequential files, not files which contain optimized query layouts, such as the native Cactvs CBS and BDB file formats. These can only be used as a single file for molfile scan commands. However, simple structure input is possible across file boundaries even with these formats.

The rest of the options are processed in the same way as the standard molfile open command.

Example:

set fhandle [molfile lopen [lsort [glob *.mol]]]

molfile max

molfile max filehandle property ?filterset?

Scan the file for the maximum value of the the specified property from the current read position to the end of the file. If no error occurs, the file is at end-of-file after the end of the command.

If a filter set is provided, it is applied to the objects read from the file during the scan, not the molfile object proper. Objects which do not pass the filter are ignored.

The property may correspond either to a data column in the file, or to a computable property on the structure or reaction objects read during the scan. Read objects are transient and automatically discarded. The property argument may contain a field specification, and in that case, only the field value is compared.

The maximum value determination uses the standard property comparison function associated with its data type. For properties which are implicitly defined during file I/O, an explicit property definition with a correct data type may be beneficial. For example, when testing the values of an SD data field, by default the data is read as an implicitly created string property. If the field content is actually an integer, the comparison as a string value does not yield the same results as when the data is compared as an integer. For file formats which encode a proper data type of its contents this is not necessary.

The return value is the maximum property or property field value found, or an empty string if no input was processed.

molfile metadata

molfile metadata filehandle property field ?value?

Obtain property metadata information, or set it. The handling of property metadata is explained in more detail in its own introductory section. The related commands molfile setparam and molfile getparam can be used for convenient manipulation of specific keys in the computation parameter field. Metadata can only be read from or set on valid property data.

molfile min

molfile min filehandle property ?filterset?

Scan the file for the minimum value of the the specified property from the current read position to the end of the file. If no error occurs, the file is at end-of-file after the end of the command.

If a filter set is provided, it is applied to the objects read from the file during the scan, not the molfile object proper. Objects which do not pass the filter are ignored.

The property may correspond either to a data column in the file, or to a computable property on the structure or reaction objects read during the scan. Read objects are transient and automatically discarded. The property argument may contain a field specification, and in that case, only the field value is compared.

The minimum value determination uses the standard property comparison function associated with its data type. For properties which are implicitly defined during file I/O, an explicit property definition with a correct data type may be beneficial. For example, when testing the values of an SD data field, by default the data is read as an implicitly created string property. If the field content is actually an integer, the comparison as a string value does not yield the same results as when the data is compared as an integer. For file formats which encode a proper data type of its contents this is not necessary.

The return value is the maximum property or property field value found, or an empty string if no input was processed.

molfile mutex

molfile mutex filehandle mode

Manipulate the object mutex. During the execution of a script command, the mutex of the major object(s) associated with the command are automatically locked and unlocked, so that the operation of the command is thread-safe. This applies to builds that support multi-threading, either by allowing multiple parallel script interpreters in separate threads or by supporting helper threads for the acceleration of command execution or background information processing. This command locks major objects for a period of time that exceeds a single command. A lock on the object can only be released from the same interpreter thread that set the lock. Any other threaded interpreters, or auxiliary threads, block until a mutex release command has been executed when accessing a locked command object. This command supports the following modes:

There is no trylock command variant because the command already needs to be able to acquire a transient object mutex lock for its execution.

molfile need

molfile need filehandle propertylist ?mode?

Standard command for the computation of property data, without immediate retrieval of results. This command is explained in more detail in the section about retrieving property data.

The return value is the file handle.

Example:

molfile need $fhandle F_AVERAGE_ATOM_COUNT

molfile new

molfile new filehandle propertylist ?filterset? ?parameterlist?

Standard data manipulation command for reading object data. It is explained in more detail in the section about retrieving property data.

For examples, see the molfile get command. The difference between molfile get and molfile new is that the latter forces the re-computation of the property data, regardless whether it is present and valid, or not.

molfile nget

molfile nget filehandle propertylist ?filterset? ?parameterlist?

Standard data manipulation command for reading object data. It is explained in more detail in the section about retrieving property data.

For examples, see the molfile get command. The difference between molfile get and molfile nget is that the latter always returns numeric data, even if symbolic names for the values are available.

molfile open

molfile open filename ?mode? ?attribute value?...
molfile open filename ?mode? ?attributedict?

This command opens a structure file or other input source for input or output. The filename argument may be any of:

This is the most common case. File names may be absolute or relative. On the Windows platform, the path naming follows the Tcl convention, with backslashes replaced by forward slashes, and optional drive letters, in the same way as the standard Tcl open command. Tilde substitution is also supported and built into the command. In case a file name could possibly collide with a reserved name, the file name can be prefixed with ./ in order to force interpretation as a file name. File name expansion can be conveniently performed by means of the standard Tcl glob command. File names must currently be spelled in the 8-bit ISO8859-1 character set. Unicode file names are not yet supported. On Unix platforms, named pipes and sockets may also be opened with this command.

Examples:

molfile open ./stdout r
molfile open ~theuser/data/newleads.sdf
molfile open C:/temp/calicheaamycin.pdb w

The file names stdout , stderr and stdin are reserved and connect the file handle to a standard I/O channel. stdout and stderr can only be opened for output, and stdin can only be read from. The character ’-’ (minus) is an alternative name for standard input.

Example:

molfile open stdout w format mdl
molfile open ./stdout 

The first line opens an MDL file for output on standard output. The second sample line opens the file in the current directory which is named “stdout” for input. By prefixing file names with directory information any file with a reserved name can be opened as standard file.

The name scratch is reserved as the name of a generic scratch file. The file is initially opened for writing, but may be switched to input later by a molfile toggle command. The magic filename is translated into the name of a platform-specific temporary file. Every invocation of this command variant generates a new scratch file, with a different name. The true file name can be obtained with an attribute query:

set fh [molfile open scratch]
set name [molfile get $fh name]

Scratch files are automatically deleted when they are closed, or when the program exits.

If a file name starts with a vertical bar character “|”, a pipe is opened from (in read mode) or to (write mode) the commands listed after the bar.

Example:

molfile open “|gzip >thefile.sdf.gz” w format mdl

When the file is closed, the pipe and all programs connected to it are automatically shut down. Pipes cannot be rewound, or switched from input to output and vice versa.

The Cactvs toolkit supports reading from various types of URLs. Currently, the schemes ftp , http, file and gopher are supported. file URLs are just another notation for normal disk files, as described above. From among the other URL schemes, only ftp and http connections may be opened for writing. The support for ftp URLs includes username and password components. If the server side supports it, passive ftp is the preferred mode. Http connections opened for writing use the PUT http command, which often is not activated in standard Web server set-ups and may therefore be of limited practical usefulness. URL connections can be rewound and backspaced, but this is costly because the existing connection has to be disconnected and the initial data from the beginning of the file to the desired position needs to be re-transferred and discarded.

Examples:

set fh [molfile open http://www.yourcompany.com/repository/jcamp/ir1.jcp]
molfile open ftp://yourid:yourpasswd@ftp.yourcompany.com/upload/ideas.sdf

If the target is a directory, all files in the directory are scanned. Those files which were identified as structure data files by any of the built-in or currently loaded I/O module extensions are concatenated to a virtual file which comprises all individual files. The order in which the files are concatenated is largely unpredictable, because it is defined by the order of the file name entries in the directory, and not any alphabetic sort criterion. The files may be of different formats, and may be any mixture of single-record and multi-record files. Subdirectories of the opened directory are not entered by default, but this may be activated by appending a ‚d‘ character to the open mode. Directories may only be opened for reading.

Example:

set fh [molfile open .]
set fh [molfile open $mydir rd]

The second example opens not only perceived structure files in the source directory, but also in all subdirectories thereof.

The Cactvs toolkit can read most file formats directly from a string. There is no need to write structure data which was obtained as a string image to a temporary file to decode it. Data strings are opened as structure file with mode ’s’. Only input is possible, but navigation within the string with molfile rewind etc. works as expected. The complementary molfile string command can be used to generate a string image of a file record.

Example:

set fh [molfile open $thedatablob s]
set eh1 [molfile read $fh]
set eh2 [molfile read $fh]
molfile close $fh

Any file name beginning with file or sock , and where the rest of the file name is a sequence of digits, are interpreted as references to Tcl file handles.

Example:

set tcl_fh [open thefile.txt w]
set cactvs_fh [molfile open $tcl_fh w]

A Tcl handle can only be accessed by this command in a mode which is compatible to the mode it was opened with, i.e. it is not possible to write to a file via a Tcl handle if it was opened for reading. If a structure file coupled to a Tcl handle is closed with a molfile close command, the Tcl handle remains valid, and my be used freely once the association to the structure file I/O object is broken. Closing the Tcl handle while the piggybacked structure file handle is being used is illegal. No input, output or positioning should be performed on the Tcl handle with standard Tcl commands while it is being referred to by a molfile object.

This functionality is not available on Windows, because on this platform Tcl internally uses Windows handles for I/O, while the Cactvs toolkit builds on standard Posix C library FILE pointers.

Some I/O modules implement access to a variety of information sources as a virtual file, which has neither a presence on the local disk, nor is one of the standard magic file names or access methods. Such virtual file names are by convention written with pointed brackets.

Example:

set fh [molfile open <pubchem>]

This command loads the PubChem virtual file access module, and returns a handle which may be used in a similar fashion as, for example, a handle to a huge local SD file. Depending on the I/O module, various operations on the handle may be optimized to be performed remotely. For example, the PubChem module offloads as many query operations of molfile scan commands as possible to the NCBI computers and downloads result structures only if they are needed as results, or query sub-expressions were specified which cannot be processed by the NCBI system.

The first optional parameter is the file access mode. It may be one of:

For some files and file formats, two more mode characters have meaning if appended to the primary mode: They are silently ignored if the file argument or file format do not support them.

The remaining parameters of the molfile command are optional keyword/value pairs, or alternatively a single dictionary with the same function. The processing of these parameters is exactly the same as in the molfile set command.

Example:

set fhandle1 [molfile open thefile.pdb]
molfile set $fhandle1 hydrogens add nitrosyle ionic
set fhandle2 [molfile open thefile.pbp r hydrogens add nitrostyle ionic]

The first two lines and file final line perform exactly the same task: Open an input file, and set up input flags so that a complete set of hydrogens is added, and nitro groups and similar groups are converted to an ionic (as opposed to pentavalent) representation.

When a file is opened for reading, its format is automatically determined. Do not use the format attribute except under very special circumstances.

The command returns the file handle of the opened input file. This is the handle which is required by most other molfile commands which refer to an opened file.

Depending on the encoding of the opened file, the actual access mode to the file may be different than expected. In case a disk file is compressed with gzip or bzip2 , the file is opened via a pipe to the responsible decompressor program. Likewise, an UCS-2 encoded file is opened via a pipe to the iconv program which converts the contents to the UTF-8 encoding. Files which are opened indirectly via such helper pipes have different access characteristics than directly addressed files. For example, backspacing is expensive, because the pipe has to be closed, re-opened, and the data stream skipped to the desired position. This takes much longer than simply repositioning a file pointer.

molfile properties

molfile properties filehandle ?pattern? ?noempty?

Generate a list of the names of all properties attached to the molfile object. Optionally, the list may be filtered by a string match pattern.

In most cases, this list is empty. Only structure file properties, such as F_COMMENT , etc., are listed, but no object attributes, such as readflags , nitrostyle , etc. Few file formats support the concept of storing file-level properties, and therefore an empty property set is usually reported. Since file objects do not contain minor objects, and currently cannot be a member of other major objects such as datasets or reactions, no properties belonging to other classes except file objects are ever listed.

If the noempty flag is set, only properties where at least one data element is not the property default value are output. By default, the filter pattern is an empty string, and the noempty flag is not set.

The property list may become modified by input operations. In some cases, the defined file-level properties may vary with the record position, or may become only available only after the first input operation, not immediately after opening the file.

The command may be abbreviated to props instead of the full name properties .

Example:

set plist [molfile properties $fhandle]

molfile purge

molfile purge filehandle propertylist/molfile/all ?emptyonly?

Delete property data from the molfile object. Only molfile property data may be deleted with this command (these usually have a F_ prefix). Molfile attributes are not deletable.

If the optional flag is set, only file property values which are identical to the default of the property are deleted. By default, or when this flag is 0, properties are deleted regardless of their values. In case a listed property is not present, or not a file property, the request is silently ignored, but using property names which cannot be resolved leads to an error. If the object class name molfile is used instead of a property name, all file-level property data is deleted from the molfile object.

Example:

molfile purge $fhandle F_COMMENT
molfile purge $fhandle all

The first command deletes a specific property, the second command deletes all file property data associated with the handle.

molfile putline

molfile putline filehandle ?lines?

Write user-specified string lines to a file, bypassing the normal record writing mechanism. This operation is only supported on files which are opened for output and contain text data. The lines should not contain end-of-line characters. These are automatically supplied depending on the file object configuration set set in the eolchars attribute.

The command returns the file handle.

molfile read

molfile read fhandle ?datasethandle/enshandle/#auto/new? ?flags?? ?recordcount?
molfile read fhandle ?datasethandle/enshandle/#auto/new ?flags?? ?attributedict?

This important command reads chemistry objects from a structure file. The type of objects returned depends on the read scope of the file. They can be ensembles, reactions, or datasets. Read scope mol returns single-molecule ensembles, but (with I/O modules supporting this feature) reads only individual molecules into the output ensemble, splitting a multi-molecule file data ensemble if necessary. The return value of the command is a list of all objects which were generated, except when the #auto dataset creation method was used, or an unlimited number of objects was read into a dataset. In that case, the recipient dataset handle is returned.

By default, the returned objects are not a member of any dataset. If a dataset handle is passed as fourth parameter, the returned objects are appended to that dataset if possible. The special value #auto or new creates a new dataset as container. This is equivalent to using the nested statement [dataset create] as dataset handle argument. If the fourth parameter is an ensemble handle, and the object read from the file is also an ensemble, the read data is stored in the shell of the old ensemble, after all old ensemble data has been deleted. Its object handle remains unchanged, as is its dataset membership. The reuse of reaction handles is currently not supported. This parameter can be skipped by specifying an empty string.

In addition to passing an empty string, or a simple dataset or ensemble handle, as the fourth command argument, a list consisting of a handle and a modifier flag set can be specified. The only flag value which is currently recognized is checkroom . If that flag is set, and the input objects are to become members of a dataset with enabled maximum size or insertion mode control, a test is made whether the dataset has sufficient room to allow the insertion of the new object(s), or whether a suitable alternative action is configured to handle the read object in a different fashion, such as discarding it. If that is not the case, the command returns immediately, without performing any input, and returns an empty string. If the test succeeds, the input operation is atomic, since the dataset is locked for the full duration of the command, so that no other threads can manipulate its status between the initial check and the file input result object transfer.

The final optional parameter is either a single argument specifying the number of objects which should be read, or a dictionary with key/value attributes. The default is equivalent to passing a simple numerical value of one, in the first, simple format. In order to read until the end of the file, the special value all may be used instead of a numerical count. With an all parameter value, the input operation is finished when no more data is available on the file. Until this condition is met, an unlimited number of records is read. No error is generated when EOF is met. There are also no EOF errors reported if a numerical record count of more than one was specified, and at least one object could be successfully read. Another magical value of the simple argument form is batch , which is substituted by the batch record set size configured on the molfile handle (see molfile get/set ).

In the second form of the final parameter, an attribute dictionary is persistently applied equivalent to a molfile set command before the input commences. Standard file handle attributes and an input limit may be both set in parallel by using the special attribute name limit as part of the dictionary. It is only recognized in this context, but not with molfile set or molfile string . The allowed values of the limit attribute are the same as in the simple command variant.

The command raises an error if input could not be completed, regardless whether the reason is a file syntax error, or simple EOF (but see above for exceptions). If an input error occurs, the EOF attribute of the file handle should therefore be checked in order to distinguish between these two conditions. In case the input file was opened for pipe reading (mode ’p’), or is connected to a Tcl channel, an EOF report may only indicate that no current data is available on the pipe or Tcl channel, but it could still arrive at a future point in time.

Examples:

if {[catch {molfile read $fhandle} ehandle]} {
	if {![molfile get $fhandle eof]} {
		puts “Error: $ehandle”
	}
} else {
	puts “Read [ens get $ehandle E_NAME]”
}

The prototypical snippet above shows the input of the next ensemble record from a previously opened file, with proper error checking.

molfile read “acd.sdf” [dataset create] all

This sample command reads a complete input file (we are using the single-operation feature of the molfile command to open and close the file acd.sdf automatically for the duration of this command) into a newly created dataset in memory. Reading huge datasets is of course not necessarily a good idea without large amounts of RAM . On typical current workstations, 10.000 or 20.000 compounds are no problem, but beyond that the risk of running out of memory is a real problem.

molfile reorganize

molfile reorganize filehandle 

This command only has an effect for file formats for which the I/O module provides a reorganizer function. This function typically optimizes and compacts the file for input and queries, and should usually be called after all records have been written. Writing to a reorganized file is typically at least initially slower than writing to a file which has not been processed.

The function returns a boolean value indicating whether any reorganization has actually been performed. In case the command is applied to a file which is not writable, an error results.

molfile rewind

molfile rewind filehandle

Reposition the file before first record, and clear all error status information. If the file is already at the first record, and no error condition is set, this command does nothing.

Not all file channels can be rewound, and for some which can, it can be an expensive operation. For example, standard input or pipe input channels are not rewindable, and an FTP URL channel has to be closed and re-opened.

Rewinding a virtual file set positions the file pointer before the first record of the first file in the set.

Standard text-stream style output files can be rewound, too. This effectively truncates them. Files which are opened for appending are truncated to their original length.

Rewinding is not necessary in all cases. The molfile scan command automatically rewinds the input file if it is at EOF at the begin of a scan.

The return value of the command is the file handle.

molfile rewrite

molfile rewrite filehandle recordlist propertylist ?values? ?filter? ?callback?

This command updates specific property fields in a file, without rewriting the complete record. This is only supported if the file was opened for writing or updating, and the I/O module for the format of the file supports this operation by a special function. This typically limits the applicability of this command to database-style file formats such as Cactvs CBS and BDB .

The record list parameter is either a list of numerical records, with one as the first file record, or one of the special values all (all file records are updated), current , next , previous (the indicated record is updated), or a table handle, optionally followed by a table column name. In the last case, the table is expected to contain the data for rewriting, and in case a column name is specified, that column should contain the applicable record numbers. If the table version is selected without a record column, the file records from one to the number of table rows is updated. None of the special values can be combined with the simple numerical record sequence style. If the parameter is a numerical record sequence, the order of the records is significant.

The values list can be empty, or it must match the length of the property list. In the latter case, every specified value must be a valid value for the property in the same list index position. Note that while it is possible to manipulate multiple records in one step with this command, it is not possible to assign a different set of values to the data fields for each processed record. For this operation, multiple rewrite statements must be issued. If the value list is absent, or empty, the values are recomputed from the structure or reaction object that is temporarily read from the file record for this purpose. This is a useful feature in case the computation function for a computable property has changed. In case the record list references a table instead of a numerical record list or a magic record name, the value list is ignored. Instead, the table is expected to contain table columns which match the properties in the list, but not necessarily in the same column order, or containing exclusively the properties in the list.

The optional filter argument is a query expression in the same style as used in the molfile scan command. If a filter expression is supplied, only records which match the expression are changed. Non-matching records are skipped. In case no filter is used, all records selected by the record list are processed

After processing, the file pointer is on the last processed record.

If the name of a Tcl callback procedure is specified, it is called after each processed record. The Tcl procedure arguments depend on the processing mode. In case of table-based processing, the arguments are the table handle, the current table row, the file handle and the current file record.

This command is not fully implemented yet. CBS files currently only support re-computation of property data from object data, not updates from explicit value lists. Neither BDB nor CBS I/O modules currently call the Tcl callback procedure except in table-based processing mode.

The command returns the number of updated records.

Example:

molfile update $fh current E_NAME “Black tar, grade A”
molfile update $fh all E_XLOPG2
molfile update $fh [list $mytable records] [list E_IDENT E_REGID]

The first command changes the property field E_NAME in the current record to the specified value. The second variant recomputes all E_XLOGP2 values in the file from the stored structure data - for example after updating the computation function of that property, or having added it as a new field to the file. The final version changes the fields E_IDENT and E_REGID for the records stored in table column records, replacing them with the data found in the table columns of the same name.

A complication in the use of this command is that database-type files like the Cactvs CBS and BDB formats store property definitions themselves. After opening the file, a newly set up property definition, which may for example possess an upgraded computation function, can have been replaced by the old definition from the file. In that case, the new property definition must be explicitly re-read to gain the upper hand again, for example with a prop read command.

molfile scan

molfile scan filehandle|remotehandle expression ?mode? ?parameters?

Execute a query on the file and return results. The structure file is scanned, by default starting from its current read position, and results are gathered until either the end of the file has been reached (or the scan wrapped once around the file, if the wraparound file flag has been set) or a scan condition caused the stopping of the scan procedure. If the scan finished without reaching the end of the file, it can be resumed with another molfile scan command at a later time.

The file scan works in principle on any file, but with very different efficiency. Files managed by file format I/O modules which support direct field access, and can supply structure and reaction data in binary form, can be queried much (often a factor of 1000 or more) faster than, for example, a plain SD file. In the latter format, every record needs to be fully parsed, the structure compared against the query expression, and most of the structure data is discarded immediately after the record has been checked. Files in formats which support various types of indexing for numerical values, bit-screen filtering for super- and substructure searches, hash codes for full-structure matching and other means of acceleration can be effectively queried with typical expressions in a few seconds, even while containing millions of compounds.

The two basic built-in Cactvs formats for effective searching are CBS (static files, good performance on CDROM and other linear media) and BDB (efficiently updateable, and with more advanced indexing than CBS ) . In contrast the systematic reading of a million-record SD file takes a few hours. Nevertheless, the feature of universal query support is very useful for working with typical data sets of a few thousand records. These do not need to be converted from their original formats to a query file for a quick exploratory data scan.

Query expression syntax classes

The toolkit currently supports two syntactically unrelated classes of query expressions: Native Cactvs expressions, which are described below, and Bruns/Watson structure queries as described in J. Med. Chem. 2012, 55, 9763-9772, The exact syntax supported is that of the internal Lilly suite in October 2014, which is significantly extended from the description in the paper, but also discards some outdated syntactic elements briefly mentioned in the paper.

Example:

set demerits [molfile scan $fh [read_file 9_aminoacridine.qry] {record demerit}]

This expression returns a nested list of records which match the query, and their merit/demerit score computed by that rule. Note that records which do not match the expression are omitted, they do not report a zero demerit in the result. Internally,, Bruns/Watson queries are mapped to the standard toolkit query expression data structure. Many of the queries in the standard Lilly rule set can be expressed equivalently as a native query. However, at this time there are a few specific Lilly query features which cannot be expressed in native toolkit syntax.

If a query expression cannot be parsed as Bruns/Watson code, an attempt is made to interpret is as native Cactvs expression, and all error messages relate to that interpretation attempt. The following paragraphs all apply exclusively to the native toolkit expression style.

Branch node expression classes

The expression argument is a tree of individual query statements. It is formatted as a nested Tcl list. The he allowed depth of branching as well as the allowed number of leaf nodes is unlimited. The following branch operations are supported in this tree:

Here are a few simple expression patterns:

molfile scan $fh $leafexpression1
molfile scan $fh [list “and” $l1 $l2]
molfile scan $fh [list “or” $l1 [list “and” $l2 $l3 $l4]]
molfile scan $fh [list “orcontinue” [list not $l1] [list “xor” $l2 $l3]]
molfile scan $fh [list bind mol [list and $l1 $l2]]

All branch nodes need to end in leaf expression nodes. An empty query expression is valid and matches every input record. Also, it is legal and actually a common case to have an expression which is just a single leaf node expression. The order of the branches does not matter. An automatically invoked optimizer sorts the branches, and simplify them, in order to achieve maximum performance.

Leaf node expression classes

These are the supported classes of leaf node expressions:

The various leaf expression classes have different syntax schemes, which are explained in the next paragraphs.

record and vrecord expressions

The record and vrecord expression classes are always written with three list elements: The expression class name, the operator, and the value or value list. The operators can be from the standard six numerical types, the range operator (<->), and the in or notin set operators. Numerical comparisons require a single comparison value, the range operator a pair of values, and the set operators a list. Examples:

“record <= 100”
“vrecord <-> {1 1000}”
“record in {1 7 19 230}”
filename expressions

The filename expression class is even simpler. It always consists of three elements: The expression class name, the operator (which can only be = or !=), and the file name. The actual file comparison operation uses device and inode identifiers on Linux/Unix platforms if the file is accessible, so the exact spelling of any path components does not matter. Example:

“filename = part1.sdf”
isnull and notnull expressions

The isnull and notnull expression classes are written with two elements. The first is the class name, and the second a property name. The property name may be qualified with an ensemble class modifier. If the modifier is not specified, the query applies to the main database structure. Otherwise, the property of the specified ensemble class is addressed. Examples:

“isnull E_NAME”
“notnull product:E_ASSAY_RESULT”
property expressions

The property query expression class is a little bit more complex. It has a variable number of elements, between three and eight. The general syntax scheme is

property {operator ?modifiers?..} value ?threshold? ?multimode? ?filter? ?c1? ?c2?

The first three elements are always the property name, which can be qualified with an ensemble class, the comparison operator, and one or more values. The number of required values is dependent on the operator. The comparison operator can be a nested list. It needs to contain as a list element the basic comparison operator (numerical, range or in/notin set operators) and may additionally contain modifier words, which are translated into flags potentially influencing the datatype-specific comparison functions. It depends on the data type of the property whether any flag word has an effect.

If the object flag word is supplied as part of the operator list, the value part of the query is parsed as a chemistry object handle, more specifically an ensemble handle, a decodable string representation of an ensemble, a reaction handle, or a decodable string representation of a reaction. The ensemble variants are accepted if the query property is attached to an ensemble or an ensemble minor object, and the reaction variants can be used if the property is reaction-related. The value of the query is then automatically extracted, even computed if needed, from the object. Properties with subfields can be entered with the basic name, or any qualified subfield name. In addition, the property name may be prefixed by a structure class designator (see paragraph on structure queries). By default a property is assumed to be data of the main structure of the file record, or the main reaction. Examples:

“E_NAME = methane”
“solvent:E_NAME {in ignorecase} [list benzene toluene ethylbenzene]”
“E_IRSPECTRUM(source) {= shell nocase} *bruker*”
“E_WEIGHT {<= object} $ehtest”
“E_CAS {= ignoredashes ignorecase} 88337-96-6”

These are the comparison flag words which are recognized:

If the operator is the in or notin word, the value part is interpreted as a list. The value, or value list item, must be parseable according to the property data definition definition. Enumerated values and similar encodings may be used if properly defined in the property descriptor record.

If the comparison function computes a score (for example, the Tversky or Tanimoto variants), the next optional argument is a threshold value which needs to be exceeded to register as hit. If the threshold parameter is not specified, or given as a negative value, any score passes. Example:

“E_SCREEN {>= tanimoto object} $eh 95”

The next two optional arguments concern the case when there is more than one file data value to compare against the expression value. This generally happens when the tested property is not a major object property, but a minor object property, such as an atom or molecule property. In that case, the database record often contains multiple values, because there is more than one atom, or more than one molecule in the structure in the record. The first argument is the general match criterion. It can be set to one , all , none , or both . The default is one . Mode one means that it is sufficient if one of the record values matches. Mode all requires all to match, mode none requires that none matches, and mode both requires that there are both matches and mismatches.

The next optional parameter is a filter which can be used to restrict the values tested. If it is not present, or an empty string, no filter is applied. Example:

“A_ELEMENT = 6 {} all ringatom”

Above expression checks whether all ring atoms in the structure are carbon. Any record with a hetero ring atom fails the test.

The final two optional arguments are integer constants which may be used by the comparison operation. If they are not specified, both are implicitly passed as zero. If the first is specified, but not the second, the second is set to 100 minus the first value. Almost all comparison operations on the various data types ignore these.

One comparison mode which does make use of them is the Tversky bit vector similarity score. Here c1 and c2 are the weights of the bits in the first and second compared value. For scoring, both parameters are divided by one hundred and the floating point results are used as weight multipliers. Example:

“E_SCREEN {>= tversky object} $eh 90 {} {} 30 70”

Above expression computes a Tversky score on the standard structure search screen E_SCREEN with 30% weight for the database structure features and 70% of the query structure features (i.e. imbalanced towards a substructure rating), and report the record if the score is 90% or higher.

Starting with version 3.358 of the toolkit, property expressions where the data type of the query property is structure or reaction are no longer parsed as standard property expression, but as structure or reaction query expressions, respectively. Example:

"V_ONTOLOGY_TERM(substructure) {>= swap  stereo isotope charge} $eh"

Since the data type of the subfield of V_ONTOLOGY_TERM is structure, the syntax rules of normal property expressions no longer apply. Instead, the syntax for structure expressions explained below is substituted.

structure expressions

Structure expressions are used to invoke structure comparison operations, such as sub- and superstructure search. The expression is a list, with three to six elements. A structure expression starts with the structure identifier, followed by the operator, which, as in property queries, may be written as a list with auxiliary modifier words, and as third mandatory argument the comparison structure source.

The structure identifier is the name of a structure class. Usually it is present as part of the record in the queried file, but some structure classes can be computed from the main structure if necessary. If a structure class can neither be found in a file record, nor computed, the node will not match. The following structure classes are supported:

At minimum, the operator section contains a standard numerical operator symbol. Additionally, modifier words may be present as additional list elements. The following operators are supported.

The default substructure match mode has the bondorder , useatomtree and usebondtree flags set (see match ss command). The initial flag set can be modified with modifier words linked to the operator. As far as it makes sense, the modifier words also change the operation of derived query modes, such as full-structure matching via hash codes.

These are the modifier words which can be used in structure expressions:

Many of these global flags can be overridden, or activated on a local level, for individual atoms or bonds, in the A_QUERY and B_QUERY properties. For example, A_QUERY has fields for flags which can request the matching of stereo or charges for specific atoms, or to allow missing stereochemistry at a specific center. These per-atom or per-bond requests override global query flag settings.

The third mandatory expression list element is the structure source. It can be one of

Query specifications found in structure sources are understood in a variety of formats. Daylight and MDL formats are decoded and translated into an internal representation in an almost completely compatible fashion. That includes Recursive SMARTS , ISIS 3D queries, MDL stereo groups and MDL reaction queries. A significant range of Sybyl SLN and CambridgeSoft ChemFinder query expressions are also understood, as well as features found in the CSD ConQuest software. Finally, in Cactvs there is no fundamental difference between a query fragment and a normal structure object. Query structures are just structures with additional information stored in properties A_QUERY , B_QUERY and possibly B_REACTION_CENTER . For basic matching, any structure object will do, even if they do not possess these query attribute properties. However, an eye should be kept in the hydrogen status of query fragments. If no specific flags are set, substructure matches attempt to match hydrogen atoms just like any other atom. Example:

set ehss [ens create C]
set ehss [ens create C smarts]

The upper substructure ensemble does not, in the absence of hydrogen ignore flags, match any structure ensemble except those which contain a full methane (one C plus four H) molecule as fragment, because that is what the substructure represents. The second code line decodes the substructure in full SMARTS mode. Not only now the full range of SMARTS expressions can be parsed (though absent in this example), but the structure is also be created without implicit hydrogens. The first substructure could still be used in a molfile scan command as a simple carbon match test if the nosubstructureh modifier flag were supplied.

In order to read query structures from a file, the following generic open statement is the standard approach:

molfile open $file r hydrogens asis readflags noimplicith

Simple query formats, such as MDL ISIS query Molfiles , are read into a flat set of attributes. More complex formats, such as SMARTS, may require the use of a tree of expressions on individual atoms and bonds, similar to the overall query tree with branch and leaf nodes described here for the molfile scan command. These complex formats are nevertheless also translated, to the degree possible, to the flat model. For example, a SMARTS expression with only uses simple atom lists or atom and bond query attributes all connected just by and can be fully represented in this way. This also means that, format translation into other query file formats is also possible for these simple expressions . The use of the full query trees in matching can in some cases be a performance issue. The noquerytree flag is available to restrict the match to those parts of the full query which can be expressed in the flat model.

The fourth and optional expression list element in the query expression is used only for a few match modes. If it is not set, the default value is minus one.

Example:

“structure ~=> $eh 90”
“product <-> C(=O)\[OH\] {2 3}”

The first sample expression is a standard Tanimoto similarity query, with a 90% threshold. The second query matches product structures with two to three carboxyl groups.

Optional expression list elements five and six correspond to the c1 and c2 parameters in property query expressions. These are currently only used in Tversky similarity queries:

“structure %>= $eh 90 30 70”

This is an expression for a skewed Tversky similarity (70% query structure, 30% file structure weight) with a 90% reporting threshold.

If the file format supports it, bitvector screening is automatically be applied to reduce the number of records for which structures need to be pulled and sent to graph-based substructure matching. The default structure match screening property is E_SCREEN . The standard versions of E_SCREEN implement three predefined fragment sets. The higher sets are identical to the lower ones in the leading bits. Sets zero to two , which yield bit vectors of increasing length and selectivity, but also storage requirements can be requested by setting

prop setparam E_SCREEN extended 0/1/2

The bit set read from the query file must correspond to the parameter setting for E_SCREEN in the current Tcl interpreter, if the screen bits are automatically computed on the query structure. The CBS and BDB file formats, which are optimized for structure query operations, contain screen bit version information in the file header and automatically configure the property parameter setting when the file is opened. For other file formats with screen bits this needs to be done explicitly in the application script. It is also possible to change the structure bit-screen property associated with a file by setting the appropriate molfile handle attribute, so it is easily possible to use custom screen bit sets instead of the default property.

Starting with version 3.358 of the toolkit, property query expressions where the data type of the property is structure are automatically parsed as structure expressions.

smartsearch expressions

This query expression takes the same arguments as a structure expression. It is internally expanded into four alternative queries, linked by a pass-dependent switch control node. The four alternative queries are a full-structure query (equivalent to operator = in a structure query), a substructure query (operator >=), and two Tanimoto similarity queries with thresholds of 95% and 90% (operator ~>=).

When such a query expression is a component of query expression tree, the query is first run with the full-structure query. If that query yields less results than the pass match limit (by default one, i.e. the query does not match anything, this can be configured via the molfile passlimit attribute), the input data source is repositioned to the original start record and then the substructure query is run, and if that run also does not yield sufficient hits, the two similarity queries are tried one after another.

Running the second and later alternatives is only possible of the data source can be repositioned to the original start position of the first pass. If that fails, the query is silently terminated early. The pass match limit comparison triggering the possible re-execution of the query is with the global hit count of the query, not the number of hits returned by the smartquery branch. If other parts of a complex query produce sufficient hits, the query is not re-run even if a smartquery branch did not return any hits.

Hits returned in different passes can be distinguished by including the pass pseudo-property in the retrieval data.

By convention, smartsearch expressions are written with an = operator. The actual operator in a smartsearch expression is ignored, but modifiers are not. So specifying options like the use of stereochemistry or isotopes is supported and useful.

It is possible to have multiple smart search expressions in a query. The query pass index for these is incremented in parallel, not independently.

The smart search feature was inspired by a similar functionality in the Accelrys Isentris system.

Examples:

“smartsearch = c1ncccc1”
“smartsearch {= stereo} \“L-lysine\””
formula expressions

Formula expressions are used to match file structures by element composition. Conceptionally, this is a special syntax for a complex property match on file structure properties E_ELEMENT_COUNT and M_ELEMENT_COUNT . A formula search expression is always a list of three elements. The first element is always formula, the second element the comparison operator, and the third word the formula specification. The following operators are supported:

For formula queries, there are no modifier words for the operator.

The syntax of the formula is built on the lowest level by element or pseudo-element symbols, which may be grouped into sum or difference expressions and may possess a prefixed count multiplier. The symbol or symbol group can then be suffixed by a simple count, or an open or closed count range. If no count range is specified, the default count is one. In case an element is entered more than once, all counts for that element are added. Finally, the expression may be grouped by period characters into sub-expressions to be applied to different molecular fragments in the tested structures.

Besides normal elements, the following pseudo-elements, which are compatible to the set of the CSD ConQuest software, are recognized:

Element items can be grouped with round brackets into sums or differences. However, this is no full arithmetic expression parser. Element symbols can only be used as stand-alone syntactic elements, bracketed all-sum expressions, or bracketed all-difference expressions.

An element or an arithmetic group can have an appended count. This count can be:

Examples:

“formula = C6H6”
“formula = C5-6H6-”
“formula >= (Cl+Br)2”
“formula > \[4M\]>=3”
“formula = (2C-H)-6”
“formula = CH3COOH”
“formula = \[Het\]>1

The first expression is a simple search which matches any ensemble with a composition of six carbon and six hydrogen atoms. The second looks for compounds with five to size carbon and six or more hydrogens, but no other elements. The third line finds compounds where the sum of chlorine and bromine atoms is two. Other elements may be present but are not required, so this expression matches Cl2, Br2 and ClBr as well as dichlorobenzene. The fourth expression finds structures with three or more metal atoms. The fifth expression finds compounds where twice the sum of the carbon atoms minus the hydrogen atoms has a value up to six. The next line finds compounds with a formula of C2H4O2. The counts for repeated elements are summed up. The last example matches any compound with one or more hetero atoms.

Periods can be used to define separate formula sections. These are applied to individual molecules in the tested structures, not the full ensemble. If a single dot is specified at the beginning or end of the expressions, it signifies a single expression section to be applied to a molecule. When a test for formula sections is applied, all permutations of possible matches between the molecules in an ensemble and the formula expression sections are tried. It is neither required to have any specific order of the molecules in the ensemble, nor in the formula expression sections, not is there a need for a match between the molecule and expression section count. However, every expression section in a formula needs to match a different molecule in the tested ensemble.

Examples:

“formula = C6H6.C7H8”
“formula = .H2O”

The first expression looks for ensembles which contain one molecule with the formula C6H6, and another with formula C7H8. The second expression matches ensembles with one or more water molecules. In both cases, molecules/fragment with different composition may be present in the record. In order to test for two or more formulae with the additional conditions that there are no other molecules/fragments, use two formula expression nodes connected with an and branch node, as in

and “formula = C6H6.C7H8” “formula = C6H6C7H8”

Element symbols which stand for specific isotopes, such as D for deuterium, are currently not processed. D is read as a simple alias for hydrogen, disregarding the isotope label.

It is possible to use an ensemble handle instead of a formula expression. In that case, the elemental formula of that ensemble is used in the query, as computed by property E_FORMULA .

reaction expressions

Reaction expressions are the construct used for reaction substructure searches, for example when looking for certain bond transformations in a database of reactions. Obviously, the scanned file needs to contain reaction information for this to succeed.

An important aspect for reaction searches are atom mapping numbers, which link atoms in the reagent ensemble to the product ensemble, and likewise in the transformation scheme which needs to be matched. The central property for this is A_MAPPING . If this property is present, it is used to restrict matches to those reactions which embody a certain transformation, and are not a simple pair of ensembles which match substructures of the left and right part of the query transformation somewhere in their connectivity. Nevertheless, it is still possible to query reaction without a mapping scheme. That is identical to a pair of substructure searches. Also, individual parts of a reaction (the reagent and product ensembles, but potentially also the catalyst or solvent entries) can be used as targets for single-ensemble sub/super/full-structure searches via structure query expressions (see above).

A reaction expression is a list of three to six elements. The first element is always reaction , the second element the operator, and the third element the reaction source. The following operators can be used:

Similar to structure query expressions, the operator can be modified by adding flag words as additional list elements to the operator list element. The following flags are recognized:

The third mandatory parameter is the query reaction source. It can be any of

Reading one or more query reactions from a file handle directly in the query statement, as it is possible for structure queries, is currently not supported. Also, the tautomer match mode is not available for reaction matching because it interferes with atom map processing.

The optional query list items four to six are identical to those for structure query expressions. They represent a reporting threshold value and the c1 and c2 comparison algorithm parameters. Please refer to the paragraph on structure match expressions for more details.

The general approach to reaction sub- and superstructure matching is as follows:

Besides the ensemble-level query attribute properties A_QUERY and B_query , reaction matches also make use of B_REACTION_CENTER (for constraints on the type of transformation a bond undergoes) and E_REACTION_ROLE (for the identification of reagent and product ensembles in the reaction object).

Reaction similarity queries use the reaction screen set (by default, property X_SCREEN ) instead of the structure screen that is used for structure similarity. This operation returns a single score. There is no scoring of the reagent or product ensembles.

Full-structure reaction matches are performed via hash code checks both the reagent and product sides. Atom mapping information is not used for this query operation. The suitable hash code is automatically selected depending on the operator modifiers (stereo, isotopes).

Starting with version 3.358 of the toolkit, property query expressions where the data type of the property is reaction are automatically parsed as reaction expressions.

Scan modes

The return value of the molfile scan command depends on the query mode. The default mode is enslist for the molfile scan command, but may be different when scanning other objects, such as datasets, networks or tables. The following modes are supported for file queries via the molfile scan command. Scan modes for other objects may include specific additional modes, while disallowing others.

If requested property data is not present on the object representing a hit, an attempt is made to compute it. If this fails, the retrieval modes table and tablecollection generate NULL cells, and property retrieval as list data produces empty list elements, but no errors. For minor object properties, the property list retrieval modes produces lists of all object property values instead of a single value. In table -based mode, only the data for the first minor object associated with the major object is retrieved, which makes this mode less suitable for direct minor object property retrieval.

Pseudo properties for retrieval

The following pseudo properties can be retrieved in property/properylist scan modes or as table values, in addition to standard property data:

Record visitation order

The optional visitation order parameter, one of the optional query parameters listed in the next section, is primarily intended to be used for convenient execution of queries on a subset of records which were selected by a previous query on the same file. It can either be a numerical record list, with the first file record indicated as record one, or one of the keywords sortup or sortdown , followed by a property name. If this parameter is not set, or set to an empty string, or the magic string all , records are visited from the current input position in simple sequential order. If the query parameter dictionary additionally contains a startposition value, this start position refers to the index (plus one) of the first element of the specified record set, not to the original underlying file.

In the record list variant of this argument, the specified (virtual) records in the file are visited in the list order, and all other file records are ignored. For optimum performance, the records should be sorted in ascending order, but this is not necessary, and, since it does affect the order of the returned results, record visitation sets with record sequences in custom order sorted to some criterion can have uses. A suitable format for a record list is a saved result of molfile scan in the recordlist or vrecordlist scan modes. It is possible to use a sorted record list with a non-rewindable input file, but an unsorted list will fail in that case if the file input pointer needs to be positioned backwards.

The sort property option variant implies a visit of all file records, but in the order of the values of a property in that file, not the native record sequence in the file. Using this access method is not too much overhead for indexed file formats such as CBS or BDB with an index on the sort property , but a serious performance hit for standard text files. This method cannot be used with files which cannot be rewound and do not have the sort property data in some direct access field, since it requires a full pass through the file to gather the sort property data values before the actual query is processed.

Examples:

molfile scan $fh “structure >= C1NCCC1” vrecordlist \	[dict create “order” [list 3 6 29 157]]
molfile scan $fh “structure ~>= $ehcmp 90” {table E_SMILES score} \	[dict create “order” {sortup E_WEIGHT}}
Query parameters

The final optional parameter is a keyword/value list of various additional attributes for fine-tuning the execution of the query. The following keywords are recognized:

More typical examples

Examples:

molfile scan $fh {structure = c1ccccc1} recordlist
molfile scan $fh {E_WEIGHT < 100} {propertylist E_SMILES E_NAME E_WEIGHT}
molfile scan $fh {notnull E_CAS} {table E_SMILES E_CAS}
molfile scan $fh {structure ~>= c1nnccc1 90} {score record}
molfile scan $fh “and {structure >= $ehss} {formula >= N3}}” ens
Distributed queries

Molfile object handles can be configured to listen on specific ports for remote scan requests. The syntax of a remote scan request is the same as for a normal file. The only exception is the handle argument. The command is executed asynchronously. Since because of this no direct results are returned, the remote scans are typically of a type which yields network-transferable objects (modes ens , enslist , reaction , reactionlist , table ) and specify a target dataset object on the local system.

On the local system, a typical set-up looks like this:

set dh [dataset create]
dataset set $dh port 10001
molfile scan $remotehost:10002 {structure >= c1ncccc1} \
	{table record E_NAME E_CAS} {} {target $localhost:10001 startposition 1}
while {![dataset tables $dh {} count]} {
	sleep 1
}

In above code, we first create a recipient dataset object, and configure it to listen on port 10001 for incoming Cactvs objects - we are expecting a table object as result later. We then issue the query for execution on the remote host, and wait until the table object containing the results has arrived.

On the remote server, the set-up could look like this:

molfile open $dbfile r port 10002
vwait

Here the database file is opened, and a port for incoming requests opened. The vwait Tcl statement does nothing, but keeps the interpreter running, while waiting for and processing events such as incoming scan commands. In this sample set-up, the remote server needs to be started first, because otherwise the connection to the remote file fails on the client.

Since execution of remote queries is asynchronous, the client could issue multiple query requests to different remote handles and then wait until results from all these requests have been collected, or a timeout or other error condition has been reached. The results could arrive in any order. The scan commands for a group of servers could, for example, specify different start positions and maximum scan values for distributed searching of a big file, or could gather results from different small files. Additionally, the use of multiple scan threads could be requested on the server by passing appropriate parameters in the control section of the command. Nevertheless, only a singled remote scan command per Tcl script thread is executed on the server at any time. If multiple scans need to be executed in parallel on a single server, a collection of script threads need to be created via the Thread package, and then every thread told to open its own port listener.

The mechanism for the reception of messages for remote scans on molfile handles which listen on ports is subtly different from the processing of commands sent to listening dataset objects. The execution of scans requires active collaboration of a Tcl interpreter. Commands are only read and processed when the interpreter is idle, for example while sitting in a vwait or sleep statement. In contrast, dataset object listeners do not rely on Tcl interpreters, and are implemented as independent threads. Remote dataset commands, such as ens move or dataset pop with a remote dataset handle, are therefore executed at any time when a mutex lock on the database object and other accessed objects can be secured.

molfile set

molfile set filehandle ?property/attribute value?...
molfile set filehandle attribute_dictionary

A standard data manipulation command. It is explained in more detail in the section on setting property data. The alternative short form with the single dictionary argument is functionally equivalent to using the expanded dictionary as separate property and value arguments.

Examples:

molfile set $fhandle F_GAUSSIAN_JOB_PARAMS(link0) [list \	“%chk=144__303_2EVE_PDB_Opt8.chk” “%mem=128MB” “%nprocshared=2”]

The command can also be used to set a broad range of object attributes. The list of attributes is documented in the section on the molfile get command.

In case a set command is applied to a virtual file, the command applies to the current physical file only, if this makes sense.

Example:

molfile set $fhandle record 2

Above command repositions the file read/write pointer to the second record.

This command supports a special attribute value syntax for manipulating bitset-type attributes (only attributes, not property values). If the first character of the argument is a minus character (-), the named bits in the set identified by the remainder of the argument are unset. If it is a plus (+), they are additionally set. With an equal sign (=), or no special lead character, the flag set replaces the old value. A leading caret character (^ ) toggles the selected bits.

Example:

molfile set $fhandle readflags +pedantic

molfile setparam

molfile setparam filehandle property key value ?key value?...

Set or update a property computation parameter in the metadata parameter list of a valid property. This command is described in the section about retrieving property data. The current settings of the computation parameters in the property definition are not changed.

molfile show

molfile show filehandle propertylist ?filterset? ?parameterlist?

Standard data manipulation command for reading object data. It is explained in more detail in the section about retrieving property data.

For examples, see the molfile get command. The difference between molfile get and molfile show is that the latter does not attempt computation of property data, but raises an error if the data is not present and valid. For data already present, molfile get and molfile show are equivalent.

molfile skip

molfile skip filehandle ?recordcount?

Skip records in a file opened for input. If the file pointer is at the beginning of a new record, this next record is the first skipped. If the file pointer is stuck in the middle of a record, for example because a molfile read command failed due to a file syntax error, the first record counted is the remainder of the current record. An attempt is made to re-synchronize to the beginning of the next record.

By default a single record is skipped. If the record count parameter is specified, more than one record can be skipped. Because of the partially read l record re-synchronization feature, negative record counts are not allowed in this command. The molfile backspace and molfile set record commands can be used to go back in a file.

The command returns the number of the next record to be read. In case an attempt was made to position behind the end of a file, or a record re-synchronization failed, an error is reported.

molfile sort

molfile sort fhandle {{propertylist ?direction? ?cmpflags?}..} ?outfile/handle?

Sort the records in the file according to the values of one or more properties or property subfields contained in the file records, or computable on the objects read from the file. The output are byte-for-byte identical images of the input records, not records reconstructed from input data objects.

The property sort list consists of zero or more sort specification elements. Every specification element is parsed as a sublist, but only the first element therein is mandatory. This element is either a property name, a property subfield name, or one of the magic names #record or record (for the file record) or # random or random (for a random number assigned to that record). The optional sort direction element may be up or down . The default sort direction is upwards. The final optional comparison flags parameter can be set to a combination of any of the values allowed with the prop compare command. The default is an empty flag set.

The first property or magic name in the sort list has the highest priority. In addition to the specified properties, the original record number is implicitly added as tie breaker to yield a stable sort. This automatic value is always sorted upwards. If an empty property list is specified, the result is thus a simple file copy without record rearrangement.

The sort properties do not need to be already present in the file. If necessary, an attempt is made to compute these on the objects read from the file in the first pass. It is possible to sort on properties which are not of the object class read from the file, for example atom properties when ensembles are read, or ensemble properties when reactions are read. In that case, the record is output at the position determined by the lowest sort rank of the property of that object, for example the minimum or maximum value of all values of an atom property in an ensemble. Additional data instances of the property associated with a given record are ignored, so no record duplicates are output.

The optional output parameter can either be the handle of an opened Tcl channel, including standard output and standard error or the name of a (preferably new) file, or a pipe construct. Output is appended to this output channel. If the parameter is omitted, the output is first written to a temporary file, the original file deleted and the temporary file renamed to the original file. In that case, the original file handle is automatically re-opened for reading on the new file. The input file handle must be positionable, because file records are accessed twice, once for reading the sort data and once for copying the records out. Sorting from standard input, pipes or other non-rewindable sources is therefore not supported, and neither is the sorting of files which are not simple record sequences. Sorting such files is currently only possible by using explicitly scripted record data buffering mechanisms.

On Windows, output to an open Tcl file handle is not supported, except for the standard output and error channels.

The return value of the command is the number of records written. The position of the sort file handle is set to the same location as before the command.

Examples:

molfile sort $fh {{E_NAME up {dictionary nocase}}} dict.sdf
molfile sort myfile.sdf {{record down}}
set fhtcl [open “randomized.sdf” w]; molfile sort $fh {{random}} $fhtcl
molfile sort $fh {{A_ELEMENT down} {E_WEIGHT up}} “|gzip >heavy.sdf.gz”

The first example creates a new file dict.sdf which contains the remaining records in the file associated with the file handle sorted by the value of property E_NAME in case-insensitive dictionary order. The second example reverses the order of the records in the file, replacing the original file in the process. The third example randomizes the record sequence in the original file, outputting the records in a new file which was opened for writing as a normal Tcl text file. The final example outputs a compressed SD file, with structures sorted by the heaviest element in the ensembles, and using the molecular weight as tie breaker.

molfile sqldget

molfile sqldget filehandle propertylist ?filterset? ?parameterlist?

Standard data manipulation command for reading object data. It is explained in more detail in the section about retrieving property data.

For examples, see the molfile get command. The differences between molfile get and molfile sqldget are that the latter does not attempt computation of property data, but initializes the property value to the default and returns that default, if the data is not present and valid; and that the SQL command variant formats the data as SQL values rather than for Tcl script processing.

molfile sqlget

molfile sqlget filehandle propertylist ?filterset? ?parameterlist?

Standard data manipulation command for reading object data. It is explained in more detail in the section about retrieving property data.

For examples, see the molfile get command. The difference between molfile get and molfile sqlget is that the SQL command variant formats the data as SQL values rather than for Tcl script processing.

molfile sqlnew

molfile sqlnew filehandle propertylist ?filterset? ?parameterlist?

Standard data manipulation command for reading object data. It is explained in more detail in the section about retrieving property data.

For examples, see the molfile get command. The differences between molfile get and molfile sqlnew are that the latter forces re-computation of the property data, and that the SQL command variant formats the data as SQL values rather than for Tcl script processing.

molfile sqlshow

molfile sqlshow filehandle propertylist ?filterset? ?parameterlist?

Standard data manipulation command for reading object data. It is explained in more detail in the section about retrieving property data.

For examples, see the molfile get command. The differences between molfile get and molfile sqlshow are that the latter does not attempt computation of property data, but raises an error if the data is not present and valid, and that the SQL command variant formats the data as SQL values rather than for Tcl script processing.

molfile string

molfile string enshandle/reactionhandle/datasethandle ?attribute value?...
molfile string enshandle/reactionhadle/datasethandle? ?attribute_dict?

This command byte vector representation of a structure file. The third argument to this command is an ensemble, reaction or dataset handle, not a file handle as for other molfile commands.

If the selected output format module supports direct output into a string, the record image is created without intermediary forms. Otherwise, a anonymous temporary file is opened, the ensemble or reaction(s) written to that file, and the file content returned as string with all newlines etc.. The file is then removed.

Writing to binary formats is possible. The return value of the command is a byte vector, not a simple text string, so it may contain NUL bytes. By default, in the absence of an explicit format specification, a MDL Molfile is written.

The remaining parameters are interpreted as in the molfile set command. There are two equivalent command variants, either using attribute and value argument pairs or a dictionary as a single argument. The parameters in the extra arguments or dictionary are typically used to set a hydrogen status, select the output format, etc.

Example:

set jmestring [string trim [molfile string [ens create C1CC1] format jme]]

The example creates an input string for the popular JME Java structure editor by P. Ertl/Novartis. The string trim statement deletes the trailing newline. The necessary JME output module is automatically loaded if it is not already loaded or compiled-in when the format parameter is decoded.

String record representations generated by this command can be opened for input as string data with the s mode of the molfile open command:

set fh [molfile open [molfile string $eh] s]

molfile subcommands

molfile subcommands

Lists all subcommands of the molfile command. Note that this command does not require a molfile handle.

molfile sync

molfile sync filehandle

This command synchronizes the file contents with the file system. The I/O modules for most file formats automatically performs a simple file buffer flushing upon finishing the output of a record, so this command is needed only under special circumstances where complete file system synchronization is required, the file was written without immediate commits, the I/O module for the file format provides a special synchronization function, or the output was done via asynchronous I/O. In any case, every file is fully synchronized when it is closed, so calling this function for normal output operations is not required.

The command returns the file handle.

molfile toggle

molfile toggle filehandle

Switch a file from input to output, or vice versa. If the file was in write, append or update mode when the command is executed, the file is rewound and the read pointer is now pointing to the first record, or the original end point for append files. If the file was configured for input, the file output mode is changed to append if the file is a normal file. If the file is a scratch file, the file is truncated to an empty file and the write position set to the first record.

Not all file types can be toggled. Special file types except FTP streams cannot, and it is not possible to toggle a simple disk file which was originally opened in read only mode (see molfile open command).

The command returns the molfile handle.

molfile truncate

molfile truncate filehandle ?record?

Truncate a file. If no explicit record is given, the file is truncated after the current record. In case the current record count of the file is less than the specified record, the command raises an error.

Only files which are rewindable can be truncated. In addition, the program must have write permission to the file, although it is not required that the file handle is opened for writing. The I/O modules for files formats which are not a simple record sequence must provide a truncation function or the operation will fail.

The command returns the molfile handle.

molfile unlock

molfile unlock filehandle propertylist/molfile/all

Unlock property data for the file object, meaning that they are again under the control of the standard data consistency manager.

The property data to unlock can be selected by providing a list of the following identifiers:

Property data locks are obtained by the molfile lock command.

This command is a generic property data manipulation command which is implemented for all major objects in the same fashion and is not related to disk file locking. Disk file locks can be set or reset by modifying the molfile object attribute lock. This is explained in more detail in the paragraph on the molfile get command.

The return value is the molfile handle.

molfile upgrade

molfile upgrade filehandle

If the I/O module provides a function to upgrade the format of an older file to the latest version of the format, for example after a support library upgrade, that function may be used. The only format which currently supports this feature is BDB .

The command returns the molfile handle.

molfile valid

molfile valid filehandle propertylist

Returns a list of boolean values indicating whether values for the named properties are currently set for the structure file. No attempt at computation is made.

Example:

if [molfile valid $fhandle F_COMMENT] {...}

molfile vappend

molfile vappend filehandle objectlist

Virtually append records to an open file handle. The underlying file is not modified, but all future input operations on this file behave as if the extra records were present.

Because no actual output is generated, this command can only be applied on files opened for reading , not output files. In addition, the file handle needs to refer to a normal disk file and to support going backwards in the file, i.e. this command cannot be used on structure files opened via URLs, standard I/O channels, socket connections or composite virtual files with multiple physical files or the contents of a directory. The file format must support multiple records and the records must be encoded as a simple concatenated byte sequence. Examples for formats which work are SMILES or SD files for structures, or RXN or RD files for reactions.

The object list may contain ensemble, reaction or dataset handles. The data is split into virtual records according to the storage capabilities of the file. The format of the data written to the virtual records can be controlled by setting the writelist , droplist and hydrogens status attributes on the file handle.

When executed for the first time on a file handle for which the record count is yet unknown, the existing file records must be tallied and all current physical record positions be registered. For very large files, this can take some time. However, this is not equivalent to reading the complete file, so it does not consume much memory and the command can in principle work on arbitrarily large files.

Virtual records are held as string images in memory. A couple of thousand such records should not be a problem for typical workstations, but for systematic editing of large files where every record is touched an explicit scripted input/output loop is preferable.

The return value is the new record count of the file.

Changes to the file can be committed to disk by means of the molfile vrewrite command.

Example:

molfile vappend $fhandle [ens create c1ccccc1]

molfile vdelete

molfile vdelete filehandle recordlist

Virtually delete records from an open file handle. The underlying file is not modified, but all future input operations on this file behave as if the specified records had been deleted.

Because no actual output is generated, this command can only be applied on files opened for reading , not output files. In addition, the file handle needs to refer to a normal disk file and to support going backwards in the file, i.e. this command cannot be used on structure files opened via URLs, standard I/O channels, socket connections or composite virtual files with multiple physical files or the contents of a directory. The file format must support multiple records and the records must be encoded as a simple concatenated byte sequence. Examples for formats which work are SMILES or SD files for structures, or RXN or RD files for reactions.

When executed for the first time on a file handle for which the record count is yet unknown, the existing file records must be tallied and all current physical record positions be registered. For very large files, this can take some time. However, this is not equivalent to reading the complete file, so it does not consume much memory and the command can in principle work on arbitrarily large files.

The record list is a list of integer values, with one as the first file record. The list does not need to be sorted, and duplicate record numbers or record numbers out of range are ignored. It is possible to virtually delete file records which are themselves virtual, i.e. were added by the vappend, vreplace or vinsert subcommands and are not physically present in the file.

Virtually deleted records have negligible memory demands, but will slightly slow down input operations on edited files.

The return value is the new record count of the file.

Changes to the file can be committed to disk by means of the molfile vrewrite command.

Example:

molfile vdelete $fhandle [list 3 9 6]

molfile vinsert

molfile vinsert filehandle objectlist

Insert virtual records for the specified objects into the file. The insertion position is before the current read position.

Except for the difference in the location where the virtual records are inserted, the command is equivalent to the molfile vappend command and has the same features and limitations. Please refer to that command for details.

molfile vreplace

molfile vreplace filehandle objectlist

Insert virtual records for the specified objects into the file. The current input record is virtually overwritten.

Except for the difference in the location where the virtual records are inserted, and the fact that an existing record is replaced, the command is equivalent to the molfile vappend command and has the same features and limitations. Please refer to that command for details.

It is possible to replace a record which is itself virtual, i.e. was introduced by a vappend, vinsert or vreplace subcommand. If more than one output object is passed, or the object is written as multiple file records, additional virtual records are created and the record count of the file increased accordingly.

Example:

set eh [molfile read $fh]
ens expand $eh
molfile backspace $fh
molfile vreplace $fh $eh
ens delete $eh

This command sequence virtually replaces a record with a version where superatoms are expanded.

molfile vrewrite

molfile vrewrite filehandle ?filename?

Commit all virtual record additions, deletions or replacements to a physical file. If no file name is given, the current file name is used. After writing, the file handle remains valid. It is open for reading, and positioned before the first record. At this moment, the file no longer contains any virtual modifications, but the file handle may again be subjected to virtual edit operations. In case a file name is specified, and is not the same as the name of the current file, the file handle refers to the new file when the command has finished.

All valid records are copied verbatim to the new file, without going through decoding and re-encoding or records (see molfile copy command). A temporary file in the same directory as the current file is created, and sufficient disk space needs to be present to hold both the original file and the edited version at the same time. In case a problem occurs, the temporary file is deleted and the current file remains active. Only if all write operations succeed the old file is deleted and the temporary file renamed if necessary. In case a file name is specified, and it is not the same as that of the current file, the original file remains untouched, but is no longer linked to the molfile handle. For large files, this operation can take some time because massive amounts of data may need to be moved.

If the file referenced by the file handle has not been edited with virtual record operations ( vappend, vdelete, vinsert, vreplace ), the command does nothing and is equivalent to a molfile rewind .

The command returns the number of records written.

Example:

set fh [molfile open „myfile.sdf“]
molfile vinsert $fh 1 [ens create c1ncccc1]
molfile vrewrite $fh „myfile_with_pyrdine_inserted_in_rec_1.sdf“

molfile write

molfile write filehandle ?objecthandle?...

This commands writes structure and reaction data to a file. Object handles may be ensemble handles, reaction handles, dataset handles, or molfile handles.

If an object is an input molfile handle, objects are read from the file until EOF is encountered if the output file supports multiple records. If the output file type is single-record, only the next record is read. The types of objects which are collected from the input molfile handle are dependent on its read scope. These objects are then treated as if they were used as parameter objects directly. Objects obtained via a molfile handle are automatically deleted after they have been written. If the input file is already at EOF when the command is executed, no objects are read, and no error is generated. However, this does not trigger the NULL record output handling described below, because the file object was specified as an argument.

The type of data which is actually written to the file depends on its format. A file opened for ensemble output can be fed with any type of handle. If reactions or datasets are passed, these are taken apart and written as individual records. If the output file is a reaction file, and an ensemble is passed, the reaction it is a member of is looked up and used as output object. If the ensemble is not a reaction ensemble, an attempt is made to store it as a plain ensemble outside any reaction. If the output routine rejects this, an error is raised. In case of datasets passed as objects for reaction output, the individual dataset objects (ensembles or reactions) are written, in combination with reaction reference substitution in case ensembles instead of reactions are found. For full-dataset output, it is legal to pass non-dataset objects. No dataset-level information is written and the objects stored as an anonymous dataset.

It is legal to supply no object handles at all. Normally, this means that simply no output is performed. However, I/O modules for specific file formats may support the output of special NULL records. In that case, the output function is called once without any objects. An example are Gaussian job files, which allow you to write records in multi-link files, where the computation instructions are taken from the file property F_GAUSSIAN_JOB_PARAMS , without supplying a structure record.

As part of the output process, new information may be computed on the objects. In case the active settings on the output molfile handle demand a structural change of an object, for example the addition or removal of hydrogen atoms, or the re-coding of ionic versus pentavalent nitro groups and similar functionality, the write objects are temporarily duplicated and these duplicates undergo the structure changes. The original output objects are never indirectly edited in their connectivity by this command.

The writelist attribute of molfiles may be set to a list of properties which should be included in the output. This has an effect only for file formats which support the storage of custom data values and which can cope with the data types of the listed properties. By default, no attempt is made to actively compute these properties for output. If they are not present in the input data, their output is silently omitted, or NULL values are written, depending on how the output format encodes these things. However, if the computeprops flag is set on the output molfile , an attempt for computation is made, and after output, the objects retain this additional data if the computation succeeds.

If the hydrogen set mode of the output molfile calls for a change in hydrogen status, the stage when these computations are performed depends on the hydrogen addition mode. If the output mode calls for potential hydrogen additions, the computations are executed after the addition - and this means, on the temporary duplicate, so the original object does not see the new property data. If the hydrogen mode does not change the hydrogen set, or potentially removes hydrogens, computations are performed on the original objects and then the object is potentially duplicated, with all its data, for hydrogen removal and output. In the latter case, the additional property data is visible on the original input objects.

The command returns a list of the object handles which were actually written to file. In cases like a reaction being split into ensembles, or a dataset taken apart, this is not necessarily the same object handle collection as the input object list. For output from an input molfile argument, the total number of objects written is returned instead, because the read objects are not retained.

Examples:

molfile write “myfile.sdf” $eh1 $eh2
set fhandle [molfile open z.cbin w hydrogens add format cbin]
molfile write $fhandle $dset1
molfile write $fhandle $dset2
molfile close $fhandle

The first sample line uses the single-shot file operation feature of the molfile command. Instead of a molfile handle, a file name is passed, and that file is automatically opened, the output performed, and then the file is closed. Two ensembles are written with a single statement to the output file myfile.sdf. The desired file format is guessed from the file name suffix. No change in hydrogen status, etc. is performed, and no extra data is written out.

The next four example lines show how two complete datasets can be written to a native Cactvs toolkit binary file. Hydrogens are added to structures or reactions in the dataset - but the original dataset elements are not changed, since the addition is performed on temporary object duplicates. Also, the Cactvs binary format is requested explicitly by setting the format attribute. In this case, this is not really required, since the file format could also be guessed from the file name suffix. However, in case a non-standard file name suffix is used, formats must be specified explicitly, or the default format ( MDL SD-file) is used. If the Cactvs binary file is later opened for reading with a read scope of dataset , all dataset elements plus the dataset-level property data can be recovered.