2.1. MolSSI Formatting Guidelines for Machine Learning Products
This document presents a set of guidelines and best practices for formatting machine learning (ML) products (e.g., datasets, modules, models, etc.) before submitting them on the Zenodo platform and tagging them to one or more curated MolSSI community collections.
2.1.1. Requirements and Policies
2.1.1.1. Prerequisites
Before getting started, please take a glance at the MolSSI Publishing Guidelines on Zenodo Platform to familiarize yourself with the basic mechanics and recommended strategies for publishing your software products on Zenodo.
2.1.1.2. Dataset File Formats
Datasets are tables of data hosting instances (chemical species such as atoms, molecules, macromolecules etc.) in rows and features or descriptors in columns. Each label results from an experimental observation or theoretical calculation and corresponds to a feature. As such, labels are stored at the intersection of each row and column.
We recognize two main conceptual categories for featurizing the input data: (i) geometrical data (e.g., coordinates, connectivities, atomic symbols, etc.), and (ii) chemical features (e.g., energetics, electronic properties etc.).
Geometrical Data
Representation of geometrical data pertinent to individual chemical species such as molecules (or monomers, dimers, polymers, clusters, unit cells, etc.) is dependent upon the task and adopted ML algorithm. In general, the raw information on individual molecular structures should be stored as separate files within a subfolder of the root directory called geometries. The recommended file format for storing geometrical data is the Chemical Table Format (*.mol*) which allows for a convenient usage of popular and free chemical data conversion toolkits such as Open Babel. This representation is probably most useful before training each model since the majority of the ML models require a featurized version of these structures into a numerical representation or embedding.
For molecules with no bonding information, it is acceptable to use the XYZ file format (*.xyz*) while for periodic (crystalline) systems the Crystallographic Information File (*.cif*) would be appropriate, while for biomolecules the closely related Macromolecular Crystallographic Information File (*.cif* or *.mmcif*) is usually the best choice.
The text-based molecular structure representations such as Simplified Molecular-Input Line-Entry System (SMILES) provide an alternative to geometrical coordinates and are comparatively straightforward to be stored in the dataset tables. SMILES are often precursors for an embedded vector representation of the input structures. Software packages such as RDKit cheminformatics program and Open Babel toolkit provide modules for the featurization of the SMILES into acceptable representations for ML algorithms. These representations, for example, can be based on graph convolutions, physical properties or extended-connectivity fingerprints (ECFPs).
Chemical Features
MolSSI’s recommendation for storing the features such as SMILES, energetics or property values is the comma-separated value (CSV)
format. This file should be stored in the root directory of the main dataset repository as <dataset_name>_meta.csv
. This provides a
coherent and convenient way for popular tools such as the Pandas library in python to perform the
input/output (I/O) operations and easily convert them between the popular storage formats such as hierarchical data format (HDF), JSON etc..
2.1.1.3. Metadata File Formats
MolSSI encourages all users to carefully record their metadata in a separate <dataset_name>_meta.csv
file and store it in
the root directory of the main dataset repository. Each row in the metadata table should correspond to a column label
[the headline (first) row of the main dataset CSV file] and should at least, describe its type accompanied by a short
description of what they are referring to. For example, molecular charge is often of integer type and reflects the electrical
charge on the molecule at its ground electronic state in vacuum.
2.1.1.4. Naming Conventions
In order to make both I/O, query and search operations more convenient, all geometrical files within geometries folder should
be named as <dataset_name>_<id>.mol
where the name of the dataset in the all-lowercase form is appended by its dataset table
id
and separated by an underscore. Each id
is an integer type value located at the first column of the dataset table and
corresponds to an individual chemical species included in the geometry .mol files.
For performance reasons, a large number of geometry files can be batched into separate subfolders of the geometries folder and
sorted according to the geometries/<batch_number>
naming convention. For the sake of consistency, all file names, feature
titles, and labels should be in lowercase.
2.1.1.5. Data Types
Depending on the nature of the data object, both features and labels can assume different data types.
Scalar Quantities
Numerical-type scalar properties such as molecular charge can be stored as integers or floats with different bit lengths.
Vector Quantities
By default, the elements of the dense vector quantities such as multiple moments should be stored in separate columns.
For dense vectors of variable length, however, multiple techniques can be adopted. Since the majority of ML models require
the input vectors of the same length, it is possible to find the dimension of the largest vector and pad the other smaller-size
vector instances with zeros. Popular ML frameworks such as TensorFlow provide data preprocessing tools to deal with these cases.
Nevertheless, zero padding wastes the storage resources. In some cases, we might choose to organize such variable length features
in separate <feature>.csv
files in the root directory and deal with them differentlly alongside describing the details of the
adopted approach in the corresponding <dataset_name>_meta.csv
file. MolSSI’s recommendation, however, is to keep the number of
files in the root directory to its minimum.
Another way is to store the vector elements is to serialize the vector to a string or byte representation. The former representation might not be economical for very large datasets while the latter might be more appropriate for I/O operations via limited bandwidths over the network.
Higher-order (>1) Tensor Quantities
Dense matrices and higher dimensional tensors can be flattened to their one dimensional counterparts and treated like vectors.
For sparse matrices and higher dimensional tensors, however, ML frameworks usually provide tools to use the sparsity or manipulate
the representation such as converting it to a one-hot for classification tasks. There are also libraries that can handle the
packing/unpacking operations for sparse matrices and tensors such as Armadillo in C++ and
scipy.sparse in python. Should the user decide to use a different
approach that is not supported by the existing tools, the details of packing/unpacking of the tensor elements should be explicitly
delineated in the metadata <dataset_name>_meta.csv
or Readme.md
file in the root directory of the main repository. MolSSI
recommends the authors to provide scripts or utility modules in order to assist the data engineering process. The corresponding
tools should be placed in the utility
subfolder in the root directory.
Non-numerical Quantities
Non-numerical quantities such as SMILES, by construction, can be stored in the string format for each instance entry and as a separate feature.