Input YAML configuration file#
The YAML configuration file contains all the inputs you need to define and customize your RSS workflow. This section provides a detailed explanation of how to set up the file, using silicon (Si
) as an example. To make it easier to understand, we will break down the YAML file into several sections based on the key components of the RSS workflow and explain each part in detail.
General parameters#
The section defines the general settings for the RSS workflow, including the system’s formula, whether to start fresh, or resume from a previous state.
# General Parameters
tag: 'Si'
train_from_scratch: false
resume_from_previous_state:
test_error:
pre_database_dir:
mlip_path:
isolated_atom_energies:
The tag
identifies the elements in the system, such as Si
in this example, and can also be passed to the parameters for random structure generation. Setting train_from_scratch=false
indicates that the RSS workflow will start from scratch. To resume a workflow, refer to the Resuming workflow from point of interruption section.
Buildcell parameters#
The section defines the settings for generating initial random structures. These parameters control the diversity, symmetry, size, and number of the generated structures. We utilize the buildcell
software from AIRSS for random structure generation, ensuring flexibility and efficiency in exploring configurational space.
# Buildcell Parameters
generated_struct_numbers:
- 8000
- 2000
buildcell_options:
- NFORM: '1'
SYMMOPS: '1-4'
SLACK: 0.25
OVERLAP: 0.1
NATOM: '{6,8,10,12,14,16,18,20,22,24}'
- NFORM: '1'
SYMMOPS: '1-4'
SLACK: 0.25
OVERLAP: 0.1
NATOM: '{7,9,11,13,15,17,19,21,23}'
fragment_file: null
fragment_numbers: null
num_processes_buildcell: 128
The buildcell_options
parameter is one of the most critical settings,
as it determines the scope of the RSS search and directly influences the diversity of the generated structures.
We provide multiple sets of buildcell
input parameters.
Using the configuration file above, a total of 8000 structures will be generated, each containing an even number of atoms.
Additionally, 2000 structures will be created with an odd number of atoms. The symmetry operations will vary from 1 to 4.
You can supply a single set or multiple sets as needed.
This flexibility ensures that the initial structures have sufficient diversity.
In principle, any parameter supported by buildcell can be used in the buildcell_options
section.
The fragment_file
and fragment_numbers
parameters are used during random structure generation to define specific fragments as the smallest building blocks. For example, you can define an H2O molecule as a fragment and use it as the basic unit for generating random structures. This allows for more customized and realistic initial configurations when working with molecular or other complex systems. The num_processes_buildcell
parameter specifies the number of CPU cores to be used in parallel during random structure generation. Note that this parameter is limited to a single node.
Note: The
generated_struct_numbers
andbuildcell_options
parameters must have the same length. Each entry inbuildcell_options
corresponds to the number of structures specified at the same position ingenerated_struct_numbers
.
Sampling parameters#
The section controls the selection methods of structures during the RSS workflow. These parameters determine how initial and iteratively generated structures are selected.
# Sampling Parameters
num_of_initial_selected_structs:
- 80
- 20
num_of_rss_selected_structs: 100
initial_selection_enabled: true
rss_selection_method: 'bcur2i'
bcur_params:
soap_paras:
l_max: 12
n_max: 12
atom_sigma: 0.0875
cutoff: 10.5
cutoff_transition_width: 1.0
zeta: 4.0
average: true
species: true
frac_of_bcur: 0.8
bolt_max_num: 3000
random_seed: null
If the RSS workflow starts exploration from scratch, it consists of two stages. In the first stage, there is no pre-existing potential, so structures are directly selected from the initial random structures for fitting. The num_of_initial_selected_structs
parameter defines how many structures are selected from the initial random structures. In the example provided, the workflow selects 80 structures from 8000 even-numbered cells and 20 structures from 2000 odd-numbered cells for fitting the initial potential. In the second stage, after obtaining the initial potential, the workflow transitions to ML-driven RSS iterations. During this stage, the num_of_rss_selected_structs
parameter specifies the number of structures sampled in each RSS iteration.
The initial_selection_enabled
parameter enables initial structure selection when set to true
. In this case, the sampling method will be CUR. The rss_selection_method
defines the strategy for selecting structures during the RSS iteration. In this example, the method bcur2i
is used, which combines Boltzmann-weighted energy histograms and CUR sampling to select low-energy and diverse structures. The bcur_params
section provides detailed settings for the bcur2i
method. The soap_paras
define the SOAP descriptors used for structure representation.
DFT labelling parameters#
The section allows you to set up VASP static calculations for accurately labeling training structures, including bulk, isolated atoms, and dimers.
# DFT Labelling Parameters
include_isolated_atom: true
isolatedatom_box:
- 20.0
- 20.0
- 20.0
e0_spin: false
include_dimer: true
dimer_box:
- 20.0
- 20.0
- 20.0
dimer_range:
- 1.0
- 5.0
dimer_num: 21
custom_incar:
KPAR: 8
NCORE: 16
LSCALAPACK: ".FALSE."
LPLANE: ".FALSE."
ISMEAR: 0
SIGMA: 0.05
PREC: "Accurate"
ADDGRID: ".TRUE."
EDIFF: 1E-7
NELM: 250
LWAVE: ".FALSE."
LCHARG: ".FALSE."
ALGO: "normal"
AMIX: 0.1
LREAL: ".FALSE."
ISYM: 0
ENCUT: 400.0
KSPACING: 0.25
GGA: 'PS'
custom_potcar:
vasp_ref_file: 'vasp_ref.extxyz'
If include_isolated_atom
and include_dimer
are set to true
, the program will automatically identify all elements present in the structures generated by buildcell
and set up cells for isolated atoms and dimers accordingly based on the recognized elements. The custom_incar
parameter allows you to define any VASP settings. You can adjust these settings by adding or removing keys as needed.
Data preprocessing parameters#
The section defines how the training data is prepared before fitting potentials. This includes filtering, regularization, combining external datasets, and splitting the data into training and testing sets.
# Data Preprocessing Parameters
config_types:
- 'initial'
- 'traj_early'
- 'traj'
rss_group:
- 'traj'
test_ratio: 0.1
regularization: true
scheme: 'linear-hull'
retain_existing_sigma: true
reg_minmax:
- [0.1, 1]
- [0.001, 0.1]
- [0.0316, 0.316]
- [0.0632, 0.632]
distillation: false
force_max: null
force_label: null
pre_database_dir: null
Regularization is currently only applicable to GAP potentials and is adjusted using the scheme
parameter. Common schemes include 'linear-hull'
and 'volume-stoichiometry'
. For systems with fixed stoichiometry, 'linear-hull'
is recommended. For systems with varying stoichiometries, 'volume-stoichiometry'
is more appropriate.
MLIP Parameters#
The section defines the settings for training machine learning potentials. Currently supported architectures include GAP, ACE(Julia), NequIP, M3GNet, and MACE. You can specify the desired model using the mlip_type
argument and tune hyperparameters flexibly by adding key-value pairs. Default and adjustable hyperparameters are available in autoplex/autoplex/fitting/common/mlip-rss-defaults.json
.
# MLIP Parameters
mlip_type: 'GAP'
ref_energy_name: 'REF_energy'
ref_force_name: 'REF_forces'
ref_virial_name: 'REF_virial'
auto_delta: true
num_processes_fit: 32
device_for_fitting: 'cpu'
twob:
cutoff: 10.0
n_sparse: 30
theta_uniform: 1.0
threeb:
cutoff: 3.25
soap:
l_max: 8
n_max: 8
atom_sigma: 0.75
n_sparse: 2000
cutoff: 5.0
general:
three_body: true
RSS Exploration Parameters#
The section sets up the RSS iterative process, including parameters for structure optimization and convergence criteria.
# RSS Exploration Parameters
scalar_pressure_method: 'exp'
scalar_exp_pressure: 1
scalar_pressure_exponential_width: 0.2
scalar_pressure_low: 0
scalar_pressure_high: 25
max_steps: 200
force_tol: 0.01
stress_tol: 0.01
stop_criterion: 0.001
max_iteration_number: 25
num_groups: 6
initial_kt: 0.3
current_iter_index: 1
hookean_repul: true
hookean_paras:
'(14, 14)': [1000, 1.0]
keep_symmetry: true
write_traj: true
num_processes_rss: 128
device_for_rss: 'cpu'
The RSS workflow supports searching for structures under high pressure, controlled by the scalar_pressure_method
parameter. Two methods are available: exp
, where pressure is sampled based on an exponential distribution with control parameters scalar_exp_pressure
and scalar_pressure_exponential_width
, and uniform
, where pressure is sampled uniformly within a range defined by scalar_pressure_low
and scalar_pressure_high
.
To terminate the iterative process, two stopping criteria are provided: stop_criterion
and max_iteration_number
. The iterations stop when the prediction error falls below the value of stop_criterion
. Or the iterations stop when the number of iterations exceeds the limit defined by max_iteration_number
. The workflow will stop when either of the above criteria is satisfied.
We strongly recommend enabling hookean_repul
, as it applies a strong repulsive force when the distance between two atoms falls below a certain threshold. This ensures that the generated structures are physically reasonable.
The GAP-RSS model of Si was iterated 25 times on a server cluster with 7 nodes, each equipped with 128 cores, taking approximately 1 day to complete. The resulting potential was found to accurately describe different crystalline polymorphs as well as disordered phases.