preprocess_data#
- autoplex.data.common.jobs.preprocess_data(vasp_ref_dir, test_ratio=None, regularization=False, retain_existing_sigma=False, scheme='linear-hull', element_order=None, distillation=False, force_max=40, force_label='REF_forces', energy_label='REF_energy', pre_database_dir=None, reg_minmax=None, isolated_atom_energies=None)[source]#
Preprocesse data to before fiting machine learning models.
This function handles tasks such as splitting the dataset, applying regularization, accumulating database, and filtering structures based on maximum force values.
- Parameters:
vasp_ref_dir (str) – Path to the directory containing the reference VASP calculation data.
test_ratio (float) – The proportion of the test set after splitting the data. If None, no splitting will be performed.
regularization (bool) – If true, apply regularization. This only works for GAP.
retain_existing_sigma (bool) – Whether to keep the current sigma values for specific configuration types. If set to True, existing sigma values for specific configurations will remain unchanged.
scheme (str) – Scheme to use for regularization.
element_order (list | None) – List of atomic numbers in order of choice (e.g. [42, 16] for MoS2). This value is useful when constructing high-dimensional convex hulls based on the “volume-stoichiometry” scheme. Specially, if the dataset contains compounds with different numbers of constituent elements (e.g., both binary and ternary structures), this value must be explicitly set to ensure the convex hull is constructed consistently.
distillation (bool) – If True, apply data distillation.
force_max (float) – Maximum force value to exclude structures.
force_label (str) – The label of force values to use for distillation.
energy_label (str) – The label of energy values to use for distillation.
pre_database_dir (str) – Directory where the previous database was saved.
reg_minmax (list[tuple]) – A list of tuples representing the minimum and maximum values for regularization.
isolated_atom_energies (dict) – A dictionary containing isolated energy values for different species.
- Returns:
The current working directory.
- Return type:
Path