preprocess_data#
- autoplex.data.common.jobs.preprocess_data(vasp_ref_dir, test_ratio=None, regularization=False, retain_existing_sigma=False, scheme='linear-hull', distillation=False, force_max=40, force_label='REF_forces', pre_database_dir=None, reg_minmax=None, isolated_atom_energies=None)[source]#
Preprocesse data to before fiting machine learning models.
This function handles tasks such as splitting the dataset, applying regularization, accumulating database, and filtering structures based on maximum force values.
- Parameters:
vasp_ref_dir (str) – Path to the directory containing the reference VASP calculation data.
test_ratio (float) – The proportion of the test set after splitting the data. If None, no splitting will be performed.
regularization (bool) – If true, apply regularization. This only works for GAP.
retain_existing_sigma (bool) – Whether to keep the current sigma values for specific configuration types. If set to True, existing sigma values for specific configurations will remain unchanged.
scheme (str) – Scheme to use for regularization.
distillation (bool) – If True, apply data distillation.
force_max (float) – Maximum force value to exclude structures.
force_label (str) – The label of force values to use for distillation.
pre_database_dir (str) – Directory where the previous database was saved.
reg_minmax (list[tuple]) – A list of tuples representing the minimum and maximum values for regularization.
isolated_atom_energies (dict) – A dictionary containing isolated energy values for different species.
- Returns:
The current working directory.
- Return type:
Path