preprocess_data

preprocess_data#

autoplex.data.common.jobs.preprocess_data(vasp_ref_dir, test_ratio=None, regularization=False, retain_existing_sigma=False, scheme='linear-hull', distillation=False, force_max=40, force_label='REF_forces', pre_database_dir=None, reg_minmax=None, isolated_atom_energies=None)[source]#

Preprocesse data to before fiting machine learning models.

This function handles tasks such as splitting the dataset, applying regularization, accumulating database, and filtering structures based on maximum force values.

Parameters:
  • vasp_ref_dir (str) – Path to the directory containing the reference VASP calculation data.

  • test_ratio (float) – The proportion of the test set after splitting the data. If None, no splitting will be performed.

  • regularization (bool) – If true, apply regularization. This only works for GAP.

  • retain_existing_sigma (bool) – Whether to keep the current sigma values for specific configuration types. If set to True, existing sigma values for specific configurations will remain unchanged.

  • scheme (str) – Scheme to use for regularization.

  • distillation (bool) – If True, apply data distillation.

  • force_max (float) – Maximum force value to exclude structures.

  • force_label (str) – The label of force values to use for distillation.

  • pre_database_dir (str) – Directory where the previous database was saved.

  • reg_minmax (list[tuple]) – A list of tuples representing the minimum and maximum values for regularization.

  • isolated_atom_energies (dict) – A dictionary containing isolated energy values for different species.

Returns:

The current working directory.

Return type:

Path