openclean_metanome.algorithm.hyfd module

Wrapper to run the HyFD algorithm (A Hybrid Approach to Functional Dependency Discovery) from the Metanome data profiling library. HyFD is a functional dependency discovery algorithm.

Thorsten Papenbrock, Felix Naumann A Hybrid Approach to Functional Dependency Discovery ACM International Conference on Management of Data (SIGMOD ‘16)

From the abstract: […] HyFD combines fast approximation techniques with efficient validation techniques in order to findall minimal functional dependencies in a given dataset. While operating on compact data structures, HyFD not only outperforms all existing approaches, it also scales to much larger datasets.

class openclean_metanome.algorithm.hyfd.HyFD(max_lhs_size: int = - 1, input_row_limit: int = - 1, validate_parallel: bool = False, memory_guardian: bool = True, null_equals_null: bool = True, env: Optional[Dict] = None, verbose: Optional[bool] = True)

Bases: openclean.profiling.constraints.fd.FunctionalDependencyFinder

HyFD is a hybrid discovery algorithm for functional dependencies. HyFD combines fast approximation techniques with efficient validation techniques in order to findall minimal functional dependencies in a given dataset:

Thorsten Papenbrock, Felix Naumann A Hybrid Approach to Functional Dependency Discovery ACM International Conference on Management of Data (SIGMOD ‘16)

run(df: pandas.core.frame.DataFrame)List[openclean.profiling.constraints.fd.FunctionalDependency]

Run the HyFD algorithm on the given data frame.

Returns a list of all discovered functional dependencies. If execution of the Metanome algorithm fails a RuntimeError will be raised.

Parameters

df (pd.DataFrame) – Input data frame.

Returns

Return type

list of FunctionalDependency

openclean_metanome.algorithm.hyfd.hyfd(df: pandas.core.frame.DataFrame, max_lhs_size: int = - 1, input_row_limit: int = - 1, validate_parallel: bool = False, memory_guardian: bool = True, null_equals_null: bool = True, env: Optional[Dict] = None, verbose: Optional[bool] = True)List[openclean.profiling.constraints.fd.FunctionalDependency]

Run the HyFD algorithm on a given data frame. HyFD is a hybrid discovery algorithm for functional dependencies.

Parameters
  • df (pd.DataFrame) – Input data frame.

  • max_lhs_size (int, default=-1) – Defines the maximum size of the left-hand-side for discovered FDs. Use -1 to ignore size limits on FDs.

  • input_row_limit (int, default=-1) – Limit the number of rows from the input file that are being used for functional dependency discovery. Use -1 for all columns.

  • validate_parallel (bool, default=False) – If true the algorithm will use multiple threads (one thread per available CPU core).

  • memory_guardian (bool, default=True) – Activate the memory guarding to prevent out of memory errors,

  • null_equals_null (bool, default=True) – Result value when comparing two NULL values.

  • env (dict, default=None) – Optional environment variables that override the system-wide settings, default=None

  • verbose (bool, default=True) – Output run logs if True.

Returns

Return type

list of FunctionalDependency

openclean_metanome.algorithm.hyfd.parse_result(outputfile: str, colmap: Dict)List[openclean.profiling.constraints.fd.FunctionalDependency]

Parse the result file of the FD discovery run to generate a list of discovered functional dependencies.

Parameters
  • outputfile (string) – Path to the output file containing the discovered FDs.

  • colmap (dict) – Mapping of column names from surrogate names to column names in the input data frame schema.

Returns

Return type

list of FunctionalDependency