openclean_metanome.algorithm.hyfd module¶
Wrapper to run the HyFD algorithm (A Hybrid Approach to Functional Dependency Discovery) from the Metanome data profiling library. HyFD is a functional dependency discovery algorithm.
Thorsten Papenbrock, Felix Naumann A Hybrid Approach to Functional Dependency Discovery ACM International Conference on Management of Data (SIGMOD ‘16)
From the abstract: […] HyFD combines fast approximation techniques with efficient validation techniques in order to findall minimal functional dependencies in a given dataset. While operating on compact data structures, HyFD not only outperforms all existing approaches, it also scales to much larger datasets.
- class openclean_metanome.algorithm.hyfd.HyFD(max_lhs_size: int = - 1, input_row_limit: int = - 1, validate_parallel: bool = False, memory_guardian: bool = True, null_equals_null: bool = True, env: Optional[Dict] = None, verbose: Optional[bool] = True)¶
Bases:
openclean.profiling.constraints.fd.FunctionalDependencyFinderHyFD is a hybrid discovery algorithm for functional dependencies. HyFD combines fast approximation techniques with efficient validation techniques in order to findall minimal functional dependencies in a given dataset:
Thorsten Papenbrock, Felix Naumann A Hybrid Approach to Functional Dependency Discovery ACM International Conference on Management of Data (SIGMOD ‘16)
- run(df: pandas.core.frame.DataFrame) → List[openclean.profiling.constraints.fd.FunctionalDependency]¶
Run the HyFD algorithm on the given data frame.
Returns a list of all discovered functional dependencies. If execution of the Metanome algorithm fails a RuntimeError will be raised.
- Parameters
df (pd.DataFrame) – Input data frame.
- Returns
- Return type
list of FunctionalDependency
- openclean_metanome.algorithm.hyfd.hyfd(df: pandas.core.frame.DataFrame, max_lhs_size: int = - 1, input_row_limit: int = - 1, validate_parallel: bool = False, memory_guardian: bool = True, null_equals_null: bool = True, env: Optional[Dict] = None, verbose: Optional[bool] = True) → List[openclean.profiling.constraints.fd.FunctionalDependency]¶
Run the HyFD algorithm on a given data frame. HyFD is a hybrid discovery algorithm for functional dependencies.
- Parameters
df (pd.DataFrame) – Input data frame.
max_lhs_size (int, default=-1) – Defines the maximum size of the left-hand-side for discovered FDs. Use -1 to ignore size limits on FDs.
input_row_limit (int, default=-1) – Limit the number of rows from the input file that are being used for functional dependency discovery. Use -1 for all columns.
validate_parallel (bool, default=False) – If true the algorithm will use multiple threads (one thread per available CPU core).
memory_guardian (bool, default=True) – Activate the memory guarding to prevent out of memory errors,
null_equals_null (bool, default=True) – Result value when comparing two NULL values.
env (dict, default=None) – Optional environment variables that override the system-wide settings, default=None
verbose (bool, default=True) – Output run logs if True.
- Returns
- Return type
list of FunctionalDependency
- openclean_metanome.algorithm.hyfd.parse_result(outputfile: str, colmap: Dict) → List[openclean.profiling.constraints.fd.FunctionalDependency]¶
Parse the result file of the FD discovery run to generate a list of discovered functional dependencies.
- Parameters
outputfile (string) – Path to the output file containing the discovered FDs.
colmap (dict) – Mapping of column names from surrogate names to column names in the input data frame schema.
- Returns
- Return type
list of FunctionalDependency