openclean_metanome.algorithm.hyucc module¶
Wrapper to run the HyUCC algorithm (A Hybrid Approach for Efficient Unique Column Combination Discovery) from the Metanome data profiling library. HyUCC is a unique column combination doscovery algorithm.
- class openclean_metanome.algorithm.hyucc.HyUCC(max_ucc_size: int = - 1, input_row_limit: int = - 1, validate_parallel: bool = False, memory_guardian: bool = True, null_equals_null: bool = True, env: Optional[Dict] = None, verbose: Optional[bool] = True)¶
Bases:
openclean.profiling.constraints.ucc.UniqueColumnCombinationFinderHyUCC is a hybrid discovery algorithm for unique column combinations. The HyUCC algorithm uses the same discovery techniques as the hybrid functional dependency discovery algorithm HyFD. HyUCC discovers all minimal unique column combinationsin a given dataset:
Thorsten Papenbrock and Felix Naumann, A Hybrid Approach for Efficient Unique Column Combination Discovery, Datenbanksysteme fuer Business, Technologie und Web (BTW 2017),
- run(df: pandas.core.frame.DataFrame) → List[Union[int, str, List[Union[str, int]]]]¶
Run the HyUCC algorithm on the given data frame. Returns a list of all discovered unique column sets.
If execution of the Metanome algorithm fails a RuntimeError will be raised.
- Parameters
df (pd.DataFrame) – Input data frame.
- Returns
- Return type
list of columns
- openclean_metanome.algorithm.hyucc.hyucc(df: pandas.core.frame.DataFrame, max_ucc_size: int = - 1, input_row_limit: int = - 1, validate_parallel: bool = False, memory_guardian: bool = True, null_equals_null: bool = True, env: Optional[Dict] = None, verbose: Optional[bool] = True) → List[Union[int, str, List[Union[str, int]]]]¶
Run the HyUCC algorithm on a given data frame. HyUCC is a hybrid discovery algorithm for unique column combinations. The algorithm returns a list of discovered column combinations.
- Parameters
df (pd.DataFrame) – Input data frame.
max_ucc_size (int, default=-1) – Defines the maximum size of discovered column sets. Use -1 to return all discovered unique column combinations.
input_row_limit (int, default=-1) – Limit the number of rows from the input file that are being used for column combination discovery. Use -1 for all columns.
validate_parallel (bool, default=False) – If true the algorithm will use multiple threads (one thread per available CPU core).
memory_guardian (bool, default=True) – Activate the memory guarding to prevent out of memory errors,
null_equals_null (bool, default=True) – Result value when comparing two NULL values.
env (dict, default=None) – Optional environment variables that override the system-wide settings, default=None
verbose (bool, default=True) – Output run logs if True.
- Returns
- Return type
list of columns
- openclean_metanome.algorithm.hyucc.parse_result(outputfile: str, colmap: Dict) → List[Union[int, str, List[Union[str, int]]]]¶
Parse the result file of the UCC discovery run to generate a list of discovered unique column sets.
- Parameters
outputfile (string) – Path to the output file containing the discovered UCCs.
colmap (dict) – Mapping of column names from surrogate names to column names in the input data frame schema.
- Returns
- Return type
list of columns