openclean_metanome.algorithm.hyucc module

Wrapper to run the HyUCC algorithm (A Hybrid Approach for Efficient Unique Column Combination Discovery) from the Metanome data profiling library. HyUCC is a unique column combination doscovery algorithm.

class openclean_metanome.algorithm.hyucc.HyUCC(max_ucc_size: int = - 1, input_row_limit: int = - 1, validate_parallel: bool = False, memory_guardian: bool = True, null_equals_null: bool = True, env: Optional[Dict] = None, verbose: Optional[bool] = True)

Bases: openclean.profiling.constraints.ucc.UniqueColumnCombinationFinder

HyUCC is a hybrid discovery algorithm for unique column combinations. The HyUCC algorithm uses the same discovery techniques as the hybrid functional dependency discovery algorithm HyFD. HyUCC discovers all minimal unique column combinationsin a given dataset:

Thorsten Papenbrock and Felix Naumann, A Hybrid Approach for Efficient Unique Column Combination Discovery, Datenbanksysteme fuer Business, Technologie und Web (BTW 2017),

run(df: pandas.core.frame.DataFrame)List[Union[int, str, List[Union[str, int]]]]

Run the HyUCC algorithm on the given data frame. Returns a list of all discovered unique column sets.

If execution of the Metanome algorithm fails a RuntimeError will be raised.

Parameters

df (pd.DataFrame) – Input data frame.

Returns

Return type

list of columns

openclean_metanome.algorithm.hyucc.hyucc(df: pandas.core.frame.DataFrame, max_ucc_size: int = - 1, input_row_limit: int = - 1, validate_parallel: bool = False, memory_guardian: bool = True, null_equals_null: bool = True, env: Optional[Dict] = None, verbose: Optional[bool] = True)List[Union[int, str, List[Union[str, int]]]]

Run the HyUCC algorithm on a given data frame. HyUCC is a hybrid discovery algorithm for unique column combinations. The algorithm returns a list of discovered column combinations.

Parameters
  • df (pd.DataFrame) – Input data frame.

  • max_ucc_size (int, default=-1) – Defines the maximum size of discovered column sets. Use -1 to return all discovered unique column combinations.

  • input_row_limit (int, default=-1) – Limit the number of rows from the input file that are being used for column combination discovery. Use -1 for all columns.

  • validate_parallel (bool, default=False) – If true the algorithm will use multiple threads (one thread per available CPU core).

  • memory_guardian (bool, default=True) – Activate the memory guarding to prevent out of memory errors,

  • null_equals_null (bool, default=True) – Result value when comparing two NULL values.

  • env (dict, default=None) – Optional environment variables that override the system-wide settings, default=None

  • verbose (bool, default=True) – Output run logs if True.

Returns

Return type

list of columns

openclean_metanome.algorithm.hyucc.parse_result(outputfile: str, colmap: Dict)List[Union[int, str, List[Union[str, int]]]]

Parse the result file of the UCC discovery run to generate a list of discovered unique column sets.

Parameters
  • outputfile (string) – Path to the output file containing the discovered UCCs.

  • colmap (dict) – Mapping of column names from surrogate names to column names in the input data frame schema.

Returns

Return type

list of columns