Utilities API Reference
This document provides detailed API documentation for utility classes and functions in Torch-RecHub.
Data Processing Tools (data.py)
Dataset Classes
TorchDataset
- Introduction: Basic PyTorch dataset implementation for handling feature and label data.
- Parameters:
x(dict): Feature dictionary with feature names as keys and feature data as valuesy(array): Label data
PredictDataset
- Introduction: Dataset class for prediction stage containing only feature data.
- Parameters:
x(dict): Feature dictionary with feature names as keys and feature data as values
MatchDataGenerator
- Introduction: Data generator for recall tasks to generate training and test data loaders.
- Main Methods:
generate_dataloader(x_test_user, x_all_item, batch_size, num_workers=8): Generate train, test, and item data loaders- Parameters:
x_test_user(dict): Test user featuresx_all_item(dict): All item featuresbatch_size(int): Batch sizenum_workers(int): Number of worker processes for data loading
DataGenerator
- Introduction: General-purpose data generator supporting dataset splitting and loading.
- Main Methods:
generate_dataloader(x_val=None, y_val=None, x_test=None, y_test=None, split_ratio=None, batch_size=16, num_workers=0): Generate train, validation, and test data loaders- Parameters:
x_val,y_val: Validation set features and labelsx_test,y_test: Test set features and labelssplit_ratio(list): Split ratio for train, validation, and test setsbatch_size(int): Batch sizenum_workers(int): Number of worker processes for data loading
Utility Functions
get_auto_embedding_dim
- Introduction: Automatically calculate embedding dimension based on number of categories.
- Parameters:
num_classes(int): Number of categories
- Returns:
- int: Embedding dimension, calculated as
[6 * (num_classes)^(1/4)]
- int: Embedding dimension, calculated as
get_loss_func
- Introduction: Get loss function.
- Parameters:
task_type(str): Task type, "classification" or "regression"
- Returns:
- torch.nn.Module: Corresponding loss function
get_metric_func
- Introduction: Get evaluation metric function.
- Parameters:
task_type(str): Task type, "classification" or "regression"
- Returns:
- function: Corresponding evaluation metric function
generate_seq_feature
- Introduction: Generate sequence features and negative samples.
- Parameters:
data(pd.DataFrame): Raw datauser_col(str): User ID column nameitem_col(str): Item ID column nametime_col(str): Timestamp column nameitem_attribute_cols(list): Item attribute columns for sequence feature generationmin_item(int): Minimum number of items per usershuffle(bool): Whether to shuffle datamax_len(int): Maximum sequence length
Recall Tools (match.py)
Data Processing Functions
gen_model_input
- Introduction: Merge user and item features, handle sequence features.
- Parameters:
df(pd.DataFrame): Data with historical sequence featuresuser_profile(pd.DataFrame): User feature datauser_col(str): User column nameitem_profile(pd.DataFrame): Item feature dataitem_col(str): Item column nameseq_max_len(int): Maximum sequence lengthpadding(str): Padding method, 'pre' or 'post'truncating(str): Truncation method, 'pre' or 'post'
negative_sample
- Introduction: Negative sampling method for recall models.
- Parameters:
items_cnt_order(dict): Item count dictionary sorted by count in descending orderratio(int): Negative sample ratiomethod_id(int): Sampling method ID- 0: Random sampling
- 1: Word2Vec-style popularity sampling
- 2: Log popularity sampling
- 3: Tencent RALM sampling
Vector Retrieval Classes
Annoy
- Introduction: Vector retrieval tool based on Annoy library.
- Parameters:
metric(str): Distance metricn_trees(int): Number of treessearch_k(int): Search parameter
- Main Methods:
fit(X): Build indexquery(v, n): Query nearest neighbors
Milvus
- Introduction: Vector retrieval tool based on Milvus.
- Parameters:
dim(int): Vector dimensionhost(str): Milvus server addressport(str): Milvus server port
- Main Methods:
fit(X): Build indexquery(v, n): Query nearest neighbors
Multi-Task Learning Tools (mtl.py)
Utility Functions
shared_task_layers
- Introduction: Get shared and task-specific layer parameters from multi-task models.
- Parameters:
model(torch.nn.Module): Multi-task model supporting MMOE, SharedBottom, PLE, AITM
- Returns:
- list: Shared layer parameters
- list: Task-specific layer parameters
Optimizer Classes
MetaBalance
- Introduction: MetaBalance optimizer for balancing gradients across tasks in multi-task learning.
- Parameters:
parameters(list): Model parametersrelax_factor(float): Gradient scaling relaxation factor, default 0.7beta(float): Moving average coefficient, default 0.9
- Main Methods:
step(losses): Perform optimization step and update parameters
Gradient Processing Functions
gradnorm
- Introduction: Implement GradNorm algorithm for dynamic task weight adjustment in multi-task learning.
- Parameters:
loss_list(list): Loss list for each taskloss_weight(list): Task weight listshare_layer(torch.nn.Parameter): Shared layer parametersinitial_task_loss(list): Initial task loss listalpha(float): GradNorm algorithm hyperparameter
