Graph Operators

Graph-Learn_torch(GLT) optimizes the end-to-end training throughput of GNN models by boosting the performance of graph sampling and feature collection. In GLT, we have implemented vertex-based graphlearn_torch.sampler.NeighborSampler and graphlearn_torch.sampler.RandomNegativeSampler. Edge-based and subgraph-based samplers will be added in the next release. In this tutorial, we introduce the detailed designs of these graph-related operators in GLT.

NeighborSampler

Similar to PyG, GLT wraps the configuration, initialization and execution of samplers inside data loaders. The following code illustrates an example of graphlearn_torch.loader.NeighborLoader in single machine training.

# graphlearn_torch NeighborLoader
train_loader = glt.loader.NeighborLoader(glt_dataset,
                                         [15, 10, 5],
                                         split_idx['train'],
                                         batch_size=1024,
                                         shuffle=True,
                                         drop_last=True,
                                         device=device,
                                         as_pyg_v1=True)

During the initialization of the neighbor loader, an instance of graphlearn_torch.sampler.NeighborSampler is created.

class NeighborSampler(BaseSampler):
  r""" Neighbor Sampler.
  """
  def __init__(self,
               graph: Union[Graph, Dict[str, Graph]],
               num_neighbors: NumNeighbors,
               device: torch.device=torch.device('cuda', 0),
               with_edge: bool=False,
               strategy: str = 'random'):

To be compatible with PyG, NeighborSampler in GLT inherits the classtorch_geometric.sampler.BaseSampler in PyG. Both homogeneous and heterogeneous graphs are supported. The sampler instance is created with the user-specified parameters: the number of hops, number of neighbors of each hop and the device where sampling operations are performed. GLT supports both CPU and GPU sampling. By setting with_edge to True, edge ids are included in the sampled results. Edges ids can be used to extract edge features. By default the sampling strategy is random sampling.

We directly use the input and output formats of PyG sampler for the NeighborSampler in GLT. The input formats of graphlearn_torch.sampler.NeighborSampler is torch_geometric.sampler.NodeSamplerInput

@dataclass
class NodeSamplerInput(CastMixin):
  r""" The sampling input of
  :meth:`~graphlearn_torch.sampler.BaseSampler.sample_from_nodes`.

  This class corresponds to :class:`~torch_geometric.sampler.NodeSamplerInput`:
  https://github.com/pyg-team/pytorch_geometric/blob/master/torch_geometric/sampler/base.py

  Args:
    node (torch.Tensor): The indices of seed nodes to start sampling from.
    input_type (str, optional): The input node type (in case of sampling in
      a heterogeneous graph). (default: :obj:`None`).
  """
  node: torch.Tensor
  input_type: Optional[NodeType] = None

The output format of sampling results on homogeneous graphs is torch_geometric.sampler.SamplerOutput.

@dataclass
class SamplerOutput(CastMixin):
  r""" The sampling output of a :class:`~graphlearn_torch.sampler.BaseSampler` on
  homogeneous graphs.

  Args:
    node (torch.Tensor): The sampled nodes in the original graph.
    row (torch.Tensor): The source node indices of the sampled subgraph.
      Indices must be re-indexed to :obj:`{ 0, ..., num_nodes - 1 }`
      corresponding to the nodes in the :obj:`node` tensor.
    col (torch.Tensor): The destination node indices of the sampled subgraph.
      Indices must be re-indexed to :obj:`{ 0, ..., num_nodes - 1 }`
      corresponding to the nodes in the :obj:`node` tensor.
    edge (torch.Tensor, optional): The sampled edges in the original graph.
      This tensor is used to obtain edge features from the original
      graph. If no edge attributes are present, it may be omitted.
    batch (torch.Tensor, optional): The vector to identify the seed node
      for each sampled node. Can be present in case of disjoint subgraph
      sampling per seed node. (default: :obj:`None`).
    device (torch.device, optional): The device that all data of this output
      resides in. (default: :obj:`None`).
    metadata: (Any, optional): Additional metadata information.
      (default: :obj:`None`).
  """
  node: torch.Tensor
  row: torch.Tensor
  col: torch.Tensor
  edge: Optional[torch.Tensor]  = None
  batch: Optional[torch.Tensor] = None
  device: Optional[torch.device] = None
  metadata: Optional[Any] = None

The output format of sampling results on heterogeneous graphs is torch_geometric.sampler.HeteroSamplerOutput.

@dataclass
class HeteroSamplerOutput(CastMixin):
  r""" The sampling output of a :class:`~graphlearn_torch.sampler.BaseSampler` on
  heterogeneous graphs.

  Args:
    node (Dict[str, torch.Tensor]): The sampled nodes in the original graph
      for each node type.
    row (Dict[Tuple[str, str, str], torch.Tensor]): The source node indices
      of the sampled subgraph for each edge type. Indices must be re-indexed
      to :obj:`{ 0, ..., num_nodes - 1 }` corresponding to the nodes in the
      :obj:`node` tensor of the source node type.
    col (Dict[Tuple[str, str, str], torch.Tensor]): The destination node
      indices of the sampled subgraph for each edge type. Indices must be
      re-indexed to :obj:`{ 0, ..., num_nodes - 1 }` corresponding to the nodes
      in the :obj:`node` tensor of the destination node type.
    edge (Dict[Tuple[str, str, str], torch.Tensor], optional): The sampled
      edges in the original graph for each edge type. This tensor is used to
      obtain edge features from the original graph. If no edge attributes are
      present, it may be omitted. (default: :obj:`None`).
    batch (Dict[str, torch.Tensor], optional): The vector to identify the
      seed node for each sampled node for each node type. Can be present
      in case of disjoint subgraph sampling per seed node.
      (default: :obj:`None`).
    edge_types: (List[Tuple[str, str, str]], optional): The list of edge types
      of the sampled subgraph. (default: :obj:`None`).
    device (torch.device, optional): The device that all data of this output
      resides in. (default: :obj:`None`).
    metadata: (Any, optional): Additional metadata information.
      (default: :obj:`None`)
  """
  node: Dict[NodeType, torch.Tensor]
  row: Dict[EdgeType, torch.Tensor]
  col: Dict[EdgeType, torch.Tensor]
  edge: Optional[Dict[EdgeType, torch.Tensor]] = None
  batch: Optional[Dict[NodeType, torch.Tensor]] = None
  edge_types: Optional[List[EdgeType]] = None
  device: Optional[torch.device] = None
  metadata: Optional[Any] = None

Negative Sampler

GLT also supports GPU and CPU sampling for negative sampler. The below code shows the graphlearn_torch.sampler.RandomNegativeSampler in GLT.

class RandomNegativeSampler(object):
  r""" Random negative Sampler.

  Args:
    graph: A ``graphlearn_torch.data.Graph`` object.
    mode: Execution mode of sampling, 'CUDA' means sampling on
      GPU, 'CPU' means sampling on CPU.
  """
  def __init__(self, graph, mode='CUDA'):
    self._mode = mode
    if mode == 'CUDA':
      self._sampler = pywrap.CUDARandomNegativeSampler(graph.graph_handler)
    else:
      self._sampler = pywrap.CPURandomNegativeSampler(graph.graph_handler)

The inputs of the sample method in RandomNegativeSampler include: req_numthe number of maximum negative samples, trials_num the maximum number of trails to generate enough negative samples, and padding specifies if the number of negative samples are smaller than req_num after trials_num is used up, whether to use randomly generated samples (could be positive or negative samples) to complement the number of samples in the outputs.

  def sample(self, req_num, trials_num=5, padding=False):
    r""" Negative sampling.

    Args:
      req_num: The number of request(max) negative samples.
      trials_num: The number of trials for negative sampling.
      padding: Whether to patch the negative sampling results to req_num.
        If True, after trying trials_num times, if the number of true negative
        samples is still less than req_num, just random sample edges(non-strict
        negative) as negative samples.

    Returns:
      negative edge_index(non-strict when padding is True).
    """
    rows, cols = self._sampler.sample(req_num, trials_num, padding)
    return torch.stack([rows, cols], dim=0)