# Graph

A graph is a structure used to represent entities(nodes or vertices) and
their relations(edges). The real-world graphs are usually associated with
node features or edge features.
GLT stores the topology data and feature data of a graph separately, and the
node/edge ids are represented by consecutive integers starting from 0.
The topology data of graph is represented by an instance of
[`graphlearn_torch.data.graph.Graph`](graphlearn_torch.data.graph.Graph) and
the feature data is described by an instance of
[`graphlearn_torch.data.feature.Feature`](graphlearn_torch.data.feature.Feature).


## 1. Graph
A graph can be described by a 2D edge_index Tensor, a CSR matrix, or a CSC matrix.
GLT implements the class [`graphlearn_torch.data.graph.Topology`](graphlearn_torch.data.graph.Topology)
to represent the input data of a graph, which supports edge_index tensor, CSR martix and CSC matrix format.
Then the graph object is described by an instance of
[`graphlearn_torch.data.graph.Graph`](graphlearn_torch.data.graph.Graph),
which takes `Topology` as input and stores the graph data in CPU memory,
pinned memory or GPU memory according to corresponding mode.
Based on `Graph`, GLT provides graph operations(both cpu and cuda versions are available)
like neighbor sampling, negative sampling and subgraph sampling.


### 1.1 Topology

The graph topology data is formed into
[`graphlearn_torch.data.graph.Topology`](graphlearn_torch.data.graph.Topology)
object from edge index (with the format of 'COO', 'CSC' or 'CSR' derectly),
which will be used to build `Graph`.
`Topology` also supports the input `edge_ids` to represent edge ids and will
assign ordinal indices to edges by input order by default.
`Topology` uses `input_layout` to represent the input edge index format and 
`layout` to represent the target edge index format. `input_layout` supports ‘COO’, 
‘CSC’ and ‘CSR’ as input, and `layout` can select ‘CSC’ or ‘CSR’ format, 
which depends on whether the sampling method is in-bound or out-bound.

``` python
class Topology(object):
  r""" Graph topology with support for CSC and CSR formats.

  Args:
    edge_index (a 2D torch.Tensor or numpy.ndarray, or a tuple): The edge
      index for graph topology, in the order of first row and then column. 
    edge_ids (torch.Tensor or numpy.ndarray, optional): The edge ids for
      graph edges. If set to ``None``, it will be aranged by the edge size.
      (default: ``None``)
    input_layout (str): The edge layout representation for the input edge index,
      should be 'COO' (rows and cols uncompressed), 'CSR' (rows compressed)
      or 'CSC' (columns compressed). (default: 'COO')
    layout ('CSR' or 'CSC'): The target edge layout representation for 
      the output. (default: 'CSR')
  """
  def __init__(self, edge_index, edge_ids, input_layout = 'COO', 
               layout: Literal['CSR', 'CSC'] = 'CSR')
```

### 1.2 Graph

The [`graphlearn_torch.data.graph.Graph`](graphlearn_torch.data.graph.Graph)
takes `Topology` as input and supports three storage modes:
  - `CPU`: graph data are stored in the CPU memory and graph
    operations are also executed on CPU.
  - `ZERO_COPY`: graph data are stored in the pinned CPU memory and graph
    operations are executed on GPU.
  - `CUDA`: graph data are stored in the GPU memory and graph operations
    are executed on GPU.

``` python
class Graph(object):
  r""" A graph object used for graph operations such as sampling.

  Args:
    csr_topo (Topology): An instance of ``Topology`` with graph topology data.
    mode (str): The graph operation mode, must be 'CPU', 'ZERO_COPY' or 'CUDA'.
      (Default: 'ZERO_COPY').
    device (int, optional): The target cuda device rank to perform graph
      operations.
  """
  def __init__(self, csr_topo: Topology, mode = 'ZERO_COPY',
               device: Optional[int] = None):
```


## 2. Feature
Feature data of large scale graphs often exceeds the limit of GPU memory,
so we cannot simply use CUDA Tensor to store feature data.
If the feature data is stored in CPU memory,
Large number of data copy operations between
host(CPU) and device(GPU) will be the bottleneck in the whole process of
GPU training.

GLT implements the
[`graphlearn_torch.data.feature.Feature`](graphlearn_torch.data.feature.Feature)
to manage storage of nodes feature and edges feature(not supported yet) and
handle the feature loookup.
When using GPU training, the feature data will be split into two parts according
to different ratios, which will be stored on GPU
(e.g., feature data that are frequently accessed) and pinned memory
respectively. Then a CUDA kernel function is used to perform feature lookup,
reducing the time of data copy and increasing the overall throughput.

### 2.1 UnifiedTensor

Feature consists of a CPU-GPU unified memory instance called
[`graphlearn_torch.data.unified_tensor.UnifiedTensor`](graphlearn_torch.data.unified_tensor.UnifiedTensor).
`UnifiedTensor` unifies the management of CUDA Tensor and CPU Tensor to provide efficient data access.
As shown in the figure below, if GPUs have direct peer2peer access (with NVLink), the memory of
these GPUs can also be unified and managed by `UnifiedTensor`.

![unified_tensor](../figures/uni_tensor.png)

Therefore, `UnifiedTensor` can access CUDA Tensor directly,
CUDA Tensor on other GPUs via NVLink, and CPU Tensor by ZERO-COPY via UVA.

One way to create a `UnifiedTensor` is to use the `init_from` method, which takes
a list of CPU Tensors as input and will store each CPU Tensor to their
corresponding device according to `tensor_devices`.

``` python
class UnifiedTensor(object):
  r"""

  Args:
    current_device (int): An integer to represent the GPU device where the
      underlying cuda operation kernel is launched.
    dtype (torch.dtype): The data type of the tensor elements.
  """
  def __init__(self, current_device: int, dtype: torch.dtype = torch.float32):
    self.current_device = current_device
    self.dtype = dtype
    self.unified_tensor = pywrap.UnifiedTensor(current_device, dtype)
    self.cpu_part = None # tensor stored in CPU memory.

  def init_from(self, tensors, tensor_devices):
    r""" Initialize from CPU torch.Tensors.

    Args:
      tensors: CPU torch.Tensors indicating the tensors that need to be stored
        on different GPUs and CPU.
      tensor_devices: The indices of devices indicating the location of the
        tensor storage, -1 means on CPU and other > 0 value means on GPUs.
      Note that tensors and tensor_devices must correspond to each other.
    """
    self.unified_tensor.init_from(tensors, tensor_devices)
```

### 2.2 DeviceGroup

GLT uses an instance of [`graphlearn_torch.data.feature.DeviceGroup`](graphlearn_torch.data.feature.DeviceGroup)
to represent a group of GPUs that have p2p access to each other.
For example, suppose there are 8 GPUs, if there is no NVLink between each other, then there will be 8 DeviceGroups,
each of which has one GPU.
And if GPU #[0,1,2,3] have NVLink connections between each other, and GPU #[4,5,6,7] have NVLink connections,
then GPU #[0,1,2,3] compose a DeviceGroup, GPU #[4,5,6,7] compose a DeviceGroup.

### 2.3 Feature

`Feature` splits the input CPU tensor into GPU(hot) part and CPU(cold) part according
to `split_ratio` in the input order(assuming it has been [reordered](#24-feature-reordering)).
The CPU part will be put into shared memory and pinned, so that different GPU can
share and access it through UVA.
For the GPU part, each `DeviceGroup` in the `device_group_list` stores a replica,
and then the data within the `DeviceGroup` is equally divided into each deivce(GPU) for storage.
It is well known that the bandwidth of NVLink is much higher than that of PCIe.
Therefore, compared to the solution that each GPU holds a replica of the feature cache,
introducing `DeviceGroup` in GLT allows us to significantly expand the cache capacity of
feature data in GPU and make full utilization of the NVLink bandwidth.

![replica_per_gpu](../figures/replica_per_gpu.png)
![replica_per_dg](../figures/replica_per_dg.png)

`Feature` uses `UnifiedTensor`s to manage the data by default, and also provides
a wrapper of CPU tensor when there is no gpu is available(set `with_gpu` to
`False` for this case).

```python
class Feature(object):
  r""" A class for feature storage and lookup with hardware topology awareness
  and high performance.

  Args:
    feature_tensor (torch.Tensor or numpy.ndarray): A CPU tensor of the raw
      feature data.
    id2index (torch.Tensor, optional):: A tensor mapping the node id to the
      index in the raw CPU feature tensor.
    split_ratio (float): The proportion of feature data allocated to the GPU,
      between 0 and 1.  (Default: ``0.0``).
    device_group_list (List[DeviceGroup], optional): A list of device groups.
    device (int, optional): The target cuda device rank to perform feature
      lookups with the GPU part.
    with_gpu: A Boolean value indicating whether the ``Feature`` uses
      ``UnifiedTensor``.
  """
  def __init__(self,
               feature_tensor: Union[torch.Tensor, numpy.ndarray],
               id2index: Optional[torch.Tensor] = None,
               split_ratio: float = 0.0,
               device_group_list: Optional[List[DeviceGroup]] = None,
               device: Optional[int] = None,
               with_gpu: Optional[bool] = True):
```

You can create a [`graphlearn_torch.data.feature.Feature`](graphlearn_torch.data.feature.Feature)
instance by a input CPU Tensor `feature_tensor`. Here is a simple example:

``` python
import torch
import graphlearn_torch as glt

feat_tensor = torch.ones(512, 128)
# suppose you have 8 GPUs.
# if there is no NVLink.
# device_groups = [glt.data.DeviceGroup(i, [i]) for i in range(8)]
# if there are NVLinks between GPU0-3 and GPU4-7.
device_groups = [glt.data.DeviceGroup(0, [0,1,2,3]),
                 glt.data.DeviceGroup(1, [4,5,6,7])]
# Split the CPU feature tensor, of which the GPU part accounts for 60%.
# Launch the GPU kernel on device 0 for this ``Feature`` instance.
feature = glt.data.Feature(feat_tensor,
                           split_ratio=0.6,
                           device_group_list=device_groups,
                           device=0,
                           dtype=torch.float32)
input = torch.tensor([1,2,5,8], device='cuda:0')
print(feature[input])
```

### 2.4 Feature Reordering
[`graphlearn_torch.data.reorder.sort_by_in_degree`](graphlearn_torch.data.reorder.sort_by_in_degree)

The `Feature` is sliced in the input order, so it will have better performance
if the hot data is placed in front of input data.
Therefore, before feature splitting, we need to reorder the input feature
according to different strategies, e.g., according to the in-degrees of vertices
or the pre-sampled hotness distribution of vertices.
The following feature reordering methods are currently supported:
- [`graphlearn_torch.data.reorder.sort_by_in_degree`](graphlearn_torch.data.reorder.sort_by_in_degree)
: Reorder the features according to the in-degree of the nodes in the graph.