# Node Classification Node classification is the basic task of graph learning and GNNs are powerful models of graph learning. We introduce the basic workflow of GNN training through the node classification example on [OGBN-Products dataset](https://ogb.stanford.edu/docs/nodeprop/#ogbn-products). The code is based on PyG's signle GPU training of [GraphSAGE on OGBN-Products](https://github.com/pyg-team/pytorch_geometric/blob/master/examples/ogbn_products_sage.py) and the only difference is to use GLT's [`graphlearn_torch.loader.neighbor_loader.NeighborLoader`](graphlearn_torch.loader.neighbor_loader.NeighborLoader) instead of PyG's `torch_geometric.loader.NeighborSampler` to accelerate training on GPU. For model testing, we keep the original NeighborSampler in PyG to do the usage comparison. ## Loading OGBN-Products dataset. ``` python import time import torch import graphlearn_torch as glt import os.path as osp import torch.nn.functional as F from ogb.nodeproppred import PygNodePropPredDataset, Evaluator from torch_geometric.loader import NeighborSampler from torch_geometric.nn import SAGEConv from tqdm import tqdm root = osp.join(osp.dirname(osp.realpath(__file__)), '..', 'data', 'products') dataset = PygNodePropPredDataset('ogbn-products', root) split_idx = dataset.get_idx_split() evaluator = Evaluator(name='ogbn-products') data = dataset[0] # PyG NeighborSampler test_loader = NeighborSampler(data.edge_index, node_idx=None, sizes=[-1], batch_size=4096, shuffle=False, num_workers=12) ``` > **Note** > In PyG 1.x, NeighborSampler is actually neighbor loader. ## Creating data loader. This part is the only difference from PyG's example. We first create an instance of [`graphlearn_torch.data.dataset.Dataset`](graphlearn_torch.data.dataset.Dataset) and initialize it with edge_index and node features. The graph data is stored in pinned memory, since the `graph_mode` is set to `ZERO_COPY`. The `graph_mode` can be `GPU`, `ZERO_COPY` or `CPU` indicating the data is stored in GPU memory, pinned memory and CPU memory, respectively. `GPU` and `ZERO_COPY` are recommended, and you should choose `ZERO_COPY` when graph data is larger than GPU memory capacity. The node features are sorted by in-degrees of nodes and then split into two parts according to `split_ratio`. The node features with higher in-degrees are stored in GPU memory, and the remaining part is stored in pinned memory for UVA. Then, we use a neigbor loader [`graphlearn_torch.loader.neighbor_loader.NeighborLoader`](graphlearn_torch.loader.neighbor_loader.NeighborLoader) which is totally compatible with PyG's `NeighborSampler`. This loader uses train index as input seeds and samples 3-hop of neighbors for each seed. In the overall process of large-scale graph GNN GPU training, sampling and feature lookup often become bottlenecks due to the low bandwidth of the PCIe and the limited concurrency of the CPU. In GLT, sampling and feature lookup are executed on the GPU, which provides a significant performance boost compared to the CPU. ``` python glt_dataset = glt.data.Dataset() glt_dataset.init_graph( edge_index=dataset[0].edge_index, graph_mode='ZERO_COPY', directed=False ) glt_dataset.init_node_features( node_feature_data=data.x, sort_func=glt.data.sort_by_in_degree, split_ratio=0.2, device_group_list=[glt.data.DeviceGroup(0, [0])], ) glt_dataset.init_node_labels(node_label_data=data.y) # graphlearn_torch NeighborLoader train_loader = glt.loader.NeighborLoader(glt_dataset, [15, 10, 5], split_idx['train'], batch_size=1024, shuffle=True, drop_last=True, device=torch.device(0), as_pyg_v1=True) ``` > **Note** > In PyG 2.x, the neighbor sampler output has been changed from that in PyG 1.x, > so we add the argument `as_pyg_v1` to support sampling in PyG 1.x. ## Defining model. Here we directly show the PyG's GraphSAGE model defination. ```python class SAGE(torch.nn.Module): def __init__(self, in_channels, hidden_channels, out_channels, num_layers): super(SAGE, self).__init__() self.num_layers = num_layers self.convs = torch.nn.ModuleList() self.convs.append(SAGEConv(in_channels, hidden_channels)) for _ in range(num_layers - 2): self.convs.append(SAGEConv(hidden_channels, hidden_channels)) self.convs.append(SAGEConv(hidden_channels, out_channels)) def reset_parameters(self): for conv in self.convs: conv.reset_parameters() def forward(self, x, adjs): # `train_loader` computes the k-hop neighborhood of a batch of nodes, # and returns, for each layer, a bipartite graph object, holding the # bipartite edges `edge_index`, the index `e_id` of the original edges, # and the size/shape `size` of the bipartite graph. # Target nodes are also included in the source nodes so that one can # easily apply skip-connections or add self-loops. for i, (edge_index, _, size) in enumerate(adjs): x_target = x[:size[1]] # Target nodes are always placed first. x = self.convs[i]((x, x_target), edge_index) if i != self.num_layers - 1: x = F.relu(x) x = F.dropout(x, p=0.5, training=self.training) return x.log_softmax(dim=-1) def inference(self, x_all): pbar = tqdm(total=x_all.size(0) * self.num_layers) pbar.set_description('Evaluating') # Compute representations of nodes layer by layer, using *all* # available edges. This leads to faster computation in contrast to # immediately computing the final representations of each batch. total_edges = 0 for i in range(self.num_layers): xs = [] for batch_size, n_id, adj in test_loader: edge_index, _, size = adj.to(device) total_edges += edge_index.size(1) x = x_all[n_id].to(device) x_target = x[:size[1]] x = self.convs[i]((x, x_target), edge_index) if i != self.num_layers - 1: x = F.relu(x) xs.append(x.cpu()) pbar.update(batch_size) x_all = torch.cat(xs, dim=0) pbar.close() return x_all model = SAGE(dataset.num_features, 256, dataset.num_classes, num_layers=3) model = model.to(device) ``` ## Training and testing. Finally, you can use the GLT's trainer_loader defined above to speed up your program. ``` python def train(epoch): model.train() pbar = tqdm(total=split_idx['train'].size(0)) pbar.set_description(f'Epoch {epoch:02d}') total_loss = total_correct = 0 step = 0 glt_dataset.node_labels = glt_dataset.node_labels.to(device) for batch_size, n_id, adjs in train_loader: # `adjs` holds a list of `(edge_index, e_id, size)` tuples. adjs = [adj.to(device) for adj in adjs] optimizer.zero_grad() out = model(glt_dataset.node_features[n_id], adjs) loss = F.nll_loss(out, glt_dataset.node_labels[n_id[:batch_size]]) loss.backward() optimizer.step() total_loss += float(loss) total_correct += int(out.argmax(dim=-1).eq(glt_dataset.node_labels[n_id[:batch_size]]).sum()) step += 1 pbar.update(batch_size) pbar.close() loss = total_loss / step approx_acc = total_correct / split_idx['train'].size(0) return loss, approx_acc @torch.no_grad() def test(): model.eval() out = model.inference(glt_dataset.node_features) y_true = glt_dataset.node_labels.cpu().unsqueeze(-1) y_pred = out.argmax(dim=-1, keepdim=True) train_acc = evaluator.eval({ 'y_true': y_true[split_idx['train']], 'y_pred': y_pred[split_idx['train']], })['acc'] val_acc = evaluator.eval({ 'y_true': y_true[split_idx['valid']], 'y_pred': y_pred[split_idx['valid']], })['acc'] test_acc = evaluator.eval({ 'y_true': y_true[split_idx['test']], 'y_pred': y_pred[split_idx['test']], })['acc'] return train_acc, val_acc, test_acc test_accs = [] for run in range(1, 2): print('') print(f'Run {run:02d}:') print('') model.reset_parameters() optimizer = torch.optim.Adam(model.parameters(), lr=0.003) best_val_acc = final_test_acc = 0 for epoch in range(1, 21): epoch_start = time.time() loss, acc = train(epoch) print(f'Epoch {epoch:02d}, Loss: {loss:.4f}, Approx. Train: {acc:.4f}', f'Epoch Time: {time.time() - epoch_start}') if epoch > 5: train_acc, val_acc, test_acc = test() print(f'Train: {train_acc:.4f}, Val: {val_acc:.4f}, Test: {test_acc:.4f}') if val_acc > best_val_acc: best_val_acc = val_acc final_test_acc = test_acc test_accs.append(final_test_acc) test_acc = torch.tensor(test_accs) print('============================') print(f'Final Test: {test_acc.mean():.4f} ± {test_acc.std():.4f}') ``` This example can have about 1x performance improvement compared to the original code, while increasing GPU utilization and decreasing CPU usage.