Github

class dhg.data.Github(data_root=None)[source]

Bases: dhg.data.base.BaseData

The Github dataset is a collaboration network dataset for vertex classification task. Nodes correspond to developers who have starred at least 10 repositories and edges to mutual follower relationships. Node features are location, starred repositories, employer and e-mail address. The labels are binary, where denoting the web developers and machine learning developers. More details see the Multi-Scale Attributed Node Embedding paper.

Note

The L1-normalization for the feature is not recommended for this dataset.

The content of the Github dataset includes the following:

  • num_classes: The number of classes: \(4\).

  • num_vertices: The number of vertices: \(37,700\).

  • num_edges: The number of edges: \(144,501\).

  • dim_features: The dimension of features: \(4,005\).

  • features: The vertex feature matrix. torch.Tensor with size \((37,700 \times 4,005)\).

  • edge_list: The edge list. List with length \((144,501 \times 2)\).

  • labels: The label list. torch.LongTensor with size \((37,700, )\).

Parameters

data_root (str, optional) – The data_root has stored the data. If set to None, this function will auto-download from server and save into the default direction ~/.dhg/datasets/. Defaults to None.