Github
- class dhg.data.Github(data_root=None)[source]
Bases:
dhg.data.base.BaseDataThe Github dataset is a collaboration network dataset for vertex classification task. Nodes correspond to developers who have starred at least 10 repositories and edges to mutual follower relationships. Node features are location, starred repositories, employer and e-mail address. The labels are binary, where denoting the web developers and machine learning developers. More details see the Multi-Scale Attributed Node Embedding paper.
Note
The L1-normalization for the feature is not recommended for this dataset.
The content of the Github dataset includes the following:
num_classes: The number of classes: \(4\).num_vertices: The number of vertices: \(37,700\).num_edges: The number of edges: \(144,501\).dim_features: The dimension of features: \(4,005\).features: The vertex feature matrix.torch.Tensorwith size \((37,700 \times 4,005)\).edge_list: The edge list.Listwith length \((144,501 \times 2)\).labels: The label list.torch.LongTensorwith size \((37,700, )\).
- Parameters
data_root (
str, optional) – Thedata_roothas stored the data. If set toNone, this function will auto-download from server and save into the default direction~/.dhg/datasets/. Defaults toNone.