DSE Graph Foundation Model Subgroup

Introduction

Greetings! We are a tiny research group from the Data Science and Engineering (DSE) Laboratory at Michigan State University. We typically focus on the Graph Foundation Models (GFM). Our perspectives are as follows: (1) LLM can be one choice for building GFM, but not yet. (2) GFM requires the guidance from theoretical principles. This is exciting as it connects the advanced progress from theory to unbeatable empirical success (Check details here). (3) There is an initial spark for Neural scaling law on graphs. We need more high-quality data, a better model backbone, and a better pre-training task design toward scaling. (4) the most important thing for GFM is the correct application scenario. Beyond traditional graph topics in the data mining domain, we are also interested in the potential of the utilization of GFM in other domains. Check more details on our current progress including papers, talks, open-source repository, and reading list.

Papers

Prospective

  • Graph Foundation Models
    Haitao Mao*, Zhikai Chen*, Wenzhuo Tang, Jianan Zhao, Yao Ma, Tong Zhao, Neil Shah, Michael Galkin, Jiliang Tang;
    ICML 2024 Spotlight
    Details
    • We propose a “graph vocabulary” perspective aiming to find the basic transferable units underlying graphs that encode the invariance of graphs
    • We illustrate theoretical guidance for the graph vocabulary design
    • We emphasize the practical techniques for building GFM following the Neural Scaling Law

GFMs

Principles

  • Do Neural Scaling Laws exist on Graph Self-Supervised Learning
    Qian Ma, Haitao Mao, Zhehua Zhang, Chunlin Feng, Jingzhe Liu, Yu Song, Yao Ma;
    preprint, 2024
  • Neural Scaling Laws on Graphs
    Jingzhe Liu, Haitao Mao, Zhikai Chen, Tong Zhao, Neil Shah, Jiliang Tang;
    preprint, 2024
    Details
    • We examine the mode and data scaling laws on graphs.
    • For model scaling, we observe some graph-specific phenomena and identify the potential reasons.
    • For data scaling, we propose that the total edge number is a better metric, and extend the data scaling law to node classification and link prediction tasks.
  • A Data Generation Perspective to the Mechanism of In-Context Learning
    Haitao Mao, Guangliang Liu, Yao Ma, Rongrong Wang, Jiliang Tang;
    preprint, 2024
    Details
    • We study the underlying mechanism of ICL from a data generation perspective.
    • we rigorously adopt the terms of skill learning and skill recognition. The difference between them is skill learning can learn new data generation functions from in-context data.
    • We illustrate two analysis frameworks, i.e., Bayesian inference statistical framework and function learning statistical framework.
  • Revisiting Link Prediction: A data perspective
    Haitao Mao, Juanhui Li, Harry Shomer, Bingheng Li, Wenqi Fan, Yao Ma, Tong Zhao, Neil Shah, Jiliang Tang;
    ICLR, 2024
    Details
    • We recognize three fundamental factors critical to link prediction: local structural proximity, global structural proximity and feature proximity.
    • We unearth the incompatibility between feature and structural proximity.
    • We collect diverse link prediction datasets and provide new guidance for model architecture design.
  • Demystifying Structural Disparity in Graph Neural Networks: Can One Size Fit All?
    Haitao Mao, Zhikai Chen, Wei Jin, Haoyu Han, Yao Ma, Tong Zhao, Neil Shah, Jiliang Tang; NeurIPS, 2023
    Details
    • We recognize two fundamental factors critical to node classification: homophily and heterophily.
    • GNNs can only work on either the homophily pattern or the heterophily one, but not both.

LLMs on Graphs

  • Label-free Node Classification on Graphs with Large Language Models (LLMS)
    Zhikai Chen, Haitao Mao, Hongzhi Wen, Haoyu Han, Wei Jin, Haiyang Zhang, Hui Liu, Jiliang Tang;
    ICLR, 2024
    Details
    • In recent years, there have been remarkable advancements in node classification achieved by Graph Neural Networks (GNNs). However, they necessitate abundant high-quality labels to ensure promising performance. In contrast, Large Language Models (LLMs) exhibit impressive zero-shot proficiency on text-attributed graphs. Yet, they face challenges in efficiently processing structural data and suffer from high inference costs. In light of these observations, this work introduces a label-free node classification on graphs with LLMs pipeline, LLM-GNN. It amalgamates the strengths of both GNNs and LLMs while mitigating their limitations.
  • Exploring the Potential of Large Language Models (LLMs) in Learning on Graphs
    Zhikai Chen, Haitao Mao, Hang Li, Wei Jin, Hongzhi Wen, Xiaochi Wei, Shuaiqiang Wang, Dawei Yin, Wenqi Fan, Hui Liu, Jiliang Tang;
    SIGKDD Explorations and NeurIPS GLFrontiers 2023 [codes]
    Details
    • In this paper, we study how LLMs can be used to empower graph machine learning problems. For node classification tasks, we propose two pipelines: LLMs-as-Enhancers and LLMs-as-Predictors. LLMs-as-Enhancers adopts LLMs to enhance the text features, which improves GNNs' performance. LLMs-as-Predictors directly adopts LLMs for inference, and present feature information together with inductive biases by natural languages. LLMs-as-Predictors achieves promising zero-shot performance.
  • Graph Machine Learning in the Era of Large Language Models (LLMs).
    Wenqi Fan, Shijie Wang, Jiani Huang, Zhikai Chen, Yu Song, Wenzhuo Tang, Haitao Mao, Hui Liu, Xiaorui Liu, Dawei Yin, Qing Li;
    Arxiv 2024

Benchmarks

Others

Paper Lists

Acknowledgement

We sincerely thank the people below for their guidance and collaboration in our research work.

Advisory: Neil Shah, Tong Zhao, Yao Ma, Wei Jin, Michael Galkin, Jian Tang, Michael Bronstein, Xavier Bresson, Bryan Hooi,Haiyang Zhang,Xiafeng Tang,Chen Luo.

Students: Harry Shomer, Juanhui Li, Guangliang Liu,Jianan Zhao, Xiaoxin He, Qian Huang, Xinyu Yuan, Zhaocheng Zhu.

Sponsors