data-centric-AI: Curated Resource List for AI Developers

GitHub Stats Value
Stars 1026
Forks 72
Language
Created 2023-03-07
License -

The ‘data-centric-AI’ project is a curated collection of resources focused on the principles and practices of data-centric artificial intelligence. This initiative aims to highlight a diverse range of ideas and techniques in the field, rather than attempting to be exhaustive. It includes survey papers, perspective papers, and tutorials such as the Large Time Series Model (LTSM) and a KDD 2023 tutorial on data-centric AI techniques. The project welcomes contributions to enhance and refine the list, making it a valuable resource for those interested in the evolving landscape of data-centric AI.

Data-centric AI is an emerging field that focuses on engineering data to improve AI systems, emphasizing data quality and quantity over model complexity.

  • Unlike traditional model-centric AI, which focuses on developing more effective models, data-centric AI shifts the focus to systematically engineering high-quality data.
  • Training Data Development: Collect and produce rich, high-quality training data.
  • Inference Data Development: Create novel evaluation sets to provide granular insights into models or trigger specific model capabilities.
  • Data Maintenance: Ensure data quality and reliability in dynamic environments.
  • Data Collection: Methods for discovering and integrating datasets.
  • Data Labeling: Techniques for active learning, weak supervision, and label correction.
  • Data Preparation: Feature extraction, normalization, and cleaning.
  • Data Reduction: Feature selection and dimensionality reduction.
  • Data Augmentation: Techniques to increase dataset size and diversity.
  • Pipeline Search: Automated preprocessing pipeline optimization.
  • In-distribution Evaluation: Methods for evaluating models within known data distributions.
  • Out-of-distribution Evaluation: Techniques for evaluating models on unseen data distributions.
  • Prompt Engineering: Strategies for creating effective prompts to interact with AI models.
  • Data Understanding: Tools for visualizing and interpreting data.
  • Data Quality Assurance: Methods for ensuring data accuracy and reliability.
  • Data Storage and Retrieval: Efficient storage and retrieval systems.
  • Training Data Development Benchmark: Evaluating data collection, labeling, preparation, and augmentation methods.
  • Inference Data Development Benchmark: Assessing in-distribution and out-of-distribution evaluation methods.
  • Data Maintenance Benchmark: Evaluating data storage, retrieval, and quality assurance.
  • Unified Benchmark: Comprehensive benchmarks for overall data-centric AI development.
  • The project includes a curated list of papers, tutorials, and blogs on data-centric AI concepts, techniques, and challenges.
  • It also provides resources for graph structure learning, knowledge graph-based paper search engines, and community discussion channels.
  • The project welcomes contributions through pull requests and encourages community engagement to enrich and refine the list of resources.
  • Data Collection: Use resources like Aurum for dataset discovery in data lakes or Table Union Search to find relevant data sets efficiently. For example, in time series analysis, tools like Tods can automate outlier detection and data collection.

    markdown

    - Example: Automate time series outlier detection using Tods [Paper](https://arxiv.org/abs/2009.09822) [Code](https://github.com/datamllab/tods)
  • Data Labeling: Implement active learning techniques such as Snorkel or Meta-AAD to efficiently label large datasets. For instance, Segment Anything relies on extensive annotated data for training.

    markdown

    - Example: Use Snorkel for weak supervision in labeling tasks [Paper](https://arxiv.org/abs/1711.10160) [Code](https://github.com/snorkel-team/snorkel)
  • Data Preparation: Utilize tools like Alphaclean to generate data cleaning pipelines automatically. This ensures high-quality data for model training.

    markdown

    - Example: Generate data cleaning pipelines with Alphaclean [Paper](https://arxiv.org/abs/1904.11827) [Code](https://github.com/sjyk/alphaclean)
  • In-distribution Evaluation: Employ techniques like FOCUS for flexible optimizable counterfactual explanations to evaluate model performance within known data distributions.

    markdown

    - Example: Use FOCUS for counterfactual explanations [Paper](https://arxiv.org/abs/1911.12199) [Code](https://github.com/a-lucic/focus)
  • Out-of-distribution Evaluation: Utilize benchmarks like Wilds to test model robustness against distribution shifts.

    markdown

    - Example: Evaluate model robustness with Wilds benchmark [Paper](https://arxiv.org/pdf/2012.07421v3.pdf) [Code](https://github.com/p-lambda/wilds)
  • Prompt Engineering: Apply techniques such as SPeC for soft prompt-based calibration to improve the performance of pre-trained language models.

    markdown

    - Example: Enhance language models with SPeC [Paper](https://arxiv.org/pdf/230
  • Focus on Data Quality: Data-centric AI shifts the focus from model development to systematic engineering of data, emphasizing data quality and quantity to improve AI systems.
  • Comprehensive Framework: The framework includes training data development, inference data development, and data maintenance, each with specific sub-goals.
  • Applications: Successful examples include GPT models and Segment Anything, highlighting the critical role of high-quality training data.
  • Benchmarks and Tools: Various benchmarks (e.g., OpenGSL, REIN) and tools (e.g., Autoaugment, Mixtext) are being developed to standardize and enhance data-centric practices.
  • Future Potential: Expected to improve model robustness, fairness, and interpretability by addressing data flaws and ensuring continuous data maintenance.

Data-centric AI is an emerging field that prioritizes the engineering of high-quality data to enhance AI performance. It complements traditional model-centric approaches by focusing on data collection, labeling, preparation, reduction, augmentation, and maintenance. The project includes extensive resources, benchmarks, and tools to support this paradigm shift. The future potential of data-centric AI lies in its ability to improve model robustness, fairness, and interpretability, making AI systems more reliable and effective.

For further insights and to explore the project further, check out the original daochenzha/data-centric-AI repository.

Content derived from the daochenzha/data-centric-AI repository on GitHub. Original materials are licensed under their respective terms.