Top 30 AI dataset tools

Discover the most powerful AI tools in this category with pricing, features, demo and use cases

ArXiv Dataset

ArXiv Dataset

ANALYTICS AISEARCH RETRIEVAL AI
95

ArXiv Dataset is not a singular AI tool or model, but rather a vast repository of scientific preprin...

Platforms
WEB
API
OTHER
Domains
RESEARCHEDUCATIONDEVELOPMENTDATA ANALYTICS+1
Use Cases
Training AI models on vast collections of scientific literatureEnabling systematic literature reviews and trend analysisProviding a source for natural language processing (NLP) research+1
Target Users
AI RESEARCHERDATA SCIENTISTRESEARCHER+2
Modalities
TEXTMULTIMODAL
Integrations
API CONNECTOROTHER
Pricing
FREE
MNIST

MNIST

COMPUTER VISION
95

MNIST is a foundational dataset of handwritten digits, widely used for training and evaluating machi...

Platforms
SDK
OTHER
Domains
DATA ANALYTICSRESEARCHEDUCATIONDEVELOPMENT
Use Cases
Training image classification models for handwritten digitsBenchmarking and comparing the performance of different machine learning algorithmsDeveloping and testing optical character recognition (OCR) systems
Target Users
MACHINE LEARNING ENGINEERDATA SCIENTISTAI RESEARCHER+2
Modalities
IMAGETABULAR
Integrations
API CONNECTOROTHER
Pricing
FREE
CIFAR-10

CIFAR-10

COMPUTER VISION
95

CIFAR-10 is a widely used benchmark dataset for image classification tasks, consisting of 60,000 32x...

Platforms
SDK
API
Domains
RESEARCHEDUCATIONDATA ANALYTICSDEVELOPMENT
Use Cases
Training and evaluating image classification modelsBenchmarking computer vision algorithmsDeveloping and testing deep learning architectures for image recognition
Target Users
MACHINE LEARNING ENGINEERAI RESEARCHERDATA SCIENTIST+2
Modalities
IMAGE
Integrations
OTHER
Pricing
FREE
ImageNet

ImageNet

COMPUTER VISION
90

ImageNet is a foundational large-scale visual database designed for use in visual object recognition...

Platforms
OTHER
Domains
RESEARCHDATA ANALYTICS
Use Cases
Training and evaluating image classification modelsDeveloping and testing object detection algorithmsBenchmarking computer vision research advancements+1
Target Users
MACHINE LEARNING ENGINEERAI RESEARCHERDATA SCIENTIST
Modalities
IMAGE
Pricing
FREE
LAION-400M

LAION-400M

GENERATIVE AI
85

LAION-400M is a massive, open-source dataset containing billions of image-text pairs, primarily used...

Platforms
OTHER
Domains
RESEARCHDEVELOPMENTIMAGE GENERATIONCONTENT CREATION
Use Cases
Training text-to-image generation modelsDeveloping multimodal AI systemsFacilitating research in generative AI+1
Target Users
AI RESEARCHERMACHINE LEARNING ENGINEERDATA SCIENTIST+2
Modalities
IMAGETEXTMULTIMODAL
Integrations
OTHER
Pricing
FREE
CIFAR-100

CIFAR-100

COMPUTER VISION
85

CIFAR-100 is a widely used benchmark dataset for image classification tasks, containing 100 fine-gra...

Platforms
SDK
OTHER
Domains
RESEARCHEDUCATIONDATA ANALYTICSDEVELOPMENT
Use Cases
Training and evaluating image classification modelsBenchmarking performance of deep learning architecturesResearching novel computer vision algorithms+1
Target Users
AI RESEARCHERMACHINE LEARNING ENGINEERDATA SCIENTIST+2
Modalities
IMAGE
Integrations
OTHER
Pricing
FREE
COCO (Common Objects in Context)

COCO (Common Objects in Context)

COMPUTER VISION
85

COCO (Common Objects in Context) is a large-scale object detection, segmentation, and captioning dat...

Platforms
OTHER
Domains
RESEARCHDEVELOPMENTDATA ANALYTICS
Use Cases
Training object detection modelsEvaluating image segmentation algorithmsDeveloping image captioning systems+1
Target Users
MACHINE LEARNING ENGINEERDATA SCIENTISTAI RESEARCHER+2
Modalities
IMAGE
Pricing
FREE
Fashion-MNIST

Fashion-MNIST

COMPUTER VISION
85

Fashion-MNIST is a benchmark dataset of 70,000 28x28 grayscale images of 10 fashion categories, wide...

Platforms
API
SDK
Domains
DATA ANALYTICSEDUCATIONRESEARCHDEVELOPMENT+1
Use Cases
Training image classification modelsBenchmarking performance of new computer vision algorithmsTeaching foundational concepts in machine learning and deep learning+1
Target Users
MACHINE LEARNING ENGINEERDATA SCIENTISTAI RESEARCHER+2
Modalities
IMAGE
Integrations
OTHER
Pricing
FREE
Wikipedia Dump

Wikipedia Dump

SEARCH RETRIEVAL AIOTHER
85

Wikipedia Dump is a massive, publicly available dataset of Wikipedia articles, offering an unparalle...

Platforms
OTHER
Domains
RESEARCHEDUCATIONDATA ANALYTICSCONTENT CREATION
Use Cases
Training large language models for knowledge understandingDeveloping information retrieval and search systemsConducting linguistic and NLP research+1
Target Users
AI RESEARCHERDATA SCIENTISTMACHINE LEARNING ENGINEER+2
Modalities
TEXT
Integrations
OTHER
Pricing
FREE
LAION-5B

LAION-5B

OTHER
85

LAION-5B is a massive, open-source dataset of 5.85 billion image-text pairs, designed to facilitate ...

Platforms
OTHER
Domains
RESEARCHDEVELOPMENTCONTENT CREATIONIMAGE GENERATION+1
Use Cases
Training large-scale text-to-image generation modelsDeveloping multimodal AI systems for image captioning and retrievalBenchmarking and advancing research in foundation models+1
Target Users
AI RESEARCHERMACHINE LEARNING ENGINEERDATA SCIENTIST+2
Modalities
IMAGETEXTMULTIMODAL
Integrations
OTHER
Pricing
FREE
Amazon Personalize

Amazon Personalize

RECOMMENDATION AIANALYTICS AI
75

Amazon Personalize is a fully managed machine learning service that makes it easy for developers to ...

Platforms
API
SDK
WEB
Domains
ECOMMERCEMARKETINGCONTENT CREATIONENTERTAINMENT+1
Use Cases
Deliver personalized product recommendations to e-commerce customers in real-time.Create personalized content recommendations for media and entertainment platforms.Develop personalized marketing campaigns based on user behavior.
Target Users
DEVELOPERSOFTWARE ENGINEERMACHINE LEARNING ENGINEER+2
Modalities
TABULARTIME_SERIES
Integrations
API CONNECTORDATABASEOTHER
Pricing
PAIDCUSTOM
Stack Exchange Data Dump

Stack Exchange Data Dump

SEARCH RETRIEVAL AI
75

The Stack Exchange Data Dump provides a comprehensive, publicly accessible collection of anonymized ...

Platforms
OTHER
Domains
DEVELOPMENTRESEARCHEDUCATIONPRODUCTIVITY
Use Cases
Training natural language processing models on community-generated technical questions and answers.Analyzing trends in programming languages, technologies, and software development practices.Building custom search and knowledge retrieval systems for technical domains.+1
Target Users
RESEARCHERDATA SCIENTISTAI RESEARCHER+2
Modalities
TEXTTABULAR
Integrations
DATABASEAPI CONNECTOROTHER
Pricing
FREE
Open Images Dataset

Open Images Dataset

COMPUTER VISION
75

The Open Images Dataset is a large-scale, open dataset of ~9 million images annotated with image-lev...

Platforms
API
SDK
Domains
RESEARCHDEVELOPMENTDATA ANALYTICSPRODUCTIVITY+1
Use Cases
Training models for object detection and recognitionDeveloping image segmentation algorithmsBenchmarking computer vision model performance+1
Target Users
AI RESEARCHERMACHINE LEARNING ENGINEERDATA SCIENTIST+2
Modalities
IMAGETEXT
Integrations
API CONNECTORDATABASEOTHER
Pricing
FREE
SVHN (Street View House Numbers)

SVHN (Street View House Numbers)

COMPUTER VISION
75

SVHN (Street View House Numbers) is a computer vision dataset used for training and evaluating digit...

Platforms
OTHER
Domains
RESEARCHDATA ANALYTICSDEVELOPMENTOTHER
Use Cases
Training models for reading street numbers in autonomous driving systems.Evaluating digit recognition algorithms in real-world visual noise.Developing systems for address extraction from imagery.
Target Users
AI RESEARCHERMACHINE LEARNING ENGINEERDATA SCIENTIST+2
Modalities
IMAGE
Integrations
OTHER
Pricing
FREE
Flickr30k

Flickr30k

COMPUTER VISIONSEARCH RETRIEVAL AI
75

Flickr30k is a large-scale dataset for image captioning and visual-linguistic research, comprising o...

Platforms
OTHER
Domains
RESEARCHDEVELOPMENTIMAGE GENERATIONDATA ANALYTICS
Use Cases
Training models for image captioningDeveloping visual question answering systemsEvaluating multimodal AI model performance+1
Target Users
AI RESEARCHERMACHINE LEARNING ENGINEERDATA SCIENTIST+2
Modalities
IMAGETEXTMULTIMODAL
Integrations
OTHER
Pricing
FREE
Common Crawl

Common Crawl

75

Common Crawl is a non-profit organization that provides a massive, open repository of web crawl data...

Platforms
OTHER
Domains
DEVELOPMENTRESEARCHDATA ANALYTICSBUSINESS
Use Cases
Training large language models for NLP tasksDeveloping search engine algorithmsConducting large-scale web data analysis+1
Target Users
AI RESEARCHERDATA SCIENTISTMACHINE LEARNING ENGINEER+2
Modalities
TEXTIMAGETABULAR
Integrations
API CONNECTORDATABASEOTHER
Pricing
FREE
The Pile

The Pile

OTHER
75

The Pile is a massive, diverse dataset curated for training large language models, encompassing a wi...

Platforms
OTHER
Domains
RESEARCHDEVELOPMENTDATA ANALYTICS
Use Cases
Training foundation language models on diverse text and codeBenchmarking and evaluating LLM performanceResearching data curation and its impact on model capabilities
Target Users
AI RESEARCHERMACHINE LEARNING ENGINEERDATA SCIENTIST+2
Modalities
TEXT
Integrations
OTHER
Pricing
FREE
OpenWebText

OpenWebText

75

OpenWebText is an open-source dataset designed to replicate the quality and diversity of OpenAI's We...

Platforms
OTHER
Domains
RESEARCHDEVELOPMENT
Use Cases
Training large language models on diverse web textEvaluating language model performanceResearching text generation and understanding
Target Users
AI RESEARCHERMACHINE LEARNING ENGINEERDATA SCIENTIST
Modalities
TEXT
Pricing
FREE
Wikitext-103

Wikitext-103

ANALYTICS AIOTHER
75

Wikitext-103 is a large language model primarily used for language modeling tasks, serving as a benc...

Platforms
SDK
OTHER
Domains
RESEARCHDEVELOPMENTEDUCATIONDATA ANALYTICS
Use Cases
Evaluate the performance of new language modelsTrain and fine-tune language generation modelsConduct research in natural language processing
Target Users
AI RESEARCHERMACHINE LEARNING ENGINEERDATA SCIENTIST
Modalities
TEXT
Integrations
OTHER
Pricing
FREE
Project Gutenberg

Project Gutenberg

75

Project Gutenberg is a massive, free online library of over 60,000 free eBooks, focusing on public d...

Platforms
WEB
Domains
EDUCATIONRESEARCHWRITINGENTERTAINMENT
Use Cases
Access and read public domain literatureDownload free eBooks for offline readingResearch historical and classic texts
Target Users
STUDENTRESEARCHERWRITER+2
Modalities
TEXT
Pricing
FREE
CelebA

CelebA

COMPUTER VISION
75

CelebA is a large-scale dataset of celebrity images with annotations for various facial attributes, ...

Platforms
OTHER
Domains
RESEARCHDATA ANALYTICSDEVELOPMENT
Use Cases
Training models for facial attribute recognition (e.g., age, gender, hair color)Developing and evaluating face detection algorithmsResearching and benchmarking computer vision models for facial analysis
Target Users
AI RESEARCHERDATA SCIENTISTMACHINE LEARNING ENGINEER+2
Modalities
IMAGE
Pricing
FREE
LibriSpeech

LibriSpeech

SPEECH AI
75

LibriSpeech is a large-scale, open-source dataset of read English speech used for training and evalu...

Platforms
OTHER
Domains
RESEARCHEDUCATIONDEVELOPMENTAUDIO MUSIC
Use Cases
Training and evaluating automatic speech recognition (ASR) modelsDeveloping and testing speaker recognition and identification systemsBenchmarking the performance of different ASR architectures+1
Target Users
MACHINE LEARNING ENGINEERAI RESEARCHERDEVELOPER+2
Modalities
AUDIO
Pricing
FREE
VoxCeleb

VoxCeleb

SPEECH AI
75

VoxCeleb is a large-scale dataset for speaker recognition and speaker diarization, comprising a vast...

Platforms
OTHER
Domains
RESEARCHAUDIO MUSICCONTENT CREATION
Use Cases
Training and evaluating speaker recognition modelsDeveloping and testing speaker diarization systemsResearching robust voice biometrics applications
Target Users
AI RESEARCHERMACHINE LEARNING ENGINEERDATA SCIENTIST+1
Modalities
AUDIO
Pricing
FREE
Common Voice (Mozilla)

Common Voice (Mozilla)

SPEECH AI
75

Common Voice is an open-source initiative by Mozilla to collect diverse voice data, enabling the tra...

Platforms
WEB
Domains
RESEARCHDEVELOPMENTEDUCATIONPRODUCTIVITY
Use Cases
Train custom automatic speech recognition (ASR) modelsDevelop and improve voice-enabled applicationsFacilitate linguistic research on spoken language+1
Target Users
RESEARCHERAI RESEARCHERDEVELOPER+2
Modalities
AUDIO
Integrations
OTHER
Pricing
FREE
AudioSet

AudioSet

ANALYTICS AIOTHER
75

AudioSet is a large-scale dataset containing diverse audio events annotated with semantic labels, pr...

Platforms
OTHER
Domains
RESEARCHEDUCATIONAUDIO MUSICOTHER
Use Cases
Training models for real-time sound event detection in smart devices.Benchmarking audio classification algorithms across a wide range of sounds.Developing applications for ambient sound analysis and environmental monitoring.
Target Users
RESEARCHERMACHINE LEARNING ENGINEERDATA SCIENTIST+2
Modalities
AUDIO
Integrations
OTHER
Pricing
FREE
VQAv2 (Visual Question Answering)

VQAv2 (Visual Question Answering)

COMPUTER VISION
75

VQAv2 is a benchmark dataset and evaluation metric for Visual Question Answering (VQA) systems, desi...

Domains
RESEARCHDEVELOPMENTDATA ANALYTICS
Use Cases
Evaluating VQA models on image understandingBenchmarking multimodal AI systemsDeveloping new approaches to visual reasoning
Target Users
AI RESEARCHERMACHINE LEARNING ENGINEERDATA SCIENTIST+2
Modalities
IMAGETEXT
CLIP Benchmark Dataset

CLIP Benchmark Dataset

COMPUTER VISIONANALYTICS AI
75

CLIP Benchmark Dataset is a curated collection of image-text pairs designed to evaluate the zero-sho...

Platforms
SDK
OTHER
Domains
RESEARCHDEVELOPMENTDATA ANALYTICSEDUCATION
Use Cases
Evaluate zero-shot image classification performanceBenchmark multimodal model robustnessCompare image-text retrieval capabilities+1
Target Users
AI RESEARCHERMACHINE LEARNING ENGINEERDATA SCIENTIST+2
Modalities
TEXTIMAGEMULTIMODAL
Integrations
OTHER
Pricing
FREE
MIMIC-III

MIMIC-III

ANALYTICS AIOTHER
75

MIMIC-III is a critical benchmark dataset that enables research in critical care medicine. It contai...

Platforms
OTHER
Domains
HEALTHCARERESEARCHDATA ANALYTICS
Use Cases
Develop predictive models for patient outcomesAnalyze treatment effectiveness in critical care settingsAdvance research in electronic health record analysis+1
Target Users
RESEARCHERAI RESEARCHERDATA SCIENTIST+1
Modalities
TABULARTIME_SERIES
Integrations
OTHER
Pricing
FREE
KITTI

KITTI

COMPUTER VISION
75

KITTI is a specialized computer vision benchmark dataset and associated software development kit, pr...

Platforms
OTHER
Domains
DEVELOPMENTRESEARCHAUTOMATIONOPERATIONS+1
Use Cases
Training and evaluating object detection models for autonomous vehiclesDeveloping and testing algorithms for 3D scene perception using lidar and stereo camerasBenchmarking performance of tracking algorithms in dynamic environments+1
Target Users
AI RESEARCHERMACHINE LEARNING ENGINEERDATA SCIENTIST+1
Modalities
IMAGETABULARSENSOR_DATA
Integrations
OTHER
Pricing
FREE
Waymo Open Dataset

Waymo Open Dataset

COMPUTER VISIONOTHER
75

The Waymo Open Dataset is a comprehensive, large-scale dataset for autonomous driving research, prov...

Platforms
SDK
OTHER
Domains
DEVELOPMENTRESEARCHAUTOMATIONDATA ANALYTICS+1
Use Cases
Training and evaluating perception models for autonomous vehiclesDeveloping and testing sensor fusion algorithmsResearching object detection and tracking in complex urban environments+1
Target Users
AI RESEARCHERDATA SCIENTISTMACHINE LEARNING ENGINEER+2
Modalities
THREE_DIMAGESENSOR_DATA
Integrations
OTHER
Pricing
FREE

Ready to Explore More?

Discover thousands more AI tools in our comprehensive directory. Find the perfect solution for your specific needs and take your projects to the next level.