With more devices connecting to the Internet, more data is collected and shared. Any research being done today relies on datasets readily available to everyone. Whether studying climate change, clean energy, or the most popular side dish served at Thanksgiving dinner, there is a dataset for that.
To help prepare for the increased data demands coming in 2021, here is a list of the top 100 websites and sources for datasets on the Internet. As with any list of this type and length, opinions will vary. Regardless of how these sites rank for you, at least we can agree on the value of any website providing reliable, verified data.
Without further ado, the list.
- Google Dataset Search: Aggregated data from external sources, providing a clear summary, a description of the data, who it’s provided by, and when it was last updated.
- Kaggle: A community hub offering aggregated datasets.
- Data.gov: Over 200,000 datasets covering everything from climate change to crime.
- Datahub.io: Stock market data, property prices, inflation, logistics, and more.
- UCI Machine Learning Repository: Datasets are categorized by task (i.e., classification, regression, or clustering), attribute (i.e., categorical, numerical), data type, and area of expertise.
- NASA’s Earth Data: Access to all of NASA’s satellite observation data — from weather and climate measurements to atmospheric observations, ocean temperatures, vegetation mapping, and more.
- NASA’s Planetary Data: This repository includes data from interplanetary missions.
- CERN Open Data Portal: Particle physics datasets.
- Global Health Observatory Data Repository: The UN WHO’s gateway to health-related statistics from around the world.
- U.S. Bureau of Labor Statistics: Datasets across a wide range of industries collected by the United States Department of Labor.
- U.S. Census Bureau: Government statistics on population, economy, education, and geography.
- UNData: United Nations data.
- Amazon Public Datasets: Large datasets relating to biology, chemistry, economics, and physiology, including the Human Genome Project.
- Pew Research: Public opinion polls, demographic research, and other social science research.
- Google Scholar: Information, including articles, books, abstracts, white papers, and court decisions.
- Datasets Subreddit: A little of everything, from English grain prices of the 14th Century to U.S. homelessness rates.
- FiveThirtyEight: Statistical analysis of elections, politics, sports, science, economics, and more.
- Qlik DataMarket: Datasets related to economics, healthcare, agriculture, and the auto industry.
- London Datastore: Data about life in London
- NYC Open Data: Data for New York City, including corruption, election, and media.
- BFI film industry statistics: The BFI accrues and releases data on everything from U.K. box office figures to audience demographics, home entertainment, movie production costs, and more.
- NYC Taxi Trip Data: Find datasets covering pick-up/drop-off times and locations, trip distances, fares, rate and payment types, passenger counts, and more.
- FBI Crime Data Explorer: This site provides a broad collection of crime statistics from various state organizations and governments.
- World Bank Open Data: Free and open access to global development data.
- DataBank: Analysis and visualizations of time series data on various topics.
- Uniform Crime Reporting Statistics: Data on violent crime at city, county, state, and national levels.
- U.S. Food and Drug Administration: Datasets of drug submissions and approvals.
- First Databank: Drug data and drug databases.
- Education Data by Unicef: Data on school completion, attendance, and literacy rates.
- European Union Open Data Portal: Dataset from European Union institutions.
- Open Data Network: Government-related data, including some visualizations.
- Gapminder: Massive collection of data sources covering everything from agriculture and employment to aid given and death.
- Awesome-Public-Datasets on Github: This repository hosts a library of awesome public datasets.
- Google Public Datasets: Public datasets available from Google Cloud.
- Academic Torrents: Academic Torrents is a site geared around sharing the datasets from scientific papers.
- Quandl: Repository of economic and financial data.
- Google Trends: Statistics on search volume (as a proportion of total search) for any given term, since 2004.
- Crunchbase: Business information about private and public companies.
- Glassdoor Research: Data related to employment.
- National Center for Education Statistics: Datasets related to education.
- Million Song Dataset: Data containing audio and metadata for a million popular songs.
- The Numbers: Movie financials, including box office, DVD, and Blu-ray sales and release dates.
- Statista: Insights and facts across 170 industries and 150+ countries.
- Academic Rights Press: Provides access to perpetually updated, week-to-week information on Spotify streaming, as well as current and historical data from Billboard, GfK, and more.
- National Center for Environmental Health: Government-funded related to environmental public health.
- National Climatic Data Center: Storm and climate indices from the National Oceanic and Atmospheric Administration.
- National Weather Service: Climate data from U.S. observation, including historical weather conditions and long-term averages.
- U.S. Bureau of Economic Analysis: U.S. economic statistics (i.e., national income, gross domestic product, etc.)
- National Bureau of Economic Research: Industry, productivity, trade, and international financial data.
- U.S. Securities and Exchange Commission: Datasets of filed information from exhibits to corporate financial reports.
- World Bank Open Data: Education statistics from finances to service delivery.
- Global Financial Data: 300 years of analytic global economy data for 60,000 companies.
- Data Catalogs: Comprehensive list of open data catalogs.
- The CIA World Factbook: Data on every country focused on history, people, government, economy, energy, geography, communications, transportation, and defense.
- Centers for Disease Control and Prevention: Public health data and statistics.
- World Health Organization: International public health information, data, statistics, and reports.
- National Center for Health Statistics: Datasets, growth charts, and other vital records.
- HealthData.gov: Health data including data on Medicaid, Medicare, treatments, and clinical studies.
- U.S. Small Business Administration: Employment data from business owners’ perspective, including economic indicators and projections.
- Gallup: Data-driven news based on U.S. and world polls.
- Rand State Statistics: Social science data for the U.S. at the national, state, and local levels.
- Roper Center for Public Opinion Research: U.S. and international polling and public opinion survey data.
- BuzzFeed News: Datasets, analysis, libraries, tools, and guides used in BuzzFeed articles.
- Socrata: Socrata hosts cleaned, open-source data sources ranging from government, business, and education datasets.
- Google Finance: 40 years’ worth of stock market data, updated in real-time.
- Google Books: Search and analyze the full text of any of the millions of books digitized as part of the Google Books project.
- Data Source Network: DataSN crawls, parses, and hosts all Internet data, not raw web pages, but data objects that are machine friendly and human-readable. For example, the Yellow Pages data set.
- LoveTheSales: Free access to data for editors and academics to mine stats on the retail industry.
- National Government Statistical Web Sites: Data, reports, statistical yearbooks, press releases, and more from about 70 web sites, including countries from Africa, Europe, Asia, and Latin America.
- National Space Science Data Center: NASA datasets from planetary exploration and more.
- SourceForge.net Research Data: Statistics on approximately 140,000 projects and over 1.5 million registered users.
- Football (Soccer) Data: Football data provides information on soccer (players, games, officials, etc.)
- Sports Statistics: Data for Soccer, NBA, NFL, NHL, and more.
- DBPedia: Millions of pieces of data, structured and unstructured, on every subject under the sun.
- xView: Massive publicly available datasets of overhead imagery. It contains images from complex scenes around the world, annotated using bounding boxes.
- ImageNet: The largest image dataset for computer vision. It provides an accessible image database that is organized hierarchically, according to WordNet.
- Kinetics-700: A large-scale dataset of over 650,000 video URLs from Youtube.
- Google’s Open Images: A vast dataset from Google A.I. containing over 10 million images.
- Appen Open Source Datasets: Created and curated for teams working on world-class A.I. applications.
- AssetMacro: Historical data of Macroeconomic Indicators and Market Data.
- BigML: Datasets from industries including aerospace, automotive, energy, entertainment, financial services, food, healthcare, IoT, pharmaceutical, transportation, telecommunications, and more.
- Data.world: Datasets to get clear, accurate, fast answers to any business question.
- Credit Risk Analytics Data: Datasets of home equity loans credit, mortgage loan level, Loss Given Default (LGD), and corporate ratings.
- DataPlanet: Standardized and structured statistical data.
- EconData: Economic time series data produced by many U.S. Government agencies.
- Europeana Data: Open metadata on 20 million texts, images, videos, and sounds.
- FIMI repository: Datasets from various data mining implementations.
- GDELT: Global data on events, location, and tone.
- Generated Photos: Free dataset with AI-generated photos.
- GEO (GEO Gene Expression Omnibus): A curated, online resource for gene expression data browsing, query, and retrieval.
- HitCompanies Datasets: Comprehensive data on 10,000 random sampled U.K. companies, updated automatically using A.I./Machine Learning.
- ICWSM-2009 dataset: 44 million blog posts made between August and October 2008.
- JMP Public featured datasets: Dataset from COVID-19 in Italy to Thanksgiving dinner sides.
- Linking Open Data: W3C SWEO Community Project.
- Web Data Commons: Structured data from the Common Crawl, the most extensive public web corpus.
- Webhose free datasets: data from a range of different sources, languages, and categories.
- U.N. Office on Drugs and Crime: Data on transnational organized.
- National Institute on Drug Abuse: The National Institute on Drug Abuse monitors the prevalence and trends regarding drug abuse in the United States.
- Reeep Data: Free-to-use clean energy datasets.
- USDA – Food Composition: The United States Department of Agriculture provides data about the composition and nutrient values of different foods.