How to Access Google Big Data Via Hexomatic

This provides access to research data on an unprecedented scale accessible in a few clicks. No coding required. 

You can find the datasets here:

COVID-19 Public Datasets

COVID-19 Public Datasets help to make data more accessible to researchers, data scientists and analysts. The program hosts a repository of public datasets that relate to the COVID-19 crisis and make them free to access and analyze. Datasets from the New York Times, European Centre for Disease Prevention and Control, Google, Global Health Data from the World Bank, and OpenStreetMap are included in the program.

Austin Crime Data

Austin Crime Data includes Part 1 crimes (as defined by Uniform Crime Reporting Statistics) for 2014 and 2015. Data is provided by the Austin Police Department and may differ from official APD crime data due to the variety of reporting and collection methods used.

Bitcoin Cash Cryptocurrency Dataset

Bitcoin Cash is a cryptocurrency that allows more bytes to be included in each block relative to it’s common ancestor Bitcoin. This dataset contains the blockchain data in their entirety, pre-processed to be human-friendly and to support common use cases such as auditing, investigating, and researching the economic and financial properties of the system. The program is hosting several cryptocurrency datasets, with plans to both expand offerings to include additional cryptocurrencies and reduce the latency of updates.

Bitcoin Cryptocurrency 

Bitcoin is a crypto currency leveraging blockchain technology  to store transactions in a distributed ledger.
This dataset is part of a larger effort to make cryptocurrency data available in BigQuery through the Google Cloud Public Datasets program. The program is hosting several cryptocurrency datasets, with plans to both expand offerings to include additional cryptocurrencies and reduce the latency of updates.

Bitcoin Blockchain Transaction Data

Bitcoin is a cryptocurrency/blockchain-based asset leveraging blockchain technology to store transactions in a distributed ledger. Bitcoin uses peer-to-peer technology to operate with no central authority or banks; managing transactions and the issuing of bitcoins is carried out collectively by the network. It allows exciting uses that could not be covered by any previous payment system.

BREATHE BioMedical Literature Dataset

BREATHE is a large-scale biomedical database containing entries from 10 major repositories of biomedical research. The dataset contains both abstract and full body texts of biomedical papers going back for decades and contains more than 16 million unique papers. This dataset can be used to train language models to better understand outcomes from biomedical research and uncover insights to combat the COVID-19 pandemic.

Catalonia Mobile Coverage Data

The GenCat Mobile Coverage app is an initiative of the Government of Catalonia to crowdsource data collection on the state of mobile telephone network coverage in Catalonia. The platform uses an Android app to record citizens’ data through their mobile devices on the level of coverage per operator, network (2G, 3G and 4G) and the device’s location. This dataset contains the platform data over the 2015-2017 period.

This data might be used to analyze the quality of mobile coverage in Catalonia of the four main operators (Movistar, Vodafone, Orange and Yoigo) and filter data according to the technology used (2G, 3G or 4G).

GEO US Boundary Dataset

These are full-resolution boundary files, derived from TIGER/Line Shapefiles, the fully-supported, core geographic products from the US Census Bureau. They are extracts of selected geographic and cartographic information from the US Census Bureau’s Master Address File/Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) database. These include information for the 50 states, the District of Columbia, Puerto Rico, and the Island areas (American Samoa, the Commonwealth of the Northern Mariana Islands, Guam, and the United States Virgin Islands). These include polygon boundaries of geographic and statistical areas, linear features including roads and hydrography, and point features.

Center for Medicare and Medicaid Services – Dual Enrollment

This public dataset was created by the Centers for Medicare & Medicaid Services. The data summarize counts of enrollees who are dually-eligible for both Medicare and Medicaid programs, including those in Medicare Savings Programs. 

“Duals” represent 20 percent of all Medicare beneficiaries, yet they account for 34 percent of all spending by the program, according to the Commonwealth Fund. As a representation of this high-needs, high-cost population, these data offer a view of regions ripe for more intensive care coordination that can address complex social and clinical needs.

CFPB Consumer Complaint Database

The Consumer Complaint Database is a collection of complaints about consumer financial products and services that are sent to companies for response. Complaints are published after the company responds, confirming a commercial relationship with the consumer, or after 15 days, whichever comes first. Complaints referred to other regulators, such as complaints about depository institutions with less than $10 billion in assets, are not published in the Consumer Complaint Database.

Chicago Crime Dataset

This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days. Data is extracted from the Chicago Police Department’s CLEAR (Citizen Law Enforcement Analysis and Reporting) system. In order to protect the privacy of crime victims, addresses are shown at the block level only and specific locations are not identified.

Chicago Taxi Trips Dataset

This dataset includes taxi trips from 2013 to the present, reported to the City of Chicago in its role as a regulatory agency. To protect privacy but allow for aggregate analyses, the Taxi ID is consistent for any given taxi medallion number but does not show the number, Census Tracts are suppressed in some cases, and times are rounded to the nearest 15 minutes. Due to the data reporting process, not all trips are reported but the City believes that most are. 

COVID-19 Cases by Country

This dataset is maintained by the European Centre for Disease Prevention and Control (ECDC)  and reports on the geographic distribution of COVID-19 cases worldwide. It includes COVID-19 reported cases and deaths broken out by country.

COVID-19 Cases in Italy

This is the Italian Coronavirus data repository from the Dipartimento della Protezione Civile. This dataset was created in response to the Coronavirus public health emergency in Italy and includes COVID-19 cases reported by region. More information on the data repository is available here.  For additional information on Italy’s situation tracking and reporting, see the department’s Coronavirus site  and interactive dashboard.

Weather Source Data for COVID-19

Weather Source, a leading provider of weather and climate technologies for business intelligence, is offering complimentary data for those researching the potential connections between weather and COVID-19 viability and transmission.


COVID-19 Data Repository by CSSE at JHU

This is the data repository for the 2019 Novel Coronavirus Visual Dashboard operated by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). The data include the location and number of confirmed COVID-19 cases, deaths, and recoveries for all affected countries, aggregated at the appropriate province/state. It was developed to enable researchers, public health authorities and the general public to track the outbreak.

COVID-19 Mobility Impact

The global economy is seeing significant differences in commercial vehicle activity due to the COVID-19 pandemic. The COVID-19 Mobility Impact Dataset offers insight into changes in commercial vehicle mobility and plotting its course toward recovery. Discover trends that illustrate recovery to pre-pandemic norms by industry and region.

COVID-19 Open Data

This repository contains the largest COVID-19 epidemiological database available in addition to a powerful set of expansive covariates. It includes open sourced data with a permissive license (enabling commercial use) relating to vaccinations, epidemiology, hospitalizations, demographics, economy, geography, health, mobility, government response, weather, and more. Moreover, the data merges daily time-series from hundreds of data sources at a fine spatial resolution, containing over 20,000 locations and using a consistent set of region keys.

COVID-19 Public Forecasts

Developed on Google Cloud’s robust infrastructure with guidance from the Harvard Global Health Institute, the COVID-19 Public Forecasts offer a prediction of COVID-19’s impact over the next 28 days. The forecasts are generated from a novel time series machine learning approach that combines AI with a robust epidemiological foundation and are trained on public data. The forecasts are maintained by Google Cloud to ensure they remain up-to-date in the changing landscape.

Covid-19 RxRx19

RxRx19a  and RxRx19b  are the first public datasets that demonstrate the rescue of morphological effects of the SARS-CoV-2 virus and COVID-19-associated cytokine storm. Through these datasets, researchers in the scientific community will have access to both the images and the corresponding deep learning embeddings to analyze or apply to their own experimentation. The embeddings are vectors that correspond to each image and come from Recursion’s internal model trained on additional cell types and perturbation modalities. Results and conclusions drawn from the in vitro experiments and targeted hypothesis-driven research will contribute to the growing body of scientific data in the fight against COVID-19.

COVID-19 Search Trends Symptoms Dataset

COVID-19 Search Trends symptoms dataset shows aggregated, anonymized trends in Google searches for a broad set of health symptoms, signs, and conditions. The dataset provides a daily or weekly time series for each region showing the relative volume of searches for each symptom. This dataset is intended to help researchers to better understand the impact of COVID-19. It shouldn’t be used for medical diagnostic, prognostic, or treatment purposes. It also isn’t intended to be used for guidance on personal travel plans.

COVID-19 Vaccination Access

The COVID-19 Vaccination Access Dataset characterizes access to COVID-19 vaccination sites based on travel times. This dataset is intended to help public health officials, researchers, and healthcare providers to identify areas with insufficient access, deploy interventions, and research these issues—you shouldn’t use this dataset for other purposes.

COVID-19 Vaccination Search Insights

The COVID-19 Vaccination Search Insights data shows aggregated, anonymized trends in searches related to COVID-19 vaccination. The dataset provides a weekly time series for each region showing the relative interest of Google searches related to COVID-19 vaccination, across several categories. The data is intended to help public health officials design, target, and evaluate public education campaigns.

USAFacts COVID-19 Database

This data from USAFacts provides US COVID-19 case and death counts by state and county. This data is sourced from the CDC, and state and local health agencies. For more information, see the USAFacts site on the Coronavirus. Interactive data visualizations are also available via USAFacts.

Oxford COVID-19 Government Response Tracker

This is the Coronavirus Government Response Tracker from the University of Oxford Blavatnik School of Government. This dataset was created in response to the COVID-19 outbreak to track and compare policy responses from governments around the world.

COVID-19 Google Community Mobility Reports

This dataset aims to provide insights into what has changed in response to policies aimed at combating COVID-19. It reports movement trends over time by geography, across different categories of places such as retail and recreation, groceries and pharmacies, parks, transit stations, workplaces, and residential.

Cymbal Investments

The Cymbal brand was created to make storytelling consistent across Google Cloud. Datasets are synthetic, and provided to industry practitioners for the purpose of product discovery, testing, and evaluation.

Dash Cryptocurrency

Dash is a cryptocurrency governed by a decentralized autonomous organization (DAO) run by a subset of users, called “masternodes”. The currency permits fast transactions that are difficult to trace. This dataset contains the blockchain data in their entirety, pre-processed to be human-friendly and to support common use cases such as auditing, investigating, and researching the economic and financial properties of the system.

NHTSA Data 360 Traffic and Safety

High-demand automotive curated datasets, making it easy to access and discover deep insights into vehicle safety, driver behavior and competitors. Datasets contain historical data sourced from authentic and trusted sources like The National Highway Traffic Safety Administration (NHTSA), the National Center for Statistics and Analysis (NCSA), and the Bureau of Economic Analysis (BEA).

Dataflix COVID Dataset

The Dataflix COVID dataset is a centralized repository of up-to-date and curated data focused on key tracking metrics and U.S. census data. The dataset is publicly-readable & accessible on Google BigQuery – ready for analysis, analytics and machine learning initiatives. The dataset is built on data sourced from trusted sources like CSSE at Johns Hopkins University and government agencies, covering a wide range of metrics including confirmed cases, new cases, % population, mortality rate and deaths, aggregated at various geographic levels including city, county, state and country. New data is published on a daily basis.

Dogecoin Cryptocurrency Dataset

Dogecoin is an open source peer-to-peer digital currency, favored by Shiba Inus worldwide. It is qualitatively more fun while being technically nearly identical to its close relative Bitcoin. This dataset contains the blockchain data in their entirety, pre-processed to be human-friendly and to support common use cases such as auditing, investigating, and researching the economic and financial properties of the system.

Eclipse Megamovie

This is the full set of images submitted for the Eclipse Megamovie project, a citizen science project to capture images of the Sun’s corona during the August 21, 2017 total solar eclipse. These images were taken by volunteer photographers (as well as the general public) from across the country using consumer camera equipment. The Eclipse Megamovie project was a collaboration between UC Berkeley, Google, the Astronomical Society of the Pacific, and many more.

Ethereum Classic Cryptocurrency Dataset

Ethereum Classic is a cryptocurrency with shared history with the Ethereum cryptocurrency. On technical merits, the two cryptocurrencies are nearly identical, differing only in programming language features supported by the Ethereum Virtual machine which is used to write smart contracts. This dataset contains the blockchain data in their entirety, pre-processed to be human-friendly and to support common use cases such as auditing, investigating, and researching the economic and financial properties of the system.

Ethereum Cryptocurrency

Ethereum is a crypto currency which leverages blockchain technology to store transactions in a distributed ledger.

Ethereum On-Chain Transaction Data

Ethereum is a cryptocurrency/blockchain-based asset leveraging blockchain technology to store transactions in a distributed ledger.

FCC Political Ads

The FCC political ads public inspection files dataset contains political ad file information that broadcast stations have uploaded to their public inspection files, which are housed on the FCC website. This data includes all political ad files that have been provided by TV and radio broadcast stations, which dates back to 2012 when the FCC started requiring digital uploads of files to its website. Broadcasters are required to maintain this data in their public inspection files for two years, after which the stations are permitted to remove them from the FCC website.

FDIC-Insured Banks and Branches

The FDIC’s Institution Directory provides a list of all FDIC-insured institutions. The file includes demographic information related to the institution such as locational detail (name, city state, etc) and operating status (active, inactive, bank class, etc). The download file also contains key financial information reported by the FDIC-insured institution, such as total deposits, quarterly net income, and more. Additionally, the dataset includes data going back to 1996.

FEC Campaign Finance

The FEC campaign finance dataset contains committee, candidate and campaign finance data for the current election cycle and for election cycles as far back as 1980. The data for the current election cycle plus the two most recent election cycles are regularly updated. The Committees, Candidates and Linkages tables are updated daily. The Itemized Records, Contributions to Candidates, Individual Contributions and Operating Expenditures tables are updated weekly on Sunday. When these datasets are updated, they include data entered into the FEC database through that date. Please note for financial data complete entry from each reporting period takes about 30 days. Because of this, some information may not have completed the entry process when these files were created.

Forest Inventory Analysis

The Forest Inventory and Analysis dataset is a nationwide survey of the forest assets of the United States. The Forest Inventory and Analysis (FIA) research program has been in existence since mandated by Congress in 1928. FIA’s primary objective is to determine the extent, condition, volume, growth, and use of trees on the Nation’s forest land.

GHCN Daily

The GHCN-Daily is an integrated database of daily climate summaries from land surface stations across the globe, and consists of daily climate records from over 100,000 stations in 180 countries and territories, and includes some data from every year since 1763.

GHCN Monthly

The GHCN-Monthly is a temperature dataset that contains monthly mean temperatures and is used for operational climate monitoring activities. It is comprised of climate records from over 7,000 stations.

GitHub Activity Data

GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008. This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

NOAA’s Global Forecast System

NOAA’s Global Forecast System (GFS) is a weather forecast model produced by the National Centers for Environmental Prediction (NCEP). Dozens of atmospheric and land-soil variables are available through this dataset, from temperatures, winds, and precipitation to soil moisture and atmospheric ozone concentration.

Global Historical Tsunami Database

The Global Historical Tsunami Database provides information on over 2,400 tsunamis from 2000 BC to the present around the globe. The dataset includes two related tables.

Global Hurricane Tracks (IBTrACS)

IBTrACS collects data about tropical cyclones (TC) reported by international monitoring centers who have a responsibility to forecast and report on TCs (and also includes some important historical datasets). Presently, IBTrACS includes data from 9 different countries. Historically, the data describing these systems has included best estimates of their track and intensity (hence the term, best track).

Global Unique Device Identification Database (GUDID)

The Global Unique Device Identification Database (GUDID) contains key device identification information submitted to the FDA about medical devices that have Unique Device Identifiers (UDI). The National Library of Medicine (NLM), in collaboration with the FDA, has created the AccessGUDID portal (the source of this dataset) to make device identification information in the GUDID available for everyone, including patients, caregivers, health care providers, hospitals, and industry.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD)  is maintained by an international coalition of investigators to aggregate and harmonize data from large-scale sequencing projects.

GOES-16

The Geostationary Operational Environmental Satellite-R Series (GOES-R) is the next generation of geostationary weather satellites. The GOES-R series will significantly improve the detection and observation of environmental phenomena that directly affect public safety, protection of property and our nation’s economic health and prosperity. The GOES-16 satellite, known as GOES-R prior to launch, is the first satellite in the series. It will provide images of weather patterns and severe storms as frequently as every 30 seconds, which will contribute to more accurate and reliable weather forecasts and severe weather outlooks.

GOES-17

GOES-17 (Geostationary Operational Environmental Satellite) is the second in the GOES-R series that promises significant upgrades in observing environmental phenomena. It provides images of weather patterns and severe storms as frequently as every 30 seconds, which supports more accurate and reliable weather forecasts and severe weather outlooks.

Google Analytics Sample

The dataset provides 12 months (August 2016 to August 2017) of obfuscated Google Analytics 360 data from the Google Merchandise Store , a real ecommerce store that sells Google-branded merchandise, in BigQuery. It’s a great way to analyze business data and learn the benefits of using BigQuery to analyze Analytics 360 data. 

Google Cloud Release Notes

This table contains release notes for the majority of generally available Google Cloud products found on cloud.google.com. You can use this BigQuery public dataset to consume release notes programmatically across all products. 

The Google Trends dataset will provide critical signals that individual users and businesses alike can leverage to make better data-driven decisions. This dataset simplifies the manual interaction with the existing Google Trends UI by automating and exposing anonymized, aggregated, and indexed search data in BigQuery.

Google Books Ngram Dataset

This Google Books Ngram Dataset contains frequencies of any set of search strings using a yearly count of n-grams found in sources printed between 1500 and 2012 in Google’s text corpora. Languages include English, Chinese (simplified), French, German, Hebrew, Italian, Russian, and Spanish.

NOAA GSOD

This public dataset was created by the National Oceanic and Atmospheric Administration (NOAA) and includes global data obtained from the USAF Climatology Center. This dataset covers GSOD data between 1929 and the present (updated daily), collected from over 9000 stations.

Hacker News

This dataset contains all stories and comments from Hacker News from its launch in 2006 to the present. Each story contains a story ID, the author that made the post, when it was written, and the number of points the story received.

Healthcare Common Procedure Coding System (CMS Medicare)

Classification of procedures performed for patients is important for billing and reimbursement in healthcare. The primary classification system used in the United States is Healthcare Common Procedure Coding System (HCPCS), maintained by Centers for Medicare and Medicaid Services (CMS). This system is divided into two levels: level I and level II. Level I HCPCS codes classify services rendered by physicians. This system is based on Common Procedure Terminology (CPT), a coding system maintained by the American Medical Association (AMA). Level II codes, which are the focus of this public dataset, are used to identify products, supplies, and services not included in level I codes. The level II codes include items such as ambulance services, durable medical goods, prosthetics, orthotics, and supplies used outside a physician’s office.

Health Professional Shortage Areas

Health Professional Shortage Areas (HPSAs) are federal designations that indicate health care provider shortages. HRSA’s Bureau of Health Workforce (BHW) develops shortage designation criteria and uses them to decide whether or not a geographic area or population group is a Health Professional Shortage Area (HPSA), Medically Underserved Area (MUA), or Medically Underserved Population (MUP). These designated areas indicate where there is a shortage of healthcare professionals, indicating potential limited access to medical services for geography or a particular population. These data include the degree of shortage (e.g. ratio of providers), the geography of the shortage area, and FTE shortage of practitioners.

EPA Historical Air Quality

The United States Environmental Protection Agency (EPA) protects both public health and the environment by establishing the standards for national air quality. The EPA provides annual summary data as well as hourly and daily data in the categories of criteria gases, particulates, meteorological, and toxics.

Human Variant Annotation Datasets

These datasets are important to genomics researchers because they characterize several aspects of what the scientific community has learned to date about human sequence variants. Making this human annotation data freely available in GCP will enable researchers to focus less on data movement and management tasks associated with procuring this data and instead make immediate use of the data to better understand the clinical relevance of particular variants such as disease-causing or protective variants (ClinVar), search a catalog of SNPs that have been identified in the human genome (dbSNP), and discover how frequently a particular variant occurs across the human population (1000Genomes, ESP, ExAC, gnomAD). 

Census Bureau International Data

The United States Census Bureau’s international dataset provides estimates of country populations since 1950 and projections through 2050. Specifically, the dataset includes midyear population figures broken down by age and gender assignment at birth. Additionally, time-series data is provided for attributes including fertility rates, birth rates, death rates, and migration rates.

World bank International Debt

This dataset contains both national and regional debt statistics captured by over 200 economic indicators. Time series data is available for those indicators from 1970 to 2015 for reporting countries.

World bank International Education

This dataset combines key education statistics from a variety of sources to provide a look at global literacy, spending, and access.

Iowa Liquor Retail Sales

This dataset contains every wholesale purchase of liquor in the State of Iowa by retailers for sale to individuals since January 1, 2012. The State of Iowa controls the wholesale distribution of liquor intended for retail sale, which means this dataset offers a complete view of retail liquor sales in the entire state. The dataset contains every wholesale order of liquor by all grocery stores, liquor stores, convenience stores, etc., with details about the store and location, the exact liquor brand and size, and the number of bottles ordered.

IRS 990

Form 990 is used by the United States Internal Revenue Service to gather financial information about nonprofit/exempt organizations. This BigQuery dataset can be used to perform research and analysis of organizations that have electronically filed Forms 990, 990-EZ and 990-PF.

Libraries.io Data

Libraries.io gathers data on open source software from 33 package managers and 3 source code repositories. We track over 2.4m unique open-source projects, 25m repositories and 121m interdependencies between them. This gives Libraries.io a unique understanding of open-source software.

London Bicycle Hires

This data contains the number of hires of London’s Santander Cycle Hire Scheme from 2011 to the present. Data includes start and stop timestamps, station names and ride duration.

London Crime

This data counts the number of crimes at two different geographic levels of London (LSOA and borough) by year, according to crime type. Includes data from 2008 to present.

SDOH HUD Low Income Housing Tax Credit Program

The Low-Income Housing Tax Credit (LIHTC) program gives State and local agencies the equivalent of nearly $8 billion in annual budget authority to issue tax credits for the acquisition, rehabilitation, or new construction of rental housing targeted to lower-income households.

Medicare

This public dataset was created by the Centers for Medicare & Medicaid Services. The data summarizes the utilization and payments for procedures, services, and prescription drugs provided to Medicare beneficiaries by specific inpatient and outpatient hospitals, physicians, and other suppliers. The dataset includes the following data – common inpatient and outpatient services, all physician and other supplier procedures and services, and all Part D prescriptions.

Baseball Dataset

This public data includes pitch-by-pitch data for Major League Baseball (MLB) games in 2016. This dataset contains the following tables: games_wide (every pitch, steal, or lineup event for each at bat in the 2016 regular season), games_post_wide(every pitch, steal, or lineup event for each at-bat in the 2016 postseason), and schedules (the schedule for every team in the regular season).

NCAA Basketball

This dataset contains data about NCAA Basketball games, teams, and players. Game data covers play-by-play and box scores back to 2009, as well as final scores back to 1996. Additional data about wins and losses go back to the 1894-1895 season for some teams. All data runs through the end of the 2017-2018 season.

NHTSA Traffic Fatalities

This public dataset was created by the United States Department of Transportation’s National Highway Traffic Safety Administration (NHTSA) and includes 20 tables that describe numerous aspects of traffic accidents that resulted in fatalities. Aspects of traffic accidents include the types of cars and roads, the maneuvers that preceded the accident, and the involvement of pedestrians and cyclists.

NOAA Global Forecast System Dataset

The Global Forecast System (GFS) is a weather forecast model produced by the National Centers for Environmental Prediction (NCEP). The GFS dataset consists of selected model outputs (described below) as gridded forecast variables. The 384-hour forecasts, with a 3-hour forecast interval, are made at 6-hour temporal resolution (i.e. updated four times daily).

National Plan and Provider Enumeration System

The CMS National Plan and Provider Enumeration System (NPPES)  was developed as part of the Administrative Simplification provisions in the original HIPAA act. The primary purpose of NPPES was to develop a unique identifier for each physician that billed medicare and medicaid. This identifier is now known as the National Provider Identifier Standard (NPI)  which is a required 10 digit number that is unique to an individual provider at the national level.

NYC 311

This data includes all New York City 311 service requests from 2010 to the present and is updated daily. 311 is a non-emergency number that provides access to non-emergency municipal services.

NYC Citi Bike Trips

Citi Bike is the nation’s largest bike share program, with 10,000 bikes and 600 stations across Manhattan, Brooklyn, Queens, and Jersey City. This dataset includes Citi Bike trips since Citi Bike launched in September 2013 and is updated daily. The data has been processed by Citi Bike to remove trips that are taken by staff to service and inspect the system, as well as any trips below 60 seconds in length, which are considered false starts.

NYC Street Trees

The NYC street tree data includes data from the 1995, 2005 and 2015 Street Tree Censuses, which are conducted by volunteers organized by the NYC Department of Parks and Recreation. Trees were inventoried by address and identified by tree species, diameter, and condition.

NYC TLC Trips

This dataset is collected by the NYC Taxi and Limousine Commission (TLC) and includes trip records from all trips completed in Yellow and Green taxis in NYC from 2009 to the present, and all trips in for-hire vehicles (FHV) from 2015 to present. Records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.

GEO OpenStreetMap Public Dataset

Adapted from Wikipedia: OpenStreetMap (OSM) is a collaborative project to create a free editable map of the world. Created in 2004, it was inspired by the success of Wikipedia and more than two million registered users who can add data by manual survey, GPS devices, aerial photography, and other free sources.

Passive Acoustic Index

NOAA National Centers for Environmental Information has established a passive acoustic data archive and serves to steward and centralize access to numerous marine monitoring projects.

Patent PDF Samples with Extracted Structured Data

The dataset consists of PDFs in Google Cloud Storage from the first page of select US and EU patents, and BigQuery tables with extracted entities, labels, and other properties, including a link to each file in GCS. The structured data contains labels for eleven patent entities (patent inventor, publication date, classification number, patent title, etc.), global properties (US/EU issued, language, invention type), and the location of any figures or schematics on the patent’s first page.

SDOH HUD Homelessness Count

This database contains the data reported in the Annual Homeless Assessment Report to Congress (AHAR). It represents a point-In-time count (PIT) of homeless individuals, as well as a housing inventory count (HIC) conducted annually.

Political Advertising on Google

This dataset contains information on how much money is spent by verified advertisers on political advertising across Google Ad Services. In addition, insights on demographic targeting used in political ad campaigns by these advertisers are also provided.

NOAA’s Preliminary Storm Reports

NOAA’s Storm Prediction Center (SPC) maintains a database of daily US storm data as reported by local National Weather Service offices from trained weather spotters. The types of storm data recorded by SPC include reports of Tornados, Wind, and Hail.

Project Sunroof

As the price of installing solar has gotten less expensive, more homeowners are turning to it as a possible option for decreasing their energy bill. Project Sunroof puts Google’s expansive data in mapping and computing resources to use, helping calculate the best solar plan for you.

Python Package Index (PyPI)

This dataset provides download statistics for all package downloads from the Python Package Index (PyPI). It also includes a dataset containing all the metadata for every distribution released on PyPI.

Google Carbon-free Energy Data

To power each Google Cloud region and/or data center, we use electricity from the grid where the region is located. This electricity generates more or less carbon emissions (gCO2eq), depending on the type of power plants generating electricity for that grid and when we consume it. In 2020, we set a goal to match our energy consumption with carbon-free energy (CFE), every hour and in every region by 2030. As we work towards our 2030 goal, we want to provide transparency on our progress. To characterize each region we use a metric: “CFE%”. This metric is calculated for every hour in every region and tells us what percentage of the energy we consumed during an hour that is carbon-free. We take into account the carbon-free energy that’s already supplied by the grid, in addition to the investments we have made in renewable energy in that location to reach our 24/7 carbon-free objective. We then aggregate the available average hourly CFE percentage for each region for the year.

RxNorm: Cinical Drugs Normalized Naming System

RxNorm was created by the U.S. National Library of Medicine (NLM) to provide a normalized naming system for clinical drugs, defined as the combination of {ingredient + strength + dose form}. In addition to the naming system, the RxNorm dataset also provides structured information such as brand names, ingredients, drug classes, and so on, for each clinical drug. Typical uses of RxNorm include navigating between names and codes among different drug vocabularies and using the information in RxNorm to assist with health information exchange/medication reconciliation, e-prescribing, drug analytics, formulary development, and other functions.

SafeGraph Neighborhood Patterns Sample Dataset (1H of 2020)

Discover SafeGraph’s  neighborhood patterns dataset covering ~5% of census block groups (CBGs) in the US. Census Block Statistical Areas included are (i) 12060 Atlanta-Sandy Springs-Roswell, GA, (ii) 12420 Austin-Round Rock, TX, (iii) 40380 Rochester, NY, (iv) 41180 St. Louis, MO-IL, (v) 41740 San Diego-Carlsbad, CA, and (vi) 42660 Seattle-Tacoma-Bellevue, WA.

San Francisco Ford GoBike Share

San Francisco Ford GoBike , managed by Motivate, provides the Bay Area’s bike share system. Bike share is a convenient, healthy, affordable, and fun form of transportation. It involves a fleet of specially designed bikes that are locked into a network of docking stations. Bikes can be unlocked from one station and returned to any other station in the system. People use bike share to commute to work or school, run errands, get to appointments, and more. The dataset contains trip data from 2013-2018, including start time, end time, start station, end station, and latitude/longitude for each station.

SEC Public Dataset

In the U.S. public companies, certain insiders and broker-dealers are required to regularly file with the SEC. The SEC makes this data available online for anybody to view and use via their Electronic Data Gathering, Analysis, and Retrieval (EDGAR) database. The SEC updates this data every quarter going back to January, 2009. To aid analysis a quick summary view of the data has been created that is not available in the original dataset. The quick summary view pulls together signals into a single table that otherwise would have to be joined from multiple tables and enables a more streamlined user experience.

NOAA Historic Severe Storm Data

The Storm Events Database is an integrated database of severe weather events across the United States from 1950 to this year, with information about a storm event’s location, azimuth, distance, impact, and severity, including the cost of damages to property and crops. Data about a specific event is added to the dataset within 120 days to allow time for damage assessments and other analyses.

San Francisco 311 service requests Dataset

This data includes all San Francisco 311 service requests from July 2008 to the present and is updated daily. 311 is a non-emergency number that provides access to non-emergency municipal services.

SF Film Locations Dataset

This data includes a list of the filming locations for movies shot in San Francisco as far back as 1924. Maintained by the San Francisco Film Commission, includes titles, locations, fun facts, names of the director, writer, actors, and studio for most films.

San Francisco Police Department Reports Database

This data includes incidents from the San Francisco Police Department (SFPD) Crime Incident Reporting system, from January 2003 until the present (2 weeks ago from the current date). The dataset is updated daily. Please note: the SFPD has implemented a new system for tracking crime. This dataset is still sourced from the old system, which is in the process of being retired (a multi-year process).

NOAA Significant Earthquakes Database

The Significant Earthquake Database is a global listing of over 5,700 earthquakes from 2150 BC to the present. In order to be classified as a significant earthquake, the event must meet at least one of the following criteria: Moderate damage (approximately $1 million or more), 10 or more deaths, Magnitude 7.5 or greater, Modified Mercalli Intensity X or greater, or the earthquake generated a tsunami. The database provides information on the date and time of occurrence, latitude, and longitude, focal depth, magnitude, maximum MMI intensity, and socio-economic data such as the total number of casualties, injuries, houses destroyed, and houses damaged, and $ dollar damage estimates.

SDOH SNAP Enrollment

This public dataset published by USDA summarizes the total number of enrollees in the Supplemental Nutrition Assistance Program (SNAP) by region. SNAP provides nutrition benefits to supplement the food budget of families and persons meeting eligibility criteria related to monthly income.

Stack Overflow Data

Stack Overflow is the largest online community for programmers to learn, share their knowledge, and advance their careers. Updated on a quarterly basis, this BigQuery dataset includes an archive of Stack Overflow content, including posts, votes, tags, and badges. 

San Francisco Street Trees 

This data includes a list of San Francisco Department of Public Works maintained street trees including planting date, species, and location. Data includes 1955 to present.

Sustainable Development Goals (SDG) Indicators

SDG indicators are a global indicator framework for the Sustainable Development Goals (SDG) and succeed the Millenium Development Goals (MDGs) which ended in 2015. They include indicators and statistical data to monitor progress, inform policy and ensure accountability of all stakeholders for the 17 goals set by the United Nations. Data available is from the early 1990s to the present and covers economic and social development issues including poverty, hunger, health, education, social justice, water, sanitation, and many more.

Synthea Generated Synthetic Data in FHIR

The Synthea Generated Synthetic Data in FHIR hosts over 1 million synthetic patient records generated using Synthea in FHIR  format. Exported from the Google Cloud Healthcare API  FHIR Store into BigQuery using analytics schema. 

The Immune Epitope Database (IEDB)

The Immune Epitope Database (IEDB) is a freely available resource funded by NIAID. It catalogs experimental data on antibody and T cell epitopes studied in humans, non-human primates, and other animal species in the context of infectious disease, allergy, autoimmunity, and transplantation.

The Met Public Domain Art Works Data

The Metropolitan Museum of Art, better known as the Met, provides a public domain dataset with over 200,000 objects including metadata and images. In early 2017, the Met debuted their Open Access policy to make part of their collection freely available for unrestricted use under the Creative Commons Zero designation and their own terms and conditions. 

The New York Times US Covid-19 Database

This is the US Coronavirus data repository from The New York Times. This data includes COVID-19 cases and deaths reported by state and county. The New York Times compiled this data based on reports from state and local health agencies.

The World Bank World Development Indicators Data

World Development Indicators Data is the primary World Bank collection of development indicators, compiled from officially-recognized international sources. It presents the most current and accurate global development data available and includes national, regional, and global estimates.

US Census Data

The United States census count (also known as the Decennial Census of Population and Housing) is a count of every resident of the US. The census occurs every 10 years and is conducted by the United States Census Bureau. Census data is publicly available through the census website, but much of the data is available in summarized data and 

US Inflation and Unemployment (BLS)

This dataset includes economic statistics on inflation, prices, unemployment, and pay & benefits provided by the Bureau of Labor Statistics (BLS).

GEO US Roads

The TIGER/Line Shapefiles are the fully-supported, core geographic products from the US Census Bureau. They are extracts of selected geographic and cartographic information from the US Census Bureau’s Master Address File/Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) database. The all roads dataset contains all linear street features with “S” (Street) type MTFCCs in the MAF/TIGER database. These include primary roads, secondary roads, local neighborhood roads, rural roads, city streets, vehicular trails (4WD), ramps, service drives, walkways, stairways, alleys, and private roads.

USA Names Dataset

This public dataset was created by the Social Security Administration and contains all names from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in this data. For others who did apply, records may not show the place of birth, and again their names are not included in the data.

USPTO OCE Office Actions Data

The Office Action Research Dataset for Patents contains detailed information derived from the Office actions issued by patent examiners to applicants during the patent examination process. The “Office action” is a written notification to the applicant of the examiner’s decision on patentability and generally discloses the grounds for rejection, the claims affected, and the pertinent prior art. This initial release consists of three files derived from 4.4 million Office actions mailed during 2008 to the mid-2017 period from USPTO examiners to the applicants of 2.2 million unique patent applications.

Who’s on First Gazetteer

Who’s On First is a gazetteer of all the places in the world, from continents to neighborhoods and venues. Who’s On First is not a carefully selected list of important or relevant places. It is not meant to act as the threshold by which places are measured. Who’s On First, instead, is meant to provide the raw material with which a multiplicity of thresholds might be created.

Zcash Cryptocurency

Zcash is a cryptocurrency and distributed ledger system (blockchain) that provides enhanced privacy by including additional cryptographic features relative to Bitcoin. This dataset contains the blockchain data in their entirety, pre-processed to be human-friendly and to support common use cases such as auditing, investigating, and researching the economic and financial properties of the system. 


Automate & scale time-consuming tasks like never before

Hexomatic. The no-code, point and click work automation platform.

Harness the internet as your own data source, build your own scraping bots and leverage ready made automations to delegate time consuming tasks and scale your business.

No coding or PhD in programming required.