Data is Plural archive

Trawl through the backlog, or roll the dice for a random dataset 🎲
Jul 2022

Shark bites.

Madeline Riley et al. describe the Australian Shark-Incident Database, which contains details about 1,100+ shark bites (and attempted shark bites) between 1791 and early 2022, gathered by the Taronga Conservation Society using “questionnaires provided to shark-bite victims or witnesses, media reports,” and information from state agencies. Read more: “New dataset shows shark bites in Australia are increasing and researchers want to know why” (The Guardian).

Jul 2022

Startup factories.

Venture studios are firms that build and launch startups. Jim Moran’s Venture Studio Index tracks 260+ of them, plus 1,200+ of the startups they’ve launched. The dataset, “collected manually by a team of researchers familiar with venture capital and the technology startup ecosystem,” includes founding years, locations, employee counts, relevant URLs, and more.

Jul 2022

Monkeypox strains.

Nextstrain, “an open-source project to harness the scientific and public health potential of pathogen genome data,” has begun analyzing genetic sequences from hundreds of monkeypox virus samples, the vast majority from infections in the past few months. The project provides metadata on each sample, including the date, country, variant, and mutation metrics, as well as detailed sequencing data from NCBI Virus. Previously: Coronavirus variant data from outbreak.info (DIP 2021.03.10). [h/t Karsten Johansson]

Jul 2022

Hospital price lists.

Since January 2021, the US government has required hospitals to publish machine-readable files listing the standard charges for all items and services they provide. But there’s no standard format for these price lists (also known as “chargemasters”), no official central repository of them, and compliance has been lacking. Seeing those problems, the versioned-data platform DoltHub earlier this year ran a paid crowdsourcing campaign that pulled nearly 300 million prices from the published lists of roughly 1,800 hospitals into a single database. Related: Thanks to an earlier price transparency rule, California posts chargemasters for hundreds of hospitals, with records going back to 2011.

Jul 2022

Wildfires around the world.

The Global Wildfire Information System, expanding on the work of European Forest Fire Information System, uses satellite data to provide weekly and annual estimates of the number of fires and area burned in 200+ countries. Its bulk data indicates monthly burned hectares by country, sub-country unit, and land type from 2002 to 2019, as well as the boundaries of individual fires from 2001 to 2020. It also publishes gridded spatial data relating to fire danger forecasts, active fires, emissions, and more. As seen in: El Diario’s analysis of forest fires in Spain. [h/t Olaya Argüeso Pérez]

Jul 2022

The World Cup.

Josh Fjelstul’s World Cup Database, published this month, provides “extensively cleaned and cross-validated” information about each of the 21 FIFA World Cup tournaments played so far. Its 27 tables contain “approximately 1.1 million data points” regarding the teams that participated, their players and managers, the referees, match outcomes, goals, penalties, and more.

Jul 2022

Digital trade provisions.

Mira Burri et al.’s TAPED dataset, which “seeks to comprehensively trace developments in the area of digital trade governance,” categorizes 100+ relevant aspects of 300+ preferential trade agreements signed since 2000. The dataset indicates, for instance, that the Peru-Australia Free Trade Agreement contains binding agreements on personal data protection, nonbinding language on cybersecurity, and no provisions regarding net neutrality.

Jul 2022

Budget apportionments.

Congress, through a process called appropriations, chooses how much money goes to each US federal agency and program. But the Office of Management and Budget, through a process called apportionment, ultimately sets the rules for spending those funds, “typically limit[ing] the obligations [an agency] may incur for specified time periods, programs, activities, projects, objects, or any combination thereof.” Those binding decisions have generally not been available to the public — until last week, when OMB launched a database of apportionments for FY 2022, per a requirement in Congress’s 2022 spending bill. [h/t Caitlin Emma]

Jul 2022

Notable people.

“A new strand of literature aims at building the most comprehensive and accurate database of notable individuals,” observe Morgane Laouenan et al., who contribute a “cross-verified database of 2.29 million individuals” mined from Wikidata and the English, French, German, Italian, Spanish, Portuguese and Swedish editions of Wikipedia. For each person, the dataset provides their birth and death dates, gender, citizenship, occupations, and other details. Previously: The MIT-based Pantheon dataset (DIP 2016.02.03), also based on Wikipedia and since updated. [h/t Philip Jung]

Jul 2022

New voting laws.

The Voting Rights Lab has been tracking 2,000+ laws proposed in US state legislatures since 2021. The tracker focuses on “12 major issue areas relating to voter access and representation,” such as early voting, same-day registration, and ID requirements. It lists each bill’s state, number, author, date introduced, current status, and issue areas, plus a summary and the lab’s “assessment of whether the legislation is likely to improve or interfere with voter access or the administration of elections.” As seen in: “Has Your State Made It Harder To Vote?” (FiveThirtyEight) Related: States Newsroom’s Kira Lerner has compiled a spreadsheet of 120 new election-related criminal penalties, based partly on the tracker’s data.

Jul 2022

Early movie theaters in Oregon.

The Oregon Theater Project, developed as part of a cinema studies course, “aims to document the history of moviegoing in Oregon – why people went to the movies, where people watched them, and what people thought about them,” with a current focus on the silent film era. Its directory of 200+ theaters is available to download and explore online.

Jul 2022

Digital payments in India.

PhonePe, a digital payments company serving India, publishes quarterly aggregated data on users and transactions. The statistics, which go back to 2018 and power an interactive map, are available on a national, state, and district level. User counts are provided as totals, as well as grouped by device brand. Transactions are measured by count and total value, with the counts also disaggregated into a few categories.

Jul 2022

Legal language.

Peter Henderson et al.’s Pile of Law is “a 256GB (and growing) dataset of open-source English-language legal and administrative data, covering court opinions, contracts, administrative rules, and legislative records.” Its texts come from CourtListener (DIP 2016.04.13), the Constitute Project (DIP 2022.04.13), the European Court of Human Rights’ published opinions, the Consumer Financial Protection Bureau’s collection of credit card agreements, and many other sources. [h/t Lynn Cherny]

Jul 2022

Technology adoption.

Diego A. Comin and Bart Hobijn’s Cross-country Historical Adoption of Technology dataset, published in 2009, compiles statistics “on the adoption of over 100 technologies in more than 150 countries since 1800.” Examples include the number of telegrams sent, television sets in use, knee replacement surgeries performed, and metric tons of freight carried on railways. Last month, Charles Kenny and George Yang published a dataset and accompanying working paper that updates those numbers and expands the technologies covered. [h/t Ranil Dissanayake]

Jul 2022

Heat metrics.

When you talk about outdoor heat, you’re likely referring to “dry-bulb” temperatures, measured by a thermometer shielded from the sun and moisture. But other factors also contribute to the physiological experience of hot weather. To that end, Keith R. Spangler et al. have created a dataset containing daily estimates of the wet-bulb globe temperature, Universal Thermal Climate Index, heat index, humidex, and other heat metrics for every county in the contiguous United States from 2000 through 2020. That first metric, for instance, was “originally developed in the 1950s to establish epidemiologically relevant thermal thresholds to prevent heat-related illnesses at US military training camps,” and takes humidity, solar radiation, and wind speed into account.

Jul 2022

Saturday Night Live.

Joel Navaroli’s snlarchives.net aims to catalog and cross-reference every episode, cast member, host, character, sketch, impression, and other aspects of Saturday Night Live’s 47-and-counting seasons. An open-source project by Hendrik Hilleckes and Colin Morris scrapes much of that information into structured data files. As seen in: Morris’s 2017 analysis of gender representation in SNL sketches.

Jul 2022

Shakespeare.

The Folger Shakespeare “brings you the complete works of the world’s greatest playwright, edited for modern readers.” Its digital editions of the Bard’s plays and poems are available to read online and to download in various file formats. It also provides an API, with endpoints for synopses, roles, monologues, word frequencies, and more. [h/t Cameron Armstrong]

Jul 2022

European air traffic.

Eurocontrol, the main organization coordinating Europe’s air traffic management, publishes an “aviation intelligence portal” with a range of industry metrics, including traffic reports that count the daily number of flights by country, by airport, and by operator. The portal also offers bulk datasets on topics such as airport traffic, flight efficiency, estimated CO2 emissions, and more. [h/t Giuseppe Sollazzo]

Jul 2022

Mass expulsions.

Political scientist Meghan M. Garrity’s Government-Sponsored Mass Expulsion dataset focuses on “policies in which governments systematically remove ethnic, racial, religious or national groups, en masse.” Using a combination of archival research and secondary sources, Garrity documents 139 such events, estimated to have expelled more than 30 million people between 1900 and 2020. For each expulsion, the dataset provides “information on the expelling country, onset, duration, region, scale, category of persons expelled, and frequency.” To download it, visit the Journal of Peace Research’s replication data portal and search for “mass expulsion.”

Jul 2022

Banned and challenged books.

A recent report from PEN America identified 1,500+ decisions, made between July 2021 and March 2022, to ban books from classrooms and school libraries. A spreadsheet accompanying the report lists each decision’s date, type, state, and school district, as well as each banned book’s title, authors, illustrators, and translators. Related: Independent researcher Tasslyn Magnusson, in partnership with EveryLibrary, maintains a spreadsheet of both book bans and book challenges, with 3,000+ entries since the 2021–22 school year. [h/t Gary Price]

Jun 2022

Dendrochronology.

The Vernacular Architecture Group’s Dendrochronology Database “provides the tree-ring dates for over 4500 buildings in the United Kingdom, ranging from cathedrals to cottages and barns.” Each entry lists a date range, dating method, location, building type, and descriptive notes.

Jun 2022

Bay Area rents.

In 2018, economist Kate Pennington used the Wayback Machine to collect data on two decades of Craigslist posts advertising housing in the San Francisco Bay Area. The results span 200,000+ listings between 2000 and 2018, from which Pennington extracted each post’s date, price, bedroom count, location, and more. Previously: Twentieth-century San Francisco rents from the city’s Housing Study DataBook and transcribed from newspaper listings (DIP 2016.05.25). [h/t Alex Albright]

Jun 2022

Automated driving crashes.

Last June, the National Highway Traffic Safety Administration issued a directive requiring manufacturers and fleet operators to report certain crashes involving either advanced driver assistance or higher-level “automated driving systems.” Earlier this month, the agency published its first release of crash report data, which it says “will be updated on a monthly basis.” The files describe each report and include information about the car, location, circumstances, and injury level. [h/t Faiz Siddiqui et al.]

Jun 2022

Diaspora voting policies.

If you don’t live in your country of citizenship, can you still vote there? Nathan Allen et al.’s Extraterritorial Voting Rights and Restrictions Dataset examines 20+ characteristics of 195 countries’ policies from 1950 to 2020. The dataset’s variables relate to dual citizenship, voter registration, mail-in ballots, and more. Read more: An introductory Twitter thread from coauthor Elizabeth Iams Wellman. [h/t Rabbia Tariq]

Jun 2022

State abortion laws.

The Guttmacher Institute, a “research and policy organization committed to advancing sexual and reproductive health and rights,” maintains a table summarizing each US state’s abortion laws. It examines key aspects of the legal landscape — such as gestational age limits, mandated counseling, and whether an abortion must be performed by a licensed physician — and links to topic-specific tables with additional detail. A separate table categorizes, as of June 1, the policy implications in each state of overturning Roe v. Wade. Previously: Guttmacher’s state-level statistics on pregnancy, birth, and abortion (DIP 2020.10.28) and Global Abortion Incidence Dataset (DIP 2021.04.14), the World Health Organization’s Global Abortion Policies Database (DIP 2021.10.13), and Caitlin Knowles Myers’s dataset of abortion facility distances (DIP 2022.02.02).

Jun 2022

NYC tree plantings.

New York City’s Department of Parks & Recreation publishes a map and dataset of recent and likely-upcoming street tree plantings. The information includes each location’s coordinates, nearest street address, ZIP code, city council district, and borough, as well as the dates of completed plantings. Previously: Every street tree in NYC (DIP 2016.11.16). [h/t Soph Warnes]

Jun 2022

Inclusive crossword names.

In a post for The New York Times’ Gameplay section, psychology professor Erica Hsiung Wojcik describes her motivations for creating the Expanded Crossword Name Database, “a free and regularly updated list of names, places and things that represent groups, identities and people often excluded from crossword grids,” with a particular focus on “names of women, non-binary, trans, and/or people of color.” It contains 2,400+ potential entries — from AALIYAH to ZORANEALEHURSTON — that correspond to 900+ distinct proper nouns, each briefly described in the project’s main spreadsheet. [h/t George Ho]

Jun 2022

Central bank interest rates.

The Bank for International Settlements maintains a longitudinal dataset of policy interest rates, which central banks adjust to influence inflation and other aspects of the economy. The dataset, which includes both official policy rates and analogous precursors, covers three dozen countries plus the European Central Bank. The records span decades, going as far back as 1946 for Denmark, India, Japan, Sweden, Switzerland, and the UK; 1954 for the US; 1960 for Canada; and 1976 for Australia.

Jun 2022

Ukraine air raid alerts.

Volodymyr Agafonkin, a Kyiv-based software engineer, has been scraping and charting the emergency notifications published through Air Alert Ukraine, a Telegram channel. The notifications serve as a digital counterpart to the sirens that warn residents of potentially-imminent Russian air attacks. Agafonkin’s dataset indicates the starting and ending times of 8,000+ alerts for 240 locations since March 15. Read more: An interview with Agafonkin in How To Read This Chart, a Washington Post newsletter.

Jun 2022

Monkeypox cases.

Global.health, a data-sharing initiative launched during the COVID-19 pandemic, has compiled a dataset of 2,500+ confirmed cases from this year’s monkeypox outbreak. Drawing from government and media sources, the dataset lists each case’s country and publicly known characteristics, such as the patient’s gender, age range, date of confirmation, and/or symptoms. As seen in: Charts and maps from the Global.health team and from Our World In Data.