Digging into data access: The need for reform
Challenges in accessing archived geoscience data could hinder the UK’s adoption of new low-carbon technologies. Alex Dickinson and Mark Ireland suggest that a centralised archival system could help
Geoscience technologies are likely to play a vital role in realising the world’s net-zero ambitions. Subsurface reservoirs have the potential to store hydrogen fuel, lock away radioactive waste, and sequester carbon dioxide. Ground-source heat pumps and deep geothermal aquifers can support decarbonisation of heating systems, whilst increased extraction of critical minerals will be key to the construction of low-carbon infrastructure.
To develop these technologies, geoscientists need a reliable understanding of how heat and fluids move through the subsurface. This understanding is derived from the analysis of extensive empirical datasets. Geoscientists drill boreholes to characterise the composition of soils and rocks at depth, to measure variables such as subsurface temperature, and to sample the chemistry of groundwater in aquifers. They use remote-sensing methods, including magnetic surveying and seismic imaging, to reveal the three-dimensional geometry of subsurface structures and to track the movement of fluids.
Acquiring these datasets can be costly. Drilling a 1.5-km-deep onshore borehole, for instance, may cost £2 million. Such costs are often not justified during the initial development of technologies with unproven economic returns. As a result, nearly all nascent low-carbon projects make use of existing datasets. The UK has a rich trove of such datasets, most of which have been acquired in the past century to support the extractive and construction industries. However, barriers to data access and inconsistencies in data presentation threaten to slow the development of new technologies, hindering our ability to meet net-zero targets.
What data are needed?
In the past, geoscientists commonly worked with observational datasets acquired in a small geographical area. They became familiar with the quirks of each dataset, which they often analysed by making subjective judgements. Today, geoscientists are making increasing use of open-source software that can automate data analysis and make the results more reliable (see box ‘Prising open software’). Using such software, they develop tools that help policymakers and investors find promising locations for low-carbon technologies and identify regions in which acquisition of further data would be worthwhile (see box ‘Heat under Holland’).
Efficient analysis using open-source software requires access to trustworthy datasets that are stored in consistently structured computer-readable formats. (By computer-readable, we mean digital files that can be manipulated using computer code, such as comma-separated-values files for numerical data or shapefiles for geospatial data. Scanned copies of printed records, while technically stored in a digital format, are not computer-readable.) Geoscientists can no longer rely on picking out quirks by eye. Unfortunately, many of the UK’s existing geoscience data are archived in inconsistent formats and are hard to obtain.
BOX: Prising open software
Many geoscientists have traditionally worked with closed-source software, which is sold under licences that restrict the ability of users to modify and share it. Closed-source software is often excellently designed for specific tasks, and usually includes intuitive graphical user interfaces. Examples of such software in the geosciences include tools for interpreting seismic data and geographic information systems for analysing spatial information. Due to the licensing of closed-source software, it can be difficult for users to adapt workflows to their own specific datasets or requirements.
Open-source software is freely available and is distributed under licences that let users change the underlying code and share these changes with others. This fosters collaborative development of innovative software that can be quickly adapted to new purposes. In particular, open-source software has supported the rapid growth of data-analysis and machine-learning tools that can be used to explore and model extremely large datasets. These tools have revolutionised the worlds of business and finance, and are starting to transform scientific fields from drug discovery to climate modelling.
Geoscientists can turn to an increasing number of excellent open-source software packages. The computing start-up Agile Scientific, for instance, has published packages for manipulating digital borehole logs, for interpreting seismic data, and for converting hand-drawn diagrams to digital form. Even large hydrocarbon companies, which have long been bastions of secrecy, are committing to the open-source revolution. Since 2015, for example, the Norwegian energy giant Equinor has required that its software developers make all their work open-source.
Currently, a search for the keyword geoscience on the popular code-sharing site GitHub yields 842 software packages. This number is certain to increase as more geoscientists recognise the power of open-source tools.
BOX: Heat under Holland
The Netherlands Organisation for Applied Scientific Research has constructed nationwide, three-dimensional predictive models of subsurface temperature and the economic potential of future geothermal systems. These models are freely available online. Due to this modelling initiative, over 90% of drilled geothermal projects have been successful, and installed geothermal power expanded by a factor of ten between 2010 and 2018 (Ministry of Economic Affairs and Climate Policy, 2020).
These models are underpinned by a repository of data from all boreholes drilled by the hydrocarbon industry. Anyone can freely access these data. Imagine, for instance, that a Dutch geoscientist is investigating the potential for a geothermal heating system near Utrecht. They want to download computer-readable logs for a certain borehole, which they find using an interactive map. Clicking on the borehole, they learn that six types of log are available. Within ten minutes, they have downloaded all these logs and 87 associated reports for no cost. These files have a total size of 380 megabytes. (Readers can find this borehole by searching for MEERKERK-01.)
Current data access
Researchers hoping to access UK geoscience data are confronted by an array of websites that are maintained by various organisations. Each organisation has a different approach to archiving and data sharing, and ease of data access is highly variable. To illustrate this variation, we consider five kinds of data that are most likely to be of use in supporting the energy transition (Fig. 1; Table 1).
Offshore data are usually easy to access. Most of these data were acquired to support the hydrocarbon industry, which has generally been well regulated since its expansion in the 1970s. Regulations include data-reporting requirements and define clear confidentiality periods that determine when archived data can be released to the public domain (see Table 2a for the current requirements). Publicly available hydrocarbon-industry data can be accessed using the National Data Repository (NDR), which provides online search tools and an interactive map to help users find metadata for boreholes and seismic surveys. Geoscientists can register with the NDR and download archived computer-readable data for free, subject to limits and delivery costs (see Table 3a).
Increasing volumes of offshore data are also being collected during the development of renewable-energy infrastructure such as wind farms. The Crown Estate archives such data, which include sediment cores, seismic reflection records, and bathymetric measurements, and which are again governed by clear confidentiality periods (Table 2a). Geoscientists can freely download computer-readable versions of these records, along with associated reports, by visiting the Marine Data Exchange.
Ease of access to onshore data is more varied. Most onshore seismic reflection datasets were acquired by the coal-mining or hydrocarbon industries, and are now preserved by the UK Onshore Geophysical Library. Website visitors can freely view seismic images. Both the original seismic records and the processed data can be purchased by firms, and are often made available to academic researchers at minimal cost (Table 3b).
Most remaining onshore datasets are maintained by the British Geological Survey (BGS), which hosts two national, semi-autonomous data centres: the National Geological Repository and the National Geoscience Data Centre (here, we use the term BGS to collectively refer to the BGS proper, the National Geological Repository, and the National Geoscience Data Centre). These datasets include measurements acquired since the nineteenth century for purposes as diverse as quarrying, mining, civil engineering, hydrocarbon exploration, nuclear-waste disposal, assessment of geothermal resources, and blue-skies scientific research. Data ownership is governed by a wide range of legal mechanisms, many of which refer to organisations and companies that no longer exist, and historical confidentiality periods are often unclear (see Table 2b for the current periods).
Due to the complexities of ownership and confidentiality, few BGS datasets can be downloaded (Table 3b). Many datasets lack detailed metadata, so lengthy correspondence is needed to establish what data exist and whether their confidentiality period has expired. Due to the extensive nature of the BGS holdings, mechanisms for accessing data can be unclear to researchers who are not well acquainted with the structure of the archives.
As an example, consider access to data acquired in onshore boreholes. More than one million onshore boreholes are described by scanned copies of printed reports, which visitors to the BGS website can download for free. These reports contain highly useful descriptions of subsurface lithologies and geological formations. However, they rarely include details of logs . (By log, we mean records of quantitative measurements made in boreholes. Commonly measured properties include density, electrical resistivity, velocity, and natural radioactivity. These measurements are key in helping geoscientists model the behaviour of geological formations. Note that we do not include qualitative descriptions of subsurface lithologies and geological formations in our use of the term log.) Importantly, they are not computer-readable.
Computer-readable borehole records are maintained in at least three BGS databases. The first database contains records that were acquired during development of infrastructure such as roads or pipelines. These records, which can be freely downloaded in a standardised format, mainly comprise lithological descriptions that extend to depths of 30 m or less. In addition to being useful for future construction projects, they may be valuable for planning the deployment of ground-source heat pumps as part of low-carbon heating systems.
However, these near-surface records are of limited use for investigating the potential of other low-carbon technologies such as deep geothermal power and hydrogen storage. Investigating these technologies instead requires logs from boreholes that reach depths of up to 4 km. Over 45,000 computer-readable logs for more than 4,500 boreholes are housed in two BGS databases. (These figures are not available as descriptive metadata on the BGS website. Instead, we have compiled them from downloaded datasets and from publicly unavailable spreadsheets which BGS employees have kindly shared with us.) Visitors to the BGS website can view and download the locations of these boreholes using an interactive map. However, the website does not detail which logs exist for each borehole. To gain this information, and to gain access to the logs, geoscientists must contact the BGS directly (see box ‘Getting hold of boreholes’).
Such difficulties are not restricted to accessing borehole data. We spent several days seeking to understand all the mechanisms by which geoscientists can access onshore UK data (Fig. 1; Table 3b). Despite our best efforts and extensive correspondence with BGS employees, we later found that we had missed various datasets. This experience underlines our point that current data-delivery mechanisms are confusing and inefficient for geoscientists who are not already closely acquainted with the structure of the BGS archives.
When onshore data can be made publicly available, they are archived using a range of physical and digital media (Fig. 2). Due to variety in the quality of archiving, the cost of data retrieval can be high (see Table 3b) and users must often manually convert data to computer-readable formats. As a result of the cost and effort of accessing and formatting onshore data, geoscientists at companies, universities and research institutes often decide to develop and maintain their own databases of publicly available records. This approach leads to two problems. First, databases may not be shared with other members of the geoscience community due to concerns over ownership (for instance, a recently compiled database of more than 900 measurements of subsurface temperature has not been made publicly available; Farr et al., Q J Eng Geol Hydrogeol 2020). Second, the use of independently maintained databases in geoscience studies may reduce confidence in the accuracy and trustworthiness of results.
BOX: Getting hold of boreholes
A researcher is working for a small British company that hopes to develop a geothermal heating system near Middlesbrough. They would like to examine computer-readable logs from a particular borehole. Whilst looking for the logs, they come across four different interactive online maps, which are maintained by three separate organisations. Only one of the maps states that logs exist for the borehole. However, it provides no explanation of how to gain access to the logs. Undeterred, the researcher searches further and discovers that the BGS houses all publicly available logs for onshore boreholes within the UK. The researcher cannot download these logs from the website, and so they contact the BGS by email. After a lengthy correspondence, they receive eleven logs for the cost of £130. Each of these logs is a text file approximately one megabyte in size. It is now two weeks since the researcher began their search. Contrast this experience with the situation in the Netherlands (see box ‘Heat under Holland’).
Improving data access
Improving access to onshore and offshore geoscience data requires work in three areas:
1) Reviewing the legal status of datasets: Compilation of a comprehensive summary of ownership and availability of all existing datasets would help geoscientists identify information that is readily accessible. Where the ownership of historical records is unclear, their legal status should be reviewed to ensure that as much data as possible is publicly available. The UK must also ensure that data acquired in future are appropriately reported and archived. Data-reporting requirements for the majority of established industries are well defined. However, preservation of near-surface data acquired by the construction industry is currently voluntary, and the requirements for nascent industries, such as geothermal power, are not always clear (Tables 2a and 2b). To avoid the loss of valuable information, regulators should review data-reporting mechanisms for growing industries and plan for the emergence of new industries. There must be continued focus on the requirements for publicly funded research projects to publish open-source data.
2) Curating standardised digital datasets: Ideally, data from all existing print and digital sources would be converted into simple, widely used open-source formats (e.g. Log ASCII Standard files for borehole logs, or SEG-Y files for seismic reflection data). Archived physical specimens would be detailed in easily searchable databases, and all datasets and databases would be accompanied by comprehensive descriptive metadata. Wherever possible, metadata for subjective interpretations (such as three-dimensional models of geological structure) would provide exhaustive links to the underlying observational data on which they are based. To aid planning of future projects, descriptions of confidential datasets would clearly indicate when the confidentiality period expires. If datasets cannot be made publicly available, the metadata should explain why.
3) Building a single online platform: Delivery of all open-source digital geoscience data through a single platform would dramatically reduce the time that geoscientists spend tracking down and requesting data. Ideally, the platform would be underpinned by an intuitively structured database that can be easily searched using text strings. Datasets with geographical information would also be displayed on a single interactive map. Each database entry would provide clear links to all associated data, metadata, and reports. (The NDR is a good example of this structure.) Wherever possible, data would be free to download or to access using tools such as Application Programming Interfaces. These tools remove the need for researchers to maintain personal archives of data on local computers. Instead, they can analyse consistent, regularly updated datasets that are remotely hosted in the cloud.
Thanks to the explosion of digital technologies in the past decade, now is an excellent time to build the digital services and infrastructure that will underpin curation and delivery of standardised datasets. Scanned records can be automatically and quickly digitised with the help of improvements in optical character recognition, whilst petabytes of data can now be easily shared using cloud-based systems. Once a standardised digital archive has been established, it will be simple to update it to accommodate new storage formats, delivery tools, and data acquired by future projects.
The necessary infrastructure might be best developed in partnership with a dedicated cloud service provider. For instance, NASA has established partnerships with Amazon Web Services and Google to make a projected 250 petabytes of open-source data available through cloud-based services by 2025. Aside from commercial partnerships, collaboration with data custodians in other countries could drive efficient and mutually beneficial improvements to data access (see box ‘Heat under Holland’ for an example of an exemplary initiative in the Netherlands).
Who will improve access?
Data custodians are aware of the need to improve access to their resources, and have already begun several excellent initiatives. The BGS, for instance, has recently published its first digital strategy (The British Geological Survey, 2020), and has set up Application Programming Interfaces for several of its datasets. However, similar strategies and tools are often independently developed by each custodial organisation. For example, the Marine Data Exchange, the NDR and the UK Onshore Geophysical Library all maintain online platforms that help geoscientists find and download seismic data. This duplication of data-management initiatives is inefficient and wastes resources.
Reducing these inefficiencies by coordinating efforts across custodial organisations would be challenging, not least because each organisation relies on different sources of funding (Fig. 3). The NDR is predominantly funded by a levy on the holders of Oil and Gas Authority-issued hydrocarbon licences, whilst the Marine Data Exchange is supported by the revenues of the Crown Estate. (The Oil and Gas Authority became the North Sea Transition Authority in March 2022.) The The UK Onshore Geophysical Library is funded entirely by sale of data to companies and by charitable donations. All three organisations have no responsibilities other than data management and data preservation.
In contrast, both the Environment Agency and the BGS receive funding from multiple sources and undertake many vital roles beyond data archiving. Although the Environment Agency relies on Government funding, it derives a third of its income from the sale of data and from licensing of activities such as fishing and waste disposal. The BGS receives more than half of its income from Government funding, with the remainder made up from commercial consultancy, competitive research grants, and data sales. These competing incentives, and often a lack of funding, mean that improvements to data access cannot always be prioritised.
We suggest that the tensions and inefficiencies of the current system could be overcome by establishing a single, centralised organisation dedicated to the preservation and management of all onshore and offshore geoscience data. This organisation would focus on reviewing the legal status of historical records and overseeing the development of digital infrastructure for the curation and delivery of standardised datasets. It would have responsibility, where appropriate, for setting charges associated with data delivery. It could also work with regulators towards efficient enforcement of data reporting.
Establishing this new model for geoscience data management would likely cost tens of millions of pounds (the Oil and Gas Authority has spent between £3 million and £4 million per year on the NDR since its establishment in 2019; Oil and Gas Authority, 2021a). In the short term, much of the cost would cover development of the necessary digital infrastructure, followed by digitisation of print media and reformatting of compiled data into standardised datasets. The digital infrastructure might be most easily created by expanding or mimicking the cloud-based systems that currently host the Marine Data Exchange and the NDR. (Moves have already been made in this direction, with the NDR beginning to absorb onshore data from the UK Onshore Geophysical Library. The NDR aims to increase its storage capacity to four petabytes by 2026; Oil and Gas Authority, 2021b; see Table 1.) Once established, cloud-based infrastructure could reduce annual data-storage costs by up to £400,000 per petabyte (Net Zero Technology Centre, 2021).
Long-term maintenance costs could be met through levies on the hydrocarbon and renewable-energy industries (the NDR is currently funded by such a levy; see Fig. 3). The overall cost of data preservation will almost certainly reduce due to elimination of duplicate initiatives. Establishment of a centralised organisation for data delivery may also require changes to the business model of the BGS, which receives around 7% of its income from data sales (The British Geological Survey, 2021) (Fig. 3).
However the costs are met, overhauling management of the UK’s geoscience data will cost only a fraction of the estimated annual investment of £50 billion needed for decarbonisation (The Climate Change Committee, 2020). This small investment seems highly worthwhile. Improved data access will help researchers, engineers and policymakers make decisions that are fully informed by all existing data, and will encourage development of low-carbon industries. By exporting these industries to the rest of the world, the UK can profit from the projected global investment in clean-energy technology of £3 trillion per year by 2030 (The International Energy Agency, 2021).
Aside from the economic benefits, transparent sharing of data and decision-making will help foster public support for low-carbon technologies that could otherwise be portrayed as dangerous. Standardised datasets will facilitate changes to university geoscience courses, and will encourage the teaching of highly valued analytical skills such as data science. On the level of fundamental research, reanalysis of comprehensive datasets using modern computational tools could revolutionise our understanding of British geology.
Data-archiving experts may disagree that establishment of a centralised body is the best way to realise the full potential of the UK’s geoscience data. We acknowledge that our suggestions are based on our perspective as end users of data, and that insights from those with expertise in data architecture are vital. Many of the organisations mentioned here are already doing excellent and often unrecognised work to improve data access. By providing an end-user viewpoint, we hope that this article will prompt discussion about the best way to build on their efforts and maximise the value of the UK’s rich data trove.
Dr Alex Dickinson is a Postdoctoral Research Associate in Energy Geosciences at Newcastle University, UK; Email: firstname.lastname@example.org; Twitter: @TheLittleCzar
Dr Mark Ireland is a Lecturer in Energy Geosciences at Newcastle University, UK; Email: email@example.com; Twitter: @MirelandMark
We thank employees at the British Geological Survey, the National Data Repository, and the Oil and Gas Authority (now the North Sea Transition Authority) for their help in finding information for this article.
Supplementary Information is available at doi.org/10.6084/m9.figshare.c.5939299
- The British Geological Survey (2020) BGS Digital Strategy 2020-2025; https://www.bgs.ac.uk/download/bgs-digital-strategy-2020-2025/
- The British Geological Survey (2021) BGS Annual Report 2019-2020; https://www.bgs.ac.uk/download/bgs-annual-report-2019-2020/
- The Committee on Climate Change (2020) Sixth Carbon Budget; https://www.theccc.org.uk/publication/sixth-carbon-budget/
- England, P.C. et al. (1980) Heat refraction and heat production in and around granite plutons in north-east England, Geophysical Journal International 62 (2), 439–455; https://doi.org/10.1111/j.1365-246X.1980.tb04866.x
- Farr, G. et al. (2020) The temperature of Britain’s coalfields. Q J Eng Geol Hydrogeol 54, qjegh2020-109; https://doi.org/10.1144/qjegh2020-109
- The International Energy Agency (2021) Net Zero by 2050: A Roadmap for the Global Energy Sector; https://iea.blob.core.windows.net/assets/deebef5d-0c34-4539-9d0c-10b13d840027/NetZeroby2050-ARoadmapfortheGlobalEnergySector_CORR.pdf
- Ministry of Economic Affairs and Climate Policy (2020) Natural resources and geothermal energy in the Netherlands: An overview of exploration, production and subsurface storage. 2020 Annual review; nlog.nl
- Net Zero Technology Centre (2021) Taking seismic data into the cloud; https://www.netzerotc.com/solution-centre/projects/case-studies/osokey-seismic-in-the-cloud/
- The Oil and Gas Authority (2021a) OGA Annual Report and Accounts 2020–21; https://www.ogauthority.co.uk/media/7685/oga-annual-report-and-accounts-2020-21.pdf
- The Oil and Gas Authority (2021b) Quicker, Easier, Better – unleashing a digital revolution in energy; https://www.ogauthority.co.uk/media/7736/oga-briefing-pack-v2-aug1121.pdf