• Search
  • LinkedIn
  • Instagram

GeoGPT: Open science in practice

Mike Stephenson and colleagues from the Deep-time Digital Earth project discuss the latest developments for GeoGPT and argue that AI will rapidly revolutionise the geosciences – if we can tackle the complexities of open science.

9 October 2024

Image by Gerd Altmann from Pixabay

GeoGPT is an open-source, non-profit Large Language Model (LLM), entirely for the geosciences. If you’ve not heard of LLMs or GPT models, they are most certainly coming your way, and will probably change the way that you work as a geoscientist.

A Large Language or Generative Pre-trained Transformer (GPT) model is a type of artificial intelligence that can generate language and classify information. The model is trained on large amounts of data, such as text from the internet, and uses neural networks (inspired by the systems in the human brain) to understand how words, characters, and sentences work together. A recent paper by Microsoft (Microsoft Research AI4Science, 2023) using their GPT-4 LLM showed the potential to analyse scientific literature, help researchers visualize large datasets, uncover trends in complex data, create code from text, and even develop novel hypotheses. Most scientists agree that LLMs are likely to have a big impact on how science in general is done.

GeoGPT

But what about geoscience? The best known LLMs, like OpenAI’s GPT series of models (e.g., GPT-3.5 and GPT-4, used in ChatGPT and Microsoft Copilot), and Google’s PaLM and Gemini were trained on very large datasets and text from across the internet. But until now, there has been no geoscience-specific LLM or GPT. So, the Deep-time Digital Earth programme (DDE; https://ddeworld.org/) set about working with the Zhejiang Laboratory in China to create a system trained on open-access geoscience data. The result is GeoGPT.

In its current unreleased prototype form, GeoGPT has some astonishing capabilities: it can extract key information from geoscience documents, develop computer code, and draw charts and graphs from text. The latest version of GeoGPT now provides Retrieval-Augmented Generation (RAG) so that sources of answers can be traced to single articles and papers. A user can now also choose between a Chinese (Qwen), French (Mixtral) or American (LlaMa) base model to compare the results – an option that is not commonly offered by LLMs.

GeoGPT has been trained on a large amount of high-quality open-source geoscience data, but until now, has not used ‘paywall’ published material. Put simply this means it knows about geoscience from open access papers, Wikipedia, and other sources – as well as the abstracts of peer-reviewed paywall papers. You could say that it has ‘master’s level’ knowledge; but the team wants to develop ‘PhD level’ expertise in GeoGPT. This is where GeoGPT (and most other developers of LLMs) comes up against a challenge: we want these systems to be as good as they can be – to provide wide and deep coverage – but some of the best training material is behind a paywall. We also want GeoGPT to be as widely used as possible.

Open science

These issues go to the heart of ‘open science’, broadly defined by UNESCO as ‘…making multilingual scientific knowledge openly available, accessible and reusable for everyone…’. But open science isn’t as simple as it sounds. The UNESCO Open Science Recommendation (UNESCO, 2021) urges publishers against protecting scientific information behind paywalls. Court cases have also highlighted the issue of whether unlicenced use of paywall material for LLM training constitutes copyright infringement or not. In a recent ongoing case between OpenAI and the New York Times, OpenAI suggested that LLM training constitutes ‘fair use’, in the same way that Google Books claims fair use when it scans snippets of books for its website. The key is whether the use is – in the jargon – ‘transformative’, or simply regurgitating content verbatim. If it’s transformative, then it’s more likely to be deemed fair use. In Japan, since 2018 there has been considerable latitude to use copyright works for training machine learning models. In the European Union, the new AI Act introduces exceptions in copyright law for text and data mining, recognising the importance of balancing copyright protection with promoting innovation and research  – so again a sign of ‘transformative’ usage swings the balance in favour of free use of paywall material in LLM training.

You could argue that LLMs are open science in a pure form in that they have the potential to provide reliable information free of charge from a very wide range of sources to anyone who has a computer and an internet link. But how do you make LLMs work in a way that keeps scientists and publishers happy? There is movement in this area in that some publishers are seeing that LLMs might provide increased circulation for published works, and other benefits, and so are seeking balanced solutions with appropriate agreement with LLM developers.

Another challenge for LLMs is the openness and transparency of their ‘inner machinery’. This is a hot topic, with some academics arguing for complete transparency (e.g., The Economist, 2024a) to maximise the chances of further innovation; and others arguing for secrecy in LLMs (The Economist, 2024b) for fear that vital code may fall into the wrong hands. Similarly, there are contemporary discussions in UNESCO (UNESCO, 2021) about AI ethics, mainly concerned that the Global South is not left behind or exploited by AI, and that the benefits of AI technologies are shared equitably.

The GeoGPT team recognises these complex issues. In working with publishers to secure high-quality training material, we’ve realised that relationships between LLMs and publishers can help foster open science while also allowing publishers to benefit. In relation to its inner workings, GeoGPT is striving for complete openness in that the broad range of its training materials are visible to everyone, and could potentially be provided at a more granular level in the future. Only its ‘data cleaning’ technology (which fixes or removes incorrect, corrupted, duplicate or incomplete data) developed by Zhejiang Laboratory is proprietary.

Global effort

GeoGPT also sees its brightest future in the Global South, where scientific data and knowledge are still hard to come by (e.g., Nobes & Harris, 2019). Recent discussions in Namibia and Nigeria during DDE workshops show how valuable GeoGPT – a free and comprehensive service for geoscientists – will be to young researchers, students and professionals alike.

GeoGPT is a global effort of open science practice aiming at unprecedented access to geoscience information and knowledge. It will likely come to a computer near you very soon!

Authors

Prof Mike Stephenson, Past and Founding President of the Deep-time Digital Earth program

Prof Jieping Ye, GeoGPT Technical Lead, Vice President of Alibaba Cloud, and a visiting scholar at Zhejiang Lab, China

Prof Yitian Xiao, Former ExxonMobil Senior Geoscience Advisor and Senior Consultant of the GeoGPT Project

Prof Hans Thybo, Chair of the DDE Science Committee and President of the International Lithosphere Program

Dr Natarajan Ishwaran, Director, International Relations at Deep-time Digital Earth

Further reading

Related articles