Try taking a picture of each of North America’s roughly 11,000 tree species, and you’ll have a mere fraction of the millions of photos within nature image datasets. These massive collections of snapshots—ranging from butterflies to humpback whales—are a great research tool for ecologists because they provide evidence of organisms’ unique behaviors, rare conditions, migration patterns, and responses to pollution and other forms of climate change.
While comprehensive, nature image datasets aren’t yet as useful as they could be. It’s time-consuming to search these databases and retrieve the images most relevant to your hypothesis. You’d be better off with an automated research assistant—or perhaps AI systems called multimodal vision language models (VLMs). They’re trained on both text and images, making it easier for them to pinpoint finer details, like the specific trees in the background of a photo.
But just how well can VLMs assist nature researchers with image retrieval? A team from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), University College London, iNaturalist, University of Edinburgh, and UMass Amherst designed a performance test to find out. Each VLM’s task: Locate and reorganize the most relevant results within the team’s “INQUIRE” dataset, composed of 5 million wildlife pictures and 250 search prompts from ecologists and other biodiversity experts.
In these evaluations, the researchers found that larger, more advanced VLMs, which are trained on far more data, can sometimes get researchers the results they want to see. The models performed reasonably well on straightforward queries about visual content, like identifying debris on a reef, but struggled significantly with queries requiring expert knowledge, like identifying specific biological conditions or behaviors. For example, VLMs somewhat easily uncovered examples of jellyfish on the beach, but struggled with more technical prompts like “axanthism in a green frog,” a condition that limits their ability to make their skin yellow.
Their findings, in an article now posted to the arXiv preprint server, indicate that the models need much more domain-specific training data to process difficult queries. MIT CSAIL Ph.D. student Edward Vendrow, who co-led work on the dataset, believes that by familiarizing with more informative data, the VLMs could one day be great research assistants. “We want to build retrieval systems that find the exact results scientists seek when monitoring biodiversity and analyzing climate change,” says Vendrow.
“Multimodal models don’t quite understand more complex scientific language yet, but we believe that INQUIRE will be an important benchmark for tracking how they improve in comprehending scientific terminology and ultimately helping researchers automatically find the exact images they need.”
The team’s experiments illustrated that larger models tended to be more effective for both simpler and more intricate searches due to their expansive training data. They first used the INQUIRE dataset to test if VLMs could narrow a pool of 5 million images to the top 100 most relevant results (also known as “ranking”). For straightforward search queries like “a reef with manmade structures and debris,” relatively large models like “SigLIP” found matching images, while smaller-sized CLIP models struggled. According to Vendrow, larger VLMs are “only starting to be useful” at ranking tougher queries.
Vendrow and his colleagues also evaluated how well multimodal models could rerank those 100 results, reorganizing which images were most pertinent to a search. In these tests, even huge LLMs trained on more curated data like GPT-4o struggled: Its precision score was only 59.6%, the highest score achieved by any model.
The researchers will present these results at the Conference on Neural Information Processing Systems (NeurIPS 2024) in December.
Discover the latest in science, tech, and space with over 100,000 subscribers who rely on Phys.org for daily insights.
Sign up for our free newsletter and get updates on breakthroughs,
innovations, and research that matter—daily or weekly.
Inquiring for INQUIRE
The INQUIRE dataset includes search queries based on discussions with ecologists, biologists, oceanographers, and other experts about the types of images they’d look for, including animals’ unique physical conditions and behaviors. A team of annotators then spent 180 hours searching the iNaturalist dataset with these prompts, carefully combing through roughly 200,000 results to label 33,000 matches that fit the prompts.
For instance, the annotators used queries like “a hermit crab using plastic waste as its shell” and “a California condor tagged with a green ’26′” to identify the subsets of the larger image dataset that depict these specific, rare events.
Then, the researchers used the same search queries to see how well VLMs could retrieve iNaturalist images. The annotators’ labels revealed when the models struggled to understand scientists’ keywords, as their results included images previously tagged as irrelevant to the search. For example, VLMs’ results for “redwood trees with fire scars” sometimes included images of trees without any markings.
“This is a careful curation of data, with a focus on capturing real examples of scientific inquiries across research areas in ecology and environmental science,” says Sara Beery, MIT Homer A. Burnell Career Development Assistant Professor, CSAIL principal investigator, and co-senior author.
“It’s proved vital to expanding our understanding of the current capabilities of VLMs in these potentially impactful scientific settings. It has also outlined gaps in current research that we can now work to address, particularly for complex compositional queries, technical terminology, and the fine-grained, subtle differences that delineate categories of interest for our collaborators.”
“Our findings imply that some vision models are already precise enough to aid wildlife scientists with retrieving some images, but many tasks are still too difficult for even the largest, best-performing models,” says Vendrow. “Although INQUIRE is focused on ecology and biodiversity monitoring, the wide variety of its queries means that VLMs that perform well on INQUIRE are likely to excel at analyzing large image collections in other observation-intensive fields.”
Taking their project further, the researchers are working with iNaturalist to develop a query system to better help scientists and other curious minds find the images they actually want to see. Their working demo allows users to filter searches by species, enabling quicker discovery of relevant results like, say, the diverse eye colors of cats.
Vendrow and co-lead author Omiros Pantazis, who recently received his Ph.D. from University College London, also aim to improve the reranking system by augmenting current models to provide better results.
University of Pittsburgh Associate Professor Justin Kitzes highlights INQUIRE’s ability to uncover secondary data. “Biodiversity data sets are rapidly becoming too large for any individual scientist to review,” says Kitzes, who wasn’t involved in the research. “This paper draws attention to a difficult and unsolved problem, which is how to effectively search through such data with questions that go beyond simply ‘who is here’ to ask instead about individual characteristics, behavior, and species interactions.
“Being able to efficiently and accurately uncover these more complex phenomena in biodiversity image data will be critical to fundamental science and real-world impacts in ecology and conservation.”
Serge Belongie, Pioneer Center for Artificial Intelligence Director and University of Copenhagen professor, notes that INQUIRE reveals the current limits of multimodal models in understanding scientists’ search queries. “This work is both a huge step forward in our understanding of multimodal models for scientific inquiry, and a sobering, inspiring reminder of just how difficult the text-to-image retrieval task remains when the details matter,” says Belongie, who wasn’t involved in the paper.
More information:
Edward Vendrow et al, INQUIRE: A Natural World Text-to-Image Retrieval Benchmark, arXiv (2024). DOI: 10.48550/arxiv.2411.02537
Provided by
Massachusetts Institute of Technology
Citation:
Ecologists find computer vision models’ blind spots in retrieving wildlife images (2024, December 18)