FAQs
If your question is not answered here, please don’t hesitate to contact us: aqqua@geomar.de
Information on AqQua
What is a foundation model?
A foundation model is a machine-learning model trained at scale, usually with self-supervised methods on broad, multimodal data, that can be adapted to carry out diverse downstream tasks Bommassani et al. 2022. AqQua is a foundational model for plankton computer vision that will be trained using state of the art vision transformers on billions of plankton images from diverse imaging devices. This model will be fine-tuned for the downstream tasks of plankton identification, classification, trait detection, outlier detection and global interpolation of plankton distribution.Will the model require human validation?
We will release the AqQua exploration tool that will enable visualization and clustering of embeddings. We will also support export to TSV formats supported by EcoTaxa, so users will be able to upload a subset of their data with predictions generated by the AqQua model for manual validation in EcoTaxa.Image content differs a lot across instruments (modalities). How will AqQua deal with that? Is there any intent to leverage the diversity in the AqQua model?
Size information of plankton should help in identifying plankton classes across multiple instruments. Data from different modalities will enhance the AqQua model as there is overlap and mutual information in images from different modalities. However, we need to face challenge that model doesn’t just distinguish modalities but learns interesting plankton relevant features. Recent advances in ViTs that support multiple channels will also be leveraged.Are you developing a common segmentation tool across instruments?
Depends on what is referred to by "segmentation". Many in-situ imaging platforms support segmentation of acquired frames into "regions of interest" (ROIs). The AqQua model will be trained for the downstream task of object segmentation on existing ROIs. While the AqQua model might generalize to the task of segmenting ROIs from full frames, this is not planned in the course of the project.How will global interpolation work in detail?
You can explicitly choose if you would like to share your data for global interpolation studies within AqQua. We will then also need the volume sampled per image acquisition. We will used boosted regression trees and possibly other machine learning algorithms to learn the global plankton or particle distribution and associated process rates from the AqQua image data. Please see Drago et al. 2020 and Clements et al. (2022, 2023) for further details.Is the AqQua project limited to Germany or is it an international project?
AqQua is funded in Germany (see below) but aims for international collaboration. We are already supported by partners worldwide and are open to all contributors.How is AqQua funded?
AqQua is funded via the Helmholtz Foundation Model Initiative. It is a one-shot endeavour to collect the data and build the foundation model. The project is funded for three years.Data Collection
What kind of data are you looking for?
We’re gathering images of marine and freshwater zooplankton and phytoplankton. All kinds of labels/identification are welcome but optional, as we’re using self-supervised learning for training our foundational model, which does not require labels.What are the minimum metadata that AqQua needs?
We require date and time of image acquisition, the latitude, longitude and altitude of sampling and pixel resolution og the instrument. Image data without these required fields cannot be ingested by the AqQua model.Are microscopy images of value? What about pre-segmented microscopy images?
Microscopy images are most welcome. We prefer segmented images for now. We might be able to generalize to ROI segmentation tasks in the future (see above) and will reach out for unsegmetned frames then.Are plankton image data from lakes of value?
Yes, we welcome data from lakes and other fresh-water bodies. Please indicate the latitude, longitude and altitude at which these data were acquired.What do I gain from sharing data with you?
By sharing data with us for model development, you contribute to the diversity of the AqQua dataset and increase the chances that the developed model will be particularly useful to the kind of data that you are working with. Every data contributor will be co-author on a joint dataset paper and invited to contribute to further publications.I have millions of images, do you want them all?
Yes, we try to gather all existing plankton images, as the foundation model requires as much image data from diverse regions and imaging devices as possible.What if my data is messy (e.g., Planktoscope with poor quality images)?
Your data are still valuable as messy data helps models learn to handle noise.My instrument outputs three copies of every image in different formats. Do I limit my data or filter it before sending it to you?
We at AqQua would like to make the process of data contribution as simple as possible. Please send all your data and indicate the particulars of the formats in the data sharing form. We will filter out the necessary format (raw images) on our end.What will happen to the data that is shared with you?
We will build the AqQua Dataset by bringing together data from thousands of individual sources, a suite of different imaging devices, and from across diverse habitats. The AqQua Dataset will be published under an open-access license earliest in July 2027. Every data contribution will be duly acknowledged and every data contributor will be co-author on a joint dataset paper. Using the AqQua Dataset, we will train a foundational model and fine-tune it for multiple downstream tasks, including classification, trait extraction, and global interpolation of plankton and particle distribution. The developed code, models, and tools will be made open source and shared with the plankton imaging community to help with plankton image recognition tasks and to support further method development. For example, this could include contributing a generalist image recognition model to EcoTaxa.If new data appears after form submission, should I fill out a new form? Also, if I want to share more data from a different intrument, should I fill out the form again?
It is recommended to fill out a new data sharing form in both of these cases as this helps with tracking, licensing and attribution.I have already sent you an excel sheet with my datasets. Do I need to submit the data sharing form in addition to sending the excel file by email?
Yes, please fill out the data sharing form as this helps with tracking, licensing and attribution.How fixed is the October 31st deadline?
It is enough to fill out the data sharing form until the deadline (see faq below as well).Can new datasets be added after the October 31st deadline?
Please list all datasets, even potential ones or those not ready for transfer, before the deadline. This will help us plan our project. The data transfer itself can be carried out until the end of 2025.Are you only interested in data with validated annotations?
No! Annotations are welcome but strictly optional as we’re using self-supervised learning for training our foundational model. This does not require labels.Exactly what form will the data be made publicly accessible in the data release by July 2027?
Data will be published in the "AqQua Data Format": Images (blosc2 inside LMDB) + Converted Metadata (Parquet) + Raw Metadata (Parquet). We will select a place like Zenodo, where the data can be downloaded for analysis. Additionally, we might make the "AqQua Data Exploration Tool" for the visusalization of data a publicly available service.Will the "AqQua Data Exploration Tool" let people see the original image with its metadata (latitude, longitude, temperature, etc) on an organism by organism level?
Yes, the idea of this tool is to visualize the embedding space and link to individual image + metadata.How do multiple lab members get credited?
Yuo can use the comment field in the form to list additional contributors; the project team will follow up for necessary clarifications.Although I am the contact person of a project, it is not my decision to make if the data can be shared. How do I proceed?
You don’t have to make the decision yourself. Check with the principal investigator, data owner, or other relevant stakeholders before proceeding. Then, let us know.
Also, if your data is hosted on EcoTaxa, please make sure that you are correctly listed as the contact person of a project. If not, select the correct person in the EcoTaxa project settings:
- In the menu, select “Project / Edit project settings”.
- In the “Priviliges” tab, select the correct person as contact.
- Click “Save”.