Skip to the content.
Open PhD Position!
We're collecting data!

FAQs

If your question is not answered here, please don’t hesitate to contact us: aqqua@geomar.de

Information on AqQua
What is a foundation model? A foundation model is a machine-learning model trained at scale, usually with self-supervised methods on broad, multimodal data, that can be adapted to carry out diverse downstream tasks Bommassani et al. 2022. AqQua is a foundational model for plankton computer vision that will be trained using state of the art vision transformers on billions of plankton images from diverse imaging devices. This model will be fine-tuned for the downstream tasks of plankton identification, classification, trait detection, outlier detection and global interpolation of plankton distribution.
Will the model require human validation? We will release the AqQua exploration tool that will enable visualization and clustering of embeddings. We will also support export to TSV formats supported by EcoTaxa, so users will be able to upload a subset of their data with predictions generated by the AqQua model for manual validation in EcoTaxa.
Image content differs a lot across instruments (modalities). How will AqQua deal with that? Is there any intent to leverage the diversity in the AqQua model? Size information of plankton should help in identifying plankton classes across multiple instruments. Data from different modalities will enhance the AqQua model as there is overlap and mutual information in images from different modalities. However, we need to face challenge that model doesn’t just distinguish modalities but learns interesting plankton relevant features. Recent advances in ViTs that support multiple channels will also be leveraged.
Are you developing a common segmentation tool across instruments? Depends on what is referred to by "segmentation". Many in-situ imaging platforms support segmentation of acquired frames into "regions of interest" (ROIs). The AqQua model will be trained for the downstream task of object segmentation on existing ROIs. While the AqQua model might generalize to the task of segmenting ROIs from full frames, this is not planned in the course of the project.
How will global interpolation work in detail? You can explicitly choose if you would like to share your data for global interpolation studies within AqQua. We will then also need the volume sampled per image acquisition. We will used boosted regression trees and possibly other machine learning algorithms to learn the global plankton or particle distribution and associated process rates from the AqQua image data. Please see Drago et al. 2020 and Clements et al. (2022, 2023) for further details.
Is the AqQua project limited to Germany or is it an international project? AqQua is funded in Germany (see below) but aims for international collaboration. We are already supported by partners worldwide and are open to all contributors.
How is AqQua funded? AqQua is funded via the Helmholtz Foundation Model Initiative. It is a one-shot endeavour to collect the data and build the foundation model. The project is funded for three years.
Data Collection
What kind of data are you looking for? We’re gathering images of marine and freshwater zooplankton and phytoplankton. All kinds of labels/identification are welcome but optional, as we’re using self-supervised learning for training our foundational model, which does not require labels.
What are the minimum metadata that AqQua needs? We require date and time of image acquisition, the latitude, longitude and altitude of sampling and pixel resolution og the instrument. Image data without these required fields cannot be ingested by the AqQua model.
Are microscopy images of value? What about pre-segmented microscopy images? Microscopy images are most welcome. We prefer segmented images for now. We might be able to generalize to ROI segmentation tasks in the future (see above) and will reach out for unsegmetned frames then.
Are plankton image data from lakes of value? Yes, we welcome data from lakes and other fresh-water bodies. Please indicate the latitude, longitude and altitude at which these data were acquired.
What do I gain from sharing data with you? By sharing data with us for model development, you contribute to the diversity of the AqQua dataset and increase the chances that the developed model will be particularly useful to the kind of data that you are working with. Every data contributor will be co-author on a joint dataset paper and invited to contribute to further publications.
I have millions of images, do you want them all? Yes, we try to gather all existing plankton images, as the foundation model requires as much image data from diverse regions and imaging devices as possible.
What if my data is messy (e.g., Planktoscope with poor quality images)? Your data are still valuable as messy data helps models learn to handle noise.
My instrument outputs three copies of every image in different formats. Do I limit my data or filter it before sending it to you? We at AqQua would like to make the process of data contribution as simple as possible. Please send all your data and indicate the particulars of the formats in the data sharing form. We will filter out the necessary format (raw images) on our end.
What will happen to the data that is shared with you? We will build the AqQua Dataset by bringing together data from thousands of individual sources, a suite of different imaging devices, and from across diverse habitats. The AqQua Dataset will be published under an open-access license earliest in July 2027. Every data contribution will be duly acknowledged and every data contributor will be co-author on a joint dataset paper. Using the AqQua Dataset, we will train a foundational model and fine-tune it for multiple downstream tasks, including classification, trait extraction, and global interpolation of plankton and particle distribution. The developed code, models, and tools will be made open source and shared with the plankton imaging community to help with plankton image recognition tasks and to support further method development. For example, this could include contributing a generalist image recognition model to EcoTaxa.
If new data appears after form submission, should I fill out a new form? Also, if I want to share more data from a different intrument, should I fill out the form again? It is recommended to fill out a new data sharing form in both of these cases as this helps with tracking, licensing and attribution.
I have already sent you an excel sheet with my datasets. Do I need to submit the data sharing form in addition to sending the excel file by email? Yes, please fill out the data sharing form as this helps with tracking, licensing and attribution.
How fixed is the October 31st deadline? It is enough to fill out the data sharing form until the deadline (see faq below as well).
Can new datasets be added after the October 31st deadline? Please list all datasets, even potential ones or those not ready for transfer, before the deadline. This will help us plan our project. The data transfer itself can be carried out until the end of 2025.
Are you only interested in data with validated annotations? No! Annotations are welcome but strictly optional as we’re using self-supervised learning for training our foundational model. This does not require labels.
Exactly what form will the data be made publicly accessible in the data release by July 2027? Data will be published in the "AqQua Data Format": Images (blosc2 inside LMDB) + Converted Metadata (Parquet) + Raw Metadata (Parquet). We will select a place like Zenodo, where the data can be downloaded for analysis. Additionally, we might make the "AqQua Data Exploration Tool" for the visusalization of data a publicly available service.
Will the "AqQua Data Exploration Tool" let people see the original image with its metadata (latitude, longitude, temperature, etc) on an organism by organism level? Yes, the idea of this tool is to visualize the embedding space and link to individual image + metadata.
How do multiple lab members get credited? Yuo can use the comment field in the form to list additional contributors; the project team will follow up for necessary clarifications.
Although I am the contact person of a project, it is not my decision to make if the data can be shared. How do I proceed?

You don’t have to make the decision yourself. Check with the principal investigator, data owner, or other relevant stakeholders before proceeding. Then, let us know.
Also, if your data is hosted on EcoTaxa, please make sure that you are correctly listed as the contact person of a project. If not, select the correct person in the EcoTaxa project settings:

  • In the menu, select “Project / Edit project settings”.
  • In the “Priviliges” tab, select the correct person as contact.
  • Click “Save”.

Data Transfer
How can I transfer my data? We support a number of different transfer methods. If you are unsure, please contact us and we will work together with you to determine the best option for your data. The optimal method depends largely on the size of the data. If the data is already externally accessible, you can just provide us with access to the existing location. Please inform us about your preferred method during the data sharing form submission. The suggestions below are purely to support you in your choice, other options are always possible.
My data is larger than ~200GB For such large datasets we recommend Globus or using an FTP server. Please contact your IT department to find out if your institute provides a Globus instance and for information on how to set up a data share. Once set up, it allows for easy upload and download of terabyte-scale datasets. If Globus is not available we recommend using an FTP server. If you don't have one available, please contact us for access to our own FTP server. Transfers to an FTP server can be continued after an interruption without having to start from scratch. An alternative but still valid option would even be to send us a physical hard drive.
My data is larger than ~20GB and smaller than ~200GB For datasets of this size we suggest GigaMove. This service allows one to upload files of up to 100GB and share access via a simple link.
My data is smaller than ~20GB For datasets of this size we suggest to either use one of the options listed before or to use a cloud based storage system such as google drive, dropbox or nextcloud.
My data is already on EcoTaxa, how can I share it with you? If your data is already on EcoTaxa, you can share it with us by simply adding the aqqua@geomar.de user with view permissions to your project. This will enable us to download your data. We will inform you once we have downloaded your data, so that you can revoke access, if you would like to.
I have many projects on EcoTaxa that I would like to share. Is there something quicker than adding the AqQua user manually? You can download these Python scripts that use the EcoTaxa API to access your EcoTaxa projects. There are two scripts. The first one generates a list of all your projects. You can use this list during the data sharing form submission The second script helps you to easily add the AqQua user to a subset of your projects. You can at any time, should you wish to, change the access rights for multiple projects in bulk via the EcoTaxa API.
How to submit data with custom segmentation masks (e.g. on squidle) andother complex metadata? Is it possible to use some other way than on Ecotaxa? Please submit data sharing form and mention the type of metadata in the comments. It is possible to submit through other ways than going over EcoTaxa, the AqQua team will reach out to you and follow up on the transfer.