Ahmad Mustafa Anis

I'm a Deep Learning Computer Vision Engineer at Roll.ai and AI Research Fellow at Fatima Fellowship advised by Dr. Wei Peng from Stanford University. I completed my Bachelors in Computer Science from International Islamic University Pakistan, advised by Dr. Muhammad Nadeem.

I was an AI Fellow at the 14th Batch of PI School of AI with a scholarship worth 12,500€.

I serve as a community lead (For Geo-Regional Asia and ML-Maths Subgroups) at Cohere.for.ai, led by Sara Hooker. In this role, I've hosted over 50 researchers from Asia to present their work at Cohere for AI, completed a study group on NYU Deep Learning by Yann LeCun.

See my Resume here (Updated Mar 2025).

Email  /  Twitter  /  Google Scholar  /  Hugging Face  /  Github  /  Medium  /  Linkedin

profile photo

Research

I am interested in Deep Learning in general, and read literature from different fields. In-particular, I try to follow the research in the following areas:

  • Improving Reasoning in Vision-Language Models
    • Visual Question Answering (VQA)
      • How well can a model reason on an image/video and answer questions. How can we extend it to multi-cultural and multi-lingual settings while maintaining alignment.
    • Image Generation
      • How well can a model reasonate with the textual caption and generate image. How can we extend it to a multi-cultural and multi-lingual settings while maintaining alignment.
  • Self-Supervised Learning
    • How to learn good representations from a huge amount of data that can be used for down-stream tasks without huge efforts.
  • World Models
    • How can we make models that can plan and reason in an environment.

News

  • [Dec 2024] Our work on "Bridging the Data Provenance Gap Across Text, Speech and Video " has been accepted at ICLR 2025
  • [Sep 2024] Starting as a Pre-Doctoral AI Research Fellow at Fatima Fellowship advised by Dr. Wei Peng, Research Scientist at Stanford University.
  • [July 2024] Our work on "The Rapid Decline of the AI Data Commons" has been accepted at NeurIPS 2024
  • [June, 2024] Serving as NLP Lead for Bytewise Fellowship..
  • [April, 2024] Accepted in Oxford Machine Learning Summer School OxML (MLX Fundamentals).
  • [April, 2024] Started working as a Deep Learning Computer Vision Engineer at Roll.ai.
  • [Dec, 2023] Got accepted as AI Fellow at the 14th Batch of PI School of AI with a scholarship worth 12,500€.
  • [Dec, 2023] Started as a ML-Maths Community Lead at Cohere.for.ai led by Sara Hooker.
  • [June, 2023] Served as Urdu Language Ambassador for AYA. Contributed 3 Datasets and led data crowd sourcing.
  • [Mar, 2023] Serving as Data Science Lead for Bytewise Fellowship..
  • [Feb, 2023] Started as a Asian Community Lead at Cohere.for.ai led by Sara Hooker. See our sessions: here.
  • [Aug, 2022] Graduated from IIUI with a Bachelors in Computer Science.
  • [April, 2022] Started as a Machine Learning Engineer at Redbuffer.ai.
  • [Dec, 2021] Started as a Software Engineer (Deep Learning and Computer Vision) at Wortel.ai.
  • [July, 2021] Started as a Deep Learning and Computer Vision Intern at Wortel.ai.
  • [Sep, 2018] Started my Undergraduate studies in CS at IIUI
  • Publications

    VLM Limitations
    On the Limitations of Vision Language Models in Understanding Image Transforms
    Ahmad Mustafa Anis, Hasnain Ali, Saquib Sarfraz
    Preprint
    This paper investigates the image-level understanding of VLMs, specifically CLIP by OpenAI and SigLIP by Google. Our findings reveal that these models lack comprehension of multiple image-level augmentations.
    Bridging the Gap
    Bridging the Data Provenance Gap Across Text, Speech and Video
    S Longpre, N Singh, [8 authors], Ahmad Mustafa Anis, et al.
    ICLR 2025
    Consent in Crisis
    Consent in Crisis: The Rapid Decline of the AI Data Commons
    S Longpre, R Mahari, [13 authors], Ahmad Mustafa Anis, et al.
    NeurIPS, 2024

    Favourite Papers

    List of papers I really admire (and hope to do similar impactful work).

  • CLIP, abs
  • World Models, abs
  • JEPA, abs
  • SimCLR, abs
  • Vision Transformers Need Registers, abs
  • Pali-3, abs
  • Learning by Distilling Context, abs
  • SigLIP, abs
  • GILL: Generating Images with Multimodal Language Models, abs

  • Code from Jon Barron Website.