Discovering Hidden Visual Concepts Beyond Linguistic Input in Infant Learning

CVPR 2025
1Nanyang Technological University 2The Max Planck Institute for Psycholinguistics

Motivation

Computational Model

Computational Model (CVCL)[1]

A multimodality model trained on infant daily experience.

Biological Infant

Biological Infant

Sequential visual and language development of infants.

✨ Our Research Question

When a multimodality model, trained on infant daily experience, will its vision encoder develop representations beyond its linguistic input, like real infant?

Key Contributions

1

Visual Development Beyond Text

Computational models' visual representation develops beyond linguistic input, similar to real infants.

2

Single Neuron Can Classify Images in Training-Free Manner

Leveraging explainable techniques, neurons in representation result in a strong training-free classifier.

3

Broadly-Trained Model v.s. Infant Model

CLIP/ResNet have richer high-level representations while similar to the infant model in the low-level.

1. Visual Development Beyond Text

Key Finding

Infant model's visual representations develop beyond linguistic input, similar to real infants. Through 'neuron labeling', we identified specific neurons inside the infant model's vision encoder that hide visual concepts are not presented in the model's training vocabulary.

We leverage CLIP-Dissect [2] to 'label' each neuron with corresponding semantic concepts, thus discovering many visual concepts. These concepts are categorized as in-vocabulary or out-of-vocabulary based on whether they appeared in the infant's linguistic input or not. Many out-of-vocabulary visual concepts emerge in specific neurons, demonstrating visual learning beyond linguistic supervision.

CLIP-Dissect Method

Neuron Labeling Vision Encoder (e.g. This neuron is sensitve to 'rug' visual concept)

In-vocabulary vs Out-of-vocabulary Concepts

In-vocabulary and Out-of-vocabulary

AoA Ranking of Discovered Concepts in Infant Model's Representation

From a cognitive perspective, we rate concepts using Age of Acquisition (AoA) [3], estimating when these concepts will be acquired (e.g., simple patterns like "ball" are acquired earlier, while more complex concepts like "calculator" are acquired later).

AoA Analysis

Complex patterns in out-of-vocabulary concepts suggest visual learning beyond text supervision.

2. Training-Free Neuron-Based Classification

Key Finding

We leverage neurons discovered from the infant model's representation to develop a training-free classification framework that significantly improves performance. Our approach demonstrates that single neurons can effectively classify images without additional training, revealing the rich representational structure learned by the infant model through discovered visual concepts.

Training-Free Classification Framework

Our training-free neuron-based classification framework leverages discovered visual concepts to enable image classification without additional training.

Neuron classifier

Overview of our training-free neuron-based classification framework.

Classification Performance Results

Our NeuronClassifier framework leverages neurons discovered in the representation to enable broader recognition capabilities, particularly demonstrating significant improvements in CVCL performance (in bold).

Neuron classifier tbl results

“Vanilla” refers to classification based on image-text similarity. “❌” denotes cases where direct classification is not possible due to missing text encoder or need fine-tuning. “All” represents the combined performance on both in- and out-of-vocabulary.

3. Representation Analysis

Key Finding

Our analysis reveals that general representations (CLIP, ImageNet models) exhibit significant differences from the infant model (CVCL) in the final layers (abstract concepts) while sharing similar representations in lower layers (simple patterns).

CKA Similarity Analysis

We first employ Centered Kernel Alignment (CKA) [4] to measure layer-wise representational similarity between different models. Using the ImageNet validation set as input, we quantitatively analyze how similar the learned representations are across different architectures. The results show that CVCL exhibits similarity to CLIP-RN50 and ResNeXt50 in the shallow layers (lower-level features) but diverges significantly in the final layer (higher-level features). Notably, Layer 4 of CVCL shows very low similarity to all layers of both common models, suggesting unique high-level representations in the infant model.

CKA Matrix Analysis

CKA similarity matrices comparing CVCL with CLIP-RN50 and ResNeXt50 models.

Net Dissect Analysis

Next, we apply Net Dissect [5] to quantify the emergence of interpretable visual concepts across different layers. The number of unique concepts indicates the richness of encoded representation. Our analysis reveals that neurons in deeper model layers capture increasingly complex concepts. Early layers primarily detect lower-level features like color and texture, while higher-level concepts such as objects and scenes emerge in deeper layers. CVCL exhibits fewer visual concepts than the ImageNet model, especially for higher-level visual concepts (e.g., objects and scenes).

Net Dissect Analysis

Net Dissect Analysis showing concept emergence across different layers and models.

BibTeX

🥳 Thank you for your interest in our work! For more details, please check our paper. If our work is helpful, please consider citing the following:
@InProceedings{Ke_2025_CVPR,
    author    = {Ke, Xueyi and Tsutsui, Satoshi and Zhang, Yayun and Wen, Bihan},
    title     = {Discovering Hidden Visual Concepts Beyond Linguistic Input in Infant Learning},
    booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
    month     = {June},
    year      = {2025},
    pages     = {4343-4352}
}

References

[1]

Wai Keen Vong et al. "Grounded language acquisition through the eyes and ears of a single child." Science 383, 504-511 (2024). DOI:10.1126/science.adi1374

[2]

Oikarinen, Tuomas, and Tsui-Wei Weng. "CLIP-Dissect: Automatic Description of Neuron Representations in Deep Vision Networks." ICLR 2023 Spotlight.

[3]

Kuperman, V., Stadthagen-Gonzalez, H., & Brysbaert, M. (2012). "Age-of-acquisition ratings for 30,000 English words." Behavior research methods, 44, 978-990.

[4]

Kornblith, Simon, Norouzi, Mohammad, Lee, Honglak, & Hinton, Geoffrey. "Similarity of neural network representations revisited." International Conference on Machine Learning, pages 3519-3529, 2019. PMLR.

[5]

Bau, David, Zhou, Bolei, Khosla, Aditya, Oliva, Aude, & Torralba, Antonio. "Network Dissection: Quantifying Interpretability of Deep Visual Representations." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.

×