Big Question 1

The nature of the mental lexicon: How to bridge neurobiology and psycholinguistic theory by computational modelling?

This Big Question addresses how to use computational modelling to link levels of description, from neurons to cognition and behaviour, in understanding the language system. Focus is on the mental lexicon and the aim is to characterize its structure in a way that is precise and meaningful in neurobiological and (psycho)linguistic terms. The overarching goal is to devise causal/explanatory models of the mental lexicon that can explain neural and behavioural data. This will significantly deepen our understanding of the neural, cognitive, and functional properties of the mental lexicon, lexical access, and lexical acquisition.

The BQ1 team takes advantage of recent progress in the understanding of modelling realistic neural networks, improvements in neuroimaging techniques and data analysis, and developments in accounting for the semantic, syntactic and phonological properties of words and other items stored in the mental lexicon. Using one common notation ‒high-dimensional numerical vectors‒ neurobiological and computational (psycho)linguistic models of the mental lexicon are integrated and methods are developed for comparing model predictions to large-scale neuroimaging data.

BQ1 thus comprises three main research strands, respectively focusing on models of lexical representation, models of neural processing, and methods for bridging between model predictions and neural data. It is taken into account that lexical items rarely occur in isolation but form parts of (and are interpreted in the context of) sentences and discourse. Moreover, the BQ1-team refrains from prior assumptions about what the lexical items are, that is, lexical items do not need to be equivalent to words but may be smaller or large units.

Thus, the Big Question is tackled from three directions:
(i) by investigating which vector representations of items in the mental lexicon are appropriate to encode their linguistically salient (semantic, combinatorial, and phonological) properties;
(ii) by developing neural processing models of access to, and development of, the mental lexicon; and
(iii) by designing novel evaluation methods and accessing appropriate data for linking the models to neuroimaging and behavioural data.

The BQ1 endeavour is inherently interdisciplinary in that it applies computational research methods to explain neural, behavioural, and linguistic empirical phenomena. One of its main innovative aspects consists of bringing together neurobiology, psycholinguistics, and linguistic theory (roughly corresponding to different levels of description of the language system) using a single mathematical formalism; a feat that requires extensive interdisciplinary team collaboration. Thus, BQ1 integrates questions of a Linguistic, Psychological, Neuroscientific, and Data-analytic nature.

People involved

Steering group

Dr. Stefan Frank
Coordinator BQ1
Tenure track researcher
Profile page

Jelle Zuidema's webplek – Willem Zuidema, University of Amsterdam

Dr. Jelle Zuidema
Tenure track researcher
Profile page

Prof. dr. Marcel van Gerven
Profile page

Dr. Hartmut Fitz
Profile page

Team Members


Prof. dr. Rens Bod
Profile page

Prof. dr. Mirjam Ernestus
Profile page

Dr. Raquel Fernández
Profile page

Prof. dr. Peter Hagoort
Programme Director
PI / Coordinator BQ2
Profile page

Dr. Karl-Magnus Petersson
Profile page


Dr. Jakub Szymanik
Profile page

Prof. dr. Robert van Rooij
Profile page

Dr. Tamar Johnson
Profile page

PhD Candidates

Samira Abnar
PhD Candidate
Profile page

Marianne de Heer Kloots
PhD Candidate
Profile page

Alessio Quaresima
PhD Candidate
Profile page


Dr. Luca Ambrogioni
Dr. Julia Berezutskaya
Dr. Louis ten Bosch
Dr. Renato Duarte
Dr. Umut Güçlü
Prof. dr. Abigail Morrison
Dr. David Neville
Dr. Roel Willems


Lisa Beinborn – Postdoc
Hartmut Fitz – Postdoc
Dieuwke Hupkes – PhD
Alessandro Lopopolo – PhD
Danny Merkx – PhD
Joe Rodd – PhD
Chara Tsoukala – PhD
Marvin Uhlmann – PhD

Research Highlights (2021)

Highlight 1

Modelling word learning and recognition using visually grounded speech

Team members:  Danny Merkx, Sebastiaan Scholten, Stefan Frank, Mirjam Ernestus, and Odette Scharenborg

Transformers’ are the state-of-the-art technology in Natural Language Processing, and also the backbone of

Computational models of speech recognition often assume that the set of target words is already given. This implies that these models do not learn to recognise speech from scratch without prior knowledge and explicit supervision. Visually grounded speech (VGS) models learn to recognise speech without prior knowledge by exploiting statistical dependencies between spoken and visual input. While it has previously been shown that VGS models learn to recognise the presence of words in the input, we explicitly investigate such a model as a model of human speech recognition.

We investigated the time-course of word recognition as simulated by the model using a gating paradigm (Figure 1) to test whether its recognition is affected by well-known word-competition effects in human speech processing. We furthermore investigate whether vector quantisation, a technique for discrete representation learning, aids the model in the discovery and recognition of words.

Figure 1. Model architecture: the model consists of two branches with the image decoder depicted on the left and the caption decoder on the right. The audio features consist of 13 MFCC with 1st and 2nd order derivatives by t frames. Each LSTM hidden state ht has1024 features which are concatenated for the forward and backward LSTM into 2048 dimensional hidden states. Vectorial attention weighs and sums the hidden states resulting in the caption embedding. The linear projection in the image branch maps the image features to the same 2048-dimensional space as the caption embedding. Finally, we calculate the cosine similarity between the image and caption embedding.

We investigated whether VGS models learn to discover and recognise words from natural speech. Our results show that our model learns to recognise nouns. To a lesser extent, the model is capable of recognising verbs, but future research should look to the image recognition side of the model to further improve this. Our model even learned to encode meaningful sub-lexical information, enabling it to interpret the visual difference signalled by the plural morphology. Contrary to what we expected based on previous research, our results show no evidence that vector quantisation aids in the discovery and recognition of words in speech. Importantly, we investigated the cognitive plausibility of the model by testing whether word competition influences our model’s word recognition performance, as we know happens in humans. We have shown that two well-known measures of word competition predict word recognition in our model and found evidence in favour of a disputed interaction between word count and neighbourhood density found in human word recognition.

Taking inspiration from human learning processes, our research has shown that using multiple streams of sensory information allows our model to discover and recognise words without any prior linguistic information from a relatively small dataset of scenes and spoken descriptions. We think that using realistic and naturally occurring input is important to create speech recognition models that are more cognitively plausible and visual grounding is an important step in that direction.

The current work combines insights from research on speech recognition, artificial intelligence, and psycholinguistics. Earlier models of speech recognition have a prior lexicon. Hence, they assume that words are already known and are unable to demonstrate how the lexicon may have come about. We have shown that it is (to some extent) possible to learn the meanings of individual words from associating images and their full-sentence spoken descriptions. These results provide a deeper understanding about the interaction between different levels of linguistic representation (from phonetics to lexical semantics) as well as between different cognitive modalities (speech and vision). This work could not have been completed without close collaboration between experts in speech recognition, artificial intelligence (AI), and psycholinguistics.

Highlight 2

Cosine Contours: A multipurpose representation for melodies

Team members: Bas Cornelissen, Jelle Zuidema, and John Ashley Burgoyne

In our search for universals in the musics of the world – which we ultimately would like to compare and relate to language universals – we found ourselves in need of a new representation of pitch contours, that is mathematically well-behaved and allows comparisons between many different musics. In this project, we propose such a representation: Cosine Contours. Here’s an interesting observation. Suppose you take a lot of melodies, and measure the pitch height at, say, 50 points in time, equally spaced between the start and end of the melody. All your melodies are now sequences of 50 pitches: a 50-dimensional dataset. The principal components of this dataset will probably look like cosine functions (see Figure 2). In this project, we demonstrate that the principal components of melodies (from motifs and phrases to complete songs) tend to be cosine-shaped. The same is true for artificial random-walk melodies. This suggests that the cosines are a kind of mathematical artefact. The culprit turns out to be the covariance matrix. This roughly approximates a so-called Toeplitz matrix, and such matrices (asymptotically) have sinusoidal eigenvectors: our principal components.

Figure 2. Outline of the work discussed in Cornelissen, Zuidema, & Burgoyne (2021).

Now suppose we want to efficiently describe the contour of melodies, capturing as much of the variability in as few dimensions as possible. A principal component projection would be the best choice, but our observations show that we might as well use cosines instead of the principal components and use a discrete cosine transform. This leads us to propose cosine contours which are simply the discrete cosine transforms of the melodies (pitch sequences). We present three short case studies to illustrate their practical usefulness in the published paper.

The present study combines insights from signal processing, AI, comparative linguistics and ethno-musicology. It illustrates a new, simple, well-defined and useful representation, using known ingredients in an innovative combination. This work steps a bit outside the usual themes of Language in Interaction, but ultimately will help answering one of its key questions: how in the interactions between many users of a culturally transmitted system universal tendencies can arise.