Dutch Word Associations

  • Home
  • Cue Lookup
  • Association Lookup
  • Synonyms
  • Network Statistics
  • Visualizations
  • Info

This is the homepage for the Dutch Word Association Database. Use the Firefox webbrowser for optimal viewing.

News

The current online version contains 8995 cues and 2.85 million responses. This work would not have been possible without the kind collaboration of thousands of volunteers who participated in the study. One way of thanking them is by making these data available not only to the scientific community but to anyone who is interested in word associations, language and memory. We hope this website will contribute to this goal.

Most of the functionality described here is self-explanatory and easy to understand by actually trying. Simply submit a word on the appropriate pages and see what happens.We are open to any requests or comments, so don't hesitate to get in touch (see info page). This website is a continuous work in progress, so please check back for updates. For this and other reasons it is best to contact us in advance if you want to use any of these results for scientific purposes. We hope to submit a new manuscript about these data by the end of 2010 at which point most of the new data will be made publicly available.

September 2010

Word Association Experiment

netwerk voor het woord 'taal'
[message for Dutch speakers]
We zoeken nog steeds tal van deelnemers voor onze studie rond woordassociaties.
Klik naar http://www.kuleuven.be/lsa om online deel te nemen. In dit kort experiment vragen we u om woordassociaties te genereren voor een 18-tal woorden. Iedereen kan deelnemen, ongeacht leeftijd en meerdere deelnames zijn toegelaten. Je mag deze link gerust doorsturen naar andere mensen die kunnen helpen.

Lookups

From here you can lookup the most frequent responses to a cue (see top menu, Cue Lookup) and the 10 most frequent cues triggering a certain association (see top menu, Association Lookup). In addition you can find some basic statistical information on the most frequent responses in the database and perform visualizations for small networks.

Synonyms

Apart from looking up the direct responses for a given cue word, it is also interesting to calculate a measure of semantic relatedness by considering the distributional overlap of the association responses of two words. This is also useful for finding the near neighbors ('synonyms') for certain words. This way we can look at relationships even when two words are not associatively related.

Network Centrality

In many natural networks, certain nodes are more central than others. There are many possible ways to define node centrality. Central nodes could be those nodes that receive many incoming links and outgoing links. More advanced techniques to derive the importance of nodes in a network such as the PageRank and HITS algorithm go a step further an also consider how important the neighboring nodes of a specific node are to recursively define a measure of centrality.

Visualisations

I am experimenting with some visualizations based on the Prefuse Flare visualisation toolkit released by Jeffrey Heer from Stanford University. The visualizations are dynamic, you can drag a network around by holding down the mouse key. By hovering over nodes you can see the incoming and outgoing edges and information about the part of speech of the node and its frequency.
strawberry network cue network cue network detail'
Illustrations of word association networks

Additional Info

General information about the goal of the project, related publications and general statistics of the projects as well as downloadable data are available by clicking the info tab in the menu.
Copyright © Simon De Deyne. All rights reserved. Last modified: 28/09/2010 22:46:22.

Cue Lookup

Lookup the ten most frequently generated associations for any cue in the list of 8995 cue words that are currently present in the database. In addition to the list of most frequent associations, the association frequency will also be presented.

Note that the participants provided three different associations for each cue they were presented. Apart from a global count that does not consider the position of the association response, the data for for the first, second and third response counts are shown in separate tables.

Fassociation frequencies summed over the first, second and third response
F1,F2,F3association frequency for the first,second, or third response
FS forward strength: probability that a cue word produces an association
BS backward strenght: probability that a asociation produces its cue word

Association Lookup

The database of word associations contains over 100.000 different associations responses. With association lookup you can search for the cues that most often elicited a certain assocation. In addition to the list of most frequent cues, the cue frequency will also be presented.

Note that the participants provided three different associations for each cue they were presented. Apart from a global count that does not consider the position of the association response, the data for for the first, second and third response counts are shown in separate tables.

F cue frequencies summed over the first, second and third response
F1,F2,F3 cue frequency for the first,second, or third response
FS forward strength: probability that an association word produces a cue
BS backward strenght: probability that a cue produces its association word

    

Synonyms

Lookup near neighbor cues (synonyms) based on the distributional overlap between the distributional overlap of the cue word and any other cue. Note that the results go beyond direct associations and might therefor differ from the direct associates (also see cue lookup) If no match is found, consider entering a different wordform.

To reduce frequency artefacts in calculating distributional overlap, a frequency transformation is often applied. The resulting weighted distributions are used to obtain a cosine match. This is an index of similarity between words. For every word the ten most similar words are presented. Following frequency transformations are available:
log2
log-transformed frequency counts
PMI
Point-wise Mutual Information transformed frequency counts
t-score
t-score transformed frequency counts

Network Centrality

This page is currently under construction. Results are preliminary.

In a word association network, some words are more important than others. This different centrality can be expressed by differences in the number of different associates it has (outdegree) or the number of times it is mentioned as an associate response (indegree). Apart from outdegree and indegree more complicated measures have been formulated to measure how important a node is in a network. Some of these measures like PageRank and HITS are currently used to measure how central certain pages are on the worldwide web.

Below you can lookup the centrality values for words and take a look at the twenty most important nodes in the network according to these measures.

If no match is found, consider entering a different wordform.

Top 20 most central words

Various Network Centrality Measurements
PageRankinDegree outDegreeBetweenness AuthorityHubCC
zonhangenmooiwerkfonteinwaterouderwets
waterlichaamkinderenzwartkraanzeevroeger
warmberoepwerkpijndorstetenmoeilijk
etenlevenetenetenroeienlekkeroud
geldvastleuksaaipompwarmgevaarlijk
lekkerwerkenschoolmooigietenbootkinderen
zeebrengenkindwitzwemmenvisklein
mooiwerkpijnroodviszonfilm
pijnaandachtmoeilijkschoolflesgroenzwart
leuknadenkenwatervervelendnatregengroot
roodstrevenroodvrouwdolfijnroodwit
groenwennenmankinddweilengeldwerk
vakantiepijnwarmleukemmerzomervervelend
schoolpopulairgeldgelddrinkenkoudman
liefdestuderenwitkleinzeewitnegatief
koudgelukvervelendgrootgietervakantierood
witlelijkvrouwviesstrandnatslecht
zomermissenzwartlelijkbronmooinodig
gezelligblijvenoudgroenvloeistofautojeugd
autospelenvakantiemoeilijkoceaanstrandlelijk
Centrality measures provide indices of how important or a central a node is to a network. Importance could be calculated by simply counting how many times a certain word is mentioned as a response. Other measures from network theory are also given below. Note that all centrality indices are calculated on a cue by cue adjacency matrix. This corresponds to a graph where each node is presented as a cue. Following centrality measures are available:
PageRank the importance of a node depending on the importance of its neighbors, similar to the Google implementation
inDegreethe number of different incoming links
outDegreethe number of different oucoming links
authorityHITS authority score
hubHITS hub score
betweennessdegree in which a node occurs along possible paths
CCclustering coefficient

Visualize a cue in the association database.

You can use the mouse to drag the network by holding the left button and dragging inside the network. If the mouse cursor focusses on a node, additional information about the incoming and outgoing edges will be displayed. Pan by mouse 'click-and-hold' on the background visualization. Sink nodes (words never presented as cues) are presented as squares. Only associations with an association frequency larger than 7 are presented.

Goal

The average Dutch university student knows about 45.000 words. This lexicon is can also be characterized by the relationships between these words. One way of doing this is by using the metaphor of a semantic associative network. Our aim is to collect data for word associations to study the property of human semantic knowledge. We use a snow balling approach to gradually expand the set of cues. We will gather data from 100 persons for each cue. When all this data is collected, we select the most frequent association responses that are not yet in our set, and present them as cue in a second round.

Background

This project is part of our research into semantic networks and concept representation at the Research Group on Concepts and Categories in Leuven, Belgium in collaboration with Prof. Gert Storms. There's also a Dutch text that gives a general overview of the word association project. Originally, this project started in 2003 with the collection of associations for a set of 400 concrete concepts (see Ruts et al., 2004 for a description). Gradually this set was expanded over the years to a stimulus set of about 1.400 words. The results at this point were reported in De Deyne and Storms (2008). We hope to make the latest version of the dataset available to other scientists in the beginning of 2010.

Participants.  At the end of September 2010, a total of 58.335 persons participated in the study. Some participant statistics are shown below. These graphs show that most participants were young, which is not surprising since many of them were university students. In addition, the majority of the volunteers was female and came from the Flemish speaking part of Belgium.
distribution of the age of the participants
gender distribution of the participants distribution of the Dutch variants spoken by the participants

Current Projects

Evolving Networks. The core assumption of distributional approaches to lexico-semantics considers the meaning of a word to be shaped by the way it is used in language. This would imply that the meaning of words and the way we use them changes during the course of our life. Word associations can be used to unravel how this meaning changes over time. This can be done by comparing the association association responses in young and older participants.

Small World Networks. Word associations also offer us a method to investigate how our memory works in general. When word associations are represented in networks we notice that not any two words are related by coincedence. Instead, this network shows a specific topology similar to other growing networks. These networks are referred to as small world networks because any two nodes in the network are separated by a small number of steps. Just like social networks where two arbitrary persons are only a few handshakes away, two random words in the association network are on average separated by 4 associations. Such a network structure might be an explanation why some words are retrieved quite easily, while others require more effort.

Missing Links When people generate associations, there are occasions where a certain responses are generated by almost all participants. When presented the word rose, most people will respond red. It is however likely that the semantic network around rose contains other information as well. The goal of predicting missing link is to find which weak links are lacking from the network and investigate the role of these weak links.

Cross-Cultural Comparisons.  Finally, we are also interested in studying wordassociations in different cultures. These studies allow us to compare what words are central in other languages, where two languages converge or differ. One of the topics of interest is comparing how the meaning of certain responses differ for Dutch speakers of Flanders with the corresponding responses for Dutch speakers from the Netherlands.

Articles

De Deyne S., & Storms G. (2008a). Word Associations: Norms for 1,424 Dutch words in a continuous task. Behavior Research Methods, 40,198-205. download pdf

De Deyne, S., & Storms, G. (2008b). Word associations: Network and semantic properties. Behavior Research Methods,40,213-231. download pdf

Ruts, W., De Deyne, S., Ameel, E., Vanpaemel, W., Verbeemen, T., & Storms, G. (2004). Dutch norm data for 13 semantic categories and 338 exemplars. Behavior Research Methods Instruments and Computers, 36, 506-515. download pdf

Applications

The word associations are currently used as part of the Dutch Womima association website. It allows you to explore personal association networks in a unique way and offers many different types of visualizations.

Downloads

The results reported in De Deyne and Storms (2008a) are available for download. The comma-separated text files contain wordassociation data for 1424 stimuli words and were collected between 2003 and 2006. For each cue, a total of at least 83 participants generated three different associations.

raw data[2Mb]: contains the cue and three responses as they were generated by each participant.
count data [1Mb]: contains for each cue the frequency for each unique first, second and third association

Press

For a nice introduction into the world of association networks, you might want to read Berthold Maris' article for the NRC and De Standaard. This article explains some of the properties of these networks such as the small-world structure.
A different article appeared in the Dutch periodical Onze Taal. It focusses on specific semantic relationships found in the network and differences between groups of people. Finally, as a response to this article, there's an interesting blog post further discusses the differences between the associations of men and women.

Contact

My personal research page can be accessed here. Comments, questions or suggestions are welcome. simon.dedeyne[at]psy[dot]kuleuven[dot]be