Lexical Resource

Vector representations of English words and compounds

eng Vector representations of English words and compounds

eng Word representations used in Dima (2019). The vectors were generated from the concatenated encow14ax (https://corporafromtheweb.org/) and English Wikipedia - Müller and Schutze (2015) version, ~9 billion words of text. The corpus was also pre-processed for compounds, i.e. the compounds from the en-comcom dataset were linked with an underscore and treated as a single word - e.g. 'police car' was rewritten to 'police_car'. Embeddings trained using a minimum word frequency of 100, leading to a vocabulary 424,014 words. The vocabulary words and their frequency in the corpus can be found in the file 'glove_encow14ax_enwiki_9B.400k_min100.vocab'. Word representations with 4 different vector dimensionalities - 50 dimensional, 100 dimensional, 200 dimensional, 300 dimensional. The embeddings were trained with GloVe, for 15 iterations, using a 10-word symmetric window of text (20 words surrounding a particular word). MAX_ITER=15 WINDOW_SIZE=10 BINARY=0 NUM_THREADS=8 X_MAX=100

2017-03-14

1

2c446ae2-7a9c-4f22-a2c4-0152288644f6

8cefa5dd-f5fb-4527-8acb-88cc6824eb48

50 dimensions

100 dimensions

200 dimensions

300 dimensions

No linked resources are available!
No linked resources are available!