eng Vector representations of English words and compounds
eng Word representations used in Dima (2019). The vectors were generated from the concatenated encow14ax (https://corporafromtheweb.org/) and English Wikipedia - Müller and Schutze (2015) version, ~9 billion words of text. The corpus was also pre-processed for compounds, i.e. the compounds from the en-comcom dataset were linked with an underscore and treated as a single word - e.g. 'police car' was rewritten to 'police_car'. Embeddings trained using a minimum word frequency of 100, leading to a vocabulary 424,014 words. The vocabulary words and their frequency in the corpus can be found in the file 'glove_encow14ax_enwiki_9B.400k_min100.vocab'. Word representations with 4 different vector dimensionalities - 50 dimensional, 100 dimensional, 200 dimensional, 300 dimensional. The embeddings were trained with GloVe, for 15 iterations, using a 10-word symmetric window of text (20 words surrounding a particular word). MAX_ITER=15 WINDOW_SIZE=10 BINARY=0 NUM_THREADS=8 X_MAX=100
2017-03-14
1
2c446ae2-7a9c-4f22-a2c4-0152288644f6
8cefa5dd-f5fb-4527-8acb-88cc6824eb48
50 dimensions
100 dimensions
200 dimensions
300 dimensions