eng Vector representations of German words and compounds
eng Word representations used in Dima(2015), Dima (2019). The vectors were generated from the decow14ax corpus (https://corporafromtheweb.org/), ~10 billion words of raw text. Corpus pre-processing: words lowercased, punctuation removed, each number was replaced by the string 'NUMBER'. Embeddings trained using a minimum word frequency of 100, leading to a vocabulary 1,029,270 words. The vocabulary file 'decow14ax_all_min_100.vocab' contains these word representations and their frequency in the support corpus. 'decow14ax_full.vocab' contains the full vocabulary generated for the corpus (no cut-off). The embeddings were trained with GloVe, for 15 iterations, using a 10-word symmetric window of text (20 words surrounding a particular word). The files are suffixed with the dimensionality of the vector representations: 50 dimensional, 100 dimensional, 200 dimensional and 300 dimensional. MAX_ITER=15 WINDOW_SIZE=10 BINARY=0 NUM_THREADS=8 X_MAX=100
2017-03-14
1
3b2f7fe4-2081-47af-aeeb-0f822a262770
8cefa5dd-f5fb-4527-8acb-88cc6824eb48
50 dimensions
100 dimensions
200 dimensions
300 dimensions