Lexical Resource

Vector representations of German words and compounds

eng Vector representations of German words and compounds

eng Word representations used in Dima(2015), Dima (2019). The vectors were generated from the decow14ax corpus (https://corporafromtheweb.org/), ~10 billion words of raw text. Corpus pre-processing: words lowercased, punctuation removed, each number was replaced by the string 'NUMBER'. Embeddings trained using a minimum word frequency of 100, leading to a vocabulary 1,029,270 words. The vocabulary file 'decow14ax_all_min_100.vocab' contains these word representations and their frequency in the support corpus. 'decow14ax_full.vocab' contains the full vocabulary generated for the corpus (no cut-off). The embeddings were trained with GloVe, for 15 iterations, using a 10-word symmetric window of text (20 words surrounding a particular word). The files are suffixed with the dimensionality of the vector representations: 50 dimensional, 100 dimensional, 200 dimensional and 300 dimensional. MAX_ITER=15 WINDOW_SIZE=10 BINARY=0 NUM_THREADS=8 X_MAX=100

2017-03-14

1

3b2f7fe4-2081-47af-aeeb-0f822a262770

8cefa5dd-f5fb-4527-8acb-88cc6824eb48

50 dimensions

100 dimensions

200 dimensions

300 dimensions

No linked resources are available!
No linked resources are available!