Vector representations of German words and compounds

Lexikalische Ressource

Vector representations of German words and compounds

Titel

eng Vector representations of German words and compounds

Resource_description

eng Word representations used in Dima(2015), Dima (2019). The vectors were generated from the decow14ax corpus (https://corporafromtheweb.org/), ~10 billion words of raw text. Corpus pre-processing: words lowercased, punctuation removed, each number was replaced by the string 'NUMBER'. Embeddings trained using a minimum word frequency of 100, leading to a vocabulary 1,029,270 words. The vocabulary file 'decow14ax_all_min_100.vocab' contains these word representations and their frequency in the support corpus. 'decow14ax_full.vocab' contains the full vocabulary generated for the corpus (no cut-off). The embeddings were trained with GloVe, for 15 iterations, using a 10-word symmetric window of text (20 words surrounding a particular word). The files are suffixed with the dimensionality of the vector representations: 50 dimensional, 100 dimensional, 200 dimensional and 300 dimensional. MAX_ITER=15 WINDOW_SIZE=10 BINARY=0 NUM_THREADS=8 X_MAX=100

Md_id

https://doi.org/10.57754/FDAT.fx84s-dxe33

Md_timestamp

2017-03-14

Lc_version

Tech_landing_page

https://doi.org/10.57754/FDAT.fx84s-dxe33

entityId

3b2f7fe4-2081-47af-aeeb-0f822a262770

sourceId

8cefa5dd-f5fb-4527-8acb-88cc6824eb48

Lex_size

50 dimensions

100 dimensions

200 dimensions

300 dimensions

Keine verknüpften Ressourcen sind verfügbar!