A Central Asian Language Survey
datasetposted on 17.06.2016 by Philippe Mennecier, John Nerbonne, Evelyne Heyer, Franz Manni
Datasets usually provide raw data for analysis. This raw data often comes in spreadsheet form, but can be any collection of data, on which analysis can be performed.
We have documented language varieties (either Turkic or Indo-European) spoken in 23 test sites by 88 informants belonging to the major ethnic groups of Kyrgyzstan, Tajikistan and Uzbekistan (Karakalpaks, Kazakhs, Kyrgyz, Tajiks, Uzbeks, Yaghnobis). The recorded linguistic material concerns 176 words of the extended Swadesh list and will be made publically available with the publication of this paper. Phonological diversity is measured by the Levenshtein distance and displayed as a consensus bootstrap tree and as multidimensional scaling plots. Linguistic contact is measured as the number of borrowings, from one linguistic family into the other, according to a precision/recall analysis further validated by expert judgment. Concerning Turkic languages, the results of our sample do not support Kazakh and Karakalpak as distinct languages and indicate the existence of several separate Karakalpak varieties. Kyrgyz and Uzbek, on the other hand, appear quite homogeneous. Among the Indo-Iranian languages, the distinction between Tajik and Yaghnobi varieties is very clear-cut. More generally, the degree of borrowing is higher than average where language families are in contact in one of the many sorts of situations characterizing Central Asia: frequent bilingualism, shifting political boundaries, ethnic groups living outside the “mother” country.