Speakers
Description
In the context of the digital economy, language accessibility has become a vital element of equitable participation in communication, digital engagement, and the preservation of linguistic diversity. As digital platforms evolve into major channels of interaction, the availability—or lack—of virtual keyboards significantly affects the usability of many languages. While globally dominant languages are well-supported, many minority and regional languages—such as Turkish—continue to face accessibility challenges.
This study explores Turkish language accessibility in digital keyboard systems. User-generated content was collected from platforms like Ekşi Sözlük, Reddit, YouTube, and forums. The initial dataset included 64 unique entries with unstructured textual feedback and timestamps. These comments addressed Turkish character support, keyboard layouts (F/Q), mobile typing usability, and multilingual input.
A hybrid methodology combining expert labeling and computational analysis was used. Experts first reviewed and labeled the dataset thematically. Preprocessing involved lowercasing, removing URLs, punctuation, and numeric symbols, normalizing whitespace, and preserving Turkish-specific characters (ç, ğ, ı, ö, ş, ü). To improve dataset variety and robustness, data augmentation techniques were used: random word shuffling, synonym replacement, and word deletion. This expanded the dataset to 141 entries.
TF-IDF vectorization and Non-negative Matrix Factorization (NMF) were applied to extract topics. Five topics were identified: (1) Turkish Character Input Issues, (2) Unclear/Off-topic or Social Commentary, (3) Multilingual Keyboard Concerns, (4) Typing Experience and Personal Sentiment, and (5) General Complaints / Usability Frustrations. Each comment was matched to a topic to create a labeled dataset.
To evaluate topic prediction, machine learning models—Logistic Regression, Random Forest, and Linear Support Vector Machine (SVM)—were trained using an 80/20 stratified split. Linear SVM and Logistic Regression both achieved 93.1% accuracy, while Random Forest reached 79.3%. These results demonstrate the feasibility of using lightweight machine learning for automatic topic classification in digital language accessibility feedback.
This hybrid approach supports scalable and adaptive analysis of linguistic accessibility, and may be applied to other structurally complex or low-resource languages.