Standardization of Wikipedia articles according to the lexical constancy of their introductions and body texts
Ludovic BOCKEN, PhDs (c) - INTP-T
Innovation, Governance, Enterprise Architecture, (Generative) Artificial Intelligence, Knowledge Engineering Specialist | Drummer
Wikipedia is a prolific encyclopedia of unequal quality. To standardize it, qualitative categories classify certain articles. The length of the articles has been identified as one of the best predictor variables but is an insufficient criterion to standardize the cognitive accessibility of Wikipedia. This research measures the constancy of repetition of the vocabulary between introductions and the body of articles. Our reproducible methodology is largely inspired by supervised classification methods and similarity metrics. The programming interface of Wikipedia and the quanteda software library are exploited to collect and measure two quality categories: the positive category of the featured articles and the negative category of the articles needing rewrite. After idempotence tests, the K and Vm metrics are selected and applied to the texts. A complementary measure is formalized as the difference relating to independent measurements. Models of combinatorial properties are then evaluated. Decision trees give an overview. The performance of aggregated models of each metric is then compared from support vector machines (SVMs). The K and Vm metrics appear as better candidates than the length one to normalize Wikipedia but the metric K appears to be more discriminating.
Please let me know if you are interested in this (kind of) research and/or if you need a communication about it. Thanks for sharing !
Innovation, Governance, Enterprise Architecture, (Generative) Artificial Intelligence, Knowledge Engineering Specialist | Drummer
5 年You could be interested in this :?https://www.dhirubhai.net/pulse/wikim-r-package-measure-wikipedia-ludovic-bocken-phds-c- :)
Prof. Dr. in Computational Linguistics
5 年Hi Ludovic! Where could I see the whole article?
Language Processing, Information Retrieval, Robotics, Data!
5 年Hi Ludovic, Have you checked my paper about the "Arabic" Wikipedia??https://ieeexplore.ieee.org/document/6987558 If you don't have access please let me know.?
Principal Software Engineer at Wikimedia Foundation
5 年Hi Lodovic! Are you familiar with the ORES project? https://www.mediawiki.org/wiki/ORES Wikimedia is very interested in automated quality assessment, especially for detecting vandalism, but also for surfacing more subtle problems. Aaron Halfaker is the research scientist on the project.
Chief Customer Officer at CarbonCatalyst
5 年Interesting view point. We are trying to create a Wikipedia site at the moment . Quite hard.