Standardization of Wikipedia articles according to the lexical constancy of their introductions and body texts

Wikipedia is a prolific encyclopedia of unequal quality. To standardize it, qualitative categories classify certain articles. The length of the articles has been identified as one of the best predictor variables but is an insufficient criterion to standardize the cognitive accessibility of Wikipedia. This research measures the constancy of repetition of the vocabulary between introductions and the body of articles. Our reproducible methodology is largely inspired by supervised classification methods and similarity metrics. The programming interface of Wikipedia and the quanteda software library are exploited to collect and measure two quality categories: the positive category of the featured articles and the negative category of the articles needing rewrite. After idempotence tests, the K and Vm metrics are selected and applied to the texts. A complementary measure is formalized as the difference relating to independent measurements. Models of combinatorial properties are then evaluated. Decision trees give an overview. The performance of aggregated models of each metric is then compared from support vector machines (SVMs). The K and Vm metrics appear as better candidates than the length one to normalize Wikipedia but the metric K appears to be more discriminating.

Please let me know if you are interested in this (kind of) research and/or if you need a communication about it. Thanks for sharing !

Ludovic BOCKEN, PhDs (c) - INTP-T

Innovation, Governance, Enterprise Architecture, (Generative) Artificial Intelligence, Knowledge Engineering Specialist | Drummer

5 年
回复
Nina Khairova

Prof. Dr. in Computational Linguistics

5 年

Hi Ludovic! Where could I see the whole article?

回复
Ali Salhi

Language Processing, Information Retrieval, Robotics, Data!

5 年

Hi Ludovic, Have you checked my paper about the "Arabic" Wikipedia??https://ieeexplore.ieee.org/document/6987558 If you don't have access please let me know.?

回复
Daniel Kinzler

Principal Software Engineer at Wikimedia Foundation

5 年

Hi Lodovic! Are you familiar with the ORES project? https://www.mediawiki.org/wiki/ORES Wikimedia is very interested in automated quality assessment, especially for detecting vandalism, but also for surfacing more subtle problems. Aaron Halfaker is the research scientist on the project.

Graeme Wood

Chief Customer Officer at CarbonCatalyst

5 年

Interesting view point. We are trying to create a Wikipedia site at the moment . Quite hard.

要查看或添加评论,请登录

Ludovic BOCKEN, PhDs (c) - INTP-T的更多文章

社区洞察

其他会员也浏览了