The analysis of discourse and the study of what characterizes it in terms of communicative objectives is essential to most tasks of Natural Language Processing. Consequently, research on textual genres as expressions of such objectives presents an opportunity to enhance both automatic techniques and resources. To conduct an investigation of this kind, it is necessary to have a good understanding of what defines and distinguishes each textual genre. This research presents a data-driven approach to discover and analyze patterns in several textual genres with the aim of identifying and quantifying the differences between them, considering how language is employed and meaning expressed in each particular case. To identify and analyze patterns within genres, a set of linguistic features is first defined, extracted and computed by using several Natural Language Processing tools. Specifically, the analysis is performed over a corpora of documents-containing news, tales and reviews-gathered from different sources to ensure an heterogeneous representation. Once the feature dataset has been generated, machine learning techniques are used to ascertain how and to what extent each of the features should be present in a document depending on its genre. The results show that the set of features defined is relevant for characterizing the different genres. Furthermore, the findings allow us to perform a qualitative analysis of such features, so that their usefulness and suitability is corroborated. The results of the research can benefit natural language discourse processing tasks, which are useful both for understanding and generating language.
Revista: IEEE Access
Autores: Marta Vicente, María Miró Maestre, Elena Lloret y Armando Suárez Cueto
Artículo publicado en: Febrero 2021