Module C: Evaluation

Evaluation will be a key and crucial aspect of the project. To ensure the quality of the generated text, two types of evaluation will be carried out: intrinsic and extrinsic, combining automatic and manual evaluation metrics.

Objective OBJ5 will be fulfilled with this module.

Activity 1. Intrinsic Evaluation

The intrinsic evaluation will independently determine the performance and quality of each of the models obtained and approaches proposed during the project, both quantitatively and qualitatively. This evaluation will be rigorously defined and established according to the aspects of the text to be evaluated.

For quantitative evaluation, tools (e.g. BLEU (Papineni et al., 2002)) and evaluation metrics – coverage, accuracy, measure F, perplexity, among others – widely used and accepted by the research community in NLP and NLG will be initially used. If necessary, the possibility of defining new additional metrics appropriate to address the deficiencies of existing ones will also be investigated, since these need to have reference texts in order to be able to compare the output generated automatically.

In other cases, a qualitative evaluation will also be carried out to evaluate other aspects, such as the grammatical correctness of the generated text, its meaning and whether it addresses the communicative purpose for which it was generated. This type of evaluation, for which a Likert scale of at least 5 values will be defined, will be carried out by real expert users (Pu et al., 2012), through crowdsourcing platforms such as Figure Eight that include adequate privacy and data protection policies, to guarantee that participants’ personal data will not be distributed. In addition, the quality of the evaluation will be verified in the form of preliminary tests or tasks that guarantee the total commitment of the participants. The results of expert evaluations may lead to the creation of a reference corpus.

Milestone: obtaining results obtained from intrinsic evaluation

Activity 2. Extrinsic evaluation

Extrinsic evaluation is intended to measure the usefulness and demonstrate the applicability of the proposed holistic approach in the context of other NLP tasks and other domains. The project aims to evaluate the methods and tools developed from two perspectives: their application to real scenarios, and the generation of abstractive summaries.

The automatic generation of abstractive summaries, i.e., generating a summary in the same way as would be done by a person, rewriting the text with their own words from certain information read, is still in a preliminary state, mainly through sentence compression (Yao et al., 2017). The integration of the NLG approach developed in Integer will have a positive impact in this field of research, and more so taking into account the research team’s extensive experience in work related to the generation of automatic summaries, mainly extractive (Lloret and Palomar, 2013, Lloret et al., 2013, Lloret et al., 2015, Lloret, 2016) that will guarantee the feasibility and applicability of the project beyond the NLG.

As for the real evaluation scenarios, given that the proposal of this project is independent of the domain, it will be carried out for the tourism and political domains.

In the tourism domain there is a transition from traditional travel agencies to digital agencies. The latter provide the user with access to several sources of information within the same website. They provide a wide offer through sophisticated and powerful search engines that allow filtering the results by price, location or distance, among others. Even so, the user has an arduous task when it comes to selecting the most suitable resource for their interests among the varied selection presented by these search engines. Comments from other users are also highly valued. However, their number can be tremendously high and vary over time. Therefore, the user must extract an overview of the quality of the offer.

In some websites, they try to facilitate this task through various tools, but in no case do they use the unstructured information from the text of the comments. The approach proposed in this project would improve these summaries by exploring the textual content of the comments given by users, allowing the generation of different types of summaries, parameterized by users.

Another area of application is politics, more specifically debates in parliamentary committees. In a field as sensitive as politics, automatic summaries could provide an additional guarantee of fidelity to the original transcripts.

In public sessions of parliamentary committees, transcriptions are made of each and every intervention. When public is not allowed, in some cases, in addition to the transcription, the summaries of the speeches are done. These are documents that try to summarize the speeches of both the speakers and the deputies in an impartial way. Integer could make summaries automatically from the transcripts. The degree of fidelity to the most important milestones that have occurred in the successive interventions could even be verified. With prior authorisation, manual transcriptions and summaries can be provided to us to evaluate and verify the effectiveness of the automatic summaries’ generation.

Milestone: results obtained by extrinsic evaluation methods


  • Lloret, E. (2016). Introducing the Key Stages for Addressing Multi-perspective Summarization. Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2016) – Volume 1: KDIR, Porto – Portugal, November 9 – 11, 2016: 321-326.
  • Lloret, E., Boldrini, E., Vodolazova, E., Martínez-Barco, P., Muñoz, R. and M. Palomar (2015). A novel conceptlevel approach for ultra-concise opinion summarization. Expert Systems with Applications. 42(20): 71487156.
  • Lloret, E. and M. Palomar (2013). COMPENDIUM: a text summarisation tool for generating summaries of multiple purposes, domains, and genres. Natural Language Engineering 19(2): 147-186.
  • Lloret, E., L. Plaza, A. Aker (2013). Analyzing the capabilities of crowdsourcing services for text summarization. Language Resources and Evaluation 47(2): 337-369.
  • Papineni, K., S. Roukos, T. Ward and W. Zhu (2002). BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL ’02), pages 311-318. Association for Computational Linguistics.
  • Pu, P., L. Chen and R. Hu (2012). Evaluating recommender systems from the user’s perspective: survey of the state of the art. User Modeling and User-Adapted Interaction, 22(4- 5), 317-355.
  • Yao J., Wan X., and Xiao, J. (2017). Recent advances in document summarization. Knowledge and Information Systems, 53 (2), 297-336