Established in mid-2017, our group addresses fundamental questions on scientific writing using cutting-edge technology.
How can we help scientists write and communicate their research better? What is the difference between well and poorly written papers?
What makes an easy-to- read and logically written paper? What are the underlying linguistic patterns of well-written papers?
How can we apply automatization to aid academic publishers in making the review process more efficient and quicker?
How can we make the review process of scientific texts more objective? Can papers be evaluated based on quantified factors of writing quality?
To solve these investigative questions, we are applying state-of- the-art machine learning techniques, applied linguistic research, and expert knowledge on scientific writing to develop new models, functions, and algorithms.
We seek to comprehensively aid researchers during the entire writing process. This goal will be achieved through our applied research, development, and innovation (R+D+i), merging the latest technological advances with established writing guidelines. Our R+D+i is manifested in WriteWise, a unique software that will modernize scientific writing by reducing the time and effort required by researchers when writing and by journals/academic publishers when reviewing manuscript submissions.
We combine machine learning and computational linguistics within the framework of natural language processing, as applied to modelling and revising the writing process and scientific texts. This line of research applies the following methodologies:
1. Novel approaches for representing textual data from scientific articles:
2. Novel computational approaches for analyzing scientific articles, with specific investigative focus on:
This research line is led by Prof. Héctor Allende. The team members involved in this R+D+i area include:
Figure 1. The WriteWise long short-term memory (LSTM) unit for constructing recurrent neuronal networks for scientific article analyses. These units not only have inputs () and outputs (), but can also have a shared status () among all units within the same layer. This point, together with the processing of LSTM gates, provides each unit with short-term memory, which is crucial for calculating outputs ().
Figure 2. The WriteWise deep-learning architecture applied to natural language processing tasks for scientific article analyses. The architecture is separated into three layers: (1) Input Layer: uses word embeddings and trait vectors to generate an input sequence; (2) Hidden Layer: composed of a bidirectional long short-term memory (LSTM) sub-layer, followed by a unidirectional network of LSTM units; and (3) Output Layer: composed of a few neurons that indicate the probability of a given sequence of words (Input Layer) being followed by a punctuation mark.
Figure 3. Graph of words (GoW) for a scientific article. Our team has developed new algorithms and codes based on graph-theory representations of text that capture term dependencies and ordering. Shown is the k-core decomposition of a GoW, which defines hierarchy levels of increasing cohesiveness. The main core retains the GoW members with the highest levels of importance, which can work as text keywords. This figure and its contents have been simplified due to copyright.
We use functional and applied discursive frameworks, combined with corpus analysis, computational linguistics, and natural language processing approaches, to empirically determine the discursive and linguistics norms and requirements of academic and scientific texts. This line of research seeks to identify and comprehend the:
1. Communicative purposes and lexical-grammar features that constitute written texts in distinct scientific disciplines.
2. Textual and discursive foundations of academic and scientific texts.
This research line is led by Prof. René Venegas. The team members involved in this R+D+i area include:
Figure 1. Academic discourse tagging system. Our group has developed a new tagging system for sentences in scientific articles, where each sentence has a function definable through unique subsets of words. This figure and its contents have been simplified due to copyright.
Figure 2. Academic discourse model. Our group has developed a novel model for writing cientific articles – each section is subdivided by N number of subsections, with each subsection having N number of linguistic functions, represented by N number of sentences, and as composed by N number and combination of words. This figure and its contents have been simplified due to copyright.
We use scientometrics combined with natural language processing to predict the impact and recognition of scientific publications.
The team members involved in this R+D+i area include:
Figure 1. Similarity network between an unpublished paper (query) and two academic journals. Network nodes indicate the similarity of the query paper with manuscripts from the selected journals, where closer nodes represent greater similarity (i.e., more neighbor nodes). This similarity network is based on a novel “journal thumbprints” algorithm, which allows comparisons and similarity predictions between query papers and target journals. With our new method, we can detect N number of words, sentences, and paragraphs, among other traits, and plot these features against different journals, considering factors such as quartiles (Qs), citations, etc. This figure and its contents have been simplified due to copyright.
Led by Dr. Eduardo Fuentes, the WriteWise Research Group is a multidisciplinary team of experienced scientific editors, linguists, and computer scientists focused on improving and modernizing scientific communication.