Lexical Complexity Using Semantics Research Paper
Dylan | May 25, 2020
Researchers Ketakee Nimavat and Tushar Champaneria propose an interesting intuitive method of gauging the lexical complexity of a text in their paper titled, "Estimation of Lexical Complexity using Language Semantics".
There are many algorithms to assess text complexity in the field of Natural Language Processing (NLP), but as with anything, you must choose the right tool for the job. Different algorithms assign complexity by examining entirely different things. Some algorithms concern themselves with grammatical complexity, while others equate lexical richness with complexity.
In Nimavat and Champaneria's proposed algorithm, they focus solely on lexical (word) complexity, completely ignoring difficult grammatical (syntactic) constructions. They have an interesting perspective on what makes words difficult. Their understanding is perhaps based on Steven Pinker's book, "The Language Instinct", or at the very least supported by it. Let's dive in and see how their algorithm works.
Intuitive UnderstandingAuthors Nimavat and Champaneria assert that a word's complexity is stored in the length and difficulty of all the words in its definition. Therefore, the definitions of simple words would consist of simple and/or very few words. Consider the following examples of definitions from the Merriam-Webster Dictionary:
bed: a piece of furniture on which to lie or sleep
grandiloquence: a lofty, extravagantly colorful, pompous, or bombastic style, manner, or quality especially in language
alchemy: a medieval chemical science and speculative philosophy aiming to achieve the transmutation of the base metals into gold, the discovery of a universal cure for disease, and the discovery of a means of indefinitely prolonging life
To distinguish the simple words from the complicated ones in the definitions, the authors construct a basic_word_list by merging OgDen's 2000 word list and a 1000 basic word list from Wikipedia.
Output: complexity score depicting the complexity of the given word
complexity_score = 0
defn = get_definition(word)
tokens = tokenize(defn)
useful_words = remove_stopwords(tokens)
for elem in useful_words:
if elem in basic_word_list:
complexity_score += 1
defn2 = get_definition(elem)
tokens2 = tokenize(defn2)
useful_words2 = remove_stopwords(tokens2)
complexity_score += len(useful_words2)
When given a word as input, this algorithm looks up and tokenizes the word's definition. Tokenization is the process of segmenting a string of text into individual words, or tokens, for additional manipulation or analysis. After tokenizing the definition, it filters out all of the stop words. In general and in this case, stop words refer to the most common words in a language, such as "the", "at", and "on" in English.
After retrieving and preprocessing the definition of the word, the algorithm loops through each word in the definition. If the word exists in our previously defined set of basic words, we add one to the word's complexity_score.
If the word isn't found in our set of basic words, we begin the process once again of retrieving, tokenizing, and filtering the stop words in the word's definition. Instead of repeating this process recursively and risk entering infinite loops, we simply count the length of the word's definition and add it to the word's complexity_score. If the intuition discussed above holds true, the longer the definition, the more complex the word, therefore increasing the complexity_score accordingly.
This process continues until each word in the input word's definition has been analyzed.
ConsiderationsAs presented in the paper, I do not believe the algorithm is robust enough to compete with better systems of measuring lexical complexity, such as the Flesch Reading Ease scale. The algorithm's performance is highly variable based on how the engineer decides to handle the following components.
DictionaryThis algorithm is incredibly dependent on the dictionary used as some are prone to give extremely complicated definitions even for the most basic words. Consider this definition of "dog" by the Oxford Dictionary:
"a domesticated carnivorous mammal that typically has a long snout, an acute sense of smell, nonretractable claws, and a barking, howling, or whining voice."
Basic Word ListThis algorithm is also incredibly dependent on the initial list of basic words set at the beginning. The paper uses OgDen's list of 2000 basic English words along with Wikipedia's list of 1000 basic English words. I found the word "banana" missing from these lists, while words such as "manhole", "serum" and "schist" are included. What makes the cut appears to be largely left to the intuition of the engineer building the algorithm.
Perhaps you could extract the top 2000 or 80% of the most common words from a large corpus to build a list that relies less on intuition.
Multiple DefinitionsMost words have multiple definitions and use-cases. This algorithm is unable to determine which definition is most appropriate given the context in which the word appears. One solution proposed in the paper to this problem is to simply average the complexity of each possible definition. Whereas this may provide an appropriate solution, I'm afraid this may skew the reliability of the complexity score. As we consider more possible definitions of a word, the more likely our derived complexity score is to align with the average complexity of all words in the dictionary. Consider the following example of "scaffold" defined by the Marriam-Webster Dictionary.
- A temporary or movable platform for workers to stand or sit on when working at a height above the floor or ground
- A platform on which a criminal is executed
- A platform at a height above the ground or floor level
- A supporting framework
If the fourth definition is used, the derived complexity score will be much lower than if we had used the first definition. Likewise, if the first definition is the most accurate, by averaging the complexity score of all four definitions, the power of the first one will be strongly diluted.