Language has been one of the most intelligent inventions by human beings. It is language that enables us to communicate with each other, know each other and eventually develop a relationship with each other. According to the Oxford English dictionary, language is defined as “the method of human communication, either spoken or written, consisting of the use of words in a structured and conventional way.” Language makes us unique from other living beings and has been one of the most important factors in shaping us as the dominant species of this planet. It has eased the lifestyle of human beings to a great extent and, as time evolved, languages also evolved — from sign language to letters, alphabets, words, sentences, dialects and more.
With the invention of computers and introduction of information technology, human beings started thinking of ways to enhance and ease their life with the help of machines. There arose the need for a new language as a way of communicating with these computers – we call this programming language. This too has automated and simplified many aspects of human life.
But as technology has advanced, people now want machines to understand the human language and to think and respond the way we do – creating more seamless, efficient and accessible communications between humans and machines. This thinking has led to the birth of a new concept: Natural Language Processing or NLP.
But as technology has advanced, people now want machines to understand the human language and to think and respond the way we do.
There are many applications of NLP and all of them have enhanced or improved the lifestyle of humans in one way or the other. The most common applications include:
Speech recognition: This is found in most smart phones in the form of Google Assistant or Siri, who can understand and communicate with humans in natural language.
Sentiment Analysis: Interprets sentiments of users by analyzing Twitter and Facebook posts/comments, movie reviews, etc. Interpreting the sentiments of users using this technique helps political and business leaders make their decisions accordingly.
Chatbot: Most customer care is now using this to interact with users and has been successful in answering the basic queries and concerns of customers.
Translation: Google translate can now translate a specific language into different languages.
Advertisement matching: Based on users past search history, it helps in recommending products/movies/series/songs that may be of interest to us.
But in order to get to the point where NLP can be used in these applications, and in our daily lives, the following two components must be understood and addressed:
Natural Language Understanding-NLU
NLU involves understanding the speech, language and intent of a specific language. We all know that natural language is sometimes ambiguous. The same word or sentence may have different meanings based on the context of the situation and the intent of the speaker, so understanding the correct meaning is inherently complex, especially for machines. Specifically, NLU faces the following ambiguities:
Lexical ambiguity -We often see that a single word can have several meanings which leads to confusion in understanding it. Example: He is going to the bank. Here ‘bank’ may refer to the bank where we deposit/withdraw money or the bank alongside a river.
Syntactical ambiguity -When a single sentence may have more than one interpretation, it leads to semantic or structural ambiguity. Example: Mumpy saw someone on the hill with a telescope. Did Mumpy use a telescope to see someone on the hill or did she see someone on the hill holding a telescope?
Referential ambiguity -This is when the correct reference in a sentence is not known. Example: Munna went to meet his father. He was very excited. Here, who is ‘He’ referred to? Is it Munna or his father?
Natural Language Generation — NLG
NLG involves generating a language to present to the user. After recognizing and understanding the input, the machine should provide the output in the same language and in such a way that the output is intelligent, logical, relevant and conversational. For this, NLG follows the below steps:
Text planning: The relevant words are selected after understanding the input and are chosen from the corpus or knowledge base.
Sentence planning: The selected words are framed into a sentence in such a way that the output looks meaningful and referential. The words are placed in proper sequence to have a structured and meaningful communication.
To overcome the various challenges that arise from NLU and NLG, NLP follows the important processes below to understand and generate human language: Tokenization: The process of breaking a sequence of strings into pieces known as tokens. The tokens can be words/phrases of a sentence OR sentences of a paragraph.
Stemming: The process of reducing a word into its root or base form. The process involves chopping off the suffixes or prefixes of a word to get the base word. Although it may result in a word which has no dictionary meaning (i.e., smoker-smoking-smoked when stemmed will result in ‘Smok’ which is not an actual word), stemming is generally useful for information retrieval systems like search engines.
Lemmatization: Similar to stemming, but with lemmatization the word is reduced into its meaningful base form or ‘lemma’. Example: Smoker-Smoking-Smoked when lemmatized will result in ‘Smoke’ which is an actual word. The list of lemma is taken from the knowledge base of NLP.
POS Tags: Parts-of-speech tags are given to each word in a series which helps in processing and interpreting the natural languages by machines. It identifies each word as noun, verb, pronoun, adjective, etc. Example: The cat is running. In this sentence POS tags are as follows: ‘The’ -Determiner, ‘cat’-Noun, ‘is running’-verb.
Name Entity Relation (NER): This process involves identifying and segmenting each word/phrase of a sentence and classifying or categorizing them under various predefined classes. Example: Mumpy watched Titanic during her stay in Sweden in 2016. Here NER technique identifies Mumpy as noun, Titanic as movie, Sweden as location and 2016 as date.
Chunking: Now that the strings are reduced into pieces and analyzed, it is time to combine different pieces into chunks and tag them so as to get a larger picture. Good chunking facilitates comprehension and retrieval of meaningful information.
Apart from the processes mentioned above, NLP also involves removal of punctuation, site links and stop words. Stop words are very helpful to frame an interpretable sentence but even if we remove them, the basic meaning can be understood. Some common stop words are: “the”, “is”, “in”, “for”, “when”, “to”, “at” etc.
One caveat about removing stop words. In some use cases, stop words or site links or even the punctuation can be critical for understanding sentiments, so they would not be removed.
All the NLP processes that are discussed here can be implemented by using the Natural Language Toolkit (NLTK). This is a very useful and interesting tool to implement NLP using Python. It contains all the libraries of the different NLP processes - like tokenizing, stemming, lemmatizing etc. -- and come as a package with NLTK when downloaded.
With the use of these NLP features and functions, Vantage can now understand the sentiments of customers so organizations can interpret and/or predict their needs and deliver the required solutions accordingly. This helps to build strong customer relationships and satisfaction, which gives businesses a competitive edge in an always-on world.
Dhruba Barman has been working with Teradata since 2012 in India and is part of Managed Services in Mumbai working as a Performance DBA. His daily tasks involve performance optimization to add value to customer's data which are in the form of jobs and reports. He provides Vantage training within Managed Services and Tech Talk sessions and is working actively on new the Vantage system for his customers, providing different ways of optimizing the new IFX system using the new features of TD16.20. He is interested in learning new technologies and actively participates in Machine Learning Hackathons, both inside and outside Teradata.