Data Science

What is Natural Language Processing (NLP)?

Posted in Data Science
What is Natural Language Processing (NLP)?

"This is Captain Kirk of the Star Ship Enterprise. Beam me up". Natural Language Processing (NLP) is no more technology from the Hollywood movie series "Star Trek," where Captain Kirk and his crew of Star Ship Enterprise talk freely with devices and computers. Today, NLP is real, and we can very well see that NLP is ubiquitous in everyday life.

What is NLP?

Natural Language Processing (NLP) is an area of computer science, artificial intelligence, and computational linguistics which deals with the interaction between computers and human in a natural manner using human languages. NLP covers Speech Recognition, Text -to Speech synthesis, Machine Translation, Natural Language Text Processing, and Summarization, User Interfaces, Multilingual and Cross-Language Information Retrieval (CLIR), Artificial Intelligence (AI) and Expert systems.

The unique feature of NLP and the difference normal Computer systems and NLP is that normal computing works on the principles of IF/THEN logic statements, whereas NLP endeavors to bridge the gap in human communication where NLP analyzes what a human user said or writes and processes to derive what the user meant. The interactive capability of NLP makes it important in the applications and advancements in Artificial Intelligence and Machine Learning.

Few Sample Application Areas of NLP

NLP is the driving engine in several common applications that we see today:

  • Natural Language translation such as Google Translate
  • Word Processors like MS Word and Grammarly use NLP to check grammatical errors in texts.
  • Speech recognition / Interactive Voice Response (IVR) systems used in call centers.
  • Personal Digital Assistant applications such as Google Home, Siri, Cortana, and Alexa

Technical Aspects of NLP

NLP requires the application of algorithms to identify and extract the natural language rules and convert the unstructured language data into a manner that can be understood by a computer. For example, given a text to the computer, appropriate algorithms will be used to extract the relevant and logical meaning of the words in every sentence, and such data is collected. There could be scenarios where the computer may throw up errors due to a deficit in understanding the meaning, and that could lead to incorrect output. The objective of NLP techniques is to achieve near 100% accuracy in the understanding and the output.

What are the components of NLP?

Syntactic and semantic analyses are two main components in Natural Language Processes and data pipelines.

Let us see what constitutes Syntax and Semantics.

1. Syntax

Syntax refers to how words are arranged in a sentence so that they represent grammatical sense.

In NLP, syntactic analysis deals with how the natural language rules align with the rules of that particular language grammar. Computer algorithms are applied using the grammar rules to a group of words or text to derive meaningful sentences.

Here are some syntax techniques that can be used:

1. Lemmatization

It deals with the reduction of various inflected forms of words such as plural forms of nouns, past tense, past participle, present participle forms of verbs, and comparative and superlative forms of adjectives and adverbs. The key is to reduce the word into a single form for easy analysis.

2. Morphological Segmentation

The words are divided individual units called morphemes, which denote the smallest meaningful unit of a language.

3. Word Segmentation

It requires dividing large sets of continuous text into distinct units.

4. Part-of-Speech Tagging

It requires identifying the part of speech for every word and tagging it.

5. Parsing

This is quite common and requires an analysis of the grammar of the sentence that is provided.

6. Sentence Breaking

Here, sentence boundaries are placed on a large set of text.

7. Stemming

This deals with breaking down of the inflected words to their root form.

2. Semantics

Semantics refers to the meaning that is conveyed by a text. Semantic analysis is one of the challenging areas of Natural Language Processing that is still evolving and entails applying algorithms to derive the meaning and interpreting words and sentence structure.

Let us see some of the Semantic analysis techniques:

2.1. Named Entity Recognition (NER)

This entails the identification of parts of the text and creating preset groups. Names of people, places, and countries are examples of NER.

2.2. Word Sense Disambiguation (WSD)

This entails providing meanings to words based on a certain scenario or context.

2.3. Natural Language Generation (NLG)

The use of databases to arrive at semantic intentions and converting those into understandable human language.

Natural Language Processing Tools and Frameworks

Tools are the backbone of any technology, and without tools, efficiency, quality, and scale cannot be achieved. Natural Language Processing is probably the most complex area in artificial intelligence, given the number of languages in the world. Fortunately, there are good and versatile tools available to support NLP work and to make it easy for NLP professionals.

While many tools are being used, for the sake of getting a quick hand on NLP, we will focus on Open Source tools because they are free to use.

1. Stanford's Core NLP Suite

Stanford CoreNLP suite comes with a variety of functionalities and provides the base forms of words, parts of speech such as names of companies, people. Besides, it provides normalized dates, times, and numeric quantities. It also provides a mark-up of the sentence structure in terms of phrases and syntactic dependencies. Further, it provides indicators such as noun phrases referring to the same entities, sentiments, extraction of relations between entities of particular or open-class, retrieves quotes said by people, etc.

It may be wise to choose Stanford CoreNLP if you or your organization requires the following:

  • A broad range of grammatical analysis capability
  • A versatile, robust and quick annotator for arbitrary texts, which is commonly used in content production
  • Frequently and periodically updated tool package, deliver the best quality in performing text analytics.
  • Maximum number of human languages supported
  • APIs for most software programming languages
  • Designed to run as a web service

Stanford CoreNLP's objective is to facilitate the easy application of a bunch of linguistic analysis tools to a section of text. The tool pipeline could be run on a section of plain text with just a few lines of software code. Designed to be highly flexible and extensible, it provides feature options to enable and disable tools. It is integrated with several of Stanford's NLP tools, such as part-of-speech (POS) tagger, named entity recognizer (NER), parser, sentiment analytics, pattern learning, and extraction tools for open information. The highlight is that it can be integrated with additional custom or third-party annotators. CoreNLP's analytic engine provides the fundamental building blocks for higher-level and domain-specific text understanding requirements. The suite is a GPL-licensed framework and supports tokenization, grammar parsing, and entity recognition. It can process content/text in English, Chinese, and Spanish.

2. Natural Language Toolkit (NLTK)

NLTK is a highly focused platform for writing Python programs to support working on human language data. The highlight point of NLTK is that it comes with easy-to-use interfaces to over 60 corpora and lexical resources like WordNet. It comes bundled with a suite of libraries for text processing such as classification, tokenization, stemming, tagging, parsing, and semantic reasoning, and wrappers. It also provides a discussion forum for the developer community.

NLTK is highly versatile and suitable for a wide variety of professionals such as linguists, engineers, students, educators, researchers, and industry users. It also comes with a hands-on guide on complementary topics such as an introduction to fundamentals of programming, computational linguistics, API documentation. NLTK can also work on Windows, Mac OS X, and Linux.

3. Apache OpenNLP

Apache OpenNLP is similar to Stanford's Library but takes a different approach. From a functionality standpoint, it is similar to the Stanford NLP suite. The primary objective of OpenNLP is to achieve a robust and mature toolkit. An additional objective is to build a plethora of pre-built models for various human languages and provide relevant annotated text resources.

The Apache OpenNLP library is used for processing natural language text and is based on machine learning. It carries out NLP tasks like tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference solution. Each of these features is provided through application program interfaces (APIs). Further, a command-line interface (CLI) is enabled to perform experiments and training.

OpenNLP supports all advanced processing requirements of the above services and is uses algorithms such as maximum entropy, which is used in machine learning.

4. GATE (General Architecture for Text Engineering)

GATE was developed over 15 years ago and is still being actively used for all types of human language computational tasks. GATE performs well in the analysis of texts of all font types and sizes. Being highly versatile, GATE is used by large corporations and by startups as well. From Large research consortia to undergraduate student projects, the GATE user community is one of the most diverse and largest and is spread across all continents.

The fact that it is Open source and Free to use, it is easily downloadable via GATE.ac.uk or can be procured on a commercial basis from GATE community commercial partners, which brings along technical support.

The GATE Family

GATE has evolved to support desktop client for developers, provide a workflow-based web application, a Java library, an architecture, and a process engine.

What is GATE in a true sense?

  • GATE is an Integrated Development Environment (IDE) where language processing components can seamlessly work with a wide variety of Information Extraction Systems and comes with a comprehensive set of plugin capability.
  • A Web App: GATE Teamware is an annotation environment that is collaborative and suited well for commercial-scale semantic annotation projects and comes with a workflow engine and a robust and scalable backend infrastructure.
  • A Framework, GATE Embedded, is an object library developed for providing GATE services to diverse applications used by the GATE Developer community.
  • An Architecture: It is a high-level pictorial representation of language processing software components.

The GATE Community also provides:

  • A hosted solution for cloud-based computing for large-scale text processing. This can be accessed through GATE Cloud.net
  • GATE Mímir: It is known as the 'Multi-paradigm Information Management Index and Repository.' This is a very large and scalable multi-model index built on top of Ontotext's semantic repository family, GATE's annotation structured database, and full-text indexing from MG4J.
  • A wiki/CMS (GATE Wiki.sf.net): This is developed to host the community's websites, and also provides a testbed for NLP experiments.

Besides the top of chart core functions, GATE also provides components for multi-language processing tasks, e.g., morphology, tagging, parsers, information extraction components. The GATE Developer and Embedded Environment come with an Information Extraction system (ANNIE) which has been adapted and evaluated for several industrial and research systems in MUC, TREC, ACE, DUC, Pascal, and NTCIR.

5. Apache UIMA

What is UIMA? - Unstructured Information Management Applications are systems that are capable of performing analysis on humungous volumes of Unstructured Data or Information to discover meaningful and relevant knowledge to end-users. Apache UIMA supports a large community of users and developers who use UIMA frameworks, tools, and annotators. UIM typically analyses plain text, audio, and video to identify attributes like people's names, places, companies, voice, images, and location.

Apache UIMA is a licensed open-source implementation of the UIMA specification that is developed by a technical group at OASIS, which is a Standards Organization.

From a technical standpoint, the Apache Frameworks run components and available for Java and C++. The C++ framework supports annotators written in C/C++, besides supporting Perl, Python, and TCL annotators. There are two scale-out frameworks, Viz. UIMA-AS and UIMA-DUCC that are add-ons to the basic Java framework. The UIMA-AS supports JMS (Java Messaging Services) and ActiveMQ. The UIMA-DUCC provides cluster management services to automate UIMA pipelines.

UIMA comes with several Add-on components such as Annotators and Consumers, Whitespace Tokenizer Annotator, Snowball Annotator, Regular Expression Annotator, Dictionary Annotator, Hidden Markov Model Tagger Annotator, BSF Annotator, OpenCalais Annotator, etc.

Besides, Apache UIMA offers Servers - Simple Server (UIMA REST service) and Packaging tools such as PEAR Packaging ANT Task and PEAR Packaging Maven Plugin.

Apache UIMA also offers the following:

  • An extensive RUTA - rule-based scripting language, which is an analysis engine and on top of UIMA.
  • An Eclipse-based tooling workbench that is designed for interactive development and testing of rules

Conclusion

Despite significant technological advancements in NLP, it has many challenges to overcome in terms of the overall quality and efficacy. NLP is designed to make the human’s job more comfortable in all application facets. However, at this point, NLP is not autonomous. It still requires human intervention to achieve 100% efficiency and efficacy.
The general fear among people and industry is that NLP would set the trend of replacing humans from their jobs. While it is partially acceptable, NLP cannot function effectively without human interference and inputs. The fundamental responsibility to push up the robustness of NLP systems and applications to make it near-natural and seamless in terms of quality lies with human architects. Despite the fears of NLP taking away jobs in the future, NLP is the hotbed of research and development in today’s information cluttered industry.
I am sure, in this article, you would have picked up the fundamentals of what NLP is, besides learning the components of NLP. Importantly, various NLP tools described in this article should help you to move forward and try some of these tools to see which one suits you the best.

 

Ramya Shankar

Ramya Shankar

A cheerful, full of life and vibrant person, I hold a lot of dreams that I want to fulfill on my own. My passion for writing started with small diary entries and travel blogs, after which I have moved on to writing well-researched technical content. I find it fascinating to blend thoughts and research and shape them into something beautiful through my writing. View all posts by the Author

Leave a comment

Your email will not be published
Cancel