NLP Overview

Overview

NLP-Overview

  1. Introduction
  • The field of NPL encompasses a variety of topics which involve the computational processing and understanding of human languages. Since the 1980s, the field has increasingly relied on data-driven computation involving statistics, probability, and machine learning.

  • Recent increases in computational power and parallelization, harnessed by Graphical Processing Units (GPUs), now allow for “deep learning”, which utilizes artificial neural networks (ANNs), sometimes with billions of trainable parameters. Additionally, the contemporary availability of large datasets, facilitated by sophisticated data collection processes, enables the training of such deep architectures.

  1. Natural Language Processing
  • The field of natural language processing, also known as computational linguistics, involves the engineering of computational models and processes to solve practical problems in understanding human languages. Although it is sometimes difficult to distinguish clearly to which areas issues belong.

  • Work in NLP can be divided into two broad sub-areas: core areas and applications.

  • Often one needs to handle one or more of the core issues successfully and apply those ideas and procedures to solve practical problems.

  • Currently, NLP is primarily a data-driven field using statistical and probabilistic computations along with machine learning.

2.1. Core area of NLP

  • The core issues are those that are inherently present in any computational linguistic system. To perform translation, text summarization, image captioning, or any other linguistic task, there must be some understanding of the underlying language. This understanding can be broken down into at least four main areas: language modeling, morphology, parsing, and semantics. The number of scholarly works in each area over the last decade is shown in Figure 3.

image

Fig. 3: Publication Volume for Core Areas of NLP. The number of publications, indexed by Google Scholar, relating to each topic over the last decade is shown. While all areas have experienced growth, language modeling has grown the most.

2.1.1 Language Modeling and Word Embeddings

  • Language modeling can be viewed in two ways. First, it determines which words follow which. By extension, however, this can be viewed as determining what words mean, as individual words are only weakly meaningful, deriving their full value only from their interactions with other words.

  • The core areas address fundamental problems such as language modeling, which underscores quantifying associations among naturally occurring words.

  • Arguably, the most important task in NLP is that of language modeling. Language modeling (LM) is an essential piece of almost any application of NLP. Language modeling is the process of creating a model to predict words or simple linguistic components given previous words or components. This is useful for applications in which a user types input, to provide predictive ability for fast text entry. However, its power and versatility emanate from the fact that it can implicitly capture syntactic and semantic relationships among words or components in a linear neighborhood, making it useful for tasks such as machine translation or text summarization. Using prediction, such programs are able to generate more relevant, human-sounding sentences

2.1.2. Morphology

  • Morphology is the study of how words themselves are formed. It considers the roots of words and the use of prefixes and suffixes, compounds, and other intraword devices, to display tense, gender, plurality, and a other linguistic constructs. (Understanding later)

  • Morphological processing, dealing with segmentation of meaningful components of words and identifying the true parts of speech of words as used.

  • Morphology is concerned with finding segments within single words, including roots and stems, prefixes, suffixes,and—in some languages—infixes. Affixes (prefixes, suffixes, or infixes) are used to overtly modify stems for gender, number, person, et cetera.

2.1.3. Parsing

  • Parsing considers which words modify others, forming constituents, leading to a sentential structure. The area of semantics is the study of what words mean. It takes into account the meanings of the individual words and how they relate to and modify others, as well as the context these words appear in and some degree of world knowledge, i.e., “common sense”. There is a significant amount of overlap between each of these areas.

  • Syntactic processing, or parsing, which builds sentence diagrams as possible precursors to semantic processing.

  • Parsing examines how different words and phrases relate to each other within a sentence. There are at least two distinct forms of parsing: constituency parsing and dependency parsing. In constituency parsing, phrasal constituents are extracted from a sentence in a hierarchical fashion. Dependency parsing looks at the relationships between pairs of individual words.

  • Remaining Challenges: Outside of universal parsing, a parsing challenge that needs to be further investigated is the building of syntactic structures without the use of treebanks for training. Attempts have been made using attention scores and Tree-LSTMs, as well as outside-inside auto-encoders. If such approaches are successful, they have potential use in many environments, including in the context of low-resource languages and out-of-domain scenarios. While a number of other challenges remain, these are the largest and are expected to receive the most focus.

D. Semantics

  • Semantic processing, which attempts to distill meaning of words, phrases, and higher level components in text.

  • Semantic processing involves understanding the meaning of words, phrases, sentences, or documents at some level. Word embeddings, such as Word2Vec and GloVe, claim to capture meanings of words, following the Distributional Hypothesis of Meaning. As a corollary, when vectors corresponding to phrases, sentences, or other components of text are processed using a neural network, a representation that can be loosely thought to be semantically representative is computed compositionally.

  • In this section, neural semantic processing research is separated into two distinct areas: Work on comparing the semantic similarity of two portions of text, and work on capturing and transferring meaning in high level constituents, particularly sentences.

  • Semantic Challenges: In addition to the challenges already mentioned, researchers believe that being able to solve tasks well does not indicate actual understanding. Integrating deep networks with general word-graphs (e.g. WordNet) or knowledge-graphs (e.g. DBPedia) may be able to endow a sense of understanding. Graph-embedding is an active area of research, and work on integrating language-based models and graph models has only recently begun to take off, giving hope for better machine understanding.

  1. Summary of Core Issues
  • Deep learning has generally performed very well, surpassing existing states of the art in many individual core NLP tasks, and has thus created the foundation on which useful natural language applications can and are being built. However, it is clear from examining the research reviewed here that natural language is an enigmatically complex topic, with myriad core or basic tasks, of which deep learning has only grazed the surface. It is also not clear how architectures for ably executing individual core tasks can be synthesized to build a common edifice, possibly a much more complex distributed neural architecture, to show competence in multiple or “all” core tasks. More fundamentally, it is also not clear, how mastering of basic tasks, may lead to superior performance in applied tasks, which are the ultimate engineering goals, especially in the context of building effective and efficient deep learning models. Many, if not most, successful deep learning architectures for applied tasks, discussed in the next section, seem to forgo explicit architectural components for core tasks, and learn such tasks implicitly. Thus, some researchers argue that the relevance of the large amount of work on core issues is not fully justified, while others argue that further extensive research in such areas is necessary to better understand and develop systems which more perfectly perform these tasks, whether explicitly or implicitly.
  1. Application of NLP Using Deep Learning
  • While the study of core areas of NLP is important to understanding how neural models work, it is meaningless in and of itself from an engineering perspective, which values applications that benefit humanity, not pure philosophical and scientific inquiry. Current approaches to solving several immediately useful NLP tasks are summarized here. Note that the issues included here are only those involving the processing of text, not the processing of verbal speech. Because speech processing [162], [163] requires expertise on several other topics including acoustic processing, it is generally considered another field of its own, sharing many commonalities with the field of NLP. The number of studies in each discussed area over the last decade is shown in Figure 4.

image

Fig. 4: Publication Volume for Applied Areas of NLP. All areas of applied natural language processing discussed have witnessed growth in recent years, with the largest growth occurring in the last two to three years.

  • The application areas involve topics

    • Extraction of useful information (e.g. named entities and relations), translation of text between and among languages, summarization of written works, automatic answering of questions by inferring answers, and classification and clustering of documents.

4.1. Information Retrieval

  • The purpose of Information Retrieval (IR) systems is to help people find the right (most useful) information in the right (most convenient) format at the right time (when they need it) [164]. Among many issues in IR, a primary problem that needs addressing pertains to ranking documents with respect to a query string in terms of relevance scores for ad-hoc retrieval tasks, similar to what happens in a search engine.

4.2. Information Extraction

  • Information extraction extracts explicit or implicit information from text. The outputs of systems vary, but often the extracted data and the relationships within it are saved in relational databases [172]. Commonly extracted information includes named entities and relations, events and their participants, temporal information, and tuples of facts.

4.3. Text Generation

  • Many NLP tasks require the generation of human-like language. Summarization and machine translation convert one text to another in a sequence-to-sequence (seq2seq) fashion. Other tasks, such as image and video captioning and automatic weather and sports reporting, convert non-textual data to text. Some tasks, however, produce text without any input data to convert (or with only small amounts used as a topic or guide). These tasks include poetry generation, joke generation, and story generation.

4.4. Summarization

  • Summarization finds elements of interest in documents in order to produce an encapsulation of the most important content. There are two primary types of summarization: extractive and abstractive. The first focuses on sentence extraction, simplification, reordering, and concatenation to relay the important information in documents using text taken directly from the documents. Abstractive summaries rely on expressing documents’ contents through generation-style abstraction, possibly using words never seen in the documents [48].

4.5. Question Answering

  • Similar to summarization and information extraction, question answering (QA) gathers relevant words, phrases, or sentences from a document. QA returns this information in a coherent fashion in response to a request. Current methods resemble those of summarization.

4.6. Machine Translation

  • Machine translation (MT) is the quintessential application of NLP. It involves the use of mathematical and algorithmic techniques to translate documents in one language to another. Performing effective translation is intrinsically onerous even for humans, requiring proficiency in areas such as morphology, syntax, and semantics, as well as an adept understanding and discernment of cultural sensitivities, for both of the languages (and associated societies) under consideration [48].
  1. Summary of Deep Learning NLP Applications
  • Numerous other applications of natural language processing exist including grammar correction, as seen in word processors, and author mimicking, which, given sufficient data, generates text replicating the style of a particular writer. Many of these applications are infrequently used, understudied, or not yet exposed to deep learning. However, the area of sentiment analysis should be noted, as it is becoming increasingly popular and utilizing deep learning. In large part a semantic task, it is the extraction of a writer’s sentiment—their positive, negative, or neutral inclination towards some subject or idea [268]. Applications are varied, including product research, futures prediction, social media analysis, and classification of spam [269], [270]. The current state of the art uses an ensemble including both LSTMs and CNNs [271].

  • This section has provided a number of select examples of the applied usages of deep learning in natural language processing. Countless studies have been conducted in these and similar areas, chronicling the ways in which deep learning has facilitated the successful use of natural language in a wide variety of applications. Only a minuscule fraction of such work has been referred to in this survey.

  • While more specific recommendations for practitioners have been discussed in some individual subsections, the current trend in state-of-the-art models in all application areas is to use pre-trained stacks of Transformer units in some configuration, whether in encoder-decoder configurations or just as encoders. Thus, self-attention which is the mainstay of Transformer has become the norm, along with cross-attention between encoder and decoder units, if decoders are present. In fact, in many recent papers, if not most, Transformers have begun to replace LSTM units that were preponderant just a few months ago. Pre-training of these large Transformer models has also become the accepted way to endow a model with generalized knowledge of language. Models such as BERT, which have been trained on corpora of billions of words, are available for download, thus providing a practitioner with a model that possesses a great amount of general knowledge of language already. A practitioner can further train it with one’s own general corpora, if desired, but such training is not always necessary, considering the enormous sizes of the pre-training that downloaded models have received. To train a model to perform a certain task well, the last step a practitioner must go through is to use available downloadable task-specific corpora, or build one’s own task-specific corpus. This last training step is usually supervised. It is also recommended that if several tasks are to be performed, multi-task training be used wherever possible.

Reference:

[1] https://arxiv.org/pdf/1807.10854.pdf

[2] https://arxiv.org/pdf/1708.02709.pdf

[3] https://arxiv.org/pdf/2108.05542.pdf

[4] Recent Advances in Natural Language Processing via Large Pre-TrainedLanguage Models: A Survey https://arxiv.org/pdf/2111.01243.pdf

[5] AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing https://arxiv.org/pdf/2108.05542.pdf Appendix:

Encompasses (v) to include different types of things

Underscores (v) to emphasize something, or to show that it is important

Data-driven science is an interdisciplinary field of scientific methods to extract knowledge from data

Data-driven learning a learning approach driven by research-like access to data

Inherently (v) is a basic or essential feature that gives something its character

Arguably (adv) used for stating your opinion or belief, especially when you think other people may disagree (Được cho là)

Versatility (adj) able to change easily or to be used for different purposes (Tính linh hoạt)

Emanate (v) to come from a particular place (bắt nguồn).

Implicitly (adj) not stated directly, but expressed in the way that someone behaves, or understood from what they are saying (ngầm hiểu).

Corollary (n) something that will also be true if a particular idea or statement is true, or something that will also exist if a particular situation exists (kết quả tất yếu)1

Owner
PeterPham
PhD Student at National Chung Cheng University
PeterPham
Transformation spoken text to written text

Transformation spoken text to written text This model is used for formatting raw asr text output from spoken text to written text (Eg. date, number, i

Nguyen Binh 16 Dec 28, 2022
내부 작업용 django + vue(vuetify) boilerplate. 짠 하면 돌아감.

Pocket Galaxy 아주 간단한 개인용, 혹은 내부용 툴을 만들어야하는데 이왕이면 웹이 편하죠? 그럴때를 위해 만들어둔 django와 vue(vuetify)로 이뤄진 boilerplate 입니다. 각 폴더에 있는 설명서대로 실행을 시키면 일단 당장 뭔가가 돌아갑니

Jamie J. Seol 16 Dec 03, 2021
ttslearn: Library for Pythonで学ぶ音声合成 (Text-to-speech with Python)

ttslearn: Library for Pythonで学ぶ音声合成 (Text-to-speech with Python) 日本語は以下に続きます (Japanese follows) English: This book is written in Japanese and primaril

Ryuichi Yamamoto 189 Dec 29, 2022
A simple tool to update bib entries with their official information (e.g., DBLP or the ACL anthology).

Rebiber: A tool for normalizing bibtex with official info. We often cite papers using their arXiv versions without noting that they are already PUBLIS

(Bill) Yuchen Lin 2k Jan 01, 2023
뉴스 도메인 질의응답 시스템 (21-1학기 졸업 프로젝트)

뉴스 도메인 질의응답 시스템 본 프로젝트는 뉴스기사에 대한 질의응답 서비스 를 제공하기 위해서 진행한 프로젝트입니다. 약 3개월간 ( 21. 03 ~ 21. 05 ) 진행하였으며 Transformer 아키텍쳐 기반의 Encoder를 사용하여 한국어 질의응답 데이터셋으로

TaegyeongEo 4 Jul 08, 2022
OpenChat: Opensource chatting framework for generative models

OpenChat is opensource chatting framework for generative models.

Hyunwoong Ko 427 Jan 06, 2023
LSTM based Sentiment Classification using Tensorflow - Amazon Reviews Rating

LSTM based Sentiment Classification using Tensorflow - Amazon Reviews Rating (Dataset) The dataset is from Amazon Review Data (2018)

Immanuvel Prathap S 1 Jan 16, 2022
Python3 to Crystal Translation using Python AST Walker

py2cr.py A code translator using AST from Python to Crystal. This is basically a NodeVisitor with Crystal output. See AST documentation (https://docs.

66 Jul 25, 2022
Perform sentiment analysis on textual data that people generally post on websites like social networks and movie review sites.

Sentiment Analyzer The goal of this project is to perform sentiment analysis on textual data that people generally post on websites like social networ

Madhusudan.C.S 53 Mar 01, 2022
PIZZA - a task-oriented semantic parsing dataset

The PIZZA dataset continues the exploration of task-oriented parsing by introducing a new dataset for parsing pizza and drink orders, whose semantics cannot be captured by flat slots and intents.

17 Dec 14, 2022
Turkish Stop Words Türkçe Dolgu Sözcükleri

trstop Turkish Stop Words Türkçe Dolgu Sözcükleri In this repository I put Turkish stop words that is contained in the first 10 thousand words with th

Ahmet Aksoy 103 Nov 12, 2022
Data manipulation and transformation for audio signal processing, powered by PyTorch

torchaudio: an audio library for PyTorch The aim of torchaudio is to apply PyTorch to the audio domain. By supporting PyTorch, torchaudio follows the

1.9k Jan 08, 2023
FastFormers - highly efficient transformer models for NLU

FastFormers FastFormers provides a set of recipes and methods to achieve highly efficient inference of Transformer models for Natural Language Underst

Microsoft 678 Jan 05, 2023
KoBERTopic은 BERTopic을 한국어 데이터에 적용할 수 있도록 토크나이저와 BERT를 수정한 코드입니다.

KoBERTopic 모델 소개 KoBERTopic은 BERTopic을 한국어 데이터에 적용할 수 있도록 토크나이저와 BERT를 수정했습니다. 기존 BERTopic : https://github.com/MaartenGr/BERTopic/tree/05a6790b21009d

Won Joon Yoo 26 Jan 03, 2023
Stuff related to Ben Eater's 8bit breadboard computer

8bit breadboard computer simulator This is an assembler + simulator/emulator of Ben Eater's 8bit breadboard computer. For a version with its RAM upgra

Marijn van Vliet 29 Dec 29, 2022
Python module (C extension and plain python) implementing Aho-Corasick algorithm

pyahocorasick pyahocorasick is a fast and memory efficient library for exact or approximate multi-pattern string search meaning that you can find mult

Wojciech Muła 763 Dec 27, 2022
Rhyme with AI

Local development Create a conda virtual environment and activate it: conda env create --file environment.yml conda activate rhyme-with-ai Install the

GoDataDriven 28 Nov 21, 2022
🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Hugging Face 15k Jan 02, 2023
State of the Art Natural Language Processing

Spark NLP: State of the Art Natural Language Processing Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provide

John Snow Labs 3k Jan 05, 2023
Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any language

Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any

Little Endian 1 Apr 28, 2022