小白的Natural Language Processing学习笔记之一

前段时间申请下学期跟着学院的两位教授做一个NLP的guided research project，本来不抱什么希望，没想到申请竟然成功了。本人几乎没有正经的研究经验，编程方面的知识储备也少得可怜。感谢教授们给我这个小白这样宝贵的机会。

这个研究项目是情感分析方向的，目标是分析一个数字图书馆的文本传达的情绪是正面还是负面。我上学期学的Python知识都快忘干净了，现在得赶紧复习。

同时我也开始自学NLP的一些基础知识。让ChatGPT帮我解释了几个概念：

Segmentation: Segmentation refers to the process of dividing a continuous piece of text, such as a sentence or a paragraph, into smaller meaningful units. In some languages, segmentation is relatively straightforward since words are clearly separated by spaces. However, in other languages like Chinese or Thai, words are not separated by spaces, making segmentation a more challenging task.
Tokenizing: Tokenizing is the process of breaking down a text into individual units called tokens. These tokens are usually words, but they can also be subwords or characters, depending on the specific task and language. Tokenization is a fundamental step in most natural language processing tasks because it converts the text into a format that the computer can process and analyze.
Stop Words: Stop words are common words that appear frequently in a language and are often removed during text processing. These words do not carry significant meaning and are unlikely to contribute much to the understanding or analysis of a text. Examples of stop words in English include “the,” “a,” “an,” “and,” “is,” “are,” etc. Removing stop words can help reduce the dimensionality of the data and improve processing efficiency for certain NLP tasks like text classification or sentiment analysis.
Stemming: Stemming is a process in which words are reduced to their root or base form by removing suffixes or prefixes. The resulting form may not always be a valid word, but it helps to group related words together and reduce inflected variations. For example, stemming might convert “running,” “runs,” ~~and “ran”~~ to the common stem “run.” There are various stemming algorithms available, such as the Porter stemming algorithm, which is widely used.（这里ChatGPT说错了，”ran”→“run”应该是lemmatization而不是stemming。大家不要盲目相信ChatGPT）
Lemmatization: Lemmatization is similar to stemming in that it aims to reduce words to their base form. However, unlike stemming, lemmatization ensures that the resulting form is a valid word (a lemma) found in the language’s dictionary. This means that lemmatization produces more accurate and meaningful results compared to stemming. For instance, lemmatization would convert “better” to “good” and “ran” to “run.”
Speech Tagging: Speech tagging, also known as part-of-speech (POS) tagging, is the process of assigning grammatical tags to each word in a sentence, indicating its part of speech (e.g., noun, verb, adjective, etc.). POS tagging is essential for understanding the syntactic structure of a sentence and is used in many NLP tasks like information extraction, machine translation, and syntactic parsing.
Named Entity Tagging: Named Entity Tagging (NER) is a process that involves identifying and classifying named entities in text into predefined categories such as person names, organization names, locations, dates, monetary values, etc. NER is useful for extracting valuable information from unstructured text and is employed in various applications, including information retrieval, question answering systems, and sentiment analysis.

这些概念都还蛮好理解的。我感觉lemmatization似乎主要是为屈折语或者黏着语服务的，毕竟汉语这样的孤立语好像没有什么lemmatize的必要。Stemming听着不太靠谱，毕竟印欧语里不规则变形实在太多了。

另外找了几个YouTube教程，等之后慢慢看：

Natural Language Processing (NLP) Zero to Hero by TensorFlow
Complete Natural Language Processing (NLP) Tutorial in Python! (with examples) by Keith Galli
Sentiment Analysis with/without NLTK Python by buildwithpython

Leave a Reply Cancel reply