小白的Natural Language Processing学习笔记之二

今天和导师第一次线上见面，了解了更多我们要做的研究的背景知识。导师人非常chill，说反正我是这个项目唯一一个学生，研究的课题或者计划都很flexible。他还说，既然我会其他语言，将来说不定也可以分析非英语的数据集。我想着我还是先用英语的数据集把基础打好吧，毕竟现在我什么都不会😅

我需要学习关于Bag-of-words model（也可以说是extracted features）的概念。再次请出ChatGPT：

In the realm of natural language processing (NLP), “extracted features” refer to specific attributes or characteristics of text data that are captured and represented in a structured format. These features are derived from the raw text and are used as inputs for machine learning models to perform various NLP tasks.

Think of text data as unstructured information that is difficult for computers to directly understand. Extracted features help convert this text into a format that algorithms can work with effectively. These features can encompass a wide range of linguistic and semantic aspects, such as:

Word Frequencies: Counting how often each word appears in a document or a collection of documents. This can help capture the importance of certain words within the context of the data.
N-grams: These are sequences of ‘n’ consecutive words in a text. For example, “natural language processing” would be a 3-gram. Capturing these sequences can provide context and help understand relationships between words.
Part-of-Speech (POS) Tags: Assigning grammatical labels to each word, such as nouns, verbs, adjectives, etc. This can provide insights into sentence structure and syntactic relationships.
TF-IDF (Term Frequency-Inverse Document Frequency): A numerical representation of the importance of a word in a document relative to its frequency across a collection of documents. It helps identify key terms in a document.
Word Embeddings: Dense vector representations of words that capture semantic relationships. These are often generated using techniques like Word2Vec or GloVe.
Sentiment Scores: Assigning a sentiment score to text to determine whether it’s positive, negative, or neutral. This can be useful for sentiment analysis.
Named Entity Recognition (NER): Identifying and classifying entities such as names of people, places, organizations, etc., in text.
Syntax Trees: Representations of sentence structure that show how words are connected grammatically.
Dependency Relations: Capturing the grammatical relationships between words, such as subject-verb-object relationships.
Topic Modeling: Identifying the main topics or themes within a collection of documents.

距离下次跟导师开会还有一周，打算用这周的时间熟悉一下他给的网站里的documentation和tutorials，以及自己尝试着用一个小一点的数据集，跟着tutorial做一些简单的分析和可视化。

Leave a Reply Cancel reply