Python模块：自然语言处理（NLP）的情感分析

Python 笔记

Python是一种通用编程语言，也是自然语言处理（NLP）中使用最广泛的语言之一。在NLP中，情感分析是一项非常重要的任务。情感分析是指对文本进行分析、分类和评估，以确定它表达的情绪是积极的、消极的还是中性的。情感分析在社交媒体监控、市场营销、舆论分析等领域中都有广泛的应用。

一、安装Python模块NLTK

NLTK（自然语言工具包）是Python编程语言中最流行的NLP库之一。要使用NLTK进行情感分析，需要先安装它。可以使用pip安装它：

pip install nltk

安装完成后，在Python中导入包：

import nltk

二、加载情感分析数据集

在进行情感分析时，需要有一个用于训练和测试的已标记或已打标签数据集。NLTK中已经有一个包含50000个电影评论的数据集，这些评论已经被标记为“正面”、“消极”或“中性”。

可以使用以下代码从NLTK数据集中加载电影评论数据：

from nltk.corpus import movie_reviews
movie_reviews.categories()

输出结果应该为 ['neg', 'pos']，表示这个数据集中有两个类别：消极的评论（neg）和积极的评论（pos）。

三、数据准备和清理

在进行情感分析之前，需要对文本进行一系列的处理和清洗，包括：

1、去除标点符号、数字和其他特殊字符。

2、将所有字符转换为小写字母。

3、将文本分成单词。

4、过滤停用词。

可以使用以下代码进行预处理：

import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

def clean_text(text):
    # 去除标点符号和数字
    text = text.translate(str.maketrans("", "", string.punctuation + string.digits))
    # 将所有字符转换为小写字母
    text = text.lower()
    # 分词
    words = word_tokenize(text)
    # 过滤停用词
    stop_words = set(stopwords.words("english"))
    words = [word for word in words if word not in stop_words]
    # 返回处理后的单词列表
    return words

四、特征提取

在进行情感分析时，需要将文本表示为向量或数字。常用的方法是使用特征提取器将每个文本转换为一个数字向量。在这里，我们将使用词袋模型来创建特征向量。

可以使用以下代码创建一个词袋特征提取器：

from nltk import FreqDist
from nltk import classify
from nltk import NaiveBayesClassifier
from nltk.classify import accuracy as nltk_accuracy
from sklearn.metrics import precision_recall_fscore_support as score

class BagOfWords:
    def __init__(self, all_words):
        self.all_words = all_words
    # 特征提取器方法
    def bag_of_words(self, cleaned_words):
        words_dict = dict([(word, True) for word in cleaned_words])
        return words_dict

    # 整个文本的单词列表
    def all_words_cleaned(self, reviews):
        cleaned_words = []
        for review in reviews:
            for word in review:
                cleaned_words.append(word)
        return cleaned_words


    # 词频分布
    def frequencies(self, cleaned_words):
        freq_dist = FreqDist(cleaned_words)
        print(freq_dist)

    # 训练和测试特征提取器
    def train_test(self, cleaned_data):
        # 特征集
        positive_features = [(self.bag_of_words(review), "Positive") for review in cleaned_data[0]]
        negative_features = [(self.bag_of_words(review), "Negative") for review in cleaned_data[1]]
        features = positive_features + negative_features

        # 测试集和训练集
        train_set = features[:3000]
        test_set = features[3000:]

        # 构建朴素贝叶斯分类器
        classifier = NaiveBayesClassifier.train(train_set)

        # 测试集的精度
        print("Test accuracy:", nltk_accuracy(classifier, test_set))

        # 对测试集进行预测，并计算混淆矩阵
        y_true = [category for _, category in test_set]
        y_pred = [classifier.classify(features) for features, _ in test_set]
        precision, recall, fscore, support = score(y_true, y_pred, average="weighted")
        print("Precision: ", precision)
        print("Recall: ", recall)
        print("F-score: ", fscore)
        
# 加载电影评价数据集
positive_reviews = movie_reviews.fileids("pos")
negative_reviews = movie_reviews.fileids("neg")
print(f"num of pos reviews: {len(positive_reviews)}")
print(f"num of neg reviews: {len(negative_reviews)}")

# 加载并预处理数据集
reviews = [
    [clean_text(movie_reviews.raw(fileids=[id])) for id in positive_reviews],
    [clean_text(movie_reviews.raw(fileids=[id])) for id in negative_reviews],
]

# 创建特征提取器对象并进行特征提取
bow = BagOfWords(all_words=bow.all_words_cleaned(reviews))
bow.frequencies(bow.all_words)
bow.train_test(reviews)

五、结果和结论

通过运行上述代码，将会输出在测试集上的分类精度以及混淆矩阵中的准确率、召回率和F1分数。本实例中得到的分类精度为80.73%，表明朴素贝叶斯分类器在情感分析中具有一定的效果。

在本篇文章中，我们讨论了如何使用Python中的NLTK模块进行情感分析。我们详细介绍了如何使用NLTK库来加载数据集、进行数据清洗和预处理、提取特征并构建分类器。通过最终的测试结果，我们可以看到情感分析在许多领域中的应用。为NLP做出关键的贡献，有助于我们更好地理解和分析自然语言。