Python自然语言处理入门

tech2022-07-11  200

A significant portion of the data that is generated today is unstructured. Unstructured data includes social media comments, browsing history and customer feedback. Have you found yourself in a situation with a bunch of textual data to analyse, and no idea how to proceed? Natural language processing in Python can help.

今天生成的数据的很大一部分都是非结构化的。 非结构化数据包括社交媒体评论,浏览历史记录和客户反馈。 您是否发现自己处于要分析大量文本数据的情况,却不知道如何进行? Python中的自然语言处理可以提供帮助。

The objective of this tutorial is to enable you to analyze textual data in Python through the concepts of Natural Language Processing (NLP). You will first learn how to tokenize your text into smaller chunks, normalize words to their root forms, and then, remove any noise in your documents to prepare them for further analysis.

本教程的目的是使您能够通过自然语言处理(NLP)的概念来分析Python中的文本数据。 您将首先学习如何将文本标记成较小的块,将单词归一化为它们的根形式,然后消除文档中的所有杂音以准备进行进一步分析。

Let’s get started!

让我们开始吧!

先决条件 (Prerequisites)

In this tutorial, we will use Python’s nltk library to perform all NLP operations on the text. At the time of writing this tutorial, we used version 3.4 of nltk. To install the library, you can use the pip command on the terminal:

在本教程中,我们将使用Python的nltk库对文本执行所有NLP操作。 在编写本教程时,我们使用了nltk 3.4版。 要安装该库 ,可以在终端上使用pip命令:

pip install nltk==3.4

To check which version of nltk you have in the system, you can import the library into the Python interpreter and check the version:

要检查系统中的nltk版本,可以将库导入Python解释器并检查版本:

import nltk print(nltk.__version__)

To perform certain actions within nltk in this tutorial, you may have to download specific resources. We will describe each resource as and when required.

要在本教程中的nltk中执行某些操作,您可能必须下载特定的资源。 我们将在需要时描述每种资源。

However, if you would like to avoid downloading individual resources later in the tutorial and grab them now in one go, run the following command:

但是,如果您希望避免在本教程的稍后部分中下载单个资源并立即进行下载,请运行以下命令:

python -m nltk.downloader all

步骤1:转换成代币 (Step 1: Convert into Tokens)

A computer system can not find meaning in natural language by itself. The first step in processing natural language is to convert the original text into tokens. A token is a combination of continuous characters, with some meaning. It is up to you to decide how to break a sentence into tokens. For instance, an easy method is to split a sentence by whitespace to break it into individual words.

计算机系统本身无法找到自然语言的含义。 处理自然语言的第一步是将原始文本转换为标记。 令牌是具有某些含义的连续字符的组合。 由您决定如何将句子分解为标记。 例如,一种简单的方法是通过空格将句子拆分为单个单词。

In the NLTK library, you can use the word_tokenize() function to convert a string to tokens. However, you will first need to download the punkt resource. Run the following command in the terminal:

在NLTK库中,可以使用word_tokenize()函数将字符串转换为令牌。 但是,您首先需要下载punkt资源。 在终端中运行以下命令:

nltk.download('punkt')

Next, you need to import word_tokenize from nltk.tokenize to use it.

接下来,你需要进口word_tokenize从nltk.tokenize使用它。

from nltk.tokenize import word_tokenize print(word_tokenize("Hi, this is a nice hotel."))

The output of the code is as follows:

代码的输出如下:

['Hi', ',', 'this', 'is', 'a', 'nice', 'hotel', '.']

You’ll notice that word_tokenize does not simply split a string based on whitespace, but also separates punctuation into tokens. It’s up to you if you would like to retain the punctuation marks in the analysis.

您会注意到word_tokenize不仅可以根据空格分割字符串,而且可以将标点符号分隔为标记。 是否要在分析中保留标点符号由您决定。

步骤2:将字词转换为其基本形式 (Step 2: Convert Words to their Base Forms)

When you are processing natural language, you’ll often notice that there are various grammatical forms of the same word. For instance, “go”, “going” and “gone” are forms of the same verb, “go”.

在处理自然语言时,您经常会注意到同一单词有多种语法形式。 例如,“ go”,“ going”和“ gone”是相同动词“ go”的形式。

While the necessities of your project may require you to retain words in various grammatical forms, let us discuss a way to convert various grammatical forms of the same word into its base form. There are two techniques that you can use to convert a word to its base.

尽管项目的必要性可能要求您保留各种语法形式的单词,但让我们讨论一种将同一单词的各种语法形式转换为其基本形式的方法。 您可以使用两种技术将单词转换为单词的基数。

The first technique is stemming. Stemming is a simple algorithm that removes affixes from a word. There are various stemming algorithms available for use in NLTK. We will use the Porter algorithm in this tutorial.

第一项技术是茎。 词干是一种简单的算法,可以从单词中删除词缀。 NLTK中有多种可用的词干算法 。 在本教程中,我们将使用波特算法。

We first import PorterStemmer from nltk.stem.porter. Next, we initialize the stemmer to the stemmer variable and then use the .stem() method to find the base form of a word.

我们先导入PorterStemmer从nltk.stem.porter 。 接下来,我们初始化词干到stemmer变量,然后使用.stem()方法找到一个单词的碱形式。

from nltk.stem.porter import PorterStemmer stemmer = PorterStemmer() print(stemmer.stem("going"))

The output of the code above is go. If you run the stemmer for the other forms of “go” described above, you will notice that the stemmer returns the same base form, “go”. However, as stemming is only a simple algorithm based on removing word affixes, it fails when the words are less commonly used in language.

上面代码的输出是go 。 如果为上述其他形式的“ go”运行词干分析器,则会注意到词干返回相同的基本形式“ go”。 但是,由于词干只是基于去除词缀的简单算法,因此当词在语言中较少使用时,词干法将失败。

When you try the stemmer on the word “constitutes”, it gives an unintuitive result.

当您在“组成”一词上使用词干分析器时,会得出不直观的结果。

print(stemmer.stem("constitutes"))

You will notice the output is “constitut”.

您会注意到输出是“ constitut”。

This issue is solved by moving on to a more complex approach towards finding the base form of a word in a given context. The process is called lemmatization. Lemmatization normalizes a word based on the context and vocabulary of the text. In NLTK, you can lemmatize sentences using the WordNetLemmatizer class.

通过继续使用更复杂的方法来解决该问题,该方法是在给定的上下文中查找单词的基本形式。 该过程称为lemmatization。 词法化根据文本的上下文和词汇对单词进行归一化。 在NLTK中,您可以使用WordNetLemmatizer类对句子进行WordNetLemmatizer 。

First, you need to download the wordnet resource from the NLTK downloader in the Python terminal.

首先,您需要从Python终端中的NLTK下载器下载wordnet资源。

nltk.download('wordnet')

Once it is downloaded, you need to import the WordNetLemmatizer class and initialize it.

下载之后,您需要导入WordNetLemmatizer类并对其进行初始化。

from nltk.stem.wordnet import WordNetLemmatizer lem = WordNetLemmatizer()

To use the lemmatizer, use the .lemmatize() method. It takes two arguments — the word and the context. In our example, we will use “v” for context. Let us explore the context further after looking at the output of the .lemmatize() method.

要使用.lemmatize() ,请使用.lemmatize()方法。 它有两个参数-单词和上下文。 在我们的示例中,我们将“ v”用于上下文。 在查看.lemmatize()方法的输出之后,让我们进一步探索上下文。

print(lem.lemmatize('constitutes', 'v'))

You would notice that the .lemmatize() method correctly converts the word “constitutes” to its base form, “constitute”. You would also notice that lemmatization takes longer than stemming, as the algorithm is more complex.

您会注意到, .lemmatize()方法正确地将单词“组成”转换为其基本形式“组成”。 您还会注意到,由于算法更加复杂,所以进行词法比词干化花费的时间更长。

Let’s check how to determine the second argument of the .lemmatize() method programmatically. NLTK has a pos_tag function which helps in determining the context of a word in a sentence. However, you first need to download the averaged_perceptron_tagger resource through the NLTK downloader.

让我们检查如何以编程方式确定.lemmatize()方法的第二个参数。 NLTK具有pos_tag函数,可帮助确定句子中单词的上下文。 但是,您首先需要通过NLTK下载器下载averaged_perceptron_tagger资源。

nltk.download('averaged_perceptron_tagger')

Next, import the pos_tag function and run it on a sentence.

接下来,导入pos_tag函数并在一个句子上运行它。

from nltk.tag import pos_tag sample = "Hi, this is a nice hotel." print(pos_tag(word_tokenize(sample)))

You will notice that the output is a list of pairs. Each pair consists of a token and its tag, which signifies the context of a token in the overall text. Notice that the tag for a punctuation mark is itself.

您会注意到输出是成对的列表。 每对都包含一个标记及其标记,该标记在整个文本中表示标记的上下文。 请注意,标点符号的标签本身就是它。

[('Hi', 'NNP'), (',', ','), ('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('nice', 'JJ'), ('hotel', 'NN'), ('.', '.')]

How do you decode the context of each token? Here is a full list of all tags and their corresponding meanings on the web. Notice that the tags of all nouns begin with “N”, and for all verbs begin with “V”. We can use this information in the second argument of our .lemmatize() method.

您如何解码每个令牌的上下文? 这是Web上所有标签及其对应含义的完整列表 。 请注意,所有名词的标签都以“ N”开头,而所有动词的标签都以“ V”开头。 我们可以在.lemmatize()方法的第二个参数中使用此信息。

def lemmatize_tokens(stentence): lemmatizer = WordNetLemmatizer() lemmatized_tokens = [] for word, tag in pos_tag(stentence): if tag.startswith('NN'): pos = 'n' elif tag.startswith('VB'): pos = 'v' else: pos = 'a' lemmatized_tokens.append(lemmatizer.lemmatize(word, pos)) return lemmatized_tokens sample = "Legal authority constitutes all magistrates." print(lemmatize_tokens(word_tokenize(sample)))

The output of the code above is as follows:

上面的代码输出如下:

['Legal', 'authority', 'constitute', 'all', 'magistrate', '.']

This output is on expected grounds, where “constitutes” and “magistrates” have been converted to “constitute” and “magistrate”, respectively.

此输出基于预期的依据,其中“构成”和“治安官”分别被转换为“构成”和“治安官”。

步骤3:资料清理 (Step 3: Data Cleaning)

The next step in preparing data is to clean the data and remove anything that does not add meaning to your analysis. Broadly, we will look at removing punctuation and stop words from your analysis.

准备数据的下一步是清理数据,并删除所有对您的分析无意义的内容。 广泛地讲,我们将研究从您的分析中删除标点符号和停用词。

Removing punctuation is a fairly easy task. The punctuation object of the string library contains all the punctuation marks in English.

删除标点符号是一个相当容易的任务。 string库的punctuation对象包含所有英语标点符号。

import string print(string.punctuation)

The output of this code snippet is as follows:

此代码段的输出如下:

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In order to remove punctuation from tokens, you can simply run:

为了从令牌中删除标点符号,您可以简单地运行:

for token in tokens: if token in string.punctuation: # Do something

Next, we will focus on removing stop words. Stop words are commonly used words in language like “I”, “a” and “the”, which add little meaning to text when analyzing it. We will therefore, remove stop words from our analysis. First, download the stopwords resource from the NLTK downloader.

接下来,我们将集中精力删除停用词。 停用词是常用的词,例如“ I”,“ a”和“ the”,它们在分析文本时几乎没有意义。 因此,我们将从分析中删除停用词。 首先,从NLTK下载器下载stopwords资源。

nltk.download('stopwords')

Once your download is complete, import stopwords from nltk.corpus and use the .words() method with “english” as the argument. It is a list of 179 stop words in the English language.

下载完成后,从nltk.corpus导入stopwords nltk.corpus并使用带有“ english”作为参数的.words()方法。 它是英语中179个停用词的列表。

from nltk.corpus import stopwords stop_words = stopwords.words('english')

We can combine the lemmatization example with the concepts discussed in this section to create the following function, clean_data(). Additionally, before comparing if a word is a part of the stop words list, we convert it to the lower case. This way, we still capture a stop word if it occurs at the start of a sentence and is capitalized.

我们可以将lemmatization示例与本节中讨论的概念结合起来,以创建以下函数clean_data() 。 另外,在比较一个单词是否是停用词列表的一部分之前,我们将其转换为小写。 这样,如果停用词出现在句子的开头并大写,我们仍然可以捕获该停用词。

def clean_data(tokens, stop_words = ()): cleaned_tokens = [] for token, tag in pos_tag(tokens): if tag.startswith("NN"): pos = 'n' elif tag.startswith('VB'): pos = 'v' else: pos = 'a' lemmatizer = WordNetLemmatizer() token = lemmatizer.lemmatize(token, pos) if token not in string.punctuation and token.lower() not in stop_words: cleaned_tokens.append(token) return cleaned_tokens sample = "The quick brown fox jumps over the lazy dog." stop_words = stopwords.words('english') clean_data(word_tokenize(sample), stop_words)

The output of the example is as follows:

该示例的输出如下:

['quick', 'brown', 'fox', 'jump', 'lazy', 'dog']

As you can see, the punctuation and stop words have been removed.

如您所见,标点和停用词已被删除。

词频分布 (Word Frequency Distribution)

Now that you are familiar with the basic cleaning techniques in NLP, let’s try and find the frequency of words in text. For this exercise, we’ll use the text of the fairy tale, The Mouse, The Bird and The Sausage, which is available freely on Gutenberg. We’ll store the text of this fairy tale in a string, text.

现在您已经熟悉了NLP中的基本清洁技术,现在让我们尝试查找文本中单词的出现频率。 在本练习中,我们将使用童话故事《老鼠,鸟和香肠》的文本,该文本可在古腾堡免费获得。 我们将把这个童话的文本存储在字符串text 。

First, we tokenize text and then clean it using the function clean_data that we defined above.

首先,我们标记text ,然后使用上面定义的函数clean_data清理text 。

tokens = word_tokenize(text) cleaned_tokens = clean_data(tokens, stop_words = stop_words)

To find the frequency distribution of words in your text, you can use FreqDist class of NLTK. Initialize the class with the tokens as an argument. Then use the .most_common() method to find the commonly occurring terms. Let us try and find the top ten terms in this case.

若要查找文本中单词的频率分布,可以使用NLTK的FreqDist类。 使用令牌作为参数初始化类。 然后,使用.most_common()方法查找常用术语。 让我们尝试找到这种情况下的前十个术语。

from nltk import FreqDist freq_dist = FreqDist(cleaned_tokens) freq_dist.most_common(10)

Here are the ten most commonly occurring terms in this fairy tale.

这是这个童话故事中最常见的十个名词。

python [('bird', 15), ('sausage', 11), ('mouse', 8), ('wood', 7), ('time', 6), ('long', 5), ('make', 5), ('fly', 4), ('fetch', 4), ('water', 4)]

python [('bird', 15), ('sausage', 11), ('mouse', 8), ('wood', 7), ('time', 6), ('long', 5), ('make', 5), ('fly', 4), ('fetch', 4), ('water', 4)]

Unsurprisingly, the three most common terms are the three main characters in the fairy tale.

毫不奇怪,三个最常用的术语是童话中的三个主要角色。

The frequency of words may not be very important when analysing text. Typically, the next step in NLP is to generate a statistic — TF – IDF (term frequency – inverse document frequency), which signifies the importance of a word in a list of documents.

分析文本时,单词的频率可能不是很重要。 通常,NLP中的下一步是生成统计信息TF – IDF (术语频率–反向文档频率),表示一个单词在文档列表中的重要性。

结论 (Conclusion)

In this post, you were introduced to natural language processing in Python. You converted text to tokens, converted words to their base forms and finally, cleaned the text to remove any part which didn’t add meaning to the analysis.

在本文中,向您介绍了Python中的自然语言处理。 您将文本转换为记号,将单词转换为它们的基本形式,最后清洗了文本以删除没有对分析增加意义的任何部分。

Although you looked at simple NLP tasks in this tutorial, there are many techniques to explore. One may wish to perform topic modelling on textual data, where the objective is to find a common topic that a text might be talking about. A more complex task in NLP is the implementation of a sentiment analysis model to determine the feeling behind any text.

尽管您在本教程中介绍了简单的NLP任务,但仍有许多技术可供探索。 可能希望对文本数据执行主题建模 ,目的是找到文本可能在谈论的一个常见主题。 NLP中更复杂的任务是实现情感分析模型,以确定任何文本背后的感觉。

What procedures do you follow when you are given a pile of text to work with? Let us know in the comments below.

当您收到一堆文本供您处理时,您将遵循什么程序? 在下面的评论中让我们知道。

翻译自: https://www.sitepoint.com/natural-language-processing-python/

相关资源:Python-NLP之旅包含NLP文章代码集锦
最新回复(0)