Natural Language Processing (NLP) for Beginners (2023)

Step-by-step beginner’s guide to NLP using Python

Natural Language Processing (NLP) for Beginners (1)

In this post, I will introduce you to one of the most known artificial intelligence field called Natural Language Processing. After the introduction, I will walk you through a hands-on exercise where we will extract some valuable information from a specific website. For the hands-on project, we will use a specific NLP module called NLTK (Natural Language Toolkit), which will be covered after the introduction section. After reading this article, you will have a better understanding of natural language processing applications and how they work. Without losing any time, let’s get started!

Table of Contents

  • Introduction
  • NLTK (Natural Language Toolkit)
  • BS4 (Beautiful Soup 4)
  • Step 1 — Import Libraries
  • Step 2— Reading the Page
  • Step 3— Data Cleaning
  • Step 4— Tokenization
  • Step 5— Data Visualization
  • Video Demonstration

Natural language refers to the language we use in our daily life. This field has been around for a long time, but artificial intelligence connected research about this field has increased with the growth of computer science and programming. The internet has changed the way we live, and how we communicate with each other. For example, instead of sending paper mails and letters, we started using text messages, emails, voice messages, etc. You can learn more about this field by doing some research online.

To give you a better understanding of how Natural Language Processing is being used in machine learning and artificial intelligence fields, I would like to share some real-life applications with you:

  • Google Translate: The machine intelligence behind Google Translate understands the words and translates them word by word to the language you want. And it does the translation without losing the meaning of the sentence.
  • Grammarly: The machine intelligence behind this service is good with grammar and words. It is a great example of how language processing has evolved in the last couple of years. It checks the grammar of the sentences, and it even gives some recommendations on how to increase the quality of the article.
  • Voice Assistants: This field is also in most of the areas where language processing has improved a lot. Mainly speech recognition technology is being used with the processing of words. Currently, the most known ones are Apple Siri, Google Assistant, and Amazon Alexa.
  • Chatbots: Another great example of language processing is chatbots. They are very much like virtual assistants but with more specific goals. They are most commonly used on websites where customers visit. They help you to get the information you need without talking to any real person. First, they try to understand your need and then bring the results in front of you.
  • Web Scraping: Web scraping is another field where language processing is commonly used. It is used to extract information from a web page without even spending time to copy each paragraph one by one. Web scraping is a great way to collect valuable data and train your machine learning model. Web scraping is also a very helpful tool when working with search engine optimization.

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.

Reference: http://www.nltk.org

(Video) Natural Language Processing In 5 Minutes | What Is NLP And How Does It Work? | Simplilearn

We have to install NLTK module so that we can use it in our project. Running the following line in your terminal will do the installation for you:

pip install nltk

Beautiful Soup is a Python library for getting data out of HTML, XML, and other markup languages. Beautiful Soup helps you pull particular content from a webpage, remove the HTML markup, and save the information. It is a tool for web scraping that helps you clean up and parse the documents you have pulled down from the web.

Reference: https://programminghistorian.org/en/lessons/intro-to-beautiful-soup

Now, let’s install the latest edition of the beautiful soup library using pip:

pip install beautifulsoup4

After the installation of the libraries is completed, we can start working on the programming. For this project, I will be using Jupyter Notebook. Ok, first thing first, let’s import the libraries into the notebook.

import nltk

from nltk.corpus import stopwords

from bs4 import BeautifulSoupimport urllib.requestimport plotly.io as pio

In this step, we will open the webpage using the urllib request method. After opening it, we will read the whole code of the webpage. As you know, the webpages have a code running in the background. You can right-click on any webpage and click on the ‘inspect element’ to have some idea of the code.

I chose the wikipedia page about Natural Language Processing.

(Video) Natural Language Processing In 10 Minutes | NLP Tutorial For Beginners | NLP Training | Edureka

page = urllib.request.urlopen('https://en.wikipedia.org/wiki/Natural_language_processing')html_plain = page.read()print(html_plain)

This is how it looks like when we print the plain html code:

Natural Language Processing (NLP) for Beginners (2)

As you can see from the screenshot, the plain html code needs some cleaning. BeautifulSoup will help us in this data cleaning process. We have to get rid of lots of unnecessary characters such as double quotes, slashes, bigger than and smaller than signs, and many more. Don’t worry, it also cleans the HTML syntax words 😊

Let’s see the magical power of BS4 (Beautiful Soup 4) after running the following lines:

soup = BeautifulSoup(html_plain,'html.parser')soup_text = soup.get_text(strip = True)

Great! One more thing before we move to split the words: to increase the quality of the processing I recommend lower casing all the characters. It will be helpful when we start counting the frequency of words. Otherwise, the machine will see “Natural” and “natural” as different words due to their different Ascii values.

ready_text = soup_text.lower()print(ready_text)
Natural Language Processing (NLP) for Beginners (3)

Looks much better! Now, let’s move to the next step where will be splitting each word into a list item. This process is known as Tokenization.

This step is crucial when working on a natural language processing project. First, we will tokenize each word by splitting them into list items. After that, we will do some word cleaning. NLTK (Natural Language Toolkit) will use to clean the stop words. This will leave us with the keywords which give us a better idea about the page. This way we don’t count the stop words, such as a, and, of, that, the, with, etc.

tokens = []
for t in ready_text.split():
tokens.append(t)
print(tokens)
Natural Language Processing (NLP) for Beginners (4)
#Run this line if you get an error message in the next code block
nltk.download()

Now, let’s clean the stop words in our tokens list.

stop_words = stopwords.words('english')
clean_tokens = tokens[:]
for token in tokens:
if token in stop_words:
clean_tokens.remove(token)
print(clean_tokens)
Natural Language Processing (NLP) for Beginners (5)
(Video) Complete Natural Language Processing (NLP) Tutorial in Python! (with examples)

In this step, firstly we will count the frequency of the tokens and then we will filter the high-frequency ones. After filtering, it’s time to visualize the most frequently used words in the natural language processing Wikipedia page. Visualization will help us to see them in order with their frequency.

Let’s calculate the frequency of words using FreqDist function by NLTK.

freq = nltk.FreqDist(clean_tokens)for key, val in freq.items():
print('Word: ' + str(key) + ', Quantity:' + str(val))
Natural Language Processing (NLP) for Beginners (6)
(Video) Natural Language Processing (NLP) Tutorial with Python & NLTK

Now, we will define a new dictionary and get the tokens that has been used more than 10 times in the page. These keywords are more valuable than others:

high_freq = dict()
for key, val in freq.items():
if (val > 10):
high_freq[key] = val

Perfect! Now we have a new dictionary called high_freq. Let’s move to the final step and create a bar chart. I think a bar chart will work better with quantitative data representation. I’ve also sorted by descending order so that the word with highest frequency comes first. Here is the visualization code:

#Note: to pass keys and values of high_freq dictionary, I had to convert them to list when passing themfig = dict({
"data": [{"type": "bar",
"x": list(high_freq.keys()),
"y": list(high_freq.values())}],
"layout": {"title": {"text": "Most frequently used words in the page"}, "xaxis": {"categoryorder":"total descending"}}
})
pio.show(fig)
Natural Language Processing (NLP) for Beginners (7)

Congrats!! You have created a program that detects the keywords inside a page. Now, without reading the whole page you can still have an idea about the page using natural language processing. Hoping that you enjoyed reading this hands-on guide. I would be glad if you learned something new today. Working on hands-on programming projects like this one is the best way to sharpen your coding skills. Feel free to contact me if you have any questions while implementing the code.

Follow my blog and youtube channel to stay inspired. Thank you,

Building a Face Recognizer in PythonStep-by-step guide to face recognition in real-time using OpenCv librarytowardsdatascience.com
Extracting Speech from Video using PythonSimple and hands-on project using Google Speech Recognition APItowardsdatascience.com
(Video) Natural Language Processing with spaCy & Python - Course for Beginners

Videos

1. Natural Language Processing In 10 Minutes | NLP Tutorial For Beginners | NLP Training | Simplilearn
(Simplilearn)
2. Natural Language Processing (NLP) for Beginners
(Behic Guven)
3. Natural Language Processing (NLP) for Beginners
(Smart Nation Singapore)
4. INTRO TO NATURAL LANGUAGE PROCESSING (NLP) FOR BEGINNERS
(Allie K Miller)
5. Natural Language Processing (NLP) Tutorial | Data Science Tutorial | Simplilearn
(Simplilearn)
6. Natural Language Processing (NLP) & Text Mining Tutorial | Machine Learning Tutorial | Simplilearn
(Simplilearn)
Top Articles
Latest Posts
Article information

Author: Delena Feil

Last Updated: 06/06/2023

Views: 5966

Rating: 4.4 / 5 (45 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Delena Feil

Birthday: 1998-08-29

Address: 747 Lubowitz Run, Sidmouth, HI 90646-5543

Phone: +99513241752844

Job: Design Supervisor

Hobby: Digital arts, Lacemaking, Air sports, Running, Scouting, Shooting, Puzzles

Introduction: My name is Delena Feil, I am a clean, splendid, calm, fancy, jolly, bright, faithful person who loves writing and wants to share my knowledge and understanding with you.