Preprocessing - Stop Words Removing

Stop words

The idea of Natural Language Processing is to do some form of analysis, or processing, where the machine can understand, at least to some level, what the text means, says, or implies.

This is an obviously massive challenge, but there are steps to doing it that anyone can follow. The main idea, however, is that computers simply do not, and will not, ever understand words directly. Humans don't either shocker. In humans, memory is broken down into electrical signals in the brain, in the form of neural groups that fire in patterns. There is a lot about the brain that remains unknown, but, the more we break down the human brain to the basic elements, we find out the basic elements really are. Well, it turns out computers store information in a very similar way! We need a way to get as close to that as possible if we're going to mimic how humans read and understand text. Generally, computers use numbers for everything, but we often see directly in programming where we use binary signals (True or False, which directly translate to 1 or 0, which originates directly from either the presence of an electrical signal (True, 1), or not (False, 0)). To do this, we need a way to convert words to values, in numbers, or signal patterns. The process of converting data to something a computer can understand is referred to as "pre-processing."
One of the major forms of pre-processing is going to be filtering out useless data. In natural language processing, useless words (data), are referred to as stop words.
Immediately, we can recognize that some words carry more meaning than other words. We can also see that some words are just plain useless, and are filler words.
Example :- "fluff"
"umm."
"uhh"
We would not want these words taking up space in our database, or taking up valuable processing time. As such, we call these words "stop words" because they are useless, and we wish to do nothing with them.

You can do this easily, by storing a list of words that you consider to be stop words. NLTK starts you off with a bunch of words that they consider to be stop words, you can access it via the NLTK corpus with using code:

from nltk.corpus import stopwords

Here is the list of stopwords in English Language:

set(stopwords.words('english'))
{'ourselves', 'hers', 'between', 'yourself', 'but', 'again', 'there', 'about', 'once', 'during', 'out', 'very', 'having', 'with', 'they', 'own', 'an', 'be', 'some', 'for', 'do', 'its', 'yours', 'such', 'into', 'of', 'most', 'itself', 'other', 'off', 'is', 's', 'am', 'or', 'who', 'as', 'from', 'him', 'each', 'the', 'themselves', 'until', 'below', 'are', 'we', 'these', 'your', 'his', 'through', 'don', 'nor', 'me', 'were', 'her', 'more', 'himself', 'this', 'down', 'should', 'our', 'their', 'while', 'above', 'both', 'up', 'to', 'ours', 'had', 'she', 'all', 'no', 'when', 'at', 'any', 'before', 'them', 'same', 'and', 'been', 'have', 'in', 'will', 'on', 'does', 'yourselves', 'then', 'that', 'because', 'what', 'over', 'why', 'so', 'can', 'did', 'not', 'now', 'under', 'he', 'you', 'herself', 'has', 'just', 'where', 'too', 'only', 'myself', 'which', 'those', 'i', 'after', 'few', 'whom', 't', 'being', 'if', 'theirs', 'my', 'against', 'a', 'by', 'doing', 'it', 'how', 'further', 'was', 'here', 'than'}


Here is the code you can remove the stop words from your text:

from nltk.corpus import stopwords
 from nltk.tokenize import word_tokenize

 example_sent = "This is a sample sentence, showing off the 
 stop words filtration."

 stop_words = set(stopwords.words('english'))
 word_tokens = word_tokenize(example_sent)
 filtered_sentence = [w for w in word_tokens if not w in 
 stop_words]
 filtered_sentence = []
 for w in word_tokens:
   if w not in stop_words:
    filtered_sentence.append(w)
 print(word_tokens)
 print(filtered_sentence)

After removing the stop words you can see the output here:

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 'off', 'the', 'stop', 'words', 'filtration', '.']#word_tokens
['This', 'sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']#filtered_sentence

Using Spacy library you can do the same here

import spacy #Import spacy library
nlp = spacy.blank("en") #get english stopwords

from spacy.lang.en.stop_words import STOP_WORDS #Put all the  
stopwords into a variable called STOP_WORDS

print(STOP_WORDS) #Print all the stopwords
len(STOP_WORDS) #Length of stopword

nlp.vocab[“the”].is_stop #Checking if a word is a stopword

for word in doc:     #Print all the stopwords in the given doc
if word.is_stop == True:
    print(word)

for word in doc: #Remove all the stopwords in the given doc 
and print remaining doc
if word.is_stop == False:
    print(word)

[word for word in doc if word.is_stop == False] #Remove all 
the stopwords in the given doc and print remaining doc 
separating by words

How can we add our own stopword to the corpus and remove a stopword from the corpus?

STOP_WORDS.add(“Lol”) #Add new stopword into corpus as you wish
 STOP_WORDS.remove(“Lol”) #Remove new stopword into corpus as you wish

You can refer my github repository-python file here.

17