Detect fake news headlines with python

Let's build a simple python script that will detect fake news headlines as well as real ones!

First things first, import these libraries:

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

For this machine learning project, I'll be using this dataset for training our model to detect real or fake news headlines.

Now to start working with the data, load in the dataset and define the x and y variables

data = pd.read_csv("news.csv")

x = np.array(data["title"])
y = np.array(data["label"])

x will be defined as the news headlines which we'd like our model to be trained and tested on
y will be the label( Fake or Real ) which we are going to predict

Next, add these lines of code to your script:

cv = CountVectorizer()
x = cv.fit_transform(x)

"WTH?" you might ask
To put it simply:

The "CountVectorizer()" function counts the number of word occurrences in the headlines in order to find the difference between real and fake headlines. Without judging wether a news headline is real or fake by how plausible it sounds, you'd probably agree that the main difference between real and fake headlines is the tone and the choice of words, a model like this can't judge news headlines based on how plausible they sound since it has no judgement of what sounds plausible or not, that's why, it's best bet at detecting real or fake headlines is by analyzing their tone and word choices.

"fit_transform()" then fits the x variable and transforms it from plain text into a list of word occurrences of all the words encountered from both real and fake headlines in order to differenciate them by their word choice, length and tone.

To make, train and test our model, add these lines of code to your script:

xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2, random_state=42)
model = MultinomialNB()
model.fit(xtrain, ytrain)

Let me explain...

First of all, we split out dataset in train(80%) and test(20%) sets, and set the "random_state" to 42 to make sure we have the same train and test sets every time you run your script( the number 42 has no meaning, you can put any number )

Next, we define our model using "MultinomialNB()", which is used for classifying data based on word counts.

Finally, we fit our model with the "xtrain" and "ytrain" sets.

Start detecting real or fake headlines!

Now to predict wether a news headline is real or not, add these lines of code to your script:

news_headline = "Atlantis discovered under the Atlantic Ocean!"
data = cv.transform([news_headline]).toarray()
print(news_headline)
print(model.predict(data))

Now if you run your script, you should see that it has predicted that this news headline is fake:

Now let's take a random news headline from bbc news and see if our model classifies it as real:

news_headline = "Kathy Hochul: Who is New York's first female governor?"

Now of course, this model is not perfect
News headlines change all the time, and even though the dataset which we are using to train our model is a whopping 30MB worth of plain text, it is only about 50% accurate.
If you add print(model.score(xtest, ytest)) to your script, you'll see that the accuracy score is ~80%, even though I've tested 40 news headlines from last week and got a 50% to 60% accuracy, that's because news headlines, news headline vocabulary and news headline topics change all the time.