22
Scrape Naver Organic Results with Python
What is Naver Search
I already answered this in my first blog about scraping Naver News results, there you can find information about what Naver Search is.
Intro
This tutorial blog post is a continuation of the Naver web scraping series. Here you'll see how to scrape Naver Organic Results website ranking, title, link, displayed link, and a snippet with Python using beautifulsoup
, requests
, lxml
libraries.
Note: This blog post shows how to extract data that is being shown in the what will be scraped section.
Prerequisites and Imports
pip install requests
pip install lxml
pip install beautifulsoup4
- Basic knowledge of Python.
- Basic familiarity of the packages mentioned above.
- Basic understanding of
CSS
selectors because you'll see mostly usage ofselect()
/select_one()
beautifulsoup
methods that acceptCSS
selectors.
I wrote a dedicated blog about web scraping with CSS
selectors to cover what it is, pros and cons, and why they're matter from a web-scraping perspective.
Imports
import requests, lxml
from bs4 import BeautifulSoup
What will be scraped
Process
If you don't need an explanation, jump to the code section.
We need to take three steps to make:
- Save HTML locally to test everything before making a lot of direct requests.
- Pick
CSS
selectors for all the needed data. - Extract the data.
Saving HTML locally prevents blocking or banning IP address, especially when a bunch of requests needs to be made to the same website in order to test the code.
A normal user won't do 100+ requests in a very short period of time, and don't do the same thing over and over again (pattern) as scripts do, so websites might tag this behavior as unusual and block IP address for some period (might be written in the response: requests.get("URL").text
) or ban permanently.
import requests
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"query": "bruce lee",
"where": "web" # theres's also a "nexearch" param that will produce different results
}
def save_naver_organic_results():
html = requests.get("https://search.naver.com/search.naver", params=params, headers=headers).text
# replacing every space to underline (_) so bruce lee will become bruce_lee
query = params['query'].replace(" ", "_")
with open(f"{query}_naver_organic_results.html", mode="w") as file:
file.write(html)
import requests
Add user-agent
and query parameters
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
# query parameters
params = {
"query": "bruce lee",
"where": "web"
}
I tend to pass query parameters to requests.get(params=params)
instead of leaving them in the URL. I find it more readable, for example, let's look at the exact same URL:
params = {
"where": "web",
"sm": "top_hty",
"fbm": "1",
"ie": "utf8",
"query": "bruce+lee"
}
requests.get("https://search.naver.com/search.naver", params=params)
# VS
requests.get("https://search.naver.com/search.naver?where=web&sm=top_hty&fbm=1&ie=utf8&query=bruce+lee") # Press F.
What about user-agent
, it's needed to act as a "real" user visit otherwise the request might be denied. You can read more about it in my other blog post about how to reduce the chance of being blocked while web scraping search engines.
Selecting container (CSS
selector that wraps all needed data), title, link, displayed link, and a snippet.
The GIF above translates to this code snippet:
for result in soup.select(".total_wrap"):
title = result.select_one(".total_tit").text.strip()
link = result.select_one(".total_tit .link_tit")["href"]
displayed_link = result.select_one(".total_source").text.strip()
snippet = result.select_one(".dsc_txt").text
import lxml, json
from bs4 import BeautifulSoup
def extract_local_html_naver_organic_results():
with open("bruce_lee_naver_organic_results.html", mode="r") as html_file:
html = html_file.read()
soup = BeautifulSoup(html, "lxml")
data = []
for index, result in enumerate(soup.select(".total_wrap")):
title = result.select_one(".total_tit").text.strip()
link = result.select_one(".total_tit .link_tit")["href"]
displayed_link = result.select_one(".total_source").text.strip()
snippet = result.select_one(".dsc_txt").text
data.append({
"position": index + 1, # starts from 1, not from 0
"title": title,
"link": link,
"displayed_link": displayed_link,
"snippet": snippet
})
print(json.dumps(data, indent=2, ensure_ascii=False))
import lxml, json
from bs4 import BeautifulSoup
Open saved HTML file, read it and pass it to BeautifulSoup()
object and assign lxml
as an HTML parser
with open("bruce_lee_naver_organic_results.html", mode="r") as html_file:
html = html_file.read()
soup = BeautifulSoup(html, "lxml")
data = []
Iterate and append as a dictionary to temporary list()
Since we also need to get an index (rank position), we can use enumerate()
method which adds a counter to an iterable and returns it. More examples.
Example:
grocery = ["bread", "milk", "butter"] # iterable
for index, item in enumerate(grocery):
print(f"{index} {item}\n")
'''
0 bread
1 milk
2 butter
'''
Actual code:
# in our case iterable is soup.select() since it returns an iterable as well
for index, result in enumerate(soup.select(".total_wrap")):
title = result.select_one(".total_tit").text.strip()
link = result.select_one(".total_tit .link_tit")["href"]
displayed_link = result.select_one(".total_source").text.strip()
snippet = result.select_one(".dsc_txt").text
data.append({
"position": index + 1, # starts from 1, not from 0
"title": title,
"link": link,
"displayed_link": displayed_link,
"snippet": snippet
})
Full Code
Now when combining all functions together, we'll get four (4) functions:
- The first function saves HTML locally.
- The second function opens local HTML and calls a parser function.
- The third function makes an actual request and calls a parser function.
- The fourth function is a parser that's being called by the second and third functions.
Note: first and second function could be skipped if you don't really want to do that but take in mind possible consequences that was mentioned above.
import requests
import lxml, json
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"query": "bruce lee", # search query
"where": "web" # nexearch will produce different results
}
# function that saves HTML locally
def save_naver_organic_results():
html = requests.get("https://search.naver.com/search.naver", params=params, headers=headers).text
# replacing every spaces so bruce lee will become bruce_lee
query = params['query'].replace(" ", "_")
with open(f"{query}_naver_organic_results.html", mode="w") as file:
file.write(html)
# fucntion that opens local HTML and calls a parser function
def extract_naver_organic_results_from_html():
with open("bruce_lee_naver_organic_results.html", mode="r") as html_file:
html = html_file.read()
# calls naver_organic_results_parser() function to parse the page
data = naver_organic_results_parser(html)
print(json.dumps(data, indent=2, ensure_ascii=False))
# function that make an actual request and calls a parser function
def extract_naver_organic_results_from_url():
html = requests.get("https://search.naver.com/search.naver", params=params, headers=headers)
# calls naver_organic_results_parser() function to parse the page
data = naver_organic_results_parser(html)
print(json.dumps(data, indent=2, ensure_ascii=False))
# parser that's being called by 2-3 functions
def naver_organic_results_parser(html):
soup = BeautifulSoup(html.text, "lxml")
data = []
for index, result in enumerate(soup.select(".total_wrap")):
title = result.select_one(".total_tit").text.strip()
link = result.select_one(".total_tit .link_tit")["href"]
displayed_link = result.select_one(".total_source").text.strip()
snippet = result.select_one(".dsc_txt").text
data.append({
"position": index + 1, # starts from 1, not from 0
"title": title,
"link": link,
"displayed_link": displayed_link,
"snippet": snippet
})
return data
Alternatively, you can achieve the same results by using SerpApi. SerpApi is a paid API with a free plan.
The difference is that there's no need to create the parser from scratch, trying to pick the correct CSS
selectors and don't get pissed off when certain selectors don't work as you expected, plus there's no need to maintain the parser over time if something in the HTML will be changed and on the next run the script will blow up with an error.
Additionally, there's no need to bypass blocks from Google (or other search engines), understanding how to scale requests volume because it's already happening under the hood for the end-users with appropriate plans. Have a try in the playground.
Install SerpApi library
pip install google-search-results
Example code to integrate:
from serpapi import GoogleSearch
import os, json
def serpapi_get_naver_organic_results():
params = {
"api_key": os.getenv("API_KEY"),
"engine": "naver", # search engine (Google, Bing, DuckDuckGo..)
"query": "Bruce Lee", # search query
"where": "web"
}
search = GoogleSearch(params)
results = search.get_dict()
data = []
for result in results["organic_results"]:
data.append({
"position": result["position"],
"title": result["title"],
"link": result["link"],
"displayed_link": result["displayed_link"],
"snippet": result["snippet"]
})
print(json.dumps(data, indent=2, ensure_ascii=False))
Import serpapi
, os
, json
libraries
from serpapi import GoogleSearch
import os, json
params = {
"api_key": os.getenv("API_KEY"),
"engine": "naver", # search engine (Google, Bing, DuckDuckGo..)
"query": "Bruce Lee", # search query
"where": "web" # filter to extract data from organic results
}
This is happening under the hood so you don't have to think about these two lines of code.
search = GoogleSearch(params) # data extraction
results = search.get_dict() # structured JSON which is being called later
data = []
for result in results["organic_results"]:
data.append({
"position": result["position"],
"title": result["title"],
"link": result["link"],
"displayed_link": result["displayed_link"],
"snippet": result["snippet"]
})
print(json.dumps(data, indent=2, ensure_ascii=False))
# ----------------
# part of the output
'''
[
{
"position": 1,
"title": "Bruce Lee",
"link": "https://brucelee.com/",
"displayed_link": "brucelee.com",
"snippet": "New Podcast Episode: #402 Flowing with Dustin Nguyen Watch + Listen to Episode “Your inspiration continues to guide us toward our personal liberation.” - Bruce Lee - More Podcast Episodes HBO Announces Order For Season 3 of Warrior! WARRIOR Seasons 1 & 2 Streaming Now on HBO & HBO Max “Warrior is still the best show you’re"
}
# other results..
]
'''
If you need more information about the plans, it was explained earlier by SerpApi team member Justin O'Hara in his breakdown of SerpApi’s subscriptions blog post (information is the same except you don't have to login to the SerpApi website).
Links
Outro
If you have anything to share, any questions, suggestions, or something that isn't working correctly, feel free to drop a comment in the comment section or via Twitter at @dimitryzub, or @serp_api.
Yours,
Dimitry, and the rest of SerpApi Team.
22