21
Networked Programs in Python - But With a Program to Convert Web Pages to Markdowns
The protocol that powers the web is known as HTTP (Hypertext Transfer Protocol). A connection can exist between the two programs to send and receive data.
A protocol is a set of rules and guidelines for communicating data. Rules are defined for each step and process during communication between two or more computers. Networks have to follow these rules to successfully transmit data.
Python provides a library called sockets to make connections and communicate data between connections.
import socket
mysock = sockek.socket(socket.AF_INET, sock.SOCK_STREAM)
cmd = "GET http://data.pr4e.org /HTTP1.0\r\n\r\n"
mysock.connect(("data.pr4e.org", 80))
mysock.sendall(cmd)
# loop through receive data until it returns a 0
# which indicate data is no longer sent
while True:
data = mysock(512)
if data < 1:
break
print(data.decode(), end='')
mysock.close()
Python provides a library, called urllib, to manage HTTP networks abstract the whole header part of HTTP.
import urllib.request, urllib.parse, urllib.error
data = urllib.request("http://data.pr4e.org/cover3.jpg").read()
fhand = open("image.jpg", "wb")
fhand.write(data)
fhand.close()
The urllib is particularly useful when you want to scrap a website and use the information.
- Install markdownify, to be used to convert HTML to markdown
- Install BeautifulSoup, to be used parse HTML
- The code below makes a request to a URL given below, parse the HTML, convert the HTML to markdown and save the markdown in a file give as the document title name
from markdownify import markdownify as md
import urllib.request, urllib.parse, urllib.error
import ssl
from bs4 import BeautifulSoup
url = input("Enter URL: ")
# ignore SSL cerificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
try:
html = urllib.request.urlopen(url).read()
except:
print("Error opening URL")
exit()
soup = BeautifulSoup(html, "html.parser")
# remove all those lengthy class, id, name and iin-line styles
for tag in soup():
for attribute in ["class", "id", "name", "style"]:
del tag[attribute]
# remove the tags in the list below
for tag in soup(["style", "script", "sidebar", "aside"]):
tag.decompose()
print(soup.prettify())
prettiified_html = soup.prettify()
markdownified_html = md(prettiified_html)
fhand = open("{}.md".format(soup.title.string), "w")
fhand.write(markdownified_html)
fhand.close()
Markdown preview on the React framework URL.
The code we wrote where is not by any means perfect. It's just to make the show a quick example of things you can do with HTTP connections in python. We can decide to further work on the functions add extra features and integrations.
Thanks for reading through to the end! I anticipate your comment π€
21