Scrape Google Inline Images with Python

Contents: intro, imports, what will be scraped, process, code, links, outro.
Intro
This blog post is a continuation of Google's web scraping series. Here you'll see how to scrape Inline Images using Python with beautifulsoup, requests, lxml, re, base64, BytesIO, PIL libraries. An alternative API solution will be shown.
Note: This blog post assumes that you're familiar with beautifulsoup, requests libraries and a basic understanding of regular expressions.
Imports
import requests, lxml, re, base64
from bs4 import BeautifulSoup 
from io import BytesIO # for decoding base64 image
from PIL import Image # for saving decoded image
from serpapi import GoogleSearch # alternative API solution
What will be scraped
Process
Selecting container, link, and where photo being used.
Extracting thumbnail
To extract thumbnail, we need to look at <img> tag with id dimg_XX (XX - some number).
If you open source code (Ctrl + U) and try to find dimg_36 (or other digits depending on the HTML code) you'll see that there are 2 occurrences that will be found, and one of them will be somewhere in the <script> tags, that's what we need.
In order to extract thumbnails we need to use regex to get them from the <script> tags, because if you would parse data from a src attribute, the output you would get will be like this:  which is base64 encoded picture.
image
More about this topic could be found on Developer Mozilla
The regular expression is extremely simple:
s='data:image/jpeg;base64,(.*?)';
Regular Expression explanation:
  • looking for s='data:image/jpeg;base64,
  • creating a capture group (.*?) which will grab everything, and ending with '; symbols.
  • only the capture group will be extracted without other parts.
  • Screenshot to illustrate what is being captured by a regular expression which you can find here:
    image
    After that, the decoded base64 string can be saved using PIL module. More can be found on StackOverFlow answer.
    Code
    import requests, lxml, re, urllib.parse, base64
    from bs4 import BeautifulSoup
    from PIL import Image
    from io import BytesIO
    
    headers = {
        "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
    }
    
    params = {
        "q": "minecraft shareds photo",
        "sourceid": "chrome",
    }
    
    html = requests.get("https://www.google.com/search", params=params, headers=headers)
    soup = BeautifulSoup(html.text, 'lxml')
    
    for result in soup.select('div[jsname=dTDiAc]'):
        link = f"https://www.google.com{result.a['href']}"
        being_used_on = result['data-lpage']
        print(f'Link: {link}\nBeing used on: {being_used_on}\n')
    
    # finding all script (<script>) tags
    script_img_tags = soup.find_all('script')
    
    # https://regex101.com/r/L3IZXe/4
    img_matches = re.findall(r"s='data:image/jpeg;base64,(.*?)';", str(script_img_tags))
    
    for index, image in enumerate(img_matches):
        try:
            # https://stackoverflow.com/a/6966225/15164646
            final_image = Image.open(BytesIO(base64.b64decode(str(image))))
    
            # https://www.educative.io/edpresso/absolute-vs-relative-path
            # https://stackoverflow.com/a/31434485/15164646
            final_image.save(f'your/absolute_or_relative/path/inline_image_{index}.jpg', 'JPEG')
        except:
            pass
    
    ------------------
    # part of the output:
    '''
    Link: https://www.google.com/search?q=minecraft+shaders+photo&tbm=isch&source=iu&ictx=1&fir=1DCWjzl0od3bFM%252Cc4Qd0ZKVFnHrsM%252C_&vet=1&usg=AI4_-kTAvknTGktfEC1K8ciH7Ot7GsAFkA&sa=X&ved=2ahUKEwiAiaDV6_HxAhVBeawKHfbtDCIQ9QF6BAgWEAE#imgrc=1DCWjzl0od3bFM
    Being used on: https://pixabay.com/illustrations/minecraft-shader-minecraft-wallpaper-1970876/
    
    Link: https://www.google.com/search?q=minecraft+shaders+photo&tbm=isch&source=iu&ictx=1&fir=bwVoAE4HTl8GXM%252Cz3y5GvasoN8hFM%252C_&vet=1&usg=AI4_-kRfUHjrz711om99elb_i3GwJuTBnw&sa=X&ved=2ahUKEwiAiaDV6_HxAhVBeawKHfbtDCIQ9QF6BAgVEAE#imgrc=bwVoAE4HTl8GXM
    Being used on: https://www.pcgamesn.com/minecraft/minecraft-shaders-best-graphics-mods
    ...
    '''
    Saved images in the background:
    image
    GIF to illustrate the output:
    SerpApi is a paid API with a free trial of 5,000 searches.
    The biggest difference is that you don't have to figure out from where to parse certain elements in order to get a proper image size since it's already done for the end-user. Other than that, there's no need to maintaining the parser or finding ways if your script request gets blocked.
    import json
    from serpapi import GoogleSearch
    
    params = {
      "api_key": "YOUR_API_KEY",
      "engine": "google",
      "q": "minecraft shaders photo",
      "hl": "en",
    }
    
    search = GoogleSearch(params)
    results = search.get_dict()
    
    print(json.dumps(results['inline_images'], indent=2, ensure_ascii=False))
    
    ------------------------
    '''
    [
      {
        "link": "/search?q=minecraft+shaders+photo&hl=en&tbm=isch&source=iu&ictx=1&fir=bwVoAE4HTl8GXM%252Cz3y5GvasoN8hFM%252C_&vet=1&usg=AI4_-kRfUHjrz711om99elb_i3GwJuTBnw&sa=X&ved=2ahUKEwit6Jq38PHxAhUkSTABHfJyCn8Q9QF6BAgWEAE#imgrc=bwVoAE4HTl8GXM",
        "thumbnail": "https://serpapi.com/searches/60f6e03895bf92b91f6fb3d6/images/9cce8031b6aba2675322296c8d247839d434db3be723a5fec2f933d8b4bd4d1e.jpeg"
      }
    ]
    ...
    '''
    Links
    Outro
    If you have any questions or something isn't working correctly or you want to write something else, feel free to drop a comment in the comment section or via Twitter at @serp_api.
    Yours,
    Dimitry, and the rest of SerpApi Team.

    24

    This website collects cookies to deliver better user experience

    Scrape Google Inline Images with Python