Automaing Video Schema with Python - Paul Shapiro's Search Wilderness

In this day in age, you can’t find a modern SEO guide that doesn’t recommend using the schema as part of your overall search strategy, and for a good reason. Schema markup, which has been supported by all the major search engines (Google, Bing, Yahoo, DuckDuckGo, etc.) for the better half of seven years now, has abysmal adoption rates. Seen across all ends of the spectrum, from intimidating to superfluous, most of the time talking with the SEO community its use is simply avoided due to the time it takes to implement. Like writing meta titles/descriptions, schema markup data is a very contextual, backend aspect of SEO which isn’t an inheritance ranking factor. So, the returns of adopting schema are outweighed by the time it takes to implement across hundreds to thousands of URLs. But, what if there was a way to automate the process?

In this instance, I’ll be using VideoObject as the main focus for automating schema in the SEO process. Though, reworking this framework can be reused to apply to automate the creation of schema across products, articles, or other properties. As per my other articles, I will be using Python as the main language for automating this process.

What you will need for this code to work for you:

An excel file full of YouTube videos and their corresponding links for your site
Genson, a JSON schema building library
Selenium Webdriver, for automating some browser activity used in the code
BeautifulSoup, an HTML parser to scrape parts of YouTube
A few core essentials that usually come packaged with any IDE in Python (Pandas, JSON, Requests, Datetime)

What You Need for VideoObject Schema

The minimum requirements needed for valid VideoObject schema markup include:

Description: A description detailing the content of the video
Name: The title of the video
thumbnailUrl: A URL that can tell Google where the thumbnail image for the video exists.
uploadDate: a date value in ISO 8601 date format.

Fortunately, this code hits all the nails on the head in one go.

How to Run the VideoObject Generator

Load in Your Data

To start automating your VideoObject schema, format your file to look like the following.

Feel free to add a few additional pieces of information in this file for context (and to make the process easier to hand off to a tech team / other team members for the project). The main columns you will need are Video URL and Embed URL.

from genson import SchemaBuilder
import dateutil.parser as parser
import requests
from bs4 import BeautifulSoup
import json
import pandas as pd
from selenium import webdriver

# Load in your file
xlsx = pd.ExcelFile('Meta_Data.xlsx')
video_data = pd.read_excel(xlsx, 'Video Data')
youtube = video_data['Video URL']
link = video_data['Embed URL']

Grabbing the Thumbnail Image

def Thumbnail_Pull(url):
    options = webdriver.ChromeOptions()
    options.add_argument("headless")
    cpath= 'C:\\Webdriver\\chromedriver'
    driver = webdriver.Chrome(options=options,executable_path=cpath)
    driver.get((url))
    url_main = driver.current_url
    driver.quit()
    i_id = url_main.replace('https://www.youtube.com/watch?v=', '').replace('&feature=youtu.be', '')
    thumbnail = 'https://img.youtube.com/vi/' + i_id + '/maxresdefault.jpg'
    return thumbnail

Using Selenium, the script takes the URL from your spreadsheet, loads the page and takes the source URL. This is to ensure that a consistent URL is used for the thumbnail (rather than a vanity URL, redirect, URL variation, etc). From there, the code constructs a URL that returns the same thumbnail that is present on your YouTube videos.

And the Rest of the Data

def YouTube_Data_Crawl(url):
    data = []
    r = requests.get(url)
    content = r.content
    soup = BeautifulSoup(content, 'lxml')
    # Retrive the title
    try:
        title = soup.select('.watch-title')
        title = title[0].getText()
        title = title.strip().replace('\n',' ')
    except IndexError:
        title = 'null'
    # Retrive the description
    try:
        des = soup.find('p', {'id': 'eow-description'})
        description = des.text
        description = description.replace('\n', '-').replace('\r', ' ').replace('\t', ' ').replace('"', '')
    except (IndexError, AttributeError) as e:
        description = 'null'
    # Retrive the date and convert to ISO format
    try:
        dates = soup.find('div', {'id': 'watch-uploader-info'})
        date_fake = dates.text.replace('Published on ', '').replace('Uploaded on ', '')
        date = parser.parse(date_fake)
        date = date.isoformat()
    except (IndexError, ValueError) as e:
        date = 'null'
    data.append((title, description, date))
    return data

The code uses BeautifulSoup on the URL to open up the page and parse out the title and description that are on the video. It also grabs the published date and converts the format into ISO 8601 using date.isoformat(). All the data is then placed into a list and returned.

Putting it All Together

### Schema Build ###
def SchemaBuild(des, name, date, thumbnailURL, embedded):
    builder = SchemaBuilder()
    builder.add_schema({"@type": "VideoObject"})
    builder.add_schema({"description": des})
    builder.add_schema({"name": name})
    builder.add_schema({"thumbnailUrl": thumbnailURL})
    builder.add_schema({"uploadDate": date})
    builder.add_schema({"embedUrl": embedded})
    meta_data = builder.to_schema()
    meta_data['$schema'] = 'http://schema.org'
    meta_data["@context"] = meta_data['$schema']
    del meta_data['$schema']
    meta = json.dumps(meta_data)
    return meta

The final function uses Genson’s SchemaBuilder to construct the JSON metadata using the data you collected from the previous lines of code. Then, the information is then added and then the exact schema type is constructed into a fully usable format.

frames = []
for video, embedded in zip(youtube, link):
    img = Thumbnail_Pull(url=video)
    vid_data = YouTube_Data_Crawl(url=video)
    df_main = pd.DataFrame(vid_data, columns=['Title', 'Description', "Date"])
    title = df_main['Title'].iloc[0]
    description = df_main['Description'].iloc[0]
    date = df_main['Date'].iloc[0]
    metadata = SchemaBuild(des=description, name=title, thumbnailURL=img, date=date, embedded=embedded)
    frames.append(metadata)
data_list = pd.DataFrame(frames, columns=['VideoObject Schema'])
data_list['Video'] = youtube
data_list.to_excel('schema_data_raw.xlsx')
video_data['Video Meta Data'] = data_list['VideoObject Schema']
video_data.to_excel('video_meta_data.xlsx')

The final product is added to your initial spreadsheet as a brand new column, lining up with its corresponding video.

From here, you can manually hard code the JSON file directly into the page you are optimizing for or utilize a separate plugin through your CMS.

For the full script, see my GitHub repo and feel free to reach out to me directly with any questions on this script or process!

Published

April 16, 2019

Derek Hawkins in Programming, Search Engine Optimization | April 16, 2019

Automating VideoObject Schema for On-Site Video SEO with Python