Google Webmaster Tools (recently renamed Google Search Console) is a treasure trove of data, especially useful keyword information.
Unfortunately, it isn’t stored for a long enough period to be useful for any trending.
If you want to get more use out of Google Webmaster Tools data, it is necessary to store it in a database, which may be challenging for some without a developer background.
I’ve been storing GWMT data ever since (not provided) came into full swing, and I recommend everyone do so as well.
Here is the Python script I have been using to download this data on a monthly basis…
Note: This script has been updated to use the new Search Analytics API.
Click here to jump to the new script.
import shutil
import gwmt.downloader
import glob
import os
import re
import csv
import time
import collections
import MySQLdb
import warnings
dbUser = '' // MySQL Username
dbPassword = '' // MySQL Password
dbHost = 'localhost' // MySQL Host
dbPort = 3306 // MySQL Host Port
dbSchema = '' // MySQL Database Name
email = '' // Google Webmaster Tools Username
emailPassword = '' // Google Webmaster Tools Password
siteUrl = '' // Google Webmaster Tools Domain (format: http://domain.com)
#based on https://bitbucket.org/richardpenman/csv2mysql
# suppress annoying mysql warnings
warnings.filterwarnings(action='ignore', category=MySQLdb.Warning)
def get_type(s):
"""Find type for this string
"""
# try integer type
try:
v = int(s)
except ValueError:
pass
else:
if abs(v) > 2147483647:
return 'bigint'
else:
return 'int'
# try float type
try:
float(s)
except ValueError:
pass
else:
return 'double'
# check for timestamp
dt_formats = (
('%Y-%m-%d %H:%M:%S', 'datetime'),
('%Y-%m-%d %H:%M:%S.%f', 'datetime'),
('%Y-%m-%d', 'date'),
('%H:%M:%S', 'time'),
)
for dt_format, dt_type in dt_formats:
try:
time.strptime(s, dt_format)
except ValueError:
pass
else:
return dt_type
# doesn't match any other types so assume text
if len(s) > 255:
return 'text'
else:
return 'varchar(255)'
def most_common(l):
"""Return most common value from list
"""
# some formats trump others
for dt_type in ('text', 'bigint'):
if dt_type in l:
return dt_type
return max(l, key=l.count)
def get_col_types(input_file, max_rows=1000):
"""Find the type for each CSV column
"""
csv_types = collections.defaultdict(list)
reader = csv.reader(open(input_file))
# test the first few rows for their data types
for row_i, row in enumerate(reader):
if row_i == 0:
header = row
else:
for col_i, s in enumerate(row):
data_type = get_type(s)
csv_types[header[col_i]].append(data_type)
if row_i == max_rows:
break
# take the most common data type for each row
return [most_common(csv_types[col]) for col in header]
def get_schema(table, header, col_types):
"""Generate the schema for this table from given types and columns
"""
schema_sql = """CREATE TABLE IF NOT EXISTS %s (
id int NOT NULL AUTO_INCREMENT,""" % table
for col_name, col_type in zip(header, col_types):
schema_sql += '\n%s %s,' % (col_name, col_type)
schema_sql += """\nPRIMARY KEY (id)
) DEFAULT CHARSET=utf8;"""
return schema_sql
def get_insert(table, header):
"""Generate the SQL for inserting rows
"""
field_names = ', '.join(header)
field_markers = ', '.join('%s' for col in header)
return 'INSERT INTO %s (%s) VALUES (%s);' % \
(table, field_names, field_markers)
def safe_col(s):
return re.sub('\W+', '_', s.lower()).strip('_')
def putCsvToDb(input_file, user, password, host, port, table, database):
print "Importing `%s' into MySQL database `%s.%s'" % (input_file, database, table)
db = MySQLdb.connect(host=host, user=user, passwd=password, port = port)
cursor = db.cursor()
# create database and if doesn't exist
cursor.execute('CREATE DATABASE IF NOT EXISTS %s;' % database)
db.select_db(database)
# define table
print 'Analyzing column types ...'
col_types = get_col_types(input_file)
print col_types
header = None
for row in csv.reader(open(input_file)):
if header:
cursor.execute(insert_sql, row)
else:
header = [safe_col(col) for col in row]
schema_sql = get_schema(table, header, col_types)
print schema_sql
# create table
#cursor.execute('DROP TABLE IF EXISTS %s;' % table)
cursor.execute(schema_sql)
# create index for more efficient access
try:
cursor.execute('CREATE INDEX ids ON %s (id);' % table)
except MySQLdb.OperationalError:
pass # index already exists
print 'Inserting rows ...'
# SQL string for inserting data
insert_sql = get_insert(table, header)
# commit rows to database
print 'Committing rows to database ...'
db.commit()
print 'Done!'
def downloadCsvs(email,password,siteUrl):
downloader = gwmt.downloader.Downloader()
downloader.LogIn(email,password)
downloader.DoDownload(siteUrl,['TOP_PAGES', 'TOP_QUERIES'])
def convertLongFileNames():
os.chdir(".")
files =glob.glob("*.csv")
for file in files:
if 'TopSearchUrls' in file:
shutil.copyfile(file, 'TOP_PAGES.csv')
os.remove(file)
if 'TopSearchQueries' in file:
shutil.copyfile(file, 'TOP_QUERIES.csv')
os.remove(file)
def removeChangeAndCtrColumns(filename):
with open(filename,"rb") as source:
rdr= csv.reader( source )
with open("temp","wb") as result:
wtr = csv.writer( result )
for r in rdr:
wtr.writerow( (r[0], r[1], r[3], r[7]) )
shutil.copyfile("temp", filename)
os.remove("temp")
def addDateColumn(filename):
with open(filename,'r') as source:
with open('temp', 'w') as result:
writer = csv.writer(result, lineterminator='\n')
reader = csv.reader(source)
all = []
row = next(reader)
row.append('Date')
all.append(row)
for row in reader:
row.append(time.strftime("%Y-%m-%d"))
all.append(row)
writer.writerows(all)
shutil.copyfile("temp", filename)
os.remove("temp")
downloadCsvs(email, emailPassword, siteUrl)
convertLongFileNames()
removeChangeAndCtrColumns('TOP_QUERIES.csv')
addDateColumn('TOP_QUERIES.csv')
removeChangeAndCtrColumns('TOP_PAGES.csv')
addDateColumn('TOP_PAGES.csv')
putCsvToDb('TOP_PAGES.csv',dbUser,dbPassword,dbHost,dbPort,'TOP_PAGES',dbSchema)
putCsvToDb('TOP_QUERIES.csv',dbUser,dbPassword,dbHost,dbPort,'TOP_QUERIES',dbSchema)
os.remove('TOP_PAGES.csv')
os.remove('TOP_QUERIES.csv')
Click Here to Download my GWMT Python Script.
To use the script, you will probably want to get some hosting like Amazon AWS or Digital Ocean.
I am using and recommend Digital Ocean since it’s a little bit easier to use than AWS and only runs me $5 per month.
You’ll need to have Python and MySQL installed.
You need to install the MySQLdb Python library, which can be a wonky process. If you’re hosting is on a Debian-based distribution like Ubuntu, I recommend installing it with apt-get:
sudo apt-get install python-mysqldb
Then you need to install gwmt-dl. To install, SSH into your server and run:
sudo pip install gwmt-dl
Edit my script to include your MySQL database information and your GWMT login details (see annotations in the script comments on lines 13-21).
To schedule the script to run monthly:
sudo crontab -e
and this command to the bottom:
0 0 1 * * python /{path to script}/gwmt2mysql.py
Hit ‘Control + X’ and then ‘Y’ to save and exit the cron file.
It will automatically download your webmaster tools data at 12:00am the 1st of every month.
You can use a tool like HeidiSQL (Windows) or Sequel Pro (OS X) to run queries, and analyze your data.
You can even use these tools to export the data in Excel, if that’s your cup of tea. I usually throw it into a tool like Tableau or TIBCO Spotfire since I have access to them.
Disclaimer:
Google announced that they would be discontinuing the original Google Webmaster Tools API on April 20, 2015 and the new API didn’t have the ability to download the useful keyword information found in the Search Queries (recently renamed “Search Analytics”) report.
Thankfully, they are still allowing you to download this data via a method in a Python script they had previously provided.
Since my script is using a similar method, it is still functional for the time being. I will provide an updated script at the time that this script stops functioning.
Update:
Google is currently testing an updated API for accessing this data from the Search Console. I will release updated code once I am permitted to under the non-disclosure agreement.
Update – 8/5/2015: Search Analytics API
Here’s the updated code using the new v3 API for Google Search Console for Search Analytics. You’ll need to follow the instructions found here under “Run a sample query” to install the necessary libraries and generate a “client_secrets.json” file.
Download the updated Google Search Console Search Analytics Python script.
#!/usr/bin/python
# -*- coding: utf-8 -*-
#
# Copyright 2015 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
'''
To use:
1) Install the Google Python client library, as shown at https://developers.google.com/webmaster-tools/v3/libraries.
2) Sign up for a new project in the Google APIs console at https://code.google.com/apis/console.
3) Register the project to use OAuth2.0 for installed applications.
4) Copy your client ID, client secret, and redirect URL into the client_secrets.json file included in this package.
5) Run the app in the command-line as shown below.
Sample usage:
$ python search-analytics-api-sample.py 'https://www.example.com/' '2015-05-01' '2015-05-30'
'''
import argparse
import sys
from googleapiclient import sample_tools
import shutil
import glob
import os
import re
import time
import collections
import MySQLdb
import warnings
import csv
# Declare command-line flags.
argparser = argparse.ArgumentParser(add_help=False)
argparser.add_argument('property_uri', type=str,
help=('Site or app URI to query data for (inclufing '
'trailing slash).'))
argparser.add_argument('start_date', type=str,
help=('Start date of the requested date range in '
'YYYY-MM-DD format.'))
argparser.add_argument('end_date', type=str,
help=('End date of the requested date range in '
'YYYY-MM-DD format.'))
def main(argv):
service, flags = sample_tools.init(
argv, 'webmasters', 'v3', __doc__, __file__, parents=[argparser],
scope='https://www.googleapis.com/auth/webmasters.readonly')
# Get the queries for the date range, sorted by click count, descending.
request = {
'startDate': flags.start_date,
'endDate': flags.end_date,
'dimensions': ['query'],
}
response = execute_request(service, flags.property_uri, request)
print_table(response, 'Export to CSV complete')
def execute_request(service, property_uri, request):
'''Executes a searchAnalytics.query request.
Args:
service: The webmasters service to use when executing the query.
property_uri: The site or app URI to request data for.
request: The request to be executed.
Returns:
An array of response rows.
'''
return service.searchanalytics().query(
siteUrl=property_uri, body=request).execute()
def print_table(response, title):
'''Prints out a response table.
Each row contains key(s), clicks, impressions, CTR, and average position.
Args:
response: The server response to be printed as a table.
title: The title of the table.
'''
#print title + ':'
if 'rows' not in response:
print 'Empty response'
return
rows = response['rows']
row_format = '{:<20}' + '{:>20}' * 4
# print row_format.format('Keys', 'Clicks', 'Impressions', 'CTR', 'Position')
f = open("./TOP_QUERIES.csv", 'wt')
writer = csv.writer(f)
writer.writerow( ('query', 'impressions', 'clicks', 'avg_position') )
for row in rows:
keys = ''
# Keys are returned only if one or more dimensions are requested.
if 'keys' in row:
keys = u','.join(row['keys']).encode('utf-8')
#print row_format.format(
# keys, row['clicks'], row['impressions'], row['ctr'], row['position'])
writer.writerow( (keys, row['impressions'], row['clicks'], row['position']) )
f.close()
# Fill out with your MySQL database information
dbUser = '' // MySQL Username
dbPassword = '' // MySQL Password
dbHost = 'localhost' // MySQL Host
dbPort = 3306 // MySQL Host Port
dbSchema = '' // MySQL Database Name
#based on https://bitbucket.org/richardpenman/csv2mysql
# suppress annoying mysql warnings
warnings.filterwarnings(action='ignore', category=MySQLdb.Warning)
def get_type(s):
"""Find type for this string
"""
# try integer type
try:
v = int(s)
except ValueError:
pass
else:
if abs(v) > 2147483647:
return 'bigint'
else:
return 'int'
# try float type
try:
float(s)
except ValueError:
pass
else:
return 'double'
# check for timestamp
dt_formats = (
('%Y-%m-%d %H:%M:%S', 'datetime'),
('%Y-%m-%d %H:%M:%S.%f', 'datetime'),
('%Y-%m-%d', 'date'),
('%H:%M:%S', 'time'),
)
for dt_format, dt_type in dt_formats:
try:
time.strptime(s, dt_format)
except ValueError:
pass
else:
return dt_type
# doesn't match any other types so assume text
if len(s) > 255:
return 'text'
else:
return 'varchar(255)'
def most_common(l):
"""Return most common value from list
"""
# some formats trump others
for dt_type in ('text', 'bigint'):
if dt_type in l:
return dt_type
return max(l, key=l.count)
def get_col_types(input_file, max_rows=1000):
"""Find the type for each CSV column
"""
csv_types = collections.defaultdict(list)
reader = csv.reader(open(input_file))
# test the first few rows for their data types
for row_i, row in enumerate(reader):
if row_i == 0:
header = row
else:
for col_i, s in enumerate(row):
data_type = get_type(s)
csv_types[header[col_i]].append(data_type)
if row_i == max_rows:
break
# take the most common data type for each row
return [most_common(csv_types[col]) for col in header]
def get_schema(table, header, col_types):
"""Generate the schema for this table from given types and columns
"""
schema_sql = """CREATE TABLE IF NOT EXISTS %s (
id int NOT NULL AUTO_INCREMENT,""" % table
for col_name, col_type in zip(header, col_types):
schema_sql += '\n%s %s,' % (col_name, col_type)
schema_sql += """\nPRIMARY KEY (id)
) DEFAULT CHARSET=utf8;"""
return schema_sql
def get_insert(table, header):
"""Generate the SQL for inserting rows
"""
field_names = ', '.join(header)
field_markers = ', '.join('%s' for col in header)
return 'INSERT INTO %s (%s) VALUES (%s);' % \
(table, field_names, field_markers)
def safe_col(s):
return re.sub('\W+', '_', s.lower()).strip('_')
def putCsvToDb(input_file, user, password, host, port, table, database):
print "Importing `%s' into MySQL database `%s.%s'" % (input_file, database, table)
db = MySQLdb.connect(host=host, user=user, passwd=password, port = port)
cursor = db.cursor()
# create database and if doesn't exist
cursor.execute('CREATE DATABASE IF NOT EXISTS %s;' % database)
db.select_db(database)
# define table
print 'Analyzing column types ...'
col_types = get_col_types(input_file)
print col_types
header = None
for row in csv.reader(open(input_file)):
if header:
cursor.execute(insert_sql, row)
else:
header = [safe_col(col) for col in row]
schema_sql = get_schema(table, header, col_types)
print schema_sql
# create table
#cursor.execute('DROP TABLE IF EXISTS %s;' % table)
cursor.execute(schema_sql)
# create index for more efficient access
try:
cursor.execute('CREATE INDEX ids ON %s (id);' % table)
except MySQLdb.OperationalError:
pass # index already exists
print 'Inserting rows ...'
# SQL string for inserting data
insert_sql = get_insert(table, header)
# commit rows to database
print 'Committing rows to database ...'
db.commit()
print 'Done!'
def convertLongFileNames():
os.chdir(".")
files =glob.glob("*.csv")
for file in files:
if 'TopSearchUrls' in file:
shutil.copyfile(file, 'TOP_PAGES.csv')
os.remove(file)
if 'TopSearchQueries' in file:
shutil.copyfile(file, 'TOP_QUERIES.csv')
os.remove(file)
def removeChangeAndCtrColumns(filename):
with open(filename,"rb") as source:
rdr= csv.reader( source )
with open("temp","wb") as result:
wtr = csv.writer( result )
for r in rdr:
wtr.writerow( (r[0], r[1], r[3], r[7]) )
shutil.copyfile("temp", filename)
os.remove("temp")
def addDateColumn(filename):
with open(filename,'r') as source:
with open('temp', 'w') as result:
writer = csv.writer(result, lineterminator='\n')
reader = csv.reader(source)
all = []
row = next(reader)
row.append('Date')
all.append(row)
for row in reader:
row.append(time.strftime("%Y-%m-%d"))
all.append(row)
writer.writerows(all)
shutil.copyfile("temp", filename)
os.remove("temp")
if __name__ == '__main__':
main(sys.argv)
convertLongFileNames()
addDateColumn('TOP_QUERIES.csv')
putCsvToDb('TOP_QUERIES.csv',dbUser,dbPassword,dbHost,dbPort,'TOP_QUERIES',dbSchema)
os.remove('TOP_QUERIES.csv')
You’ll also have to modify your Cron accordingly to account for new command (and date range selection).
For example, I can run the script for my site using the following command:
python searchanalytics2mysql.py 'http://searchwilderness.com' '2015-07-01' '2015-07-31' --noauth_local_webserver
Nicely done!
This sounds really useful. The hurdle I have to get over is Python. But it’s on my radar.
I have a question. What are the benefits of downloading all the search data rather than just looking at the search data provided by Google Universal Analytics.
Hey David,
The data in Google Analytics found under Acquisition->Search Engine Optimization->Queries is the same data found within Google Webmaster Tools. Viewed within Webmaster Tools and Google Analytics, only 90 days of data is available.
If you’re referring to the data found in Google Analytics under Acquisition->Campaigns->Organic Keywords, then that data is woefully incomplete ever since Google went forward and enabled encrypted search on all searches. You’re likely to find that >90% of your keyword data shows as “(not provided)”, rendering the data useless as a sample.
Hope that helps!
-Paul
I have a question about this statement. Using the Acquisition->Search Engine Optimization->Queries report in Google Analytics, I get slightly different results than the Keywords tool in Search Console, specifically for CTR.
Any ideas why?
Any chance to use this with google spreadsheets?
Amazing. I was looking to develop a solution like this, but you’ve done it for me! 🙂
Hi Paul thanks for the method.
I followed all the steps but there is a invalid syntax error at line 13. Can you help me with it please ?
This is exactly what I was looking for. thanks for sharing the code. I also made some modifications:
– I cannot install MySQLdb so I installed PyMySQL
– the code can then be tweaked to using PyMySQL by replacing MySQLdb with pymysql
Can you please post your solution code here. I ran into the same issue as you because Python 3.x has problems with MySQLdb. I was able to successfully install PyMySQL. Now just the final link is the tweaked file. It would really help a lot. Thanks in advance.
Paul- this write up is an answered prayer, thank you!
Now, this might seem like a silly question…could this script be integrated into an AdWords script? E.g., grab keywords and bid down on avg position of X or lower?
Excited to dive in this week…
Yes, you can do this with the API. Although, I wouldn’t specifically use this script for that purpose.
Total n00b question, but in the updated Python script for Search Analytics, at what point does the client_secrets.json come into play? I’ve read over the example in Google’s docs and yours as well, but don’t understand at what point the authorization is taking place :-/
For instance, I’m assuming it’s somewhere here:
service, flags = sample_tools.init(
argv, ‘webmasters’, ‘v3′, __doc__, __file__, parents=[argparser],
scope=’https://www.googleapis.com/auth/webmasters.readonly’)
But that is *a lot* different from the other found here https://developers.google.com/webmaster-tools/v3/quickstart/quickstart-python:
# Copy your credentials from the console
CLIENT_ID = ‘YOUR_CLIENT_ID’
CLIENT_SECRET = ‘YOUR_CLIENT_SECRET’
# Check https://developers.google.com/webmaster-tools/v3/ for all available scopes
OAUTH_SCOPE = ‘https://www.googleapis.com/auth/webmasters.readonly’
# Redirect URI for installed apps
REDIRECT_URI = ‘urn:ietf:wg:oauth:2.0:oob’
# Run through the OAuth flow and retrieve credentials
flow = OAuth2WebServerFlow(CLIENT_ID, CLIENT_SECRET, OAUTH_SCOPE, REDIRECT_URI)
authorize_url = flow.step1_get_authorize_url()
print ‘Go to the following link in your browser: ‘ + authorize_url
code = raw_input(‘Enter verification code: ‘).strip()
credentials = flow.step2_exchange(code)
# Create an httplib2.Http object and authorize it with our credentials
http = httplib2.Http()
http = credentials.authorize(http)
webmasters_service = build(‘webmasters’, ‘v3’, http=http)
Total n00b question, but in the updated Python script for Search Analytics, at what point does the client_secrets.json come into play? I’ve read over the example in Google’s docs and yours as well, but don’t understand at what point the authorization is taking place :-/
For instance, I’m assuming it’s somewhere here:
service, flags = sample_tools.init(
argv, ‘webmasters’, ‘v3′, __doc__, __file__, parents=[argparser],
scope=’https://www.googleapis.com/auth/webmasters.readonly’)
But that is *a lot* different from the other found here: https://developers.google.com/webmaster-tools/v3/quickstart/quickstart-python
I believe it’s called in Googles’ library and isn’t actually referenced in the script. It’s just a static file required for auth. Put in the same directory as the script.
Yep. client_secrets.json is opened on init(). Check line 76 of source code: https://github.com/google/google-api-python-client/blob/master/googleapiclient/sample_tools.py
Nice job! But may be using mysql is overhead. I’ve tried Search Console API. Check my code if you’re interested in it: https://www.ukietech.com/blog/marketing/download-google-search-queries-using-google-search-console-api/
Nice Alex. Anyone needing to download to CSV can use your script. My script actually dumps to CSV and then writes to SQL, but I think people will appreciate not having to modify/write their own code 🙂
Hey, thanks for providing the script. I got the updated version to work without much hassle, but am still facing a problem. For most URLs, I get the following error message:
googleapiclient.errors.HttpError:
It seems to be working for home pages only. For example, I can access the data for http://www.example.com/, but will receive the given error message for http://www.example.com/directory/index.html. Does anyone have an idea why that is and what to do about it?
Looks like the error message from the previous post was stripped because of its brackets. Here it is again:
HttpError 403 when requesting https://www.googleapis.com/webmasters/v3/sites/{URL}/query?alt=json returned “User does not have sufficient permission for site ‘{URL}’. See also: https://support.google.com/webmasters/answer/2451999.”
What about the ability to write “top queries” to one file/table and writing “top pages” to a separate file/table. How would your script need to be modified to accomplish that task.