in Search Engine Optimization

Python Script for Creating an XML Sitemap with rel=”alternate” hreflang=”x”

Google specifies three scenarios for which rel=”alternate” hreflang=”x” is recommended:

  1. You translate only the template of your page, such as the navigation and footer, and keep the main content in a single language. This is common on pages that feature user-generated content, like a forum post.
  2. Your pages have broadly similar content within a single language, but the content has small regional variations. For example, you might have English-language content targeted at readers in the US, GB, and Ireland.
  3. Your site content is fully translated. For example, you have both German and English versions of each page.

via Google Webmaster Tools Help

Similarly, there are three means for which hreflang can be implemented. It can be tagged with the element within the section of each page, expressed through the http header for non-html files, or within your XML sitemap. There is an obvious advantage to applying it within an xml sitemap for enterprise level sites, like the ones I tend to work on. Typically, it is much easier to get an updated xml sitemap uploaded than to apply new tagging to a myriad of pages. However, even when applied within an XML sitemap, it can a be tedious process for large websites. I created a quick python script to help make that process a little bit easier.

hreflang python tool usage

This is script is designed for a website where the alternate language site has an equal number of pages to the primary language. For example, there are the same number of pages for en-uk and en-ca as there are for en-us.

For sites with a varied number of pages for each language, I recommend using the tool created by theMediaFlow, located here. For example if the pages have the same structure for en-uk and en-us, but en-ca doesn’t have an equivalent page.

from xml.etree import ElementTree
ElementTree.register_namespace('', 'http://www.sitemaps.org/schemas/sitemap/0.9')
ElementTree.register_namespace('xhtml', 'http://www.w3.org/1999/xhtml')
print 'HREFLANG XML Sitemap Generator v0.15 by Paul Shapiron'
finput = raw_input('Please specify the file path to your source xml sitemap:n')
doc = ElementTree.parse(open(finput))
root = doc.getroot()
root.set('xmlns:xhtml', 'http://www.w3.org/1999/xhtml')
nparts = int(input('How many parts of the URL do you want to replace:n'))
nlangs = int(input('How many languages:n'))
lang_replacements = dict()
matches = []
for i in xrange(nparts):
    matches.append(raw_input('please input a part of the URL you want to replace:n'))

for i in xrange(nlangs):
    langcode = raw_input('Please enter the #' + str(i + 1) + ' language code:n')
    replacements = []
    for i in xrange(nparts):
        replace_match = matches[i]
        replace_with = raw_input('what would you like to replace ' + matches[i] + ' with?:n')
        replacements.append((replace_match, replace_with))

    lang_replacements[langcode] = replacements

for el in doc.findall('{http://www.sitemaps.org/schemas/sitemap/0.9}url'):
    url = el.find('{http://www.sitemaps.org/schemas/sitemap/0.9}loc').text
    for (langcode, replacements,) in lang_replacements.iteritems():
        localized_url = url
        for replacement_tuple in replacements:
            localized_url = localized_url.replace(replacement_tuple[0], replacement_tuple[1])

        ElementTree.SubElement(el, 'xhtml:link', {'rel': 'alternate',
         'hreflang': langcode,
         'href': localized_url})
         
def indent(elem, level = 0):
    i = 'n' + level * '  '
    if len(elem):
        if not elem.text or not elem.text.strip():
            elem.text = i + '  '
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
        for elem in elem:
            indent(elem, level + 1)

        if not elem.tail or not elem.tail.strip():
            elem.tail = i
    elif level and (not elem.tail or not elem.tail.strip()):
        elem.tail = i

indent(root)
foutput = raw_input('choose a filename to save output as *.xml n')
tree = ElementTree.ElementTree(root)
tree.write(foutput)
f = open(foutput, 'w')
f.write('n' + ElementTree.tostring(root))
f.close()

Download hreflang_gen_v0.15.py.

The below usage assumes that you have the Python 2.7.X Programming Language installed and possess the necessary technical expertise.

Using the tool to generate optimized XML Sitemaps with rel=”alternate” hreflang=”x”:

  1. Download the .py files or copy the source code into your text editor of choice and save it.
  2. Make sure you have the XML sitemap you wish to modify saved locally. Placing it in the same directory as the tool will make it easier to use, but isn’t necessary.
  3. Open a command prompt/terminal and run the command “python [file name of .py file]”. The script will prompt you for the location of your original XML Sitemap file.
  4. If the input XML Sitemap is in the same directory as the tool, at the prompt “Please specify the file path to your source xml sitemap” you can enter just the filename. If it is in a different directory, enter the full path to the file (e.g., C:OriginalSitemap.xml). Press the Enter key.
  5. If you entered a valid path to a file, you will be prompted with the question “How many parts of the URL do you want to replace”. Enter a number equal to the number of alterations that must be made to the existing URL structure in order to change it into its international form (in most cases this value will be 1). For example, if you want to change domain.com to domain.co.uk you would enter the value as “1” (since you are only changing the 1 part “.com” to “.co.uk”). Press the Enter key.
  6. Next, at the prompt “How many languages” enter a numeric value equal to the number of alternate international websites you wish to add to the sitemap (or the number of rel=”alternate” hreflang=”x” tags you will be adding per URL in your XML Sitemap). The structural changes you will make to the URL must be consistent with your other changes. If they aren’t, you can simply rerun the program again after you have made your initial changes. Press the Enter key.
  7. At the prompt “please input a part of the URL you want to replace” enter the part of the original URL you wish to change into an international URL for rel=”alternate” hreflang=”x”. Press the Enter key.
  8. For the prompt, “Please enter the #x language code”, enter the corresponding language code value you wish to put in the hreflang attribute. If you chose a value more than 1 for the prompt in Step 6, then this question will be asked the appropriate number of times. Press the Enter key.
  9. The next prompt will ask “what would you like to replace x with?” in which you should enter the appropriate replacement text corresponding to Step 7. If you chose a value more than 1 for the prompt in Step 6, then this question will be asked the appropriate number of times. Press the Enter key.
  10. The last prompt will ask you to “choose a filename to save output as *.xml”. Entering just a filename will output your optimized XML Sitemap to program directory. You can also enter a complete file path. Remember to add the .xml extension to the filename. Press the Enter key and enjoy your optimized XML Sitemap! Make sure to check the XML source to make sure that the output is correct. The XML may not display the same as the original sitemap did in your browser, but it will validate correctly within Google Webmaster Tools.

Write a Comment

Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.