Reorganizing XML Sitemaps with Python for Fun & Profit


One of the primary advantages of mixing programming and website positioning abilities is that you’ll find intelligent options that will be tough to see in the event you solely know website positioning or programming individually.

For occasion, monitoring the indexing of your most necessary pages is an important website positioning activity.

If they don’t seem to be being listed, that you must know the explanation and take motion. The better part is that we are able to be taught all this for free immediately from Google Search Console.

GSC - Sitemaps

In the screenshot above, the XML sitemaps are grouped by web page kind, however the 4 sitemaps listed below are particularly used to trace the progress of some website positioning A/B assessments we ran for this consumer.

GSC - Sitemaps 2

In the Index Coverage stories, we are able to test every sitemap, be taught which particular pages will not be listed, why they don’t seem to be listed, and get a way of the way to repair them (if they are often mounted).

The the rest of this put up will cowl the way to reorganize your XML sitemaps utilizing any standards that may assist you isolate indexing issues on pages you care about.

Table of Contents

Required Libraries

In this text, we’re going to use Python three and the next third-party libraries:

If you used Google Colab, that you must improve pandas. Type:

!pip set up –improve pandas==0.23

Overall Process

We are going to learn URLs from present XML sitemaps, load them within the pandas information frames, create or use further columns, group the URLs by the columns we’ll use as standards, and write the teams of URLs into XML sitemaps.

Read Sitemap URLs from XML Sitemap Indices

Let’s begin by studying an inventory of sitemap URLs within the Search Engine Journal sitemap index.

The partial output is:

The variety of sitemaps are 30 
{‘https://www.searchenginejournal.com/post-sitemap1.xml’: ‘2005-08-15T10:52:01-04:00’, …

Next, we load them right into a pandas information body.

The output reveals the primary 10 URLs with their final modification timestamp.

Read URLs from XML Sitemaps

Now that we have now sitemap URLs, we are able to pull the precise web site URLs. For instance functions, we’ll solely pull URLs from the put up sitemaps.

The partial output is
https://www.searchenginejournal.com/post-sitemap1.xml

The variety of URLs are 969
https://www.searchenginejournal.com/post-sitemap2.xml

The variety of URLs are 958
https://www.searchenginejournal.com/post-sitemap3.xml

The variety of URLs are 943

Search Engine Journal XML sitemaps use the Yoast website positioning plugin, which whereas it separates classes and blogs, all posts are grouped into post-sitemapX.xml sitemap information.

We wish to reorganize put up sitemaps by the preferred phrases that seem within the slugs. We created the phrase cloud you see above with the preferred phrases we discovered. Let’s put this collectively!

Creating a Word Cloud

Word Cloud

In order to prepare sitemaps by their hottest URLs, we’ll create a phrase cloud. A phrase cloud is simply the preferred phrases ordered by their frequency. We remove frequent phrases like “the”, “a”, and many others. to have a clear group.

We first create a brand new column with solely the paths of the URLs, then obtain English stopwords from the Nltk package deal.

The course of is to first take solely the trail portion of the URLs, break the phrases through the use of – or / as separators, and rely the phrase frequency. When counting, we exclude cease phrases and phrases which are solely digits. Think the 5 in “5 ways to do X”.

The partial output is:

[(‘google‘, 4430), (‘search’, 2961), (‘seo‘, 1482), (‘yahoo’, 1049), (‘marketing’, 989), (‘new’, 919), (‘content’, 919), (‘social’, 821), …

Just for fun (as promised in the headline), here is the code that will create a visual word cloud with the word frequencies above.

Now, we add the wordcloud column as a category to the data frame with the sitemap URLs.

Here is what the output looks like.

Wordcloud column

We can use this new category to review URLs that contain the popular word: Google.

df[df[“category”] == “google”]

This checklist solely the URLs with that fashionable phrase within the path.

Breaking the 1k URL Index Coverage Limit

Google Search Console’s Index Coverage report is highly effective, but it surely limits the stories to just one thousand URLs. We can break up our already filtered XML sitemaps URLs additional into teams of 1k URLs.

We can use pandas’ highly effective indexing functionality for this.

Reorganizing Sitemaps by Bestsellers

One of probably the most highly effective makes use of of this system is to interrupt out pages that result in conversions.

In ecommerce websites, we might escape the perfect sellers and be taught which of them will not be listed. Easy cash!

As SEJ is just not a transactional web site, I’ll create some faux transactions for instance this tactic. Normally, you’d fetch this information from Google Analytics.

I’m assuming that pages with the phrases “adwords”, “facebook”, “ads” or “media” have transactions.

Reorganizing XML Sitemaps with Python for Fun & Profit

We create a faux transactions column with solely the relative path as you’d usually discover in Google Analytics.

Next, we’ll merge the 2 information frames so as to add the transaction information to the unique sitemap information body. By default, the pandas merge operate will carry out an internal be part of, so solely the rows in frequent can be found.

df.merge(fake_transaction_pages, left_on=”path”, right_on=”path”)

Reorganizing XML Sitemaps with Python for Fun & Profit

As I need all rows, I’ll change the be part of kind to left so it contains all rows within the authentic information body. Note the lacking rows faux NaN (lacking worth) within the faux transactions column.

df.merge(fake_transaction_pages, left_on=”path”, right_on=”path”, how=”left”)

missing rows

I can simply fill the lacking values with zeros.

df.merge(fake_transaction_pages, left_on=”path”, right_on=”path”, how=”left”).fillna(0)

fill missing values

I can now get simply the checklist of finest sellers (by transaction) utilizing this.

new_df=df.merge(fake_transaction_pages, left_on=”path”, right_on=”path”, how=”left”).fillna(0)

new_df[new_df.fake_transactions > 0]

Write XML Sitemaps

So far, we have now seen the way to group URLs utilizing pandas information frames utilizing totally different standards, however how can we convert these URLs again into XML sitemaps? Quite simple!

There is at all times a tough technique to do issues and in the case of creating XML sitemaps that will be to make use of LovelySoup, lxml or comparable libraries to construct the XML tree from scratch.

A less complicated method is to make use of a templating language like these used to construct internet apps. In our case, we’ll use a well-liked template language known as Jinja2.

There are three parts right here:

  • The template with a for loop to iterate of a context object known as pages. It must be a Python tuple, the place the primary ingredient is the URL, and the second is the final modification timestamp.
  • Our authentic pandas information body has one index (the URL) and one column (the timestamp). We can name pandas itertuples() which can create a sequence that might be rendered properly as an XML sitemap.

Reorganizing XML Sitemaps with Python for Fun & Profit

This is not less than 10 instances less complicated than constructing the sitemaps from scratch!

Resources to Learn More

As ordinary, that is only a pattern of the cool stuff you are able to do once you add Python scripting to your day-to-day website positioning work. Here are some hyperlinks to discover additional.

More Resources:


Image Credits

Screenshots taken by creator, February 2019
Word cloud plot generated by creator, February 2019

Subscribe to SEJ

Get our every day publication from SEJ’s Founder Loren Baker concerning the newest information within the trade!

Ebook



Tags: , , , , ,