An Introduction to Python for SEO Pros Using Spreadsheets


2019 far exceeded my expectations by means of Python adoption contained in the SEO group.

As we start a model new yr and I hear further SEO professionals wanting to be part of inside the pleasant, nevertheless aggravated by the preliminary finding out curve, I decided to write this introductory piece with the aim of getting further of us involved and contributing.

Most SEO work contains working with spreadsheets which you’ve got acquired to redo manually when working with numerous producers or repeating the equivalent analysis over time.

When you implement the equivalent workflow in Python, you may have the opportunity to trivially reproduce the work and even automate your complete workflow.

We are going to research Python fundamentals whereas discovering out code John Mueller currently shared on Twitter that populates Google Sheets. We will modify his code to add a straightforward visualization.

Setting up the Python Environment

Similar to working with Excel or Google Sheets, you’ve got acquired two major selections when working with Python.

You can arrange and run Python in your native laptop computer, otherwise you’ll have the opportunity to run it inside the cloud using Google Colab or Jupyter notebooks.

Let’s consider every.

Working with Python on Your Local Computer

I often choose to work on my Mac when there could also be software program program that acquired’t run inside the cloud, for occasion, as soon as I need to automate a web-based browser.

You need to receive three software program program packages:

  • Anaconda.
  • Visual Studio Code.
  • The Python bindings for Code.

Go to https://www.anaconda.com/distribution/ to receive and arrange Python 3.7 for your working system. Anaconda comprises Python and lots of the libraries that you just need for data analysis.

This will take a while to full.

Once carried out, search for the Anaconda Navigator and launch it.

Click to launch JupyterLab and it ought to open a model new tab in your browser with a JupyterLab session.

Click on the massive icon to start a Python Three pocket e-book and also you’re finding out to start type or copy/pasting code snippets.

You can take into account this pocket e-book as comparable to a model new Excel sheet.

The subsequent step is optionally out there.

Go to https://code.visualstudio.com/download and acquire and arrange Visual Studio Code for your laptop computer.

I personally use Visual Studio Code as soon as I need to write code in Python and JavaScript or when writing JavaScript code. You may even use it if you want to convert your pocket e-book code proper right into a command-line script.

It is easier to prototype in Jupyter notebooks and everytime you get all of the issues to work, you want to use Visual Studio Code to put all of the issues collectively in a script or app that others can use from the command line.

Make constructive to arrange the Python extension for VSC. You can uncover it proper right here.

Visual Studio Code has built-in assist for Jupyter Notebooks.

You can create one by typing the important thing phrase combination Command+Shift+P and selecting the selection “Python Jupyter Notebook”.

Working with Python inside the Cloud

I do most of my Python work on Google Colab notebooks so that’s my hottest chance.

Go to https://colab.research.google.com/ and chances are you’ll skip the downloading and installations steps.

Click on the selection to start a model new Python Three pocket e-book and you must have the equal of a model new Google Sheet.

Learning the basics of Python & Pandas

Mueller shared a Colab pocket e-book that pulls data from Wikipedia and populates and Google Sheet with that data.

Professional programmers need to research the ins and out of a programming language and which will take various time and effort.

For SEO practitioners, I imagine a easier technique that features discovering out and adapting current code, may match greater. Please share your ideas for those that do that and see if I’m correct.

We are going lots of the equivalent fundamentals you research in typical Python programming tutorials with a wise context in ideas.

Let’s start by saving Mueller’s pocket e-book to your Google Drive.

After you click on on the hyperlink. Select File > Save a replica in Drive.

Here is the occasion Google sheet with the output of the pocket e-book.

Overall Workflow

Mueller wishes to get matter ideas that perform greater in cell in distinction to desktop.

He realized that celeb, leisure, and medical content material materials does best on cell.

Let’s study by the code and suggestions to get a high-level overview of how he figured this out.

We have numerous gadgets to the puzzle.

  1. An empty Google sheet with 6 prefilled columns and 7 columns that need to be crammed in
  2. The empty Google sheet encompasses a Pivot desk in a separate tab that reveals cell views characterize 70.59% of all views in Wikipedia
  3. The pocket e-book code populates the 7 missing columns largely in pairs by calling a helper carry out referred to as update_spreadsheet_rows.
  4. The helper carry out receives the names of the columns to substitute and a carry out to title which will return the values for the columns.
  5. After all of the columns are populated, we get a remaining Google sheet that options an up to date Pivot Table with a break down of the topic.

Python Building Blocks

Let’s research some frequent Python setting up blocks whereas we consider how Mueller’s code retrieves values to populate a number of fields: the PageId and Description.

# Get the Wikipedia net web page ID -- wished for a bunch of issues. Uses "Article" column

def get_PageId(title):

# Get net web page description from Wikipedia

def get_description(pageId):

We have two Python options to retrieve the fields. Python options are like options in Google Sheets nevertheless you define their habits in any technique you want. They take enter, course of it and return an output.

Here is the PageId we get after we title get_PageId(“Avengers: Endgame”)

'44254295'

Here is the Description we get after we title get_description(pageId)

'2019 superhero film produced by Marvel Studios'

Anything after the # picture is taken into consideration a Python comment and is ignored. You use suggestions to doc the intention of the code.

Let’s step by, line by line, the get_PageId carry out to research the best way it is going to get the ID of the title of the article that we’re passing on.

# title the Wikipedia API to get the PageId of the article with the given title.

  q = {"action": "query", "format": "json", "prop": "info", "titles": title}

q is a Python dictionary. It holds key-value pairs. If you lookup the value of “action”, you get “query” and so forth. For occasion, you’d perform such a lookup using q[“action”].

“action” is a Python string. It represents textual knowledge.

“titles”: title maps the “titles” key to the Python variable title that we handed as enter to the carry out. All keys and values are hardcoded and particular, moreover for the ultimate one. This is what the dictionary seems like after we execute this carry out.

  q = {"action": "query", "format": "json", "prop": "info", "titles": "Avengers: Endgame"}

In the next line we have.

  url = "https://en.wikipedia.org/w/api.php?" + urllib.parse.urlencode(q)

Here we have a Python module carry out urllib.parse.urlencode. Module options are equivalent to Google sheet options that current regular efficiency.

Before we title module or library options, we would like to import the module that accommodates them.

This line on the prime of the pocket e-book does that.

import urllib.parse

Let’s clarify the choice and see the output we get.

urllib.parse.urlencode({"action": "query", "format": "json", "prop": "info", "titles": "Avengers: Endgame"})

You can uncover detailed documentation on the urlencode module carry out proper right here. Its job is to convert a dictionary of URL parameters right into a query string. A query string is the part of the URL after the question mark.

This is the output we get after we run it.

"action=query&format=json&prop=info&titles=Avengers%3A+Endgame"

This is what our URL definition line seems like after we add the outcomes of urlencode.

  url = "https://en.wikipedia.org/w/api.php?" + "action=query&format=json&prop=info&titles=Avengers%3A+Endgame"

The + sign proper right here concatenates the strings to sort one.

url = "https://en.wikipedia.org/w/api.php?action=query&format=json&prop=info&titles=Avengers%3A+Endgame"

This ensuing string is the API request the pocket e-book sends to Wikipedia.

In the next line of code, we open the dynamically generated URL.

  response = requests.get(url)

requests.get is a Python third-party module carry out. You need to arrange third-party libraries using the Python machine pip.

!pip arrange --upgrade -q requests

You can run command line script and devices from a pocket e-book by prepending them with !

The code after ! simply is not Python code. It is Unix shell code. This article provides an entire itemizing of the most typical shell directions.

After you set within the third-party module, you need to import it akin to you do with regular libraries.

import requests

Here is what the translated title seems like.

  response = requests.get("https://en.wikipedia.org/w/api.php?action=query&format=json&prop=info&titles=Avengers%3A+Endgame")

You can open this request inside the browser and see the API response from Wikipedia. The carry out title permits us to try this with out manually opening a web-based browser.

The outcomes from the requests.get title will get saved inside the Python variable response.

This is what the consequence seems like.

{“batchcomplete”: “”,
“query”: {“pages”: {“44254295”: {“contentmodel”: “wikitext”,
“lastrevid”: 933501003,
“length”: 177114,
“ns”: 0,
“pageid”: 44254295,
“pagelanguage”: “en”,
“pagelanguagedir”: “ltr”,
“pagelanguagehtmlcode”: “en”,
“title”: “Avengers: Endgame”,
“touched”: “2020-01-03T17:13:02Z”}}}}

You can take into account this superior data building as a dictionary the place some values embrace completely different dictionaries and so forth.

The subsequent line of code slices and dices this data building to extract the PageId.

consequence = itemizing(response.json()["query"]["pages"].keys())[0]

Let’s step by it to see the best way it is going to get it.

response.json()["query"]

When we look up the value for the necessary factor “query”, we get a smaller dictionary.

{“pages”: {“44254295”: {“contentmodel”: “wikitext”,
“lastrevid”: 933501003,
“length”: 177114,
“ns”: 0,
“pageid”: 44254295,
“pagelanguage”: “en”,
“pagelanguagedir”: “ltr”,
“pagelanguagehtmlcode”: “en”,
“title”: “Avengers: Endgame”,
“touched”: “2020-01-03T17:13:02Z”}}}

Then, we look up the value of “pages” on this smaller dictionary.

response.json()["query"]["pages"]

We get a good smaller one. We are drilling down on the massive response data building.

{“44254295”: {“contentmodel”: “wikitext”,
“lastrevid”: 933501003,
“length”: 177114,
“ns”: 0,
“pageid”: 44254295,
“pagelanguage”: “en”,
“pagelanguagedir”: “ltr”,
“pagelanguagehtmlcode”: “en”,
“title”: “Avengers: Endgame”,
“touched”: “2020-01-03T17:13:02Z”}}

The PageId is obtainable in two areas on this slice of the knowledge building. As the one key, or as a worth inside the nested dictionary.

John made primarily essentially the most wide selection, which is to use the necessary factor to steer clear of extra exploration.

response.json()["query"]["pages"].keys()

The response from this title is a Python dictionary view of the keys. You can research further about dictionary view on this text.

dict_keys(["44254295"])

We have what we’re making an attempt for, nevertheless not within the becoming format.

In the next step, we convert the dictionary view proper right into a Python itemizing.

itemizing(response.json()["query"]["pages"].keys())

This what the conversion seems like.

["44254295"]

Python lists are like rows in a Google sheet. They often embody numerous values separated by commas, nevertheless on this case, there is only one.

Finally, we extract the one issue that we care about from the itemizing. The first one.

itemizing(response.json()["query"]["pages"].keys())[0]

The first think about Python lists begins at index 0.

Here is the last word consequence.

"44254295"

As that’s an identifier, is more healthy to maintain as a string, however once we wished a amount to perform arithmetic operations, we would do one different transformation.

int(itemizing(response.json()["query"]["pages"].keys())[0])

In this case, we get a Python integer.

44254295

The predominant variations between strings and integers are the styles of operations you would perform with them. As you observed sooner than we’re ready to use the + operator to concatenate two strings, however once we used the equivalent operator in two numbers, it’d add them collectively.

 "44254295" + "3" = "442542953"

44254295 + 3 = 44254298

As a side discover, I ought to level out jq, a cool command line machine that permits you to slice and dice JSON responses straight from curl calls (one different superior command line machine). curl permits you to do the equal of what we’re doing with the requests module proper right here, nevertheless with limitations.

So far we’ve realized how to create options and data kinds that allow us to extract data and filter data from third-party web sites (Wikipedia in our case).

Let’s title the next carry out in John’s pocket e-book to research one different crucial setting up block: transfer administration constructions.

get_description("442542953")

This is what the API URL seems like. You can attempt it inside the browser.

"https://en.wikipedia.org/w/api.php?action=query&format=json&prop=pageterms&pageids=44254295"

Here what the response seems like.

{“ns”: 0,
“pageid”: 44254295,
“terms”: {“alias”: [“Avengers Endgame”, “Avengers End Game”, “Avengers 4”],
“description”: [“2019 superhero film produced by Marvel Studios”],
“label”: [“Avengers: Endgame”]},
“title”: “Avengers: Endgame”}

This is the code which will step by to understand administration flows in Python.

  # some pages have not acquired descriptions, so we is not going to blindly seize the value

  if "terms" in rs and "description" in rs["terms"]:

    consequence = rs["terms"]["description"][0]

  else:

    consequence = ""

  return consequence

This half checks if the response building (above) encompasses a key named “terms”. It makes use of the Python If … Else administration transfer operator. Control transfer operators are the algorithmic setting up blocks of packages in most languages, along with Python.

if "terms" in rs

If this check is worthwhile, we look up the value of such key with rs[“terms”]

We anticipate the consequence to be one different dictionary and check it to see if there is a key with the value “description”.

"description" in rs["terms"]

If every checks are worthwhile, then we extract and retailer the define price.

consequence = rs["terms"]["description"][0]

We anticipate the last word price to be a Python itemizing, and we solely want the first issue as we did sooner than.

The and Python logical operator combines every checks into one the place every need to be true for it to be true.

If the check is pretend, the define is an empty string.

 consequence = ""

Populating Google Sheets from Python

With a powerful understanding of Python basic setting up blocks, now we’re ready to give consideration to primarily essentially the most thrilling part of Mueller’s pocket e-book: routinely populating Google Sheets with the values we’re pulling from Wikipedia.

# helper carry out to substitute all rows inside the spreadsheet with a carry out

def update_spreadsheet_rows(spaceName, parameterName, carry outToName, forceUpdate=False):

  # Go by spreadsheet, substitute column 'spaceName' with the knowledge calculated 

  # by 'carry outToName(parameterName)'. Show a progressbar whereas doing so.

  # Only calculate / substitute rows with out values there, besides forceUpdate=True.

Let’s step by some fascinating parts of this carry out.

The efficiency to substitute Google Sheets is roofed by a third-party module.

We need to arrange it and import it sooner than we’re ready to use it.

!pip arrange --upgrade -q gspread

import gspread

Mueller chosen to convert the sheets into pandas data physique and whereas, as he mentions inside the suggestions, it was not important, nevertheless we’re ready to take the prospect to research considerably little little bit of pandas too.

update_spreadsheet_rows("PageId", "Article", get_PageId)

At the tip of every helper carry out that fills a column, we have a reputation identical to the one above.

We are passing the associated columns and the carry out which will get the corresponding values.

When you cross the establish of a carry out with out parameters in Python, you are not passing data nevertheless code for the carry out to execute. This simply is not one factor that, as far as I do know, you’re able to do in a spreadsheet.

  columnNr = df.columns.get_loc(spaceName) + 1 # column number of output space

The very very first thing we want to know is which column we would like to substitute. When we run the code above we get 7, which is the column place of the PageId inside the sheet (starting with 1).

  for index, row in df.iterrows():

In this line of code, we have one different administration transfer operator, the Python For Loops. For loops allow you to iterate over elements that characterize collections, for occasion, lists and dictionaries.

In our case above, we’re iterating over a dictionary the place the index variable will preserve the necessary factor, and the row variable will preserve the value.

To be further actual, we’re iterating over a Python dictionary view, a dictionary view is form of a read-only and faster copy of the dictionary, which is right for iteration.

<generator object DataFrame.iterrows at 0x7faddb99f728>

When you print iterrows, you don’t actually get the values, nevertheless a Python iterator object.

Iterators are options that entry data on demand, require a lot much less memory and perform faster than accessing collections manually.

INDEX:

2

ROW:

Article                                     César Alonso de las Heras

URL                 https://en.wikipedia.org/wiki/César_Alonso_de_...

Views                                                       1,944,569

PartMobile                                                     79.06%

ViewsCell                                                 1,537,376

ViewsDesktop                                                  407,193

PageId                                                       18247033

Description                                                          

WikiInHyperlinks                                                          

WikiOutLinks                                                         

ExtOutLinks                                                          

WikidataId                                                           

WikidataInstance                                                     

Name: 2, dtype: object


sdsdsds

This is an occasion iteration of the for loop. I printed the index and row values.

# if we already did it, don't recalculate besides 'forceUpdate' is prepared.

    if forceUpdate or not row[fieldName]: 

      consequence = carry outToName(row[parameterName])

forceUpdate is a Python boolean price which defaults to False. Booleans can solely be true or false.

row[“PageId”] is empty initially, so not row[“PageId”] is true and the next line will execute. The or operator permits the next line to execute for subsequent runs solely when the flag forceUpdate is true.

      consequence = carry outToName(get_PageId)

This is the code that calls our personalized carry out to get the online web page ids.

The consequence price for the occasion iteration is 39728003

When you consider the carry out fastidiously, you may uncover that we use df which is not outlined inside the carry out. The code that does that is initially of the pocket e-book.

# Convert to a DataFrame and render. 

# (A DataFrame is overkill, nevertheless I needed to play with them further :))

import pandas as pd

df = pd.DataFrame.from_records(worksheetRows)

The code makes use of the third-party module pandas to create a data physique from the Google Sheet rows. I like to suggest finding out this 10 minutes to pandas article to get acquainted. It is a very extremely efficient data manipulation library.

Finally, let’s see how to we substitute the Google Sheet.

      row[fieldName] = consequence # save regionally

      worksheet.update_cell(index+1, columnNr, consequence) # substitute sheet too

This code may very well be translated to.

      row["PageId"] = 39728003 # save regionally

      worksheet.update_cell(3+1, 7, 39728003) # substitute sheet too

This is the code that updates the Google sheet. The variable worksheet will be not outlined inside the update_spreadsheet_rows carry out, nevertheless you would discover it initially of the pocket e-book.

# Authenticate (copy & paste key as detailed), and skim spreadsheet

# (This is always sophisticated, however it really works)

from google.colab import auth

auth.authenticate_user()

import gspread

from oauth2client.client import GoogleCredentials

gc = gspread.authorize(GoogleCredentials.get_application_default())

# get all data from the spreadsheet

worksheet = gc.open(spreadsheetName).sheet1

worksheetRows = worksheet.get_all_values()

I left this code for closing on account of it is the very last item that may get executed and it’s often further refined than the sooner code. However, it is the very very first thing you need to execute inside the pocket e-book.

First, we import the third-party module gspread, and full an Oauth authentication in Chrome to get entry to Google Sheets.

# get all data from the spreadsheet

worksheet = gc.open("Wikipedia-Views-2019").sheet1

worksheetRows = worksheet.get_all_values()

We manipulate the Google sheet with the worksheet variable and we use the worksheetRows variable to create the pandas Dataframe.

Visualizing from Python

Now we get to your homework.

I wrote code to partially reproduce John’s pivot desk and plot a straightforward bar chart.

An Introduction to Python for SEO Pros Using Spreadsheets

Your job is to add this code to your copy of the pocket e-book and add print(varible_name) statements to understand what I’m doing. This is how I analyzed John’s code.

Here is the code.

#Visualize from Python

df.groupby("WikidataInstance").agg({"ViewsMobile": np.sum, "ViewsDesktop": np.sum})

# the aggregation doesn't work on account of the numbers embrace commas

# This offers an error ValueError: Unable to parse string "1,038,950,248" at place 0
#pd.to_numeric(df["ViewsMobile"])

# StackOverflow is your good good friend :)

#https://stackoverflow.com/questions/22137723/convert-number-strings-with-commas-in-pandas-dataframe-to-float
import locale

from locale import atoi
locale.setlocale(locale.LC_NUMERIC, '')

#df[["ViewsMobile", "ViewsDesktop"]].applymap(atoi)

df["ViewsMobile"] = df["ViewsMobile"].apply(atoi)

df["ViewsDesktop"] = df["ViewsDesktop"].apply(atoi)

# We attempt as soon as extra and it actually works
totals_df = df.groupby("WikidataInstance").agg({"ViewsMobile": np.sum, "ViewsDesktop": np.sum})

totals_df

#Here we plot
totals_df.head(20).plot(type="bar")

If you purchased this far and want to research further, I like to suggest you observe the hyperlinks I included inside the article and observe the code snippets on this data.

At the tip of most of my columns, I share fascinating Python duties from the SEO group. Please ponder making an attempt out those who curiosity you and ponder discovering out them as we did proper right here.

But, even greater, see the best way you might have the flexibility to add one factor straightforward nevertheless priceless you would share once more!

More Resources:


Image Credits

Screenshot taken by creator, January 2020



Tags: , , , ,