Python Scraping – How to get S&P 500 companies from Wikipedia

Scraping is a great way to extract extensively data from the web with no manual intervention at all . For a Financial Analyst or investor, it can really save hours of time consuming work collecting data manually. In this article, I will show you how to scrape Wikipedia with Python to collect information from the S&P 500 index.

We will build a script to get a list of S&P 500 company tickers for any given industry. We will use Python and Beautiful Soup. If you are not familiar with scraping and Beautiful Soup, I advise you to have a look at one of my articles where I explain the basics on how to scrape a website with Python.

Scraping Wikipedia with Python
Photo by Nicolas Picard on Unsplash

A Python Scraper for Wikipedia

In this post, we will build a script to extract a list of tickers containing companies from the S&P 500 index. If we have a look at the Wikipedia page containing the list of S&P 500 tickers, we see that the information that we want is included in a table:

How to scrape Wikipedia with Python
Wikipedia Table – S&P 500 Companies

Each row (except the first one which is the header) contains information for an individual company. Therefore, we will need to scrape this table to get the ticker symbol, in the 1st column, and the GICS Sub Industry, in the 5th column, for each of the rows.

First of all, we need to have a look at the source code of the page. To do that, we simply right click on the Wikipedia page and select the option View Page Source. Then, we are able to see the html code of the page. The table can be found under the table tag having wikitable sortable as class:

<table class="wikitable sortable" id="constituents">

Great, we have located our target object, in this case it is a table. Now we are ready to start building our code in order to scrape table information and extract the required data:

import bs4 as bs
import requests
import pandas as pd
 
resp = requests.get('http://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
soup = bs.BeautifulSoup(resp.text, 'lxml')
table = soup.find('table', {'class': 'wikitable sortable'})
print(table)

#Response:
<table class="wikitable sortable" id="constituents">
<tbody><tr>
<th><a href="/wiki/Ticker_symbol" title="Ticker symbol">Ticker Symbol</a>
</th>
<th>Security</th>
<th><a href="/wiki/SEC_filing" title="SEC filing">SEC filings</a></th>
<th><a href="/wiki/Global_Industry_Classification_Standard" title="Global Industry Classification Standard">GICS</a> Sector</th>
<th>GICS Sub Industry</th>
<th>Headquarters Location</th>
<th>Date first added</th>
<th><a href="/wiki/Central_Index_Key" title="Central Index Key">CIK</a></th>
<th>Founded
</th></tr>
<tr>
<td><a class="external text" href="https://www.nyse.com/quote/XNYS:MMM" rel="nofollow">MMM</a>
</td>..
...
...

If above code is confusing to you, I recommend you to have a look at one of my previous posts explaining the basics on web scraping with Python.

What above code is doing is to first make an http request. Then, we parse the Wikipedia page using Beautiful soup. Beautiful Soup creates a parse tree that allow us to extract data in HTML format.

The variable table contains now the html code from the Wikipedia table. Next, we can find and extract the actual text of the html table by using Beautiful Soup methods such as find all.

If we print table.findall(‘tr’), we get a list containing each of the tr (e.g. table row) elements in the Wikipedia table (see below).

print(table.findAll('tr')[1:])

#outcome:
[<tr>
<td><a class="external text" href="https://www.nyse.com/quote/XNYS:MMM" rel="nofollow">MMM</a>
</td>
<td><a href="/wiki/3M" title="3M">3M Company</a></td>
<td><a class="external text" href="https://www.sec.gov/cgi-bin/browse-edgar?CIK=MMM&amp;action=getcompany" rel="nofollow">reports</a></td>
<td>Industrials</td>
<td>Industrial Conglomerates</td>
<td><a class="mw-redirect" href="/wiki/St._Paul,_Minnesota" title="St. Paul, Minnesota">St. Paul, Minnesota</a></td>
<td></td>
<td>0000066740</td>
<td>1902
</td></tr>, <tr>
<td><a ...,
...,
..]

We are interested in all rows in the table except for the first one, that is why we use [1:].

Looping trough the table rows

Now, we can simply loop through the list and each iteration of the loop will be a different row. Lets create two empty lists and append the tickers and sectors for each of the rows to them. Remember that each row (tr) represents a different company:

tickers = []
industries = []
for row in table.findAll('tr')[1:]:
        ticker = row.findAll('td')[0].text
        #fourth element is the sector
        industry = row.findAll('td')[4].text
        
        tickers.append(ticker)
        industries.append(industry)

By looking again into below code returned by the table.findAll(‘tr’) method, we can see that the ticker of the company is contained under the first td tag. In this case, it has the value MMM and that is the ticker for the first of the companies.

[<tr>
<td><a class="external text" href="https://www.nyse.com/quote/XNYS:MMM" rel="nofollow">MMM</a>
</td>
<td><a href="/wiki/3M" title="3M">3M Company</a></td>
<td><a class="external text" href="https://www.sec.gov/cgi-bin/browse-edgar?CIK=MMM&amp;action=getcompany" rel="nofollow">reports</a></td>
<td>Industrials</td>
<td>Industrial Conglomerates</td>
<td><a class="mw-redirect" href="/wiki/St._Paul,_Minnesota" title="St. Paul, Minnesota">St. Paul, Minnesota</a></td>
<td></td>
<td>0000066740</td>
<td>1902
</td></tr>, <tr>
<td><a ...,
...,
..]

We know that if we use again the findall method to find all td elements, we will be able to find the ticker in the first position. That is why in each loop iteration (i.e. each row of the table), we extract the first element (i.e. [0]) using the method .text.

We can do the same for the industry. Note that since industry is in the fifth position, we need to specify [4] after findAll(‘td’).

Now, after the loop has been completed, we will have all tickers and industries stored in the ticker and industry list:

print(tickers)        

#Outcome:
['MMM\n', 'ABT\n', 'ABBV\n', 'ABMD\n', 'ACN\n', 'ATVI\n', 'ADBE\n', 'AMD\n', ....]

We can get rid of the n at the of the ticker by using map, lambda and strip:

tickers = list(map(lambda s: s.strip(), tickers))
industries = list(map(lambda s: s.strip(), industries))

print(tickers)

#outcome:
['MMM',
 'ABT',
 'ABBV',
 'ABMD',
 'ACN',
 'ATVI',
 'ADBE',...

Creating a Pandas DataFrame from the S&P 500 List

Now, that we have our list of tickers, we create a Pandas DataFrame with two columns containing the ticker and the industry. We do that in order to be able to perform some filters on the data like for example extracting the tickers for all Pharmaceutical companies.

First, we create two Pandas DataFrames with the S&P 500 tickers and industries. Then, we merge them together using the concat method

tickerdf = pd.DataFrame(tickers,columns=['ticker'])
sectordf = pd.DataFrame(industries,columns=['industry'])

tickerandsector = pd.concat([tickerdf, sectordf], axis=1, join_axes=[tickerdf.index])
print(tickerandsector)

And just like that, we have a nice Pandas DataFrame with the ticker and industry for each of the S&P 500 companies. You can add more information to it by changing a bit the code above.

How to scrape Wikipedia with Python
S&P 500 Tickers and Sectors – Python

Filtering tickers for a particular S&P 500 sector

Now, by performing basic Pandas operations, we can filter out the Pandas DataFrame and select any industry as shown below. Then, we can covert the filtered ticker column into a list:

filtersector = tickerandsector.loc[tickerandsector['industry'] == 'Pharmaceuticals']

listoftickers =  filtersector['ticker'].tolist()
print listoftickers

#Outcome:
['ABBV', 'AGN', 'JNJ', 'LLY', 'MRK', 'MYL', 'PRGO', 'PFE', 'ZTS']

Wrapping Up

What a great tool we have created. Now we are able to scrap wikipedia to extract the ticker and sector of S&P 500 companies.

Web Scraping is super useful. You can reuse the code to scrap any other information from the web. Although scraping is not illegal, I advise you to have a look at the target website policies to ensure that scraping the website does not violate the site terms. Also, try to limit the number of http requests per second send to the page in order to avoid overloading the target site server.

To conclude, I would like to share the full code below. I have included the code inside of a function, save_sp500_tickers(). This way, you can call the function and pass the name of an industry to it. Then, you will get a list with the tickers for that particular industry. Enjoy!

Recently, I made a video showing an easier way to extract the SP500 tickers into a Python list using Pandas. You can have a look at it below:

SP500 Tickers to Python List

If you find the blog content on Python for Finance interesting, I would appreciate if you could share the blog in your social media channels.

I have some other posts on the S&P500:

import bs4 as bs
import requests
import pandas as pd
 
def save_sp500_tickers(selectedsector):
    resp = requests.get('http://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
    soup = bs.BeautifulSoup(resp.text, 'lxml')
    table = soup.find('table', {'class': 'wikitable sortable'})

    tickers = []
    industries = []

    for row in table.findAll('tr')[1:]:
            ticker = row.findAll('td')[0].text
            industry = row.findAll('td')[4].text

            tickers.append(ticker)
            industries.append(industry)

    tickers = list(map(lambda s: s.strip(), tickers))
    industries = list(map(lambda s: s.strip(), industries))

    tickerdf = pd.DataFrame(tickers,columns=['ticker'])
    sectordf = pd.DataFrame(industries,columns=['industry'])

    tickerandsector = pd.concat([tickerdf, sectordf], axis=1, join_axes=[tickerdf.index])
    filtersector = tickerandsector.loc[tickerandsector['industry'] == selectedsector]

    listoftickers =  filtersector['ticker'].tolist()
    return listoftickers

save_sp500_tickers('Pharmaceuticals')

1 thought on “Python Scraping – How to get S&P 500 companies from Wikipedia

Comments are closed.