Python – SEC Edgar Scraping Financial Statements

This post on Python SEC Edgar Scraping Financial Statements is a bit different than all the others in my blog. I just want to share with all of you a script in order to scrap financial statements from the SEC Edgar website. I will only explain how it works in a Youtube video due to the low value added on writing an article for it.

Although there are great APIs, such as fmpcloud.io, which makes super easy to get high quality financial data. Some of you may find useful to scrap financial statements from the SEC Edgar portal. Specially, to retrieve information not given by any API such as financial debt maturities, PP&E…

Below code will use Python Edgar package and pandas to scrap the financial statements of any company given in the selectedcompany variable.

In the example below, we scrap the 10-Q report for Facebook in order to get the Balance Sheet for the last two years in a Pandas DataFrame. For the code to work, there are a few things to be done:

For further clarification on how the code works, refer to my Youtube video, Python SEC Edgar Scraping Financial Statements, where I explain it step by step.

import edgar
#install Python Edgar library
import pandas as pd

edgar.download_index('/Users/jos/Desktop/Desktop/Desktop', 2018,skip_all_present_except_last=False)

selectedcompany = 'Facebook'
selectedreport = '10-Q'

csv = pd.read_csv('2019-QTR4.tsv', sep='\t',  lineterminator='\n', names=None) 

csv.columns.values[0] = 'Item'

companyreport = csv[(csv['Item'].str.contains(selectedcompany)) & (csv['Item'].str.contains(selectedreport))]

Filing = companyreport['Item'].str.split('|')
Filing = Filing.to_list()

for item in Filing[0]:
    
    if 'html' in item:
        report = item
        
url = 'https://www.sec.gov/Archives/' + report
#https://www.sec.gov/ix?doc=/Archives/edgar/data/1652044/000165204419000032/goog10-qq32019.htm

df = pd.read_html(url)
document_index = df[0]
document_index = document_index.dropna()

document_name = document_index[document_index['Description'].str.contains(selectedreport)]
document_name = document_name['Document'].str.split(' ')
document_name = document_name[0][0]

report_formatted = report.replace('-','').replace('index.html','')
url = 'https://www.sec.gov/Archives/' + report_formatted + '/' + document_name


df = pd.read_html(url)

for item in df:
    BS = (item[0].str.contains('Retained') | item[0].str.contains('Total Assets'))
    if BS.any():
        Balance_Sheet = item
        

Balance_Sheet = Balance_Sheet.iloc[2:,[0,2,6]]

header = Balance_Sheet.iloc[0]
Balance_Sheet = Balance_Sheet[1:]

Balance_Sheet.columns = header


Balance_Sheet.columns.values[0] = 'Item'
Balance_Sheet = Balance_Sheet[Balance_Sheet['Item'].notna()]

Balance_Sheet[Balance_Sheet.columns[1:]] = Balance_Sheet[Balance_Sheet.columns[1:]].astype(str)
Balance_Sheet[Balance_Sheet.columns[1]] = Balance_Sheet[Balance_Sheet.columns[1]].map(lambda x: x.replace('(','-'))
Balance_Sheet[Balance_Sheet.columns[2]] = Balance_Sheet[Balance_Sheet.columns[2]].map(lambda x: x.replace('(','-'))

Balance_Sheet[Balance_Sheet.columns[1]] = Balance_Sheet[Balance_Sheet.columns[1]].map(lambda x: x.replace(',',''))
Balance_Sheet[Balance_Sheet.columns[2]] = Balance_Sheet[Balance_Sheet.columns[2]].map(lambda x: x.replace(',',''))

Balance_Sheet[Balance_Sheet.columns[1]] = Balance_Sheet[Balance_Sheet.columns[1]].map(lambda x: x.replace('—','0'))
Balance_Sheet[Balance_Sheet.columns[2]] = Balance_Sheet[Balance_Sheet.columns[2]].map(lambda x: x.replace('—','0'))



Balance_Sheet[Balance_Sheet.columns[1:]] = Balance_Sheet[Balance_Sheet.columns[1:]].astype(float)

Balance_Sheet


2 thoughts on “Python – SEC Edgar Scraping Financial Statements

  1. Thank you for your great contribution, Joey! I am already interested in going through your work to get more info how to retrieve the data that I want to.
    Thanks a lot! You did a great deal with that package!

  2. Hello. Thank you for this great piece of work. Brilliant. I also really liked your style of walking through the code line by line in the YouTube video. I learned some new techniques!

    I am somewhere between novice and intermediate, and I tried implementing your code line by line in a PyCharm IDE. I had to do a couple of tweeks and was able to successfully run the code.

    The issue I keep running into is that sometimes it works and sometimes I get an error, either a 403-Forbidden or a ValueErrors(“No tables found”) error. I tried adding a function to make the url call as a browser, but that didn’t help. I also added a line to write the Balance Sheet to an Excel file using openpyxl.

    Any ideas on handling the error?

    Here is the code I used (lots of commented print statements as I checked the outputs):

    # This is a Python script to get EDGAR data

    import edgar
    import pandas as pd
    from pandas import DataFrame, Series
    from pandas.io.parsers import TextFileReader
    import requests
    from IPython.display import display

    # Requires installation of requests, HTML5, lxml, beautifulsoup4, IPython, openpyxl

    def main():
    new_index = “n”
    new_index = input(“Download updated index? (y or n) “)
    if new_index == “y”:
    edgar.download_index(‘.’, 2018, skip_all_present_except_last=False)

    selectedcompany = ‘Alphabet Inc.’
    selectedreport = ’10-Q’

    csv: Union[Union[TextFileReader, Series, DataFrame, None], Any] = pd.read_csv(‘2019-QTR4.tsv’, sep=’\t’, lineterminator=’\n’, names=None)
    #csv = pd.read_csv(‘2019-QTR4.tsv’, names=None)

    csv.columns.values[0] = ‘Item’
    #print(csv)
    companyreport = csv[(csv[‘Item’].str.contains(selectedcompany)) & (csv[‘Item’].str.contains(selectedreport))]

    #print(‘Printing the Company Report’)
    #print(companyreport)
    #print()

    Filing = companyreport[‘Item’].str.split(‘|’)
    Filing = Filing.to_list()

    #print(‘Printing the Filing’)
    #print(Filing)
    #print()

    for item in Filing[0]:
    if ‘html’ in item:
    report = item

    url = ‘https://www.sec.gov/Archives/’ + report
    #print(url)
    #print()
    #print()

    # https://www.sec.gov/ix?doc=/Archives/edgar/data/1652044/000165204419000032/goog10-qq32019.htm
    df = pd.read_html(url)
    #print(‘printing the df’)
    #print(df)

    document_index = df[0]

    #print(document_index)
    document_index = document_index.dropna()

    #print(‘printing dataframe after NaN’)
    #print(document_index)

    document_name = document_index[document_index[‘Description’].str.contains(selectedreport)]

    #print(’20:35′)
    #print(‘ok to line’)

    document_name = document_name[‘Document’].str.split(‘ ‘)
    document_name = document_name[0][0]
    #print(‘Document name is ‘)
    #print(document_name)
    #print()

    report_formatted = report.replace(‘-‘, ”)
    report_formatted = report_formatted.replace(‘index.html’, ”)
    report_formatted = report_formatted.replace(‘\r’, ”)

    url = ‘https://www.sec.gov/Archives/’ + report_formatted + ‘/’ + document_name
    print(url)
    #print(“ok to final url”)
    #print()
    #print()

    df = request_url(url)
    #df = pd.read_html(url)
    #display(df)
    #print(‘Successfully read the 10-Q’)

    for item in df:
    BS = (item[0].str.contains(‘Retained’) | item[0].str.contains(‘Total Assets’))
    # this was an or statement using |
    if BS.any():
    Balance_Sheet = item

    #print(Balance_Sheet)

    Balance_Sheet = Balance_Sheet.iloc[2:, [0,2,6]]
    print(‘Displaying the Balance Sheet’)
    display(Balance_Sheet)
    print(‘Successfully read the Balance Sheet’)
    print()
    print()

    #print(Balance_Sheet)

    header = Balance_Sheet.iloc[0]
    Balance_Sheet = Balance_Sheet[1:]

    Balance_Sheet.columns = header

    Balance_Sheet.columns.values[0] = ‘Item’
    Balance_Sheet = Balance_Sheet[Balance_Sheet[‘Item’].notna()]

    Balance_Sheet[Balance_Sheet.columns[1:]] = Balance_Sheet[Balance_Sheet.columns[1:]].astype(str)
    Balance_Sheet[Balance_Sheet.columns[1]] = Balance_Sheet[Balance_Sheet.columns[1]].map(lambda x: x.replace(‘(‘, ‘-‘))
    Balance_Sheet[Balance_Sheet.columns[2]] = Balance_Sheet[Balance_Sheet.columns[2]].map(lambda x: x.replace(‘(‘, ‘-‘))

    Balance_Sheet[Balance_Sheet.columns[1]] = Balance_Sheet[Balance_Sheet.columns[1]].map(lambda x: x.replace(‘,’, ”))
    Balance_Sheet[Balance_Sheet.columns[2]] = Balance_Sheet[Balance_Sheet.columns[2]].map(lambda x: x.replace(‘,’, ”))

    Balance_Sheet[Balance_Sheet.columns[1]] = Balance_Sheet[Balance_Sheet.columns[1]].map(lambda x: x.replace(‘—’, ‘0’))
    Balance_Sheet[Balance_Sheet.columns[2]] = Balance_Sheet[Balance_Sheet.columns[2]].map(lambda x: x.replace(‘—’, ‘0’))

    Balance_Sheet[Balance_Sheet.columns[1:]] = Balance_Sheet[Balance_Sheet.columns[1:]].astype(float)

    #print(Balance_Sheet)
    Balance_Sheet.to_excel(“output.xlsx”)

    def request_url(url):
    # requests – thinks I am a broswer code
    header = {
    “User-Agent”: “Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36”,
    “X-Requested-With”: “XMLHttpRequest”
    }
    #r = requests.get(url, headers=header)
    #df = pd.read_html(r.text)

    r = requests.get(url)
    #print(‘html page is :’)
    #print(r)
    dfTables = pd.read_html(r.text)

    #print(‘printing the df’)
    #print(dfTables)
    return dfTables

    if __name__ == ‘__main__’:
    main()

Comments are closed.