Read websites'? content using Python

Read websites' content using Python

When I think of when it all started , I cannot remember the exact time , but as far as I can remember I have always loved AS Roma and I have always cheered them , laughed when the won and cried when they lost which was pretty often :D

Recently I was thinking to go through some sources and see what has really happened through these years , the quality of players through the years , have they risen or fallen down. What about the players who have been purchased or sold. Fortunately there are many website that keep these stats and they are all free to use but the problem is that you have to go through all the years or even all the players pages to get the data and you cannot have them in an integrated place like a database or even an excel sheet.

No alt text provided for this image

So here it struck my mind to use Python. There is a package in Python called Selenium which is a framework to automate test for web applications, but it is commonly used for web scrapping too.It is pretty straight forward all you have to do is to know a little HTML to tell it to go through many pages read the content and finally store them in an excel sheet for example.

We'll be using two python libraries. selenium and pandas. To install them simply run

pip install selenium pandas

In addition to this, you'll need a browser driver to simulate browser sessions. Since I am on chrome, we'll be using that for the walk-through.

You can download the chrome drive here and save somewhere in the project.

Getting started

from selenium import webdriver
import pandas as pd
import re
import unidecode




browser = webdriver.Chrome("C:\\Users\\Milad\\PycharmProjects\\TransferMarkt\\driver\\chromedriver.exe")

browser.maximize_window()

now we have an instance of the driver and we open a browser to start.

I have defined a function to convert many players positions into main 4 positions.

def get_player_line(argument):
    switcher = {
        'Goalkeeper': "Goal",
        'Centre-Back' : "Defence",
        'Left-Back' : "Defence",
        'Right-Back': "Defence",
        'Attacking Midfield': "Midfield",
        'Central Midfield': "Midfield",
        'Defensive Midfield' : "Midfield",
        'Left Midfield' : "Midfield",
        'Right Midfield': "Midfield",
        'Left Winger' : "Forward",
        'Right Winger': "Forward",
        'Second Striker': "Forward",
        'Centre-Forward' : "Forward"
    }

    return switcher.get(argument)

Now it is time to define the team and the data-set in which we want to save the data.

Club = "AS Roma"
Country = "Italy"


df = pd.DataFrame(columns=['Country','Club','Season','player_name','player_position' , 'player_line','player_age','player_nationality' , 'player_value'])

If you check the Transfermarkt website and find players detail for a team in each season you see that the link format is a constant link with year number changed

https://www.transfermarkt.us/as-roma/startseite/verein/12?saison_id=2018

so you can change the ending year number and browse the season that you want, so we continued by writing a for loop to browse the last 15 seasons.

for i in range(15):
    browser.get('https://www.transfermarkt.co.uk/as-roma/startseite/verein/12?saison_id={}'.format(2019-i))

After opening the web page you like the best trick is to find the element that you are looking for by finding a constant id in the pages elements and go down its tree like structure.

Here we find the table which contains players detail and get the length of the table in order to write another loop in which we browse each player on by one in each round of the second loop

players_table_count = len(browser.find_elements_by_xpath(
        '//*[@id="yw1"]/table/tbody/tr'))



    Season = 2019-i

for j in range(players_table_count):
        player_name_path = '//*[@id="yw1"]/table/tbody/tr[' + (j+1).__str__() + \
                 ']/td[2]'
        player_name_position = browser.find_element_by_xpath(player_name_path).text


        player_name = unidecode.unidecode(player_name_position.split("\n")[0])


        player_position_path = '//*[@id="yw1"]/table/tbody/tr[' + (j+1).__str__() + \
                 ']/td[2]/table/tbody/tr[2]'


        player_position = browser.find_element_by_xpath(player_position_path).text


        player_line = get_player_line(player_position)


        player_age_path = '//*[@id="yw1"]/table/tbody/tr[' + (j+1).__str__() + \
                 ']/td[4]'


        player_age_path_2 = '//*[@id="yw1"]/table/tbody/tr[' + (j + 1).__str__() + \
                          ']/td[3]'


        player_age = browser.find_element_by_xpath(player_age_path_2).text[-3:-1] if browser.find_element_by_xpath(player_age_path).text[-3:-1] == "" else browser.find_element_by_xpath(player_age_path).text[-3:-1]




        player_nationality_path = player_age_path = '//*[@id="yw1"]/table/tbody/tr[' + (j+1).__str__() + \
                 ']/td[5]' if i == 0 else '//*[@id="yw1"]/table/tbody/tr[' + (j+1).__str__() + \
                 ']/td[4]'


        player_nationality = browser.find_element_by_xpath(player_nationality_path).find_element_by_tag_name('img').get_attribute('alt')


        player_value_path = player_age_path = '//*[@id="yw1"]/table/tbody/tr[' + (j+1).__str__() + \
                 ']/td[6]'
        try:


            player_value = float(re.findall(r"[-+]?\d*\.\d+|\d+", browser.find_element_by_xpath(player_value_path).text)[0]) \
                if "k" in browser.find_element_by_xpath(player_value_path).text else\
                float(re.findall(r"[-+]?\d*\.\d+|\d+", browser.find_element_by_xpath(player_value_path).text)[0]) * 1000
        except:
            player_value = 0

Do not get terrified by the code , you just need to know how tables , tr , td or other tags work in HTML structure and try F12 on the websites you like to study their HTML structure to understand this piece of code better.All I can explain is that the key to this code is understanding how HTML works and finding elements by giving their path to the code.

We can then store the read data in our dataset

record_insert = int(df['Country'].count() + 1)


        df.loc[record_insert] = [Country , Club , Season , player_name , player_position , player_line , player_age ,player_nationality , player_value]


        print(player_name + " Season " + str(Season) + " Done")

and finally store all in an excel file.

df.to_csv("player_dataset.csv" , index=False)

and do whatever analysis you want on the data set that you have.

No alt text provided for this image

I hope that you have enjoyed this article , please share it if so :)


Ramesh Kuppuswamy

Associate Consultant at Tata Consultancy Services || Computational Engineering || Mehanical Engineering

5 年

Excellent sample for 1) webscraping using python &? 2) webautomation using selenium.

要查看或添加评论,请登录

Milad Firouzi的更多文章

社区洞察

其他会员也浏览了