Read websites' content using Python
When I think of when it all started , I cannot remember the exact time , but as far as I can remember I have always loved AS Roma and I have always cheered them , laughed when the won and cried when they lost which was pretty often :D
Recently I was thinking to go through some sources and see what has really happened through these years , the quality of players through the years , have they risen or fallen down. What about the players who have been purchased or sold. Fortunately there are many website that keep these stats and they are all free to use but the problem is that you have to go through all the years or even all the players pages to get the data and you cannot have them in an integrated place like a database or even an excel sheet.
So here it struck my mind to use Python. There is a package in Python called Selenium which is a framework to automate test for web applications, but it is commonly used for web scrapping too.It is pretty straight forward all you have to do is to know a little HTML to tell it to go through many pages read the content and finally store them in an excel sheet for example.
We'll be using two python libraries. selenium and pandas. To install them simply run
pip install selenium pandas
In addition to this, you'll need a browser driver to simulate browser sessions. Since I am on chrome, we'll be using that for the walk-through.
You can download the chrome drive here and save somewhere in the project.
Getting started
from selenium import webdriver import pandas as pd import re import unidecode browser = webdriver.Chrome("C:\\Users\\Milad\\PycharmProjects\\TransferMarkt\\driver\\chromedriver.exe") browser.maximize_window()
now we have an instance of the driver and we open a browser to start.
I have defined a function to convert many players positions into main 4 positions.
def get_player_line(argument): switcher = { 'Goalkeeper': "Goal", 'Centre-Back' : "Defence", 'Left-Back' : "Defence", 'Right-Back': "Defence", 'Attacking Midfield': "Midfield", 'Central Midfield': "Midfield", 'Defensive Midfield' : "Midfield", 'Left Midfield' : "Midfield", 'Right Midfield': "Midfield", 'Left Winger' : "Forward", 'Right Winger': "Forward", 'Second Striker': "Forward", 'Centre-Forward' : "Forward" }
return switcher.get(argument)
Now it is time to define the team and the data-set in which we want to save the data.
Club = "AS Roma" Country = "Italy" df = pd.DataFrame(columns=['Country','Club','Season','player_name','player_position' , 'player_line','player_age','player_nationality' , 'player_value'])
If you check the Transfermarkt website and find players detail for a team in each season you see that the link format is a constant link with year number changed
https://www.transfermarkt.us/as-roma/startseite/verein/12?saison_id=2018
so you can change the ending year number and browse the season that you want, so we continued by writing a for loop to browse the last 15 seasons.
for i in range(15): browser.get('https://www.transfermarkt.co.uk/as-roma/startseite/verein/12?saison_id={}'.format(2019-i))
After opening the web page you like the best trick is to find the element that you are looking for by finding a constant id in the pages elements and go down its tree like structure.
Here we find the table which contains players detail and get the length of the table in order to write another loop in which we browse each player on by one in each round of the second loop
players_table_count = len(browser.find_elements_by_xpath( '//*[@id="yw1"]/table/tbody/tr'))
Season = 2019-i
for j in range(players_table_count): player_name_path = '//*[@id="yw1"]/table/tbody/tr[' + (j+1).__str__() + \ ']/td[2]' player_name_position = browser.find_element_by_xpath(player_name_path).text player_name = unidecode.unidecode(player_name_position.split("\n")[0]) player_position_path = '//*[@id="yw1"]/table/tbody/tr[' + (j+1).__str__() + \ ']/td[2]/table/tbody/tr[2]' player_position = browser.find_element_by_xpath(player_position_path).text player_line = get_player_line(player_position) player_age_path = '//*[@id="yw1"]/table/tbody/tr[' + (j+1).__str__() + \ ']/td[4]' player_age_path_2 = '//*[@id="yw1"]/table/tbody/tr[' + (j + 1).__str__() + \ ']/td[3]' player_age = browser.find_element_by_xpath(player_age_path_2).text[-3:-1] if browser.find_element_by_xpath(player_age_path).text[-3:-1] == "" else browser.find_element_by_xpath(player_age_path).text[-3:-1] player_nationality_path = player_age_path = '//*[@id="yw1"]/table/tbody/tr[' + (j+1).__str__() + \ ']/td[5]' if i == 0 else '//*[@id="yw1"]/table/tbody/tr[' + (j+1).__str__() + \ ']/td[4]' player_nationality = browser.find_element_by_xpath(player_nationality_path).find_element_by_tag_name('img').get_attribute('alt') player_value_path = player_age_path = '//*[@id="yw1"]/table/tbody/tr[' + (j+1).__str__() + \ ']/td[6]' try: player_value = float(re.findall(r"[-+]?\d*\.\d+|\d+", browser.find_element_by_xpath(player_value_path).text)[0]) \ if "k" in browser.find_element_by_xpath(player_value_path).text else\ float(re.findall(r"[-+]?\d*\.\d+|\d+", browser.find_element_by_xpath(player_value_path).text)[0]) * 1000 except: player_value = 0
Do not get terrified by the code , you just need to know how tables , tr , td or other tags work in HTML structure and try F12 on the websites you like to study their HTML structure to understand this piece of code better.All I can explain is that the key to this code is understanding how HTML works and finding elements by giving their path to the code.
We can then store the read data in our dataset
record_insert = int(df['Country'].count() + 1) df.loc[record_insert] = [Country , Club , Season , player_name , player_position , player_line , player_age ,player_nationality , player_value]
print(player_name + " Season " + str(Season) + " Done")
and finally store all in an excel file.
df.to_csv("player_dataset.csv" , index=False)
and do whatever analysis you want on the data set that you have.
I hope that you have enjoyed this article , please share it if so :)
Associate Consultant at Tata Consultancy Services || Computational Engineering || Mehanical Engineering
5 年Excellent sample for 1) webscraping using python &? 2) webautomation using selenium.