登录查看更多内容

Read websites' content using Python

Milad Firouzi

Senior Business Intelligence Developer at Teck Resources Limited

发布日期: 2020年1月13日

When I think of when it all started , I cannot remember the exact time , but as far as I can remember I have always loved AS Roma and I have always cheered them , laughed when the won and cried when they lost which was pretty often :D

Recently I was thinking to go through some sources and see what has really happened through these years , the quality of players through the years , have they risen or fallen down. What about the players who have been purchased or sold. Fortunately there are many website that keep these stats and they are all free to use but the problem is that you have to go through all the years or even all the players pages to get the data and you cannot have them in an integrated place like a database or even an excel sheet.

So here it struck my mind to use Python. There is a package in Python called Selenium which is a framework to automate test for web applications, but it is commonly used for web scrapping too.It is pretty straight forward all you have to do is to know a little HTML to tell it to go through many pages read the content and finally store them in an excel sheet for example.

We'll be using two python libraries. selenium and pandas. To install them simply run

pip install selenium pandas

In addition to this, you'll need a browser driver to simulate browser sessions. Since I am on chrome, we'll be using that for the walk-through.

You can download the chrome drive here and save somewhere in the project.

Getting started

from selenium import webdriver
import pandas as pd
import re
import unidecode




browser = webdriver.Chrome("C:\\Users\\Milad\\PycharmProjects\\TransferMarkt\\driver\\chromedriver.exe")

browser.maximize_window()

now we have an instance of the driver and we open a browser to start.

I have defined a function to convert many players positions into main 4 positions.

def get_player_line(argument):
    switcher = {
        'Goalkeeper': "Goal",
        'Centre-Back' : "Defence",
        'Left-Back' : "Defence",
        'Right-Back': "Defence",
        'Attacking Midfield': "Midfield",
        'Central Midfield': "Midfield",
        'Defensive Midfield' : "Midfield",
        'Left Midfield' : "Midfield",
        'Right Midfield': "Midfield",
        'Left Winger' : "Forward",
        'Right Winger': "Forward",
        'Second Striker': "Forward",
        'Centre-Forward' : "Forward"
    }

    return switcher.get(argument)

Now it is time to define the team and the data-set in which we want to save the data.

Club = "AS Roma"
Country = "Italy"


df = pd.DataFrame(columns=['Country','Club','Season','player_name','player_position' , 'player_line','player_age','player_nationality' , 'player_value'])

If you check the Transfermarkt website and find players detail for a team in each season you see that the link format is a constant link with year number changed

https://www.transfermarkt.us/as-roma/startseite/verein/12?saison_id=2018

so you can change the ending year number and browse the season that you want, so we continued by writing a for loop to browse the last 15 seasons.

for i in range(15):
    browser.get('https://www.transfermarkt.co.uk/as-roma/startseite/verein/12?saison_id={}'.format(2019-i))

After opening the web page you like the best trick is to find the element that you are looking for by finding a constant id in the pages elements and go down its tree like structure.

Here we find the table which contains players detail and get the length of the table in order to write another loop in which we browse each player on by one in each round of the second loop

players_table_count = len(browser.find_elements_by_xpath(
        '//*[@id="yw1"]/table/tbody/tr'))

    Season = 2019-i

for j in range(players_table_count):
        player_name_path = '//*[@id="yw1"]/table/tbody/tr[' + (j+1).__str__() + \
                 ']/td[2]'
        player_name_position = browser.find_element_by_xpath(player_name_path).text


        player_name = unidecode.unidecode(player_name_position.split("\n")[0])


        player_position_path = '//*[@id="yw1"]/table/tbody/tr[' + (j+1).__str__() + \
                 ']/td[2]/table/tbody/tr[2]'


        player_position = browser.find_element_by_xpath(player_position_path).text


        player_line = get_player_line(player_position)


        player_age_path = '//*[@id="yw1"]/table/tbody/tr[' + (j+1).__str__() + \
                 ']/td[4]'


        player_age_path_2 = '//*[@id="yw1"]/table/tbody/tr[' + (j + 1).__str__() + \
                          ']/td[3]'


        player_age = browser.find_element_by_xpath(player_age_path_2).text[-3:-1] if browser.find_element_by_xpath(player_age_path).text[-3:-1] == "" else browser.find_element_by_xpath(player_age_path).text[-3:-1]




        player_nationality_path = player_age_path = '//*[@id="yw1"]/table/tbody/tr[' + (j+1).__str__() + \
                 ']/td[5]' if i == 0 else '//*[@id="yw1"]/table/tbody/tr[' + (j+1).__str__() + \
                 ']/td[4]'


        player_nationality = browser.find_element_by_xpath(player_nationality_path).find_element_by_tag_name('img').get_attribute('alt')


        player_value_path = player_age_path = '//*[@id="yw1"]/table/tbody/tr[' + (j+1).__str__() + \
                 ']/td[6]'
        try:


            player_value = float(re.findall(r"[-+]?\d*\.\d+|\d+", browser.find_element_by_xpath(player_value_path).text)[0]) \
                if "k" in browser.find_element_by_xpath(player_value_path).text else\
                float(re.findall(r"[-+]?\d*\.\d+|\d+", browser.find_element_by_xpath(player_value_path).text)[0]) * 1000
        except:
            player_value = 0

Do not get terrified by the code , you just need to know how tables , tr , td or other tags work in HTML structure and try F12 on the websites you like to study their HTML structure to understand this piece of code better.All I can explain is that the key to this code is understanding how HTML works and finding elements by giving their path to the code.

We can then store the read data in our dataset

record_insert = int(df['Country'].count() + 1)


        df.loc[record_insert] = [Country , Club , Season , player_name , player_position , player_line , player_age ,player_nationality , player_value]

        print(player_name + " Season " + str(Season) + " Done")

and finally store all in an excel file.

df.to_csv("player_dataset.csv" , index=False)

and do whatever analysis you want on the data set that you have.

I hope that you have enjoyed this article , please share it if so :)

Ramesh Kuppuswamy

Associate Consultant at Tata Consultancy Services || Computational Engineering || Mehanical Engineering

5 年

Excellent sample for 1) webscraping using python &? 2) webautomation using selenium.

1 次回应

要查看或添加评论，请登录

Milad Firouzi的更多文章

How to spot differences between images using Python

2019年9月28日

How to spot differences between images using Python

Yesterday Sep 27th was my all time Icon , Francesco Totti's birthday and I came across a very interesting video on…

2 条评论
Does a Business Intelligence developer need to know Python?

2019年9月12日

Does a Business Intelligence developer need to know Python?

Let's start this article with getting to know what exactly a BI (Business Intelligence) developer has to do or better…

8 条评论
Market Basket Analysis using Power BI and R

2019年1月31日

Market Basket Analysis using Power BI and R

As you may know according to Wikipedia R is a programming language and free software environment for statistical…

4 条评论
Migrate data from MongoDB to SQL Server

2018年12月4日

Migrate data from MongoDB to SQL Server

We are working in the world that data matters , and it matters a lot. Nowadays we try to store non-relational and…

12 条评论
Top 10 Analytics And Business Intelligence Trends for 2018

2018年1月7日

Top 10 Analytics And Business Intelligence Trends for 2018

Over the past decade business intelligence has been revolutionized. Data exploded, and became big.

See all articles

Read websites' content using Python

Milad Firouzi

Senior Business Intelligence Developer at Teck Resources Limited

Getting started

Milad Firouzi的更多文章

社区洞察

其他会员也浏览了

Is Python a front-end programming language or a back-end programming language?

Flask Dynamic Routing

Migrating Scripts From Imperative JavaScript to Functional Python (with a Guest Appearance by SQLite)

Analyzing Google Ads data by executing Python code in your browser using PyScript

The Role of Python and JavaScript in Data Visualization

Python vs. MEAN Stack: Choosing the Right Tool for the Job

Python Web Contents Capture Tool

The Tiny Python Tuple That Could (Represent Anything)

Power HTML with Python : PyScript Powered your web browser to execute Python with HTML

From Python Script to React App: Building a Snapchat Memories Downloader with AI

Getting started

Milad Firouzi的更多文章

How to spot differences between images using Python

Does a Business Intelligence developer need to know Python?

Market Basket Analysis using Power BI and R

Migrate data from MongoDB to SQL Server

Top 10 Analytics And Business Intelligence Trends for 2018

社区洞察

其他会员也浏览了

Is Python a front-end programming language or a back-end programming language?

Flask Dynamic Routing

Migrating Scripts From Imperative JavaScript to Functional Python (with a Guest Appearance by SQLite)

Analyzing Google Ads data by executing Python code in your browser using PyScript

The Role of Python and JavaScript in Data Visualization

Python vs. MEAN Stack: Choosing the Right Tool for the Job

Python Web Contents Capture Tool

The Tiny Python Tuple That Could (Represent Anything)

Power HTML with Python : PyScript Powered your web browser to execute Python with HTML

From Python Script to React App: Building a Snapchat Memories Downloader with AI