Python IO - Locked Down Lessons
Very easy to use functions available in python for accessing data from pdf /Json / textfile /web /RDBMS .. Sample code attached for reference . I use Anaconda Jupyter IDE for testing the code
Steps to read pdf using inbuilt python functions
Need to import library PyPDF2
import PyPDF2 # reading the pdf file pdf_object = open('example.pdf', 'rb') pdf_reader = PyPDF2.PdfFileReader(pdf_object) # Below ready made function provides Number of pages in the PDF file print(pdf_reader.numPages) # get a certain page's text content in an py Object page_object = pdf_reader.getPage(1) # Extract text from the page_object,string operations can be used to disect the print(page_object.extractText())
Below two steps to request and read API
- Use the requests module to connect to a URL and fetch a response
- Use json.loads() to convert a JSON object to a python dictionary
import requests, json import pprint r = requests.get(url) # pass the url with the API key and get the response # converting the json object to a dict using json.loads() r_dict = json.loads(r.text)
# Json is converted to python dictionary which is key value data structure
Getting Data from websites for analysis
Web scraping refers to the art of pro grammatically getting data from the internet. One of the coolest features of python is that it makes it easy to scrape websites.In Python 3, the most popular library for web scraping is BeautifulSoup
The general procedure to get data from websites is:
- Use requests to connect to a URL and get data from it
- Create a BeautifulSoup object
- Get attributes of the BeautifulSoup object (i.e. the HTML elements that you want)
import requests, bs4 # getting HTML from the amazon web page url ="https://www.amazon.in/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=sport+shoes" req = requests.get(url) # this library get data html page from url # create a bs4 object # To avoid warnings, provide "html5lib" explicitly soup = bs4.BeautifulSoup(req.text, "html5lib") #print(soup) #soup.select('div > p') # selects all the <p> elements within div tags inside html
Getting Data from CSV file
The easiest way to read delimited files is using
pd.read_csv(filepath, sep, header) and specify a separator (delimiter).default is comma
import numpy as np import pandas as pd # Using encoding = "ISO-8859-1" file = pd.read_csv("xxxxxx.txt", sep="\t", encoding = "ISO-8859-1") file.head()
# head () deafult function to print the first 5 rows