Metadata Extraction from Unstructured Data (PDF, DOC, Images) using Python and NLP
Dhivya vijayakumar
Data engineer || Python developer || Certified AWS Developer Associate
I'm thrilled to share a project I've been working on involving the extraction of metadata from unstructured data sources such as PDFs, DOC files, and images using Python and NLP(Natural Level Processing)techniques. ???
?? Objective: The goal was to efficiently extract valuable metadata (like titles, authors, dates, keywords, and entities) from various unstructured data formats.?
Note:( Sample Doc/PDF/Image -Store in your Local and provide the complete Path to the folder)
?? Technologies Used:?
Python for scripting?
PyMuPDF for PDF text extraction?
python-docx for DOC text extraction?
pytesseract and Pillow for OCR on images?
NLTK and spaCy for text processing and NLP?
?? Steps Involved:?
Extract Text from PDFs: Utilized PyMuPDF to extract text from PDF files, ensuring accuracy and efficiency.?
(NOTE: This is for the PDF file with Multiple pages)
import fitz
def extract_text_from_pdf(file_path):?
text = ""?
pdf_document = fitz.open(file_path)?
for page_num in range(pdf_document.page_count):?
page = pdf_document.load_page(page_num)?
text += page.get_text()?
return text?
pdf_text = extract_text_from_pdf("sample.pdf")?
?
Extract Text from DOC Files: Leveraged python-docx to seamlessly extract text from DOC files.?
from docx import Document?
def extract_text_from_doc(file_path):?
doc = Document(file_path)?
return ' '.join([para.text for para in doc.paragraphs])?
doc_text = extract_text_from_doc("sample.docx")?
?
Extract Text from Images: Implemented OCR using pytesseract and Pillow to extract text from images.?
from PIL import Image?
import pytesseract?
def extract_text_from_image(image_path):?
image = Image.open(image_path)?
return pytesseract.image_to_string(image)?
image_text = extract_text_from_image("sample_image.png")?
?
Text Preprocessing: Cleaned and prepared the text data using NLTK and spaCy.?
import nltk?
from nltk.corpus import stopwords?
from nltk.tokenize import word_tokenize?
nltk.download('punkt')?
nltk.download('stopwords')?
def preprocess_text(text):?
stop_words = set(stopwords.words('english'))?
word_tokens = word_tokenize(text)?
filtered_text = ' '.join([word for word in word_tokens if word.isalnum() and word not in stop_words])?
return filtered_text?
cleaned_text = preprocess_text(pdf_text)?
?
领英推è
Metadata Extraction: Extracted titles, authors, dates, and keywords using regex and NLP models.?
import re?
import spacy?
nlp = spacy.load("en_core_web_sm")?
def extract_title(text):?
first_line = text.split('\n')[0]?
first_sentence = text.split('.')[0]?
return first_sentence if len(first_sentence) < 100 else first_line?
?
def extract_author(text):?
author_pattern = re.compile(r'(By\s+[A-Z][a-z]+\s+[A-Z][a-z]+)|(Author:\s+[A-Z][a-z]+\s+[A-Z][a-z]+)')?
match = author_pattern.search(text)?
return match.group(0) if match else "Unknown"?
?
def extract_date(text):?
date_pattern = re.compile(r'\b(?:\d{1,2}[/-]\d{1,2}[/-]\d{2,4}|\d{4}-\d{2}-\d{2})\b')?
match = date_pattern.search(text)?
return match.group(0) if match else "Unknown"?
?
def extract_keywords(text, n=10):?
doc = nlp(text)?
return keywords[:n]?
?
def extract_entities(text):?
doc = nlp(text)?
entities = [(ent.text, ent.label_) for ent in doc.ents]?
return entities?
title = extract_title(pdf_text)?
author = extract_author(pdf_text)?
date = extract_date(pdf_text)?
keywords = extract_keywords(cleaned_text)?
entities = extract_entities(cleaned_text)?
?
Saving Metadata: Stored the extracted metadata in a JSON file for further use.?
(Note: Even you can save this in CSV format too).
import json?
def save_metadata(title, author, date, keywords, entities):?
metadata = {?
'title': title,?
'author': author,?
'date': date,?
'keywords': keywords,?
'entities': entities?
}?
with open('metadata.json', 'w') as f:?
json.dump(metadata, f, indent=4)?
save_metadata(title, author, date, keywords, entities)?
?
?? This project was a fantastic learning experience, demonstrating the power of Python and NLP in handling unstructured data?