Metadata Extraction from Unstructured Data (PDF, DOC, Images) using Python and NLP

I'm thrilled to share a project I've been working on involving the extraction of metadata from unstructured data sources such as PDFs, DOC files, and images using Python and NLP(Natural Level Processing)techniques. ???


?? Objective: The goal was to efficiently extract valuable metadata (like titles, authors, dates, keywords, and entities) from various unstructured data formats.?

Note:( Sample Doc/PDF/Image -Store in your Local and provide the complete Path to the folder)

?? Technologies Used:?

Python for scripting?

PyMuPDF for PDF text extraction?

python-docx for DOC text extraction?

pytesseract and Pillow for OCR on images?

NLTK and spaCy for text processing and NLP?


?? Steps Involved:?

Extract Text from PDFs: Utilized PyMuPDF to extract text from PDF files, ensuring accuracy and efficiency.?

(NOTE: This is for the PDF file with Multiple pages)

import fitz

def extract_text_from_pdf(file_path):?

text = ""?

pdf_document = fitz.open(file_path)?

for page_num in range(pdf_document.page_count):?

page = pdf_document.load_page(page_num)?

text += page.get_text()?

return text?

pdf_text = extract_text_from_pdf("sample.pdf")?

?

Extract Text from DOC Files: Leveraged python-docx to seamlessly extract text from DOC files.?

from docx import Document?

def extract_text_from_doc(file_path):?

doc = Document(file_path)?

return ' '.join([para.text for para in doc.paragraphs])?

doc_text = extract_text_from_doc("sample.docx")?

?

Extract Text from Images: Implemented OCR using pytesseract and Pillow to extract text from images.?

from PIL import Image?

import pytesseract?

def extract_text_from_image(image_path):?

image = Image.open(image_path)?

return pytesseract.image_to_string(image)?

image_text = extract_text_from_image("sample_image.png")?

?

Text Preprocessing: Cleaned and prepared the text data using NLTK and spaCy.?

import nltk?

from nltk.corpus import stopwords?

from nltk.tokenize import word_tokenize?

nltk.download('punkt')?

nltk.download('stopwords')?

def preprocess_text(text):?

stop_words = set(stopwords.words('english'))?

word_tokens = word_tokenize(text)?

filtered_text = ' '.join([word for word in word_tokens if word.isalnum() and word not in stop_words])?

return filtered_text?

cleaned_text = preprocess_text(pdf_text)?

?

Metadata Extraction: Extracted titles, authors, dates, and keywords using regex and NLP models.?

import re?

import spacy?

nlp = spacy.load("en_core_web_sm")?

def extract_title(text):?

first_line = text.split('\n')[0]?

first_sentence = text.split('.')[0]?

return first_sentence if len(first_sentence) < 100 else first_line?

?

def extract_author(text):?

author_pattern = re.compile(r'(By\s+[A-Z][a-z]+\s+[A-Z][a-z]+)|(Author:\s+[A-Z][a-z]+\s+[A-Z][a-z]+)')?

match = author_pattern.search(text)?

return match.group(0) if match else "Unknown"?

?

def extract_date(text):?

date_pattern = re.compile(r'\b(?:\d{1,2}[/-]\d{1,2}[/-]\d{2,4}|\d{4}-\d{2}-\d{2})\b')?

match = date_pattern.search(text)?

return match.group(0) if match else "Unknown"?

?

def extract_keywords(text, n=10):?

doc = nlp(text)?

keywords = [token.text for token in doc if token.is_alpha and not token.is_stop]?

return keywords[:n]?

?

def extract_entities(text):?

doc = nlp(text)?

entities = [(ent.text, ent.label_) for ent in doc.ents]?

return entities?

title = extract_title(pdf_text)?

author = extract_author(pdf_text)?

date = extract_date(pdf_text)?

keywords = extract_keywords(cleaned_text)?

entities = extract_entities(cleaned_text)?

?

Saving Metadata: Stored the extracted metadata in a JSON file for further use.?

(Note: Even you can save this in CSV format too).


import json?

def save_metadata(title, author, date, keywords, entities):?

metadata = {?

'title': title,?

'author': author,?

'date': date,?

'keywords': keywords,?

'entities': entities?

}?

with open('metadata.json', 'w') as f:?

json.dump(metadata, f, indent=4)?

save_metadata(title, author, date, keywords, entities)?

?

?? This project was a fantastic learning experience, demonstrating the power of Python and NLP in handling unstructured data?


要查看或添加评论,请登录

社区洞察

其他会员也浏览了