Similar Company Search API Micro-service
1. Overview
Objective: Build a micro-service that identifies companies similar to a given company profile using a combination of firmographic data and textual descriptions. The service ingests raw data from multiple sources, preprocesses it, generates vector embeddings via an external API (e.g., OpenAI), stores the enriched data in PostgreSQL (for full record storage) and FAISS (for fast vector similarity search), and exposes RESTful APIs via a Nest.js backend.
Key Components:
2. System Architecture
High-Level Diagram
AWS Stack Components
3. Data Ingestion & Preprocessing
Data Sources & Ingestion Pipeline
Preprocessing Steps
1. Standardize Numeric Values:
Example (Min-Max Normalization in Python):
import numpy as np
# Example revenue data (in millions)
revenues = np.array([50, 200, 150, 1000])
min_rev = revenues.min() # e.g., 50
max_rev = revenues.max() # e.g., 1000
# Normalize values to [0, 1]
normalized_revenues = (revenues - min_rev) / (max_rev - min_rev)
print(normalized_revenues)
Purpose:?Ensures that features like revenue, funding, and employee counts are on a comparable scale for downstream processing and distance computations.
2. Encode Categorical Features:
Example (One-Hot Encoding using Pandas):
import pandas as pd
df = pd.DataFrame({
'company_id': [1, 2, 3],
'industry': ['Tech', 'Finance', 'Healthcare']
})
encoded_df = pd.get_dummies(df, columns=['industry'])
print(encoded_df) #expected outout: infustry_tech, industry_finance, industry_healthcare)
3. Clean Textual Fields:
Example (Using NLTK for Stopword Removal and Stemming):
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
ps = PorterStemmer()
text = "advanced analytics and innovative solutions"
words = text.split()
cleaned_words = [ps.stem(word) for word in words if word.lower() not in stop_words]
cleaned_text = " ".join(cleaned_words)
print(cleaned_text) # Expected output: "advanc analytic innov solut"
Purpose:?Removes irrelevant words and reduces words to their base forms, improving the quality of the input to the embedding model.
4. Embedding Generation
import requests
def generate_embedding(text: str) -> list:
response = requests.post(
'https://api.openai.com/v1/embeddings',
headers={'Authorization': f'Bearer {YOUR_OPENAI_API_KEY}'},
json={
'model': 'text-embedding-ada-002',
'input': text
}
)
embedding = response.json()['data'][0]['embedding']
return embedding
# Create composite text from company data
composite_text = f"{company['industry']} { ' '.join(company['keywords']) } { ' '.join(company['tech_used']) }"
company_embedding = generate_embedding(composite_text)
Storage:Save the generated embedding along with the company record in PostgreSQL. Also, store a mapping from company ID to the FAISS vector index.
5. Storage & Indexing
PostgreSQL (RDS)
CREATE TABLE companies (
id SERIAL PRIMARY KEY,
name VARCHAR(255),
industry VARCHAR(100),
revenue NUMERIC,
funding NUMERIC,
employee_count INTEGER,
keywords TEXT[],
tech_used TEXT[],
embedding REAL[],
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Purpose:Acts as the authoritative store for company profiles and metadata. Supports complex queries and data integrity.
FAISS Index Service
6. API Layer (Nest.js Micro-service)
API Endpoints
Sample Controller (Nest.js)
// src/companies/companies.controller.ts
import { Controller, Get, Post, Body, Param, HttpException, HttpStatus } from '@nestjs/common';
import { CompaniesService } from './companies.service';
@Controller('companies')
export class CompaniesController {
constructor(private readonly companiesService: CompaniesService) {}
@Post()
async createOrUpdateCompany(@Body() companyData: any) {
try {
const company = await this.companiesService.createOrUpdateCompany(companyData);
return company;
} catch (error) {
throw new HttpException('Error creating/updating company', HttpStatus.INTERNAL_SERVER_ERROR);
}
}
@Get(':id/similar')
async getSimilarCompanies(@Param('id') id: string) {
try {
const similarCompanies = await this.companiesService.findSimilarCompanies(id);
return similarCompanies;
} catch (error) {
throw new HttpException('Error fetching similar companies', HttpStatus.INTERNAL_SERVER_ERROR);
}
}
}
Service Layer (Nest.js)
// src/companies/companies.service.ts
import { Injectable, Logger } from '@nestjs/common';
import axios from 'axios';
import { DatabaseService } from '../database/database.service'; // Interface for PostgreSQL operations
import { FaissService } from '../faiss/faiss.service'; // Interface for FAISS operations
@Injectable()
export class CompaniesService {
private readonly logger = new Logger(CompaniesService.name);
constructor(
private readonly dbService: DatabaseService,
private readonly faissService: FaissService,
) {}
async createOrUpdateCompany(companyData: any): Promise<any> {
// 1. Data Preprocessing
const cleanedData = this.cleanCompanyData(companyData);
// 2. Embedding Generation
const compositeText = `${cleanedData.industry} ${cleanedData.keywords.join(' ')} ${cleanedData.tech_used.join(' ')}`;
const embedding = await this.generateEmbedding(compositeText);
cleanedData.embedding = embedding;
// 3. Save to PostgreSQL
const company = await this.dbService.upsertCompany(cleanedData);
// 4. Update FAISS Index
await this.faissService.addOrUpdateVector(company.id, embedding);
return company;
}
async findSimilarCompanies(companyId: string): Promise<any[]> {
const company = await this.dbService.getCompanyById(companyId);
if (!company || !company.embedding) {
throw new Error('Company not found or missing embedding');
}
const similarVectors = await this.faissService.search(company.embedding, 10);
const similarCompanyIds = similarVectors.map(v => v.companyId);
const similarCompanies = await this.dbService.getCompaniesByIds(similarCompanyIds);
return similarCompanies;
}
cleanCompanyData(data: any): any {
// Implement normalization, one-hot encoding, and text cleaning here
// e.g., normalize numeric fields, encode categorical values, clean keywords/texts.
return data;
}
async generateEmbedding(text: string): Promise<number[]> {
const response = await axios.post(
'https://api.openai.com/v1/embeddings',
{
model: 'text-embedding-ada-002',
input: text,
},
{
headers: { 'Authorization': `Bearer ${process.env.OPENAI_API_KEY}` },
}
);
return response.data.data[0].embedding;
}
}
7. Deployment & Operational Considerations
Deployment:
Scalability & Maintenance:
Security & Logging:
8. Summary
This design describes a micro-service architecture for a similar company search API built on an AWS stack. It covers end-to-end data ingestion, preprocessing (normalization, encoding, text cleaning), embedding generation via OpenAI’s API, and storage in both PostgreSQL (for full records) and FAISS (for high-performance similarity search). The Nest.js API layer orchestrates these components, providing a robust backend for targeted sales and lead generation without frontend considerations.