登录查看更多内容

Similar Company Search API Micro-service

Manish Katyan

发布日期: 2025年2月27日

1. Overview

Objective: Build a micro-service that identifies companies similar to a given company profile using a combination of firmographic data and textual descriptions. The service ingests raw data from multiple sources, preprocesses it, generates vector embeddings via an external API (e.g., OpenAI), stores the enriched data in PostgreSQL (for full record storage) and FAISS (for fast vector similarity search), and exposes RESTful APIs via a Nest.js backend.

Key Components:

Data Ingestion & Preprocessing:?Extract data, clean and standardize numeric fields, encode categorical features, and clean textual fields.
Embedding Generation:?Use OpenAI’s embedding model to generate vector representations.
Storage:PostgreSQL for persistent storage of full company records and metadata.FAISS for efficient nearest neighbor (similarity) searches.
API Layer:?Nest.js-based REST API exposing endpoints for creating/updating company data and retrieving similar companies.

2. System Architecture

High-Level Diagram

AWS Stack Components

AWS Lambda: For data ingestion, preprocessing, and embedding generation using Python.
Amazon RDS (PostgreSQL): To store complete company profiles, including raw, cleaned, and normalized data.
FAISS Service: Hosted on an EC2 instance (or as a container in AWS Fargate) with a Python-based micro-service to manage the FAISS index.
Nest.js API: Deployed (e.g., in an ECS cluster or as a container in AWS Fargate) to expose REST endpoints.

3. Data Ingestion & Preprocessing

Data Sources & Ingestion Pipeline

Sources:External public company databases, internal CRM exports, third-party APIs.
Ingestion Process:AWS Lambda functions (triggered on a schedule or event) fetch raw data, which is passed through a preprocessing pipeline before storage.

Preprocessing Steps

1. Standardize Numeric Values:

Example (Min-Max Normalization in Python):

import numpy as np

# Example revenue data (in millions)
revenues = np.array([50, 200, 150, 1000])
min_rev = revenues.min()  # e.g., 50
max_rev = revenues.max()  # e.g., 1000

# Normalize values to [0, 1]
normalized_revenues = (revenues - min_rev) / (max_rev - min_rev)
print(normalized_revenues)

Purpose:?Ensures that features like revenue, funding, and employee counts are on a comparable scale for downstream processing and distance computations.

2. Encode Categorical Features:

Example (One-Hot Encoding using Pandas):

import pandas as pd

df = pd.DataFrame({
    'company_id': [1, 2, 3],
    'industry': ['Tech', 'Finance', 'Healthcare']
})

encoded_df = pd.get_dummies(df, columns=['industry'])
print(encoded_df) #expected outout: infustry_tech, industry_finance, industry_healthcare)

Purpose:?Converts categorical features (e.g., industry, segments, location) into numeric vectors that can be concatenated with other numerical data. For high-cardinality fields, consider dense embeddings.

3. Clean Textual Fields:

Example (Using NLTK for Stopword Removal and Stemming):

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
ps = PorterStemmer()

text = "advanced analytics and innovative solutions"
words = text.split()
cleaned_words = [ps.stem(word) for word in words if word.lower() not in stop_words]
cleaned_text = " ".join(cleaned_words)

print(cleaned_text)  # Expected output: "advanc analytic innov solut"

Purpose:?Removes irrelevant words and reduces words to their base forms, improving the quality of the input to the embedding model.

4. Embedding Generation

Process: After preprocessing, a composite text is formed by combining cleaned textual fields (e.g., keywords, tech used) and, optionally, normalized structured fields.
Implementation Example:

import requests

def generate_embedding(text: str) -> list:
    response = requests.post(
        'https://api.openai.com/v1/embeddings',
        headers={'Authorization': f'Bearer {YOUR_OPENAI_API_KEY}'},
        json={
            'model': 'text-embedding-ada-002',
            'input': text
        }
    )
    embedding = response.json()['data'][0]['embedding']
    return embedding

# Create composite text from company data
composite_text = f"{company['industry']} { ' '.join(company['keywords']) } { ' '.join(company['tech_used']) }"
company_embedding = generate_embedding(composite_text)

Storage:Save the generated embedding along with the company record in PostgreSQL. Also, store a mapping from company ID to the FAISS vector index.

5. Storage & Indexing

PostgreSQL (RDS)

Schema Example:

CREATE TABLE companies (
    id SERIAL PRIMARY KEY,
    name VARCHAR(255),
    industry VARCHAR(100),
    revenue NUMERIC,
    funding NUMERIC,
    employee_count INTEGER,
    keywords TEXT[],
    tech_used TEXT[],
    embedding REAL[],
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Purpose:Acts as the authoritative store for company profiles and metadata. Supports complex queries and data integrity.

FAISS Index Service

Function: Maintain an in-memory index of embeddings for fast similarity search.
Operations: 1. Add/Update:?When a new company record is created or updated, add or update its embedding in FAISS., and 2. Search: Given an embedding (query), return the nearest neighbor company IDs and distances.
Integration: The Nest.js API calls the FAISS service (via a REST endpoint or a direct library interface) to obtain similar company IDs. These IDs are then used to retrieve full records from PostgreSQL.

6. API Layer (Nest.js Micro-service)

API Endpoints

POST /companies Function:?Create or update a company profile.Triggers data preprocessing, embedding generation, and updates both PostgreSQL and FAISS.
GET /companies/:id/similar Function:?Retrieve a list of similar companies.Retrieves the company embedding from PostgreSQL, queries FAISS for similar vectors, and fetches full company details.

Sample Controller (Nest.js)

// src/companies/companies.controller.ts
import { Controller, Get, Post, Body, Param, HttpException, HttpStatus } from '@nestjs/common';
import { CompaniesService } from './companies.service';

@Controller('companies')
export class CompaniesController {
  constructor(private readonly companiesService: CompaniesService) {}

  @Post()
  async createOrUpdateCompany(@Body() companyData: any) {
    try {
      const company = await this.companiesService.createOrUpdateCompany(companyData);
      return company;
    } catch (error) {
      throw new HttpException('Error creating/updating company', HttpStatus.INTERNAL_SERVER_ERROR);
    }
  }

  @Get(':id/similar')
  async getSimilarCompanies(@Param('id') id: string) {
    try {
      const similarCompanies = await this.companiesService.findSimilarCompanies(id);
      return similarCompanies;
    } catch (error) {
      throw new HttpException('Error fetching similar companies', HttpStatus.INTERNAL_SERVER_ERROR);
    }
  }
}

Service Layer (Nest.js)

// src/companies/companies.service.ts
import { Injectable, Logger } from '@nestjs/common';
import axios from 'axios';
import { DatabaseService } from '../database/database.service'; // Interface for PostgreSQL operations
import { FaissService } from '../faiss/faiss.service'; // Interface for FAISS operations

@Injectable()
export class CompaniesService {
  private readonly logger = new Logger(CompaniesService.name);

  constructor(
    private readonly dbService: DatabaseService,
    private readonly faissService: FaissService,
  ) {}

  async createOrUpdateCompany(companyData: any): Promise<any> {
    // 1. Data Preprocessing
    const cleanedData = this.cleanCompanyData(companyData);
    
    // 2. Embedding Generation
    const compositeText = `${cleanedData.industry} ${cleanedData.keywords.join(' ')} ${cleanedData.tech_used.join(' ')}`;
    const embedding = await this.generateEmbedding(compositeText);
    cleanedData.embedding = embedding;

    // 3. Save to PostgreSQL
    const company = await this.dbService.upsertCompany(cleanedData);

    // 4. Update FAISS Index
    await this.faissService.addOrUpdateVector(company.id, embedding);

    return company;
  }

  async findSimilarCompanies(companyId: string): Promise<any[]> {
    const company = await this.dbService.getCompanyById(companyId);
    if (!company || !company.embedding) {
      throw new Error('Company not found or missing embedding');
    }

    const similarVectors = await this.faissService.search(company.embedding, 10);
    const similarCompanyIds = similarVectors.map(v => v.companyId);
    const similarCompanies = await this.dbService.getCompaniesByIds(similarCompanyIds);
    return similarCompanies;
  }

  cleanCompanyData(data: any): any {
    // Implement normalization, one-hot encoding, and text cleaning here
    // e.g., normalize numeric fields, encode categorical values, clean keywords/texts.
    return data;
  }

  async generateEmbedding(text: string): Promise<number[]> {
    const response = await axios.post(
      'https://api.openai.com/v1/embeddings',
      {
        model: 'text-embedding-ada-002',
        input: text,
      },
      {
        headers: { 'Authorization': `Bearer ${process.env.OPENAI_API_KEY}` },
      }
    );
    return response.data.data[0].embedding;
  }
}

7. Deployment & Operational Considerations

Deployment:

Package the Nest.js micro-service in Docker and deploy via AWS ECS/Fargate (preferable for easy of deployment and scale).
Use AWS Lambda for ETL/preprocessing jobs.
Host PostgreSQL on Amazon RDS and the FAISS service on an EC2 instance/Fargate.

Scalability & Maintenance:

Schedule regular ETL jobs to update company records and refresh the FAISS index.
Monitor latency and throughput for FAISS searches and API endpoints.
Ensure data consistency between PostgreSQL and the FAISS index through periodic reconciliation jobs.

Security & Logging:

Secure API endpoints with proper authentication.
Log ingestion, embedding generation, and index update metrics for troubleshooting and monitoring.

8. Summary

This design describes a micro-service architecture for a similar company search API built on an AWS stack. It covers end-to-end data ingestion, preprocessing (normalization, encoding, text cleaning), embedding generation via OpenAI’s API, and storage in both PostgreSQL (for full records) and FAISS (for high-performance similarity search). The Nest.js API layer orchestrates these components, providing a robust backend for targeted sales and lead generation without frontend considerations.

要查看或添加评论，请登录

Manish Katyan的更多文章

Multi-Agent Conversational AI App Using LangGraph

2025年2月27日

Multi-Agent Conversational AI App Using LangGraph

1. Introduction Background This document outlines the engineering design and implementation plan for a multi-agent…
Building an Enterprise-grade Conversational AI Platform

2025年2月27日

Building an Enterprise-grade Conversational AI Platform

1. Introduction & Objectives 1.
Intent Extraction Service Using DistilBERT

2025年2月27日

Intent Extraction Service Using DistilBERT

1. Overview This document describes the design and implementation of an intent extraction service for a conversational…
AI Is Writing Your Code—But Are We Throwing Away What Really Matters?

2025年2月11日

AI Is Writing Your Code—But Are We Throwing Away What Really Matters?

AI tools like Cursor can churn out solid code in seconds. It’s exciting—you can spin up a demo at lightning speed.
How to Create a Winning Pricing Proposal for Sales Leaders

2024年12月27日

How to Create a Winning Pricing Proposal for Sales Leaders

Crafting a pricing proposal that grabs the attention of B2B sales leaders isn’t easy. It has to speak directly to their…
How to Design and Present Tailored Solutions That Win Deals: A Step-by-Step Guide for Sales Leaders

2024年12月26日

How to Design and Present Tailored Solutions That Win Deals: A Step-by-Step Guide for Sales Leaders

When you’re pitching a solution to a prospect, it’s tempting to rely on a polished, off-the-shelf presentation. After…
A Sales Leader's Guide to Current State Analysis

2024年12月24日

A Sales Leader's Guide to Current State Analysis

The Current State Analysis stage is an essential step in the sales process. This is where you take the time to…
9 Steps for Sales Reps to Nail the First Meeting

2024年12月23日

9 Steps for Sales Reps to Nail the First Meeting

Introduction: Why Business Diagnosis Matters Every meeting matters in sales, but the first meeting? That’s where the…
Buyer Personas: Which Decision Makers Can Buy?

2024年12月21日

Buyer Personas: Which Decision Makers Can Buy?

Introduction: Why Buyer Personas Matter Understanding and leveraging buyer personas is one of the most important steps…

1 条评论
How to Identify and Target the Right Customers for Your Business

2024年12月20日

How to Identify and Target the Right Customers for Your Business

Defining your Ideal Customer Profile (ICP) is one of the most important steps in building a strong sales process. This…

See all articles

1. Overview

2. System Architecture

High-Level Diagram

AWS Stack Components

3. Data Ingestion & Preprocessing

Data Sources & Ingestion Pipeline

Preprocessing Steps

4. Embedding Generation

5. Storage & Indexing

PostgreSQL (RDS)

FAISS Index Service

6. API Layer (Nest.js Micro-service)

API Endpoints

Sample Controller (Nest.js)

7. Deployment & Operational Considerations

8. Summary

Manish Katyan的更多文章

Multi-Agent Conversational AI App Using LangGraph

Building an Enterprise-grade Conversational AI Platform

Intent Extraction Service Using DistilBERT

AI Is Writing Your Code—But Are We Throwing Away What Really Matters?

How to Create a Winning Pricing Proposal for Sales Leaders

How to Design and Present Tailored Solutions That Win Deals: A Step-by-Step Guide for Sales Leaders

A Sales Leader's Guide to Current State Analysis

9 Steps for Sales Reps to Nail the First Meeting

Buyer Personas: Which Decision Makers Can Buy?

How to Identify and Target the Right Customers for Your Business