Similar Company Search API Micro-service

Similar Company Search API Micro-service

1. Overview

Objective: Build a micro-service that identifies companies similar to a given company profile using a combination of firmographic data and textual descriptions. The service ingests raw data from multiple sources, preprocesses it, generates vector embeddings via an external API (e.g., OpenAI), stores the enriched data in PostgreSQL (for full record storage) and FAISS (for fast vector similarity search), and exposes RESTful APIs via a Nest.js backend.

Key Components:

  • Data Ingestion & Preprocessing:?Extract data, clean and standardize numeric fields, encode categorical features, and clean textual fields.
  • Embedding Generation:?Use OpenAI’s embedding model to generate vector representations.
  • Storage:PostgreSQL for persistent storage of full company records and metadata.FAISS for efficient nearest neighbor (similarity) searches.
  • API Layer:?Nest.js-based REST API exposing endpoints for creating/updating company data and retrieving similar companies.


2. System Architecture

High-Level Diagram


High Level Di

AWS Stack Components

  • AWS Lambda: For data ingestion, preprocessing, and embedding generation using Python.
  • Amazon RDS (PostgreSQL): To store complete company profiles, including raw, cleaned, and normalized data.
  • FAISS Service: Hosted on an EC2 instance (or as a container in AWS Fargate) with a Python-based micro-service to manage the FAISS index.
  • Nest.js API: Deployed (e.g., in an ECS cluster or as a container in AWS Fargate) to expose REST endpoints.


3. Data Ingestion & Preprocessing

Data Sources & Ingestion Pipeline

  • Sources:External public company databases, internal CRM exports, third-party APIs.
  • Ingestion Process:AWS Lambda functions (triggered on a schedule or event) fetch raw data, which is passed through a preprocessing pipeline before storage.

Preprocessing Steps

1. Standardize Numeric Values:

Example (Min-Max Normalization in Python):

import numpy as np

# Example revenue data (in millions)
revenues = np.array([50, 200, 150, 1000])
min_rev = revenues.min()  # e.g., 50
max_rev = revenues.max()  # e.g., 1000

# Normalize values to [0, 1]
normalized_revenues = (revenues - min_rev) / (max_rev - min_rev)
print(normalized_revenues)        

Purpose:?Ensures that features like revenue, funding, and employee counts are on a comparable scale for downstream processing and distance computations.

2. Encode Categorical Features:

Example (One-Hot Encoding using Pandas):

import pandas as pd

df = pd.DataFrame({
    'company_id': [1, 2, 3],
    'industry': ['Tech', 'Finance', 'Healthcare']
})

encoded_df = pd.get_dummies(df, columns=['industry'])
print(encoded_df) #expected outout: infustry_tech, industry_finance, industry_healthcare)        

  • Purpose:?Converts categorical features (e.g., industry, segments, location) into numeric vectors that can be concatenated with other numerical data. For high-cardinality fields, consider dense embeddings.

3. Clean Textual Fields:

Example (Using NLTK for Stopword Removal and Stemming):

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
ps = PorterStemmer()

text = "advanced analytics and innovative solutions"
words = text.split()
cleaned_words = [ps.stem(word) for word in words if word.lower() not in stop_words]
cleaned_text = " ".join(cleaned_words)

print(cleaned_text)  # Expected output: "advanc analytic innov solut"        

Purpose:?Removes irrelevant words and reduces words to their base forms, improving the quality of the input to the embedding model.

4. Embedding Generation

  • Process: After preprocessing, a composite text is formed by combining cleaned textual fields (e.g., keywords, tech used) and, optionally, normalized structured fields.
  • Implementation Example:

import requests

def generate_embedding(text: str) -> list:
    response = requests.post(
        'https://api.openai.com/v1/embeddings',
        headers={'Authorization': f'Bearer {YOUR_OPENAI_API_KEY}'},
        json={
            'model': 'text-embedding-ada-002',
            'input': text
        }
    )
    embedding = response.json()['data'][0]['embedding']
    return embedding

# Create composite text from company data
composite_text = f"{company['industry']} { ' '.join(company['keywords']) } { ' '.join(company['tech_used']) }"
company_embedding = generate_embedding(composite_text)        

Storage:Save the generated embedding along with the company record in PostgreSQL. Also, store a mapping from company ID to the FAISS vector index.

5. Storage & Indexing

PostgreSQL (RDS)

  • Schema Example:

CREATE TABLE companies (
    id SERIAL PRIMARY KEY,
    name VARCHAR(255),
    industry VARCHAR(100),
    revenue NUMERIC,
    funding NUMERIC,
    employee_count INTEGER,
    keywords TEXT[],
    tech_used TEXT[],
    embedding REAL[],
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);        

Purpose:Acts as the authoritative store for company profiles and metadata. Supports complex queries and data integrity.

FAISS Index Service

  • Function: Maintain an in-memory index of embeddings for fast similarity search.
  • Operations: 1. Add/Update:?When a new company record is created or updated, add or update its embedding in FAISS., and 2. Search: Given an embedding (query), return the nearest neighbor company IDs and distances.
  • Integration: The Nest.js API calls the FAISS service (via a REST endpoint or a direct library interface) to obtain similar company IDs. These IDs are then used to retrieve full records from PostgreSQL.


6. API Layer (Nest.js Micro-service)

API Endpoints

  1. POST /companies Function:?Create or update a company profile.Triggers data preprocessing, embedding generation, and updates both PostgreSQL and FAISS.
  2. GET /companies/:id/similar Function:?Retrieve a list of similar companies.Retrieves the company embedding from PostgreSQL, queries FAISS for similar vectors, and fetches full company details.

Sample Controller (Nest.js)

// src/companies/companies.controller.ts
import { Controller, Get, Post, Body, Param, HttpException, HttpStatus } from '@nestjs/common';
import { CompaniesService } from './companies.service';

@Controller('companies')
export class CompaniesController {
  constructor(private readonly companiesService: CompaniesService) {}

  @Post()
  async createOrUpdateCompany(@Body() companyData: any) {
    try {
      const company = await this.companiesService.createOrUpdateCompany(companyData);
      return company;
    } catch (error) {
      throw new HttpException('Error creating/updating company', HttpStatus.INTERNAL_SERVER_ERROR);
    }
  }

  @Get(':id/similar')
  async getSimilarCompanies(@Param('id') id: string) {
    try {
      const similarCompanies = await this.companiesService.findSimilarCompanies(id);
      return similarCompanies;
    } catch (error) {
      throw new HttpException('Error fetching similar companies', HttpStatus.INTERNAL_SERVER_ERROR);
    }
  }
}        

Service Layer (Nest.js)

// src/companies/companies.service.ts
import { Injectable, Logger } from '@nestjs/common';
import axios from 'axios';
import { DatabaseService } from '../database/database.service'; // Interface for PostgreSQL operations
import { FaissService } from '../faiss/faiss.service'; // Interface for FAISS operations

@Injectable()
export class CompaniesService {
  private readonly logger = new Logger(CompaniesService.name);

  constructor(
    private readonly dbService: DatabaseService,
    private readonly faissService: FaissService,
  ) {}

  async createOrUpdateCompany(companyData: any): Promise<any> {
    // 1. Data Preprocessing
    const cleanedData = this.cleanCompanyData(companyData);
    
    // 2. Embedding Generation
    const compositeText = `${cleanedData.industry} ${cleanedData.keywords.join(' ')} ${cleanedData.tech_used.join(' ')}`;
    const embedding = await this.generateEmbedding(compositeText);
    cleanedData.embedding = embedding;

    // 3. Save to PostgreSQL
    const company = await this.dbService.upsertCompany(cleanedData);

    // 4. Update FAISS Index
    await this.faissService.addOrUpdateVector(company.id, embedding);

    return company;
  }

  async findSimilarCompanies(companyId: string): Promise<any[]> {
    const company = await this.dbService.getCompanyById(companyId);
    if (!company || !company.embedding) {
      throw new Error('Company not found or missing embedding');
    }

    const similarVectors = await this.faissService.search(company.embedding, 10);
    const similarCompanyIds = similarVectors.map(v => v.companyId);
    const similarCompanies = await this.dbService.getCompaniesByIds(similarCompanyIds);
    return similarCompanies;
  }

  cleanCompanyData(data: any): any {
    // Implement normalization, one-hot encoding, and text cleaning here
    // e.g., normalize numeric fields, encode categorical values, clean keywords/texts.
    return data;
  }

  async generateEmbedding(text: string): Promise<number[]> {
    const response = await axios.post(
      'https://api.openai.com/v1/embeddings',
      {
        model: 'text-embedding-ada-002',
        input: text,
      },
      {
        headers: { 'Authorization': `Bearer ${process.env.OPENAI_API_KEY}` },
      }
    );
    return response.data.data[0].embedding;
  }
}        


7. Deployment & Operational Considerations

Deployment:

  1. Package the Nest.js micro-service in Docker and deploy via AWS ECS/Fargate (preferable for easy of deployment and scale).
  2. Use AWS Lambda for ETL/preprocessing jobs.
  3. Host PostgreSQL on Amazon RDS and the FAISS service on an EC2 instance/Fargate.

Scalability & Maintenance:

  1. Schedule regular ETL jobs to update company records and refresh the FAISS index.
  2. Monitor latency and throughput for FAISS searches and API endpoints.
  3. Ensure data consistency between PostgreSQL and the FAISS index through periodic reconciliation jobs.

Security & Logging:

  1. Secure API endpoints with proper authentication.
  2. Log ingestion, embedding generation, and index update metrics for troubleshooting and monitoring.


8. Summary

This design describes a micro-service architecture for a similar company search API built on an AWS stack. It covers end-to-end data ingestion, preprocessing (normalization, encoding, text cleaning), embedding generation via OpenAI’s API, and storage in both PostgreSQL (for full records) and FAISS (for high-performance similarity search). The Nest.js API layer orchestrates these components, providing a robust backend for targeted sales and lead generation without frontend considerations.




要查看或添加评论,请登录

Manish Katyan的更多文章