登录查看更多内容

SparkCognition AI Studio - A Test Drive

Sanil Pillai

Bridging Human Potential and AI Innovation | Coaching for the Future of Work

发布日期: 2024年2月3日

Recently, Deven Samant and I had the chance to explore SparkCognition's AI Studio to test the construction of a no-code AI pipeline. Our opportunity arose thanks to Amir Husain and Jarred Capellman , who initiated a competition that facilitated this experience. Our goal with the pipeline was to evaluate the ease of using Large Language Models (LLMs) to extract insights from resumes and job descriptions, as well as to find correlations between the two. Fortunately, AI Studio enabled us to achieve this effortlessly and without writing any code. Here's an overview of our approach.

Downloaded a dataset of pdf resumes from Kaggle and bulk converted them into text using pdftotext.com
Downloaded a dataset of CSV job descriptions from Kaggle and bulk-converted them into text using convertio.co/csv-txt/
Combined both the text files into everything.txt (lazy with the naming here :-))
Uploaded the text to the AI Studio repository

5. Connected the text file to a Document node

6. Connected the Document node to an AI Studio Question Prompt

7. Added an LLM node (OpenAI in this case). Configured the node with my OpenAI key

8. Connected the Question node and the LLM node to a langchain node.

9. Ran the pipeline. That's it!

This is what our pipeline looked like

Our TalentAI pipeline in SparkCognition AI Studio

We managed to conduct various successful queries, as illustrated by the examples below:

HackerRank 5 个月前

Importance of Frameworks in AI

Analytics Insight? 4 个月前

Artificial Intelligence #207

Andriy Burkov 10 个月前

In a very short amount of time, we secured access to a working Retrieval-Augmented Generation (RAG) setup for LLMs. By contrast, I attempted to create similar functionality through Python code, which, as a beginner, took me a few hours and involved numerous debugging efforts. The code is below:

#Code to create a vector store from everything.txt
import pinecone

from pinecone import Pinecone
pc = Pinecone(
        api_key='******'
)
#pinecone.init(api_key='******', environment='us-west-2')

index_name = '384index'

# Create an index if it doesn't already exist
#if index_name not in pc.list_indexes():
#    pc.create_index(name=index_name, dimension=768, metric='cosine')  # 768 is for BERT-like models
#
# Connect to your index
index = pc.Index(name=index_name)

file_path = '/Users/sanilpillai/Downloads/everything.txt'

with open(file_path, 'r') as file:
    documents = [line.strip() for line in file.readlines() if line.strip()]

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings
document_embeddings = model.encode(documents)

for i, (text, embedding) in enumerate(zip(documents, document_embeddings)):
    document_id = f"doc_{i}"  # Generate a unique ID for each document
    index.upsert(vectors=[(document_id, embedding.tolist(), {"text": text})])

# Example query to check the first few uploaded documents
query_result = index.query(vector=document_embeddings[0].tolist(), top_k=1)
print(query_result)

#Code to retrieve query response
import openai
import pinecone

# Initialize OpenAI
openai.api_key = '*****'

# Initialize Pinecone
from pinecone import Pinecone
pc = Pinecone(
        api_key='*****'
)
#pinecone.init(api_key='*****', environment='us-west-2')

# Check if the index exists, if not create one
index_name = '384index'
#if index_name not in pc.list_indexes():
    #pc.create_index(index_name, dimension=768)  # Dimension depends on the model used for embeddings

# Connect to the index
index = pc.Index(index_name)


# Ingest documents into Pinecone
#index.upsert(vectors=[(doc['id'], doc['vector'], doc['metadata']) for doc in docs])

def retrieve_documents(query, top_k=3):
    """
    Retrieve top_k most relevant documents from Pinecone.
    """
   
    model = SentenceTransformer('all-MiniLM-L6-v2')
    # Generate embeddings
    query_vector = model.encode(query).tolist()
    #query_vector = convert_query_to_vector(query)
    response = index.query(vector=query_vector, top_k=top_k, include_metadata=True)
    print(response)
    #document_snippets = [hit['metadata']['text'] for hit in response['results'][0]['matches']]
    document_snippets = [match['metadata']['text'] for match in response['matches']]
    return document_snippets

def generate_response(document_snippets, prompt, model="gpt-3.5-turbo"):
    """
    Generate a response using OpenAI's GPT based on retrieved documents.
    """
    augmented_prompt = "\n\n".join(document_snippets) + "\n\n" + prompt
    from openai import OpenAI
    client = OpenAI(api_key=openai.api_key)
    response = client.chat.completions.create(
        model=model,
        messages=[{'role': 'user', 'content': augmented_prompt}],
        max_tokens=150,
        temperature=0.7,
        top_p=1.0,
        frequency_penalty=0.0,
        presence_penalty=0.0
    )
    return response.choices[0].message.content

# Example usage
query = "who was an executive between 2005 and 2013?"
document_snippets = retrieve_documents(query)
response = generate_response(document_snippets, "Given the context above, ")
print(response)

It goes without saying that the capabilities and benefits of AI Studio became abundantly clear during our brief exploration. We could iterate much faster and not have to worry about the intricacies of vector stores and embeddings. I look forward to its evolution and ways that it will simplify complex processes for both technical and non-technical users alike, making advanced AI tools accessible and user-friendly for everyone!

Lillian Liang Emlet, MD MS CPC ELI-MP

Energy Leadership Coach for Healthcare Professionals | Founder & CEO, Transforming Healthcare Coaching | Contact us for Signature 1:1 & Group Coaching Programs for Healthcare Clinicians | Academic Intensivist | MedEd

9 个月

This is super interesting: and I appreciate the breaking down of the thought process and output. Thank you for sharing!

2 次回应

Khurram Mahmood

Co-Founder, Ensemble | UT Austin, CMU and Ex-Workday, Oracle, Veeva

9 个月

Super cool! AI Studio is revolutionary. Thanks for sharing your experience Sanil Pillai

2 次回应

Amir Husain

Founder: Avathon (prev SparkCognition), SkyGrid, Navigate | Author: The Sentient Machine, Gen AI for Leaders, Hyperwar | Board: UT Austin PAIB & CS, WorldQuant Predictive, SpecFive, Global Venture Bridge

9 个月

Excellent project!

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

SparkCognition AI Studio - A Test Drive

Sanil Pillai

Bridging Human Potential and AI Innovation | Coaching for the Future of Work

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

Artificial Intelligence #207

Artificial Intelligence #185

Artificial Intelligence #185

Issue #300 - The ML Engineer ??

GenAI Weekly — Edition 15

Llama 3.2: On-device 1B/3B and Multimodal 11B/90B Models – Access via API ??

Introducing Claude 3.5 Sonnet: Anthropic's Fastest and Smartest Model that Outperforms Claude 3 Opus. ??

New flagship and advanced LLM from MistralAI with a 32K context window ??

Introducing CodeLlama 70B: A 70 billion-parameter model achieving SOTA performance in code generation.

Web ML Monthly #15: Hello China, Web AI for medicine, rusty models, and Reinforcement Learning in the browser.

领英推荐

Get your to-do items done, ASAP!

2023年2月20日

The Fog of Happiness

2021年10月26日

Theory of Mind, Worrying, and Mindfulness

2021年4月25日

The Joy of Taking A Break From Joy

2020年11月16日

Who moved my goalpost?

2020年8月2日

Deep Rooted

2020年6月3日

The Change Formula

2019年6月10日

Why Stress about Stress?

2019年5月15日

Feeling Like a Failure?

2019年5月10日

Mindfulness and Listening to Music

2019年3月15日

社区洞察

其他会员也浏览了

Artificial Intelligence #207

Artificial Intelligence #185

Artificial Intelligence #185

Issue #300 - The ML Engineer ??

GenAI Weekly — Edition 15

Llama 3.2: On-device 1B/3B and Multimodal 11B/90B Models – Access via API ??

Introducing Claude 3.5 Sonnet: Anthropic's Fastest and Smartest Model that Outperforms Claude 3 Opus. ??

New flagship and advanced LLM from MistralAI with a 32K context window ??

Introducing CodeLlama 70B: A 70 billion-parameter model achieving SOTA performance in code generation.

Web ML Monthly #15: Hello China, Web AI for medicine, rusty models, and Reinforcement Learning in the browser.