10 Tips on LLMs (ChatGPT) for Market Research (with a Pythonic Approach)

10 Tips on LLMs (ChatGPT) for Market Research (with a Pythonic Approach)

Using ChatGPT for market research is like having a super smart "know it all" intern who is "On 24/7". But there are good and better ways to use it, and things to watch out for.

Here are my top 10 tips for using LLMs for doing market research:

  1. Use GPT to transform data into different formats
  2. Context really matters!
  3. Use "DIALECTICS" in prompting
  4. Use LLMs to Refactor your code
  5. Trust but verify, especially when it comes to math
  6. Use GPT to build simple tools
  7. Build a simple RAG (Retrieval Augmented Generation)
  8. Build couple summarizers
  9. Use LLM to write charting code
  10. Don't always copy and paste the codes

No. 1: Use GPT to Transform Data into Different Formats

Make this prompt your friend:

"Here are my data below. Put them in a JSON [dataframe, table, or whatever format you need] format for me.

[Paste your data here]"

For me, this is probably the most useful prompt for a market researcher, or anyone whose daily routine involves any data analysis, because we waste so much time copying and pasting. Back in the old days, we'd often need:

  • do a couple operations in excel first (like CLEAN or TRIM)
  • dedupe
  • double check for spelling errors/typos/empty spaces
  • load or read into Python code (if needed)

But GPT can do that in few seconds and mostly error free. If you are dealing with a small ad hoc dataset (i.e., you copied from a website or excel file), this does wonders. You can just copy and paste the output into your file (see below example.)

df = pd.DataFrame(
    {
        "Year": [
            2004,
            2005,
            2006,
            2007,
            2008,
            2009,
            2010,
            2011,
            2012,
            2013,
            2014,
            2015,
            2016,
            2017,
            2018,
            2019,
        ],
        "PCE": [
            8853.9,
            9327.5,
            9841.3,
            10287.2,
            10408.7,
            10164.0,
            10523.6,
            11025.5,
            11464.4,
            11874.8,
            12363.5,
            12807.3,
            13267.5,
            13719.6,
            14276.7,
            14729.3,
        ],
        "Federal_Spending": [
            2292.8,
            2472.0,
            2655.1,
            2728.7,
            3003.2,
            3517.7,
            3456.2,
            3603.1,
            3537.1,
            3454.6,
            3506.7,
            3688.0,
            3853.1,
            3982.0,
            4108.3,
            4447.2,
        ],
    }
)        

Couple variations of the prompt include:

"Here are my data below in [Chinese, German, Japanese, Spanish]. Put them translate them into English and put them in a JSON [dataframe, table, or whatever format you need] format for me.

[Paste your data here]"

"Here are my data below in [Chinese, German, Japanese, Spanish]. Put them in a dataframe format for me, with [YYY] as rows and [XXX] as columns.

[Paste your data here]"

You can also be more creative, and prompt GPT to perform simple mapping/matching tasks (i.e., take two lists of industries - one from the US and one from Eurostat - and ask GPT to compare and match them.) This will take away 50% to 60% of the initial manual effort; however, you'd still human judgement afterwards.

No. 2: Context Really Matters

The next one is including context in the prompt. For those who have been using LLMs for a while, this seems obvious, but for market research, it's a good habit to keep and will save you time - you get there in one shot instead of multiple prompts.

Here is an example:

"In the context of real estate market [or any context you want to set], are there any known research studies that compare the two valuation methods mentioned before?"

When it comes to LLM efficiency and accuracy, specificity is your friend. In theory, ChatGPT does a good job remembering the conversation, meaning that it can infer the context from the conversation history. But there is a limit, in LLM, the term is context window - how much of history to "remember". GPT's limit is around 7,000 tokens (roughly 6,000+ words; punctuations and symbols are counted as tokens too.) If you use Gemini (Google) or Claude models, the context windows are much larger. But most of the LLM models what we use are transformer based, which use "attention" method. In layman's terms, every time it "thinks", it remembers the entire "context". That is why it works so well, but it's very computationally expensive. So, if the context window is very big, then it becomes less accurate.

Therefore, as a habit, it's good to always provide context, just to be sure. Many recommend defining the "role" as well, such as "as an economist/philosopher/market researcher/...". I myself find this not as important as providing specific context: the newer models, i.e., GPT-4o, is actually pretty good at inferring the role from the context.

No. 3: Use "DIALECTICAL REASONING" in Prompting

Top three is using dialectical prompting. What is dialectic anyway? Dialectic is originally a technique is philosophy. It has three parts: thesis, anti-thesis, and synthesis. In plain English, it means this:

  • Thesis is your main point (viewpoint, recommendation, conclusion), i.e., "Democracy is the ideal political system."
  • Antithesis is the counter point, i.e., the limitations of democracy, concerns, counter arguments, etc.
  • Synthesis brings both together, i.e., "acknowledges the strengths of democracy while addressing its weaknesses, proposing a system that aims to balance broad participation with safeguards against potential abuses."

If you haven't noticed, GPT is "conversational", so that it tends to agree with the users more often than not. While this is definitely great for user experience, you want your final output to readers (the public, executives, clients ...) to be tight. Therefore, you want to ask LLM to play the devil's advocates a bit. What I usually do is to prompt GPT to:

  1. validate the main argument
  2. poke hole into it
  3. bring everything together, for example summarize the pros and cons
  4. (optional) prompt GPT "how can I strengthen my ..."

Example: Globalization Impact

  1. Thesis prompt: "What are the primary economic benefits of globalization?"
  2. Antithesis prompt "What are the main economic drawbacks and criticisms of globalization?"
  3. Synthesis prompt: "How can countries maximize the economic benefits of globalization while mitigating its negative impacts?"

Example: Artificial Intelligence Impact

  1. Pro-Agent Prompt: "What are the primary benefits of artificial intelligence in modern society?"
  2. Con-Agent Prompt: "What are the main drawbacks and ethical concerns associated with artificial intelligence?"
  3. Reconciliation-Agent Prompt: "How can society balance the benefits of artificial intelligence with its potential risks and ethical concerns?"

No. 4: Use LLMs to Refactor Your Code

This one is more for people like me, who have moved away from excel-like tools to Python or R for data analysis. Personally, I find it more fun. I supposed you can use the same approach for VBAs, but I have never tried that myself.

The biggest low hanging fruit of coding with LLMs, at least for people me, who are not professionally trained/educated developers but more self-taught coders, is use it to do write codes that we already understand how it works, at least conceptually.

I myself use GPT to re-organize my code, put in documentations (docstrings), and tweak codes for different purposes. One, if I do everything myself, it takes a lot very detailed and repetitive work, and I have to look up others code examples (or my old code from other projects.) But GPT can do this faster and more accurate.

Here are few prompt examples:

  • "Below are my codes. Can you simplify them (if you can), make them more professional looking, and add all the necessary docstrings (module, class, functions) with examples?"
  • "Below are my codes. Can you make them into a class (function) file, including def main and if main, as well as docstrings?"
  • "Below are my function codes, can you make the functions ASYNC and also add CLI?" (CLI stands for command line interface)

No. 5: Trust but Verify, Especially When It Comes to Math

This is one thing that you really need to watch out for: LLMs are not that great in doing math. If you ask GPT to calculate percentages and just throw out those numbers without checking, you will be in trouble more often than not. There are technical reasons why this is so, which I won't going to go into but there are plenty of online discussions on this.

Based on my personal experience, there were few times that GPT spit out numbers that would make a 3rd grader scratch his head. If it involves multiple computation, you want to either do it in excel or ask GPT to write the code for you to do it pandas yourself (Pandas is a Python library for data analysis.)

When you ask GPT find data that you plan to publish or send out to clients, always ask for the source. This will not only help you citing in your report, but also help GPT with more specific context. It's also consistent to compare numbers from similar sources. For example, if you are comparing economic data, such industry output, CPI/PPI, or employment from different countries, you actually need to prompt GPT specifically to limit to government statistical bureaus' data only (i.e., BEA, BLS in the US, Eurostat for Europe, NBSC for China, SBJ for Japan, and NSO for India, so on ...)

BEA - Bureau of Economic Activity (part of Dept of Commerce), BLS - Bureau of Labor Statistics (part of Dept of Labor), Eurostat - Statistical Office of the European Union; NBSC - Nation Bureau of Statistics of China, SBJ - Statistics Bureau of Japan; NSO - National Statistical Office (India)

It's also a good habit to ask GPT to how the data was calculated. Unless you specify it, GPT doesn't always consider the nuances.

For example, when I was calculating average annual returns of stocks of few companies, I noticed that when I asked GPT what the difference between stock price and stock price adjusted for splits and dividends is, it knows the difference (when you bought a stock for $100 a year ago and it paid out a dividend of $1 since, in Yahoo Finance, the historical stock price in no longer $100 but $100 - $1, or$99.) But if you just ask for historical price, or average annual return, GPT will not pull the right numbers for you. For highly visibility numbers, you are better off just do the grunt work and calculate them manually.

No. 6: Use GPT to Build Simple Tools

If you are doing a project that will take more than few days, it's worth the time to spend half a day to build out some quick data tools with the help of LLMs.

In the pre-GPT days, here was how this worked:

  • I would look around for some code samples and built something.
  • Of course, as a rule, it never worked the first time. I'd try to debug myself, which could take a while.
  • If that didn't work, I'd have to find a developer to help. When I had direct access to developers, it was easier, but I was pulling that guy off the project for half a day. So, I couldn't do that too much.
  • If I didn't have access to a programmer, I'd find someone who has access to python developers (I kept a rolodex.) If I were lucky, he would lend me "Amy" or "Michael" from his team to help me out. But I'd have to wait of course and it's disruptive to his team. (I used to take out managers from my former company's IT department out to lunches to score brownie points just for that.)
  • The whole process could easily take 3 to 4 days.

With LLMs, it is so much easier:

You can either build a python function yourself - a simple one and doesn't even have to work - and ask GPT to modify/expand it. Or you can just prompt GPT to do it for you from scratch. It's like an entourage of "Amies" and "Michaels" for just $20 bucks a month and you don't have to feel stupid for asking stupid questions!

Here are couple examples tools that GPT help me built.

DataFrameUnitConverter:

When you deal with data internationally, different countries have different numeric systems. Here in the US, we have thousands, million, billions, so on. I can't count anything without the "," separators. Numbers from other countries often don't come with separators. In China and Japan, they separate by 10 to the power of 4 (or 10,000) (it's called "wan" in Chinese and "man" in Japanese.) India uses lakh (100,000) and crore (10,000,000).

It's such a pain. So, ChatGPT helped me build a unit converter, so that I can use it convert numbers in Indian and Chinese numeric systems to the familiar trillions, billions, and millions.

import pandas as pd


class DataFrameUnitConverter:
    """
    A class used to perform scaling operations on specified columns of a pandas DataFrame.

    Attributes
    ----------
    df : pd.DataFrame
        The DataFrame to be manipulated.

    Example
    -------
    converter = DataFrameUnitConverter(df)

    # Multiply column 'A' by 1,000,000 (1 million)
    df = converter.multi_by_million('A')

    # Divide column 'B' by 1,000 (1 thousand)
    df = converter.div_by_thousand('B')

    # Convert column 'B' from 100 million ('yi') to billion
    df = converter.convert_yi_to_billion('B')

    # Convert column 'C' from lakh to million
    df = converter.convert_lakh_to_million('C')
    """

    def __init__(self, df):
        self.df = df

    def multiply(self, col_names=None, factor=1):
        """Multiply specified columns by a factor."""
        if col_names is None:
            col_names = self.df.select_dtypes(include="number").columns.tolist()
        elif isinstance(col_names, str):
            col_names = [col_names]

        for col_name in col_names:
            self.df[col_name] = self.df[col_name] * factor
        return self.df

    def divide(self, col_names=None, factor=1):
        """Divide specified columns by a factor."""
        if col_names is None:
            col_names = self.df.select_dtypes(include="number").columns.tolist()
        elif isinstance(col_names, str):
            col_names = [col_names]

        for col_name in col_names:
            self.df[col_name] = self.df[col_name] / factor
        return self.df

    def multi_by_trillion(self, col_names=None):
        return self.multiply(col_names, 10**12)

    def multi_by_billion(self, col_names=None):
        return self.multiply(col_names, 10**9)

    def multi_by_100_million(self, col_names=None):
        return self.multiply(col_names, 10**8)

    def multi_by_million(self, col_names=None):
        return self.multiply(col_names, 10**6)

    def multi_by_10_thousand(self, col_names=None):
        return self.multiply(col_names, 10**4)

    def multi_by_thousand(self, col_names=None):
        return self.multiply(col_names, 10**3)

    def div_by_trillion(self, col_names=None):
        return self.divide(col_names, 10**12)

    def div_by_billion(self, col_names=None):
        return self.divide(col_names, 10**9)

    def div_by_100_million(self, col_names=None):
        return self.divide(col_names, 10**8)

    def div_by_million(self, col_names=None):
        return self.divide(col_names, 10**6)

    def div_by_10_thousand(self, col_names=None):
        return self.divide(col_names, 10**4)

    def convert_wan_to_million(self, col_names=None):
        """Convert values from 'wan' (10 thousand) to millions."""
        self.multi_by_10_thousand(col_names)
        self.div_by_million(col_names)
        return self.df

    def convert_wan_to_billion(self, col_names=None):
        """Convert values from 'wan' (10 thousand) to billions."""
        self.multi_by_10_thousand(col_names)
        self.div_by_billion(col_names)
        return self.df

    def convert_yi_to_billion(self, col_names=None):
        """Convert values from 'yi' (100 million) to billions."""
        self.multi_by_100_million(col_names)
        self.div_by_billion(col_names)
        return self.df

    def convert_yi_to_trillion(self, col_names=None):
        """Convert values from 'yi' (100 million) to trillions."""
        self.multi_by_100_million(col_names)
        self.div_by_trillion(col_names)
        return self.df

    def convert_lakh_to_million(self, col_names=None):
        """Convert values from lakh (100 thousand) to millions."""
        self.divide(col_names, 10)
        return self.df

    def convert_crore_to_million(self, col_names=None):
        """Convert values from crore (10 million) to millions."""
        self.multiply(col_names, 10)
        return self.df

    def convert_lakh_to_crore(self, col_names=None):
        """Convert values from lakh (100 thousand) to crore (10 million)."""
        self.divide(col_names, 100)
        return self.df

    def convert_lakh_to_thousand(self, col_names=None):
        """Convert values from lakh (100 thousand) to thousand."""
        self.multiply(col_names, 100)
        return self.df


def main():
    # Sample DataFrame
    df = pd.DataFrame(
        {"A": [1, 2, 3], "B": [1000, 2000, 3000], "C": [1000000, 2000000, 3000000]}
    )

    # Create an instance of DataFrameUnitConverter
    converter = DataFrameUnitConverter(df)

    # Example usage
    # df = converter.multi_by_million()
    # df = converter.div_by_10_thousand()
    # df = converter.convert_yi_to_billion()
    df = converter.convert_lakh_to_million()

    # Display the resulting DataFrame
    print(df)


if __name__ == "__main__":
    main()        

DataFrameStyler

This one is simple styling class for pandas dataframes. Because table with colors and formats not to my liking really bothers me, I built this one with GPT's help. This is more because of my own proclivity. This one took a little longer, but it was mainly my own fault - tried a whole bunch of different colors and styles. The point is that these tools are not difficult to build with LLMs nowadays.

"""
File: dataframe_styler.py
Author: Xiao-Fei Zhang
Date: last updated on Jun 22, 2024

DataFrameStyler is a class to style pandas dataframes with custom formatting.
"""

import pandas as pd


class DataFrameStyler:
    """
    A class to apply various styles to pandas DataFrames and export them to HTML.
    """

    def __init__(self, df):
        self.df = df.copy()

    def style_numeric_columns(
        self, header_color="lightblue", index_color="lightblue", decimal_points=1
    ):
        """
        Format all numeric columns in the DataFrame to have specified decimal places and thousands separators.

        Parameters:
        header_color (str): The background color for the header cells.
        index_color (str): The background color for the index cells.
        decimal_points (int): The number of decimal points for formatting numeric columns.

        Returns:
        pandas.io.formats.style.Styler: The formatted DataFrame.

        Example:
        >>> df = pd.DataFrame({"A": [0.2345, 0.3456, 0.4567], "B": [4.5678, 5.6789, 6.7890]})
        >>> styler = DataFrameStyler(df)
        >>> styled = styler.style_numeric_columns()
        >>> styler.export_styled_to_html(styled, "styled_numeric.html")
        """
        df_filled = self.df.fillna("-")
        formatted_df = df_filled.copy()

        format_str = "{:,.{}f}".format
        for col in formatted_df.columns:
            if pd.api.types.is_numeric_dtype(formatted_df[col]):
                formatted_df[col] = formatted_df[col].map(
                    lambda x: format_str(x, decimal_points)
                )

        styled = formatted_df.style

        # Apply header and index styles
        styled = self.apply_header_and_index_styles(styled, header_color, index_color)

        return styled

    def style_numeric_columns_with_gradient(
        self,
        cmap="coolwarm",
        header_color="lightblue",
        index_color="lightblue",
        decimal_points=1,
    ):
        """
        Format all numeric columns in the DataFrame to have specified decimal places and thousands separators
        and then apply a gradient background style to these columns.

        Parameters:
        cmap (str): The colormap for the gradient background.
        header_color (str): The background color for the header cells.
        index_color (str): The background color for the index cells.
        decimal_points (int): The number of decimal points for formatting numeric columns.

        Returns:
        pandas.io.formats.style.Styler: The styled DataFrame.

        Example:
        >>> df = pd.DataFrame({"A": [0.2345, 0.3456, 0.4567], "B": [4.5678, 5.6789, 6.7890]})
        >>> styler = DataFrameStyler(df)
        >>> styled = styler.style_numeric_columns_with_gradient()
        >>> styler.export_styled_to_html(styled, "styled_numeric_grad.html")
        """
        numeric_cols = self.df.select_dtypes(include=["number"]).columns
        format_str = "{:,.{}f}".format
        styled = self.df.style.format(
            lambda x: format_str(x, decimal_points), subset=numeric_cols
        ).background_gradient(cmap=cmap, subset=numeric_cols)

        # Apply header and index styles
        styled = self.apply_header_and_index_styles(styled, header_color, index_color)

        return styled

    def style_percentage_columns(
        self, header_color="lightblue", index_color="lightblue", decimal_points=2
    ):
        """
        Format numeric columns in the DataFrame as percentages without multiplying by 100.

        Parameters:
        header_color (str): The background color for the header cells.
        index_color (str): The background color for the index cells.
        decimal_points (int): The number of decimal points for formatting numeric columns.

        Returns:
        pandas.io.formats.style.Styler: The styled DataFrame.

        Example:
        >>> df = pd.DataFrame({"A": [0.2345, 0.3456, 0.4567], "B": [4.5678, 5.6789, 6.7890]})
        >>> styler = DataFrameStyler(df)
        >>> styled = styler.style_percentage_columns()
        >>> styler.export_styled_to_html(styled, "styled_pct.html")
        """
        formatted_df = self.df.copy()
        numeric_cols = formatted_df.select_dtypes(include=["number"]).columns
        format_str = "{:.{}%}".format
        formatted_df[numeric_cols] = formatted_df[numeric_cols].apply(
            lambda x: x.map(lambda x: format_str(x, decimal_points))
        )

        styled = formatted_df.style

        # Apply header and index styles
        styled = self.apply_header_and_index_styles(styled, header_color, index_color)

        return styled

    def style_percentage_columns_with_gradient(
        self,
        cmap="coolwarm",
        header_color="lightblue",
        index_color="lightblue",
        decimal_points=1,
    ):
        """
        Apply a percentage format to numeric columns without multiplying by 100,
        and then apply a gradient background style to these columns.

        Parameters:
        cmap (str): The colormap for the gradient background.
        header_color (str): The background color for the header cells.
        index_color (str): The background color for the index cells.
        decimal_points (int): The number of decimal points for formatting numeric columns.

        Returns:
        pandas.io.formats.style.Styler: The styled DataFrame.

        Example:
        >>> df = pd.DataFrame({"A": [0.2345, 0.3456, 0.4567], "B": [4.5678, 5.6789, 6.7890]})
        >>> styler = DataFrameStyler(df)
        >>> styled = styler.style_percentage_columns_with_gradient()
        >>> styler.export_styled_to_html(styled, "styled_pct_grad.html")
        """
        styled_df = self.df.copy()
        numeric_cols = styled_df.select_dtypes(include=["number"]).columns

        format_str = "{:.{}%}".format
        styled = styled_df.style.format(
            lambda x: format_str(x, decimal_points), subset=numeric_cols
        ).background_gradient(cmap=cmap, subset=numeric_cols)

        # Apply header and index styles
        styled = self.apply_header_and_index_styles(styled, header_color, index_color)

        return styled

    def custom_format_and_gradient_columns(
        self,
        format_dict,
        cmap="coolwarm",
        header_color="lightblue",
        index_color="lightblue",
        decimal_points=1,
    ):
        """
        Apply specified formatting and gradient background style to columns in the DataFrame, and
        set background colors for index and header cells.

        Parameters:
        format_dict (dict): A dictionary specifying the format for each column.
        cmap (str): The colormap for the gradient background.
        header_color (str): The background color for the header cells.
        index_color (str): The background color for the index cells.
        decimal_points (int): The number of decimal points for formatting numeric columns.

        Returns:
        pandas.io.formats.style.Styler: The styled DataFrame.

        Example:
        >>> df = pd.DataFrame({"A": [0.2345, 0.3456, 0.4567], "B": [4.5678, 5.6789, 6.7890]})
        >>> format_dict = {"A": "float", "B": "percent"}
        >>> styler = DataFrameStyler(df)
        >>> styled = styler.custom_format_and_gradient_columns(format_dict)
        >>> styler.export_styled_to_html(styled, "styled_custom_grad.html")
        """
        styled_df = self.df.copy()
        formatter = {}

        for col, fmt in format_dict.items():
            if fmt == "percent":
                formatter[col] = lambda x: "{:.{}%}".format(x, decimal_points)
            elif fmt == "float":
                formatter[col] = lambda x: "{:,.{}f}".format(x, decimal_points)

        styled = styled_df.style.format(formatter)
        numeric_cols = styled_df.select_dtypes(include=["number"]).columns
        styled = styled.background_gradient(cmap=cmap, subset=numeric_cols)

        # Apply header and index styles
        styled = self.apply_header_and_index_styles(styled, header_color, index_color)

        return styled

    def apply_header_and_index_styles(self, styler, header_color, index_color):
        """
        Apply header and index styles to the Styler object.

        Parameters:
        styler (pandas.io.formats.style.Styler): The styled DataFrame.
        header_color (str): The background color for the header cells.
        index_color (str): The background color for the index cells.

        Returns:
        pandas.io.formats.style.Styler: The styled DataFrame with header and index styles applied.

        Example:
        >>> df = pd.DataFrame({"A": [0.2345, 0.3456, 0.4567], "B": [4.5678, 5.6789, 6.7890]})
        >>> styler = DataFrameStyler(df)
        >>> styled = styler.style_numeric_columns()
        >>> styled = styler.apply_header_and_index_styles(styled, "lightblue", "lightblue")
        """
        styles = [
            {
                "selector": "th",
                "props": [
                    ("background-color", header_color),
                    ("color", "black"),
                    ("font-weight", "bold"),
                ],
            },
            {
                "selector": "thead th",
                "props": [
                    ("background-color", header_color),
                    ("color", "black"),
                    ("font-weight", "bold"),
                ],
            },
            {
                "selector": "tbody th",
                "props": [
                    ("background-color", index_color),
                    ("color", "black"),
                    ("font-weight", "bold"),
                ],
            },
        ]
        return styler.set_table_styles(styles)

    def export_styled_to_html(self, styler, output_file):
        """
        Export the styled DataFrame to an HTML file.

        Parameters:
        styler (pandas.io.formats.style.Styler): The styled DataFrame.
        output_file (str): The file path to save the HTML output.

        Example:
        >>> df = pd.DataFrame({"A": [0.2345, 0.3456, 0.4567], "B": [4.5678, 5.6789, 6.7890]})
        >>> styler = DataFrameStyler(df)
        >>> styled = styler.style_numeric_columns()
        >>> styler.export_styled_to_html(styled, "styled_table.html")
        """
        html = styler.to_html()
        with open(output_file, "w") as f:
            f.write(html)
        print(f"Styled table exported to {output_file}")


def main():

    # Sample DataFrame
    df = pd.DataFrame({"A": [0.2345, 0.3456, 0.4567], "B": [4.5678, 5.6789, 6.7890]})

    # Create an instance of DataFrameStyler
    styler = DataFrameStyler(df)

    # Example 1: Style numeric columns with default parameters
    styled_numeric = styler.style_numeric_columns()
    styler.export_styled_to_html(styled_numeric, "styled_numeric.html")

    # Example 2: Style numeric columns with gradient
    styled_numeric_grad = styler.style_numeric_columns_with_gradient()
    styler.export_styled_to_html(styled_numeric_grad, "styled_numeric_grad.html")

    # Example 3: Style percentage columns with default parameters
    styled_pct = styler.style_percentage_columns()
    styler.export_styled_to_html(styled_pct, "styled_pct.html")

    # Example 4: Style percentage columns with gradient
    styled_pct_grad = styler.style_percentage_columns_with_gradient()
    styler.export_styled_to_html(styled_pct_grad, "styled_pct_grad.html")

    # Example 5: Style with custom format and gradient
    custom_format_dict = {"A": "float", "B": "percent"}
    styled_custom_grad = styler.custom_format_and_gradient_columns(custom_format_dict)
    styler.export_styled_to_html(styled_custom_grad, "styled_custom_grad.html")


if __name__ == "__main__":
    main()        

Here are some ideas for other useful tools: quarterize revenues, shifting fiscal year to calendar year revenues, industry matching, geographic mapping, language translation, FX translation ...

No. 7: Build a Simple RAG (Retrieval Augmented Generation)

If your project involves going over multiple documents, it's a lot reading and searching. For example, an analysis that need couple years of 10Ks, 10Qs, earnings releases, transcripts, or an economic study of particular industry or country.

I recommend that you build a simple RAG for the searching, because "Control + F" will eat up you day!

Compared to simple tools I mentioned before, this is a bit more complex. You need to do more than just copying and pasting code from LLMs. But if you keep it simple and use a high-level framework, it's totally doable.

Here is an example of the quick RAG I build for one of my recent personal research projects. I used the OpenAI API and Llama_Index framework for the search and chatbot, and Streamlit for web interface (Streamlit is a new popular web interface tool that's both Python and LLM friendly.)

After I decided to keep it simple and easy, the actual coding part wasn't hard:

  • less than 100 lines of code if excluding comments/docstrings
  • may just a day or two to build
  • Llama_Index's chat feature has a retrieve/query/chat all-in-one approach
  • the web interface part in particular, was super easy using GPT (I have zero experience with HTML or Javascript and have no intention to learn) - all I had to do was dumping the code into the prompt and ask GPT to add streamlit code to it.

The RAG part took a bit longer mainly because I compared different methods first to see which one worked better, and GPT-4 is not trained the most recent Llama_Index code (it's still new) - I had to rely on both GPT and Llama_index's chatbot.

Here is the example code:

"""
This module sets up a Streamlit application that allows users to interact 
with a LlamaIndex-powered chatbot to search and retrieve information 
from documents stored on their PC. 

The chatbot uses OpenAI's GPT-4 model to provide accurate and context-aware 
responses based on the content of the documents.

The main functionalities include:
- Loading OpenAI API settings and initializing the language model (LLM) 
and embedding model.
- Checking if a pre-existing vectorstore index is available; 
if not, creating a new index from the documents.
- Setting up a chat engine using LlamaIndex that supports conversational interactions 
with context memory.
- Configuring a Streamlit interface for user interaction, 
including a chat input and message display area.
- Implementing a "safeword" feature to allow users to exit the chat session.

Modules and Functions:
- is_directory_empty: Checks if a given directory is empty.
- load_or_create_index: Loads an existing vectorstore index 
or creates a new one from documents.
- chat_with_bot: Handles the interaction with the chat engine 
to get responses based on user input.
"""

import os
import sys
import logging
from dotenv import load_dotenv
import openai
import streamlit as st
from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    Settings,
    StorageContext,
    load_index_from_storage,
)
from llama_index.core.memory import ChatMemoryBuffer
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from get_file_names import get_file_names
from language_detector import LanguageDetector

# Ensure UTF-8 encoding throughout, logging
os.environ["PYTHONIOENCODING"] = "utf-8"
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

# Load OpenAI API key and LLM settings
load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")
llm = OpenAI("gpt-4", temperature=0.1)
embed_model = OpenAIEmbedding(
    embed_model="text-embedding-ada-002"
)  # Removed max_length
Settings.llm = llm
Settings.embed_model = embed_model

# Directories for persisted vectorstore and documents
PERSIST_DIR = (
    "Your directory to persist/save vectorstore/index files"
)
DOC_DIR = "You document directory path"


def is_directory_empty(dir_path):
    """
    Check if the given directory is empty.

    Args:
        dir_path (str): Path to the directory.

    Returns:
        bool: True if the directory is empty, False otherwise.
    """
    return len(os.listdir(dir_path)) == 0


def load_or_create_index():
    """
    Load the vectorstore index from disk if it exists, otherwise create a new one from documents.

    Returns:
        VectorStoreIndex: The loaded or newly created vectorstore index.
    """
    if is_directory_empty(PERSIST_DIR):
        txt_files = get_file_names(DOC_DIR, full_path=True, file_types=".txt")
        en_txt_files = [
            file
            for file in txt_files
            if LanguageDetector.detect_primary_language_from_file(file) == "en"
        ]
        documents = SimpleDirectoryReader(
            DOC_DIR, encoding="utf-8", filename_as_id=True
        ).load_data()
        index = VectorStoreIndex.from_documents(documents, show_progress=True)
        index.storage_context.persist(PERSIST_DIR)
        return index
    else:
        storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
        return load_index_from_storage(storage_context)


def chat_with_bot(user_input):
    """
    Get a response from the chat engine based on user input.

    Args:
        user_input (str): The user's input.

    Returns:
        str: The chat engine's response.
    """
    response = chat_engine.chat(user_input)
    return response.response


# Load or create the vectorstore index
index = load_or_create_index()

# Initialize chat engine
memory = ChatMemoryBuffer.from_defaults(token_limit=3900)
chat_engine = index.as_chat_engine(
    chat_mode="condense_plus_context",
    memory=memory,
    llm=llm,
    context_prompt=(
        "You are a chatbot, able to have normal interactions, as well as talk about "
        "the economic indicators in these documents."
        "Here are the relevant documents for the context:\n"
        "{context_str}"
        "\nInstruction: Use the previous chat history, or the context above, to interact and help the user."
    ),
    verbose=False,
)

# Set up Streamlit app
st.set_page_config(
    page_title="FindMyStuff, a RAG bot powered by LlamaIndex",
    page_icon="??",
    layout="centered",
    initial_sidebar_state="auto",
)

st.title(
    "FindMyStuff, a RAG bot to search content on your PC - powered by LlamaIndex ????"
)
st.info('To exit the chat session, the safe word is "quit".')

SAFE_WORD = "quit"

# Initialize chat messages history
if "messages" not in st.session_state:
    st.session_state.messages = [
        {
            "role": "assistant",
            "content": "Ask me a question about content on documents in your PC!",
        }
    ]

if "quit" not in st.session_state:
    st.session_state.quit = False

# Chat functionality
if st.session_state.quit:
    st.write("Chat session ended. Refresh the page to start a new session.")
else:
    if prompt := st.chat_input("Your question"):
        if prompt.lower() == SAFE_WORD:
            st.session_state.quit = True
            st.write("Chat session ended. Refresh the page to start a new session.")
        else:
            st.session_state.messages.append({"role": "user", "content": prompt})
            response = chat_with_bot(prompt)
            st.session_state.messages.append({"role": "assistant", "content": response})

# Display chat messages
for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.write(message["content"])        

Here is what looks like in action: the bot has some memory and can carry on a simple conversation.

Here are few other ideas that you may want to explore:

  • If your project is more extensive, then I'd recommend you build a full-fledge RAG that can search tables and text separately, build a separate SQL database, add knowledge graph, and use multiple agents (in Langchain or Llama_index, you will find tools for querying SQL/Table data (or any structured data).
  • If you need to gather data from websites, then it's better to build a RAG for websites (you will find plenty tutorials on how to do this.)

No. 8: Build Couple Summarizers

I like to keep couple summarization tools around. You can obviously just use the prompt. But using the API, you can customize it and create 2 or 3 different versions.

Here is a simple one with native openai code:

"""
File: doc_summarizer_wt_native_gpt
Author: Xiao-Fei Zhang
Date: 2024 Jun 27

Function to summarize a document using the native OpenAI API.
"""

import os
import logging
import sys
import argparse
from dotenv import load_dotenv
import openai

# Logging @ normal level
logging.basicConfig(
    stream=sys.stdout,
    level=logging.WARNING,  # Set the logging level to WARNING
    format="%(asctime)s - %(levelname)s - %(message)s",
)


def load_env(env_path=None):
    """
    Load environment variables from the specified .env file.

    Parameters:
    env_path (str): Path to the .env file. If None, loads from the default location.

    Returns:
    None
    """
    if env_path:
        load_dotenv(dotenv_path=env_path)
    else:
        load_dotenv()


def summarize_document(source_path, output_path, llm_model="gpt-4", env_path=None):
    """
    Summarize the content of a document and save the summary to a specified file.

    Parameters:
    source_path (str): Path to the source file or directory containing the document(s).
    llm_model (str): The language model to use for summarization ('gpt-3.5-turbo', 'gpt-4', or 'gpt-4o').
    output_path (str): Path to save the summary text file.
    env_path (str): Path to the .env file. If None, loads from the default location.

    Returns:
    None

    Examples:
    summarize_document("path/to/source/file.txt", "path/to/output/summary.txt", "gpt-4", "path/to/.env")
    summarize_document("path/to/source/directory", "path/to/output/summary.txt", "gpt-4o", "path/to/.env")
    """
    # Load the OpenAI API key from the environment variable
    logging.debug("Loading environment variables...")

    load_env(env_path)

    api_key = os.getenv("OPENAI_API_KEY")
    if not api_key:
        logging.error("OPENAI_API_KEY not found in environment variables.")
        return
    openai.api_key = api_key

    # Read the file contents
    if os.path.isdir(source_path):
        files = [
            os.path.join(source_path, f)
            for f in os.listdir(source_path)
            if os.path.isfile(os.path.join(source_path, f))
        ]
        file_contents = ""
        for file in files:
            with open(file, "r", encoding="utf-8") as f:
                file_contents += f.read() + "\n"
    else:
        with open(source_path, "r", encoding="utf-8") as file:
            file_contents = file.read()

    print(
        f"File contents: {file_contents[:100]}..."
    )  # Print the first 100 characters for verification

    # Generate the summary using the specified model
    prompt = f"""
    {{
        "role": "user",
        "task": "summarize",
        "content": "{file_contents}",
        "instructions": "Please summarize the given content. Ensure the summary includes all the key points and important details, and is presented in bullet points for clarity. Avoid missing any critical information."
    }}
    """
    response = openai.chat.completions.create(
        model=llm_model,
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt},
        ],
    )

    # Extract the summary text
    summary = response.choices[0].message.content

    # Save the summary to the output file
    with open(output_path, "w", encoding="utf-8") as file:
        file.write(summary)

    # Print the summary
    print(summary)


def cli():
    """
    Command-line interface function to execute the summarization process.
    """
    parser = argparse.ArgumentParser(
        description="Summarize a document using OpenAI's language models."
    )
    parser.add_argument(
        "source_path", type=str, help="Path to the source file or directory."
    )
    parser.add_argument(
        "llm_model",
        type=str,
        choices=["gpt-3.5-turbo", "gpt-4", "gpt-4o"],
        help="The LLM model to use for summarization.",
    )
    parser.add_argument(
        "output_path", type=str, help="Path to save the summary text file."
    )
    parser.add_argument(
        "--env_path", type=str, help="Path to the .env file.", default=None
    )

    args = parser.parse_args()

    summarize_document(
        args.source_path, args.output_path, args.llm_model, args.env_path
    )


def main():
    """
    Main function to execute the summarization process.

    This function uses predefined paths and model for demonstration purposes.
    """
    txt_file = r"You Input document file path - document in"
    summary_f_path = r"You output file path - summary out"
    env_path = r"C:\github\china_blog_charting\.env"

    source_path = txt_file
    output_path = summary_f_path
    llm_model = "gpt-4o"
    summarize_document(source_path, output_path, llm_model, env_path)


if __name__ == "__main__":
    if len(sys.argv) > 1:
        cli()
    else:
        main()        

If you use GPT-3 or GPT-4, the summaries generally more succinct. GPT-4o gives a fuller and richer summaries. Sometimes I like to use the long version and customize it myself. Other times, I just want few bullets.

It has a CLI (Command-line interface) function so that you can execute it from the terminal, making it a bit quicker (GPT really makes tasks such as adding CLI nice and easy!)

No. 9: Use LLM to Write Charting Codes

One thing that I really like about doing data analysis in Python instead of Excel is charting. It gives you a lot more flexibility and it's quick if you have to repeat similar charts with different datasets.

For example, my favorite one is Altair - it's really fancy and highly configurable (you layer different charts on top of each other, have different fonts/color for different axis, you can do tree charts, geo charts, etc., all with just few lines of code.)

However, because of that, its learning curve is high. You also need to have your data in a flat-file format (i.e., each row has vendor, year, quarter, region, country, market, value in USD, value in local currency, so on ...) You may need to "de-pivot" data first before using it. I used to have to spend couple hours to customize one chart.

Now, you can just ask GPT to write the charting code for you and you do some minor tweak and debugging - 15 to 30 minutes only.

Here is a quick example:

import Altair as alt

.... loading and normalizing data

# Create the base pie chart with size and title
chart_title = [
    "Central to Local Government Transfers by Regions (% of Total)",
    "- Selected Programs Only, 2023 Actual",
]  # use list to break up the title into 2 lines

# Create the base pie chart with size properties
base = alt.Chart(data).properties(width=450, height=500)

# Customize the chart with encoding
pie_chart = base.mark_arc(outerRadius=220).encode(
    theta=alt.Theta(field="Percentage of Total", type="quantitative"),
    color=alt.Color(field="Region", type="nominal", legend=None),
    tooltip=["Region", alt.Tooltip(field="Percentage of Total", format=".2%")],
)

# Calculate cumulative percentage for label positioning
data["Cumulative"] = data["Percentage of Total"].cumsum()
data["Cumulative"] = data["Cumulative"] - data["Percentage of Total"] / 2

# Add data labels
label_val = (
    alt.Chart(data)
    .mark_text(size=12, color="black")
    .encode(
        theta=alt.Theta("Cumulative:Q", stack=None),
        radius=alt.value(140),
        text=alt.Text("Percentage of Total:Q", format=".0%"),
    )
)

label_reg = (
    alt.Chart(data)
    .mark_text(radiusOffset=20, size=14, color="#000060", fontWeight="bold")
    .encode(
        theta=alt.Theta("Cumulative:Q", stack=None),
        radius=alt.value(255),  # Place labels outside the pie chart
        text=alt.Text("Region:N"),
    )
)
# Combine the pie chart and labels
final_chart = (
    (pie_chart + label_val + label_reg).properties(
        title={
            "text": chart_title,
            "subtitle": "Source: Ministry of Finance (PRC)",
            "fontSize": 16,
            "anchor": "middle",
            "color": "black",
        }
    )
    # .configure_legend(title=None, orient="right")
)

# Display the chart
final_chart.show()        

Charting Output: this pie chart is more complicated than it looks because there is extensive customization behind the scenes, such as distance between the label and the "pies", font color/size/style. Usually, without GPT structure it for me first, building this type of charts with Altair would be quite time-consuming, unless you are Altair whiz.

No. 10: Don't Always Copy and Paste the Codes

My last advice is rather counter intuitive. It's easy to form the habit of just copying and pasting codes from LLMs. But you have to remember that you are learning by doing.

We rely on LLMs to write codes, but we need to know how to read the codes. Even if you use tools like Devin, Co-Pilot, or AutoDev, that write codes for you automatically (or auto complete), building your own "muscle memory" is a huge plus. This is mainly for two reasons:

  1. As a market researcher, you work on different projects often. Therefore, you have to customize your tools. If you know how the codes are built, then you can quickly come up with new ideas to be implemented by LLMs.
  2. Debugging is not going away completely yet. If you have no concept of how the codes work, then you will end up spending more time trying to "brute force" LLMs to debug. Often, LLMs will just regurgitate the same code over and over again.

What I do sometimes is to find the code changes that GPT has made and manually type in the new code. I also ask GPT to explain the codes to me. It may be slower in the beginning, but it saves more time down the road.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了