S&P Global at Questioning the Answers: LLMs in Boardroom Strategy

Published 07/03/2025, 00:10
S&P Global at Questioning the Answers: LLMs in Boardroom Strategy

On Thursday, 06 March 2025, S&P Global (NYSE: SPGI) hosted "Questioning the Answers: LLMs enter the Boardroom," a conference that explored the strategic use of large language models (LLMs) in assessing executive performance during earnings calls. The event highlighted both promising and challenging aspects of integrating LLMs into financial analysis, with insights into stock performance linked to executive communication styles.

Key Takeaways

  • Executives who are proactive and on-topic can generate 247 basis points of alpha annually, while reactive and off-topic counterparts face negative 256 basis points.
  • The analysis covered the Russell 3000 index, using data from January 2008 to September 2024.
  • Proactive communication is linked to reduced speculative uncertainty and higher valuation multiples.
  • LLMs are becoming more integrated into investment workflows, though adoption is still in early stages.
  • A coding notebook was provided to replicate the research and analysis.

Financial Results

  • Proactive and on-topic managers generate 247 basis points of alpha per year, while reactive and off-topic managers see negative 256 basis points.
  • Russell 3000 long side generates 190 basis points of alpha with a 61% hit rate; Russell 2000 generates 200 basis points with a 62% hit rate.
  • Long short strategies yield 390 basis points for Russell 3000 and 450 basis points for Russell 2000, both with a 63% hit rate.

Operational Updates

  • S&P Global Market Intelligence uses machine-readable transcripts from January 2008 to September 2024, focusing on the Russell 3000 index.
  • Questions and answers from earnings calls are summarized using Snowflake’s Cortex LLM summarization function.
  • LLM embeddings vectorize text to maintain semantic meaning, and cosine similarity scores determine on-topic responses.
  • Portfolios are ranked monthly, with top 20% companies held long and bottom 20% shorted.

Future Outlook

  • Proactive executives signal strong risk management and operational strength, reducing speculative uncertainty.
  • Transparent firms with competitive advantages are more likely to communicate clearly and directly.
  • LLMs are expected to become more integrated into investment management workflows, although still in early adoption stages.

Q&A Highlights

  • Sector variations are controlled by ensuring equal representation in long and short portfolios.
  • Cosine scores are used to assess whether responses are on or off-topic, rather than binary assessments.
  • LLM challenges include structuring input data and fine-tuning for accuracy.
  • Evasive behavior in executives is a negative indicator for future performance.

Readers are invited to refer to the full transcript for a comprehensive understanding of the methodology and findings.

Full transcript - Questioning the Answers: LLMs enter the Boardroom:

Dan Samberg, Moderator, Head of Quant Research, S and P Global Market Intelligence: Good morning, Samberg, and I head Quant Research for S and P Global Market Intelligence. We have an incredible webinar to share with you today, Questioning the LLMs Enter the Boardroom. In this webinar, we’re going to take a look at how large language models can be used to score executive performance on earnings calls to help investors make data informed decisions. First off, as your moderator, I have a few housekeeping reminders. This webinar features closed captioning in English.

To activate, simply click the closed caption icon in the media player. At the conclusion of the session, a brief survey will appear. Completing it takes less than a minute and your feedback is invaluable to If you’re joining us for the replay, please use the request demo link found under the related content widget to reach out to us. This widget also includes links to white papers and other relevant collateral. You can also access our webinar replay portal to revisit this session and others on demand.

With that now out of the way, we’re ready to dive in. We have an all star lineup joining us today. Liam Hynes, our Global Head of New Product Development for Quant Research will be walking through the team’s work. Henry Chang, Quant Analyst is going to walk us through the code for this research. And Ronan Feldman, Founder of Pronto NLP, which is now an S and P Global Market Intelligence Company, will discuss how Pronto and this research come together.

Liam, take us away.

Liam Hynes, Global Head of New Product Development for Quant Research, S and P Global Market Intelligence: Great. Thank you, Dan. Okay. We’re going to jump right into it. There’s a lot of content to cover today.

So first of all, we’re going to start off with a little bit of a history lesson. On 10/16/2001, Enron held their third quarter two thousand and one earnings call. On the Q and A section of that call, one of the questions asked, how confident can we be that these will be the last write offs? Kenneth Lay, Enron’s CEO responded, if we thought we had any other impaired assets, they would be on this list today. But we do still have at least three areas of uncertainty in the company, which you’re aware of.

Of course, one’s California, we’ve got India, and then of course, finally, broadband. Was Kenneth Lay reactive or proactive on this write off topic in his pre prepared remarks? When analysts posed questions on write offs, there were six questions on write offs

Henry Chang, Quant Analyst, S and P Global Market Intelligence: from analysts and

Liam Hynes, Global Head of New Product Development for Quant Research, S and P Global Market Intelligence: write offs were not mentioned once in the pre prepared remarks. That meant that Kenneth Lay was being reactive rather than proactive on the write off topic. Did Kenneth remain on topic to the analyst question? No, he did not. California, India and broadband do not equal write offs.

Kenneth Lay was pivoting and totally off topic to the question asked. I think folks probably know the rest of the story. Enron collapsed in December 2001, wiping around $74,000,000,000 off its shareholder value, costing thousands of employees their jobs and retirement savings. Lay was actually indicted in 02/2004 on charges including fraud and conspiracy, and in May 2006, he was convicted on multiple counts. This exchange with the analyst was actually evidence in the case the Department of Justice brought against Lay, as prosecutors argue that Lay’s statement was intentionally misleading as he failed to disclose Enron’s true financial troubles.

So that gives us a nice segue into some of the hypothesis that we want to test. The first hypothesis is, do firms that remain on topic when answering questions during earnings call Q and A have superior stock performance compared to those that pivot to adjacent or unrelated topics. And the second, firms that proactively address key issues in their pre prepared remarks before analysts ask about them in earnings call Q and A have superior stock performance compared to those that were stopped responding reactively. So what we’re going to look at today, the data set that we’re looking at and this analysis was stood up on top of is S and P market intelligence machine readable transcripts. The time period we’re looking at is Jan two thousand and eight to September 2004, and we’re looking at the WOSN 3,000.

Okay. So let’s have a look at an example here. The first thing we need to do is identify the question and answer pairs in the earnings call and process them. In this example, we show Caterpillar’s second quarter twenty twenty four earnings call and a question around adjusted margin on lower sales. What we do is we push this question to Snowflake’s Cortex LLM summarization function.

And what that does is it gives us a summarized question, given to that 2023 performance, should margins be adjusted on lower sales and is this due to pricing? Okay, well, why do we do that? Well, summarization does four things. It does noise reduction, it improves semantic matching, there is a standardization and comparability And then lastly, there is some computational efficiency to be gained by doing that. So let’s look at the question and answer pairs, what we actually do here.

So first of all, we start off with the original question and answer in their long form. Then we summarize those questions and answers. And then what we do is we use large language model embeddings. We vector embed the text. So the LLM vector embeddings represents text as a numerical vector.

What this does is it preserves the semantic meaning for efficient processing downstream. And by vector embedding the text, we can now identify whether the answer to the question is on or off topic by calculating the cosine similarity score between the question and answer vectors. So a high cosine score indicates that the answer is using concepts and language similar to the question, I. E. It’s on topic, and a low cosine score equates to the answer being off topic.

We do this for every question and answer pair in the earnings call, and then we get an average cosine score for the entire call. So what we do downstream once we’ve done this feature generation? So we want to see if on topic executives outperform their off topic peers. We’ve just run through the feature engineering. Next, what we’ll do is we’ll form portfolios.

So what we do is by taking the average cosine score, we resample it to the end of the month, we rank the companies in each sector and then what we do is we go along the top 20% of companies with a high Cosan score and remember that’s a high on topics score and then we short the companies with a low Cosan score on companies who are off topic. We then look at the 114 returns, which we have FAMA adjusted for FAMA French adjusted for market value size and momentum. And then we’ve also adjusted the forward returns for some natural language processing signals, sentiment, language complexity and numeric transparency. So here are the results for the Russell baskets. So, you can see here on the long side, the Russell three thousand generates statistically significant alpha of 190 basis points and that’s with a hit rate of 61% and the smaller cap Russell two thousand generates 200 basis points with a hit rate of 62.

On the long short side, the Russell three thousand generates three ninety basis points of alpha with a 63% hit rate and the Russell two generates four fifty basis points, again with a very strong hit rate of 63%. So pure alpha signal from identifying on or off topic executives on an earnings call. So now we’re going to move on how we calculated the proactive and reactive score. For this, we used prompt engineering, for an LLM to pretend it was an executive on an earnings call and to answer the questions. So the context that we gave the executive, the LLM executive to answer the analyst question was just the pre prepared remarks and any proceeding answers to the previous question.

So for example, if there were 20 questions on the call, for the twentieth question, we gave the LLM executive the pre prepared remarks plus the answers to the preceding 19 questions for it to answer. And then we do the exact same process as we did previously. We take those features and we form the portfolios going long, the top 20 of pro active companies, shorting the bottom 20% of reactive companies. And we see the alpha results here. So again, long short portfolio, on the long side, there was a three generates statistically significant alpha of 74 basis points, hit rate of 53%.

And the smaller cap Russell II generates just under 100 basis points there, 96 basis points with a hit rate of 61%. And on the long short side, the Russell three, one hundred and seventy three basis points, 55% hit rate and then the smaller Cap Russel two, small alpha there, two forty basis points with a hit rate of 57%. So again, a pure alpha signal from just identifying proactive and reactive executives on an earnings call. So just going to jump into now a couple of examples, going back to the Caterpillar example. One of the questions from the earnings call was backlog increased by $300,000,000 Was this due to pricing or an increase in order volumes?

Well, higher sales and order volumes were actually covered in the prepared remarks by Andrew Bonfield, the CFO. And when we posed that question to the LLM executive, the answer was the backlog increase was due to a mix of higher prices and increased order volumes resulting from new orders and a dealer inventory changes. That came out with a Cosent score of 0.89, putting it in the 80 percentile and this indicated that the executive was being proactive. So what was the actual original executive answer? The original executive answer was price had significant impact while volumes fluctuate quarter by quarter, backlog remains healthy.

And how did that executive score? Eighty second percentile and the executive was answering the question and on topic. So we look at how did Caterpillar perform two weeks post that earnings call when the S and P 500 was down 2.16% and Caterpillar actually outperformed, it returned seven point three percent two weeks post the call, so a 9% active return of the S and P 500. Onto another example, this is taking a question from Golden Ocean Group’s first quarter twenty twenty three earnings call. So the question was, could you talk about the FFA markets as Capes large vessels are in backwardation and smaller vessels are in contango?

Well, FFA markets was not covered in the pre prepared remarks by the executive. The L and M answer was the FFA market croissant score, tenth percentile, so the executive was being reactive, I. E, this topic was not covered in the prepared remarks and the executive was reacting to the analyst’s question. How did the executive actually answer the question? Our comments are more in the longer term perspective, so I won’t be able to comment specifically on the FFA curve.

I would say that that’s not even going off topic. I think that’s changing the topic or just bluntly not even answering the topic. Cosan score is 0.69, fifteenth percentile. Executive is clearly answering off topic. What happened to Golden Ocean Group’s share price?

Two weeks post the call, the S and P 500 was up 1% and Golden Ocean Group dropped seventeen point three percent two weeks post the call. Okay. So we’ve identified two alpha producing signals. One is, a proactiveness and the second is on topicness. But what what happens if we create four communication styles from those two signals?

So the first one we looked at is a proactive and on topic manager. So this is an executive that is giving the analysts everything that they want to know in the prepared remarks. And when the analyst ask a question, the executives remaining on topic to the question asked. The second is proactive and off topic. So proactive in the prepared remarks but answers the question off topic.

The third is reactive and on topic. So even though they didn’t cover the topics in the pre prepared remarks, when analysts do ask about those topics in the Q and A, executives remain on topic to the question asked. And then you’ve got the entire flip side, you’ve got reactive and off topic. So these are executives where they haven’t given everything that the analysts are looking for in the prepared remarks. And when those analysts go looking for that information with their questions, the executives go off topic.

And the performance is quite telling. So you can see here the blue line, the proactive and on topic executives significantly outperformed their reactive and off topic peers. This is a backtest that was done over the past sixteen years on the Russell three thousand. So proactive and on topic managers generate two forty seven basis points of alpha per year. And the flip side, their reactive and off topic counterparts generate negative two fifty six basis points of alpha year.

So that’s a differential of five zero six basis points on the long short side from those two communication styles. So what we’re looking at here is a table of those returns. So the top left quadrant that you can see here is a proactive and on topic manager and the bottom right quadrant is a reactive and off topic manager. Essentially, what we did is we did a dependent sort on proactiveness. So we looked at the proactive cosine score, we put that into three buckets, into three tertiles.

And then within each of those tertiles, we tertiled it on the on topic score. And what that does is it gives you nine distinct portfolios. And what you’re looking at here is the return of those portfolios, the T stat in the brackets and the hit rate percentage below that. So firms with the most on topic and proactive executives outperform those with the most off topic and reactive executives by more than 5% per year. The spread of on versus off topic is larger when executives are reactive and the spread of proactive versus reactive is larger when executives are off topic.

But an interesting exercise to do is to rebase the returns in the table. So if I’m looking at the top left proactive and on topic quadrant here, I can see that 100%. That represents the maximum return that you can get from this strategy. On the bottom right, the 0% there that you can see represents the zero return that you can get from this. So the interesting thing is that it is the combination of the two evasive behaviors, both reactiveness and off topic alignment that signals underperformance.

So managers are significantly penalized if they’re both reactive and coupled with off topic. It makes sense, right, they haven’t covered some key topics in the prepared remarks that analysts are looking for. And then when analysts actually go and try and find information from the executive on those topics, they’re avoiding the question, going off topic and potentially being evasive.

Dan Samberg, Moderator, Head of Quant Research, S and P Global Market Intelligence: Hey, Liam, fantastic. So far, we’re getting a lot of action through our Q and A widget here. I see a question has just come in. So just to clarify on this analysis, I’m going to rephrase the question a bit here. So what you’re basically saying is that you could be the most off topic executive, but as long as you’re proactive, you can earn $0.75 on the dollar and as long as you’re on topic, even if you’re the most reactive, you can earn $0.69 on the dollar.

Is that the way to understand this?

Liam Hynes, Global Head of New Product Development for Quant Research, S and P Global Market Intelligence: Yes, that’s exactly right, Dan. Yes, so you can be entirely you can be an executive and be entirely off topic, but as long as you are proactive in the presentation, you’re not penalized as much. And the same, you can be an entirely reactive executive on the earnings call, but as long as you’re answering the analyst questions and remaining on topic, you’re not penalized. It’s when both of those characteristics are blended in an executive that they’re really penalized. When they’re both reactive and off topic, there’s significant deterioration to the returns of that company.

Okay, so we’ve kind of run through, we’ve shown empirical results that these two behavior signals are efficacious of forward returns. But why? We’ve kind of experienced the what, but why are they experiencing these returns? Well, when a manager or an executive is in the earnings call and they are answering questions, what does their one year forward gross profit look like? So reactive and off topic managers, about one year from the earnings call, generated around 12% growth in their gross profit.

But proactive and on topic managers have around 2.5 times that growth. They generate 31% growth in their gross profit one year prior to the earnings call. So essentially what that means is that you have executives that are exuding some confidence because they understand their business very well and they can probably understand if there’s any headwinds or any operational inefficiencies in their organization in the coming twelve months. So their non evasiveness breeds transparency, they’re willing to answer every question and remain on topic and they’re not evading or hiding anything from the audience.

Dan Samberg, Moderator, Head of Quant Research, S and P Global Market Intelligence: We’ve got another good inbound here, Liam. A question, did you look at whether there was any multiple expansion within the company? So just sort of dovetailing on what you were saying there, it looks like firms can either appreciate and value via stock price through multiple expansion or through improving actuals. And so you’re showing improving actuals here, which argues that there’s real improvement in the firm. How about perception wise?

Did we see any growth in the multiple that the stock trades on for the proactive and on topic firms?

Liam Hynes, Global Head of New Product Development for Quant Research, S and P Global Market Intelligence: Yes, there was some slight growth in that. The main effect on the multiples basically came from the actual next slide I have, but essentially proactively addressing key issues reduces speculative uncertainty, so it actually leads to a lower risk premium. So the more transparent you are, the potentially the lower the risk premium you have. And what that can do is it results in in a higher valuation multiple as investors pricing stability and there’s some predictability around the future earnings growth. So very well timed question there from the audience.

Thank you. And then the other economic rationale that we have is strategic foresight and competitive positioning, right? So firms that proactively address investor concerns, they demonstrate strong risk management and strategic foresight. And what that does is it signals operational strength. Confident firms with durable competitive advantages are more likely to engage in clear direct communication and then firms in weaker positions may avoid key topics signaling underlying vulnerabilities.

So like I mentioned on the previous slide with the gross profit growth, if you’re an executive and you know that your twelve year outlook isn’t that strong, it puts that executive in a weaker position and they avoid key topics and they may try to avoid difficult or hard questions from analysts. I’m going to hand it over to Henry Chang. Henry is going to run through a little bit of a coding tutorial on how we took the machine readable transcripts and pointed it to the LLM API and came up with these two scores, the proactive score and the on topic score. I’ll hand it over to you, Henry, there if you want to share your screen.

Dan Samberg, Moderator, Head of Quant Research, S and P Global Market Intelligence: And while Henry is pulling that up, I’ll just let the audience know as when Henry shares his screen, the media player should get larger within the webinar console. So the slides will still be visible on one side, but the demo is going to occur in the media player and that should resize automatically for you. As a reminder, everything in the webinar console is resizable, movable. So if you need to adjust, now is a good time to do that. Go ahead, Henry.

Henry Chang, Quant Analyst, S and P Global Market Intelligence: Thanks, Sam. Thanks, Liam. So in my part, I’ll be covering how to actually start from the transcript and divide those two signals that Liam just mentioned. So for the purpose of today’s demo, I’m pulling up Caterpillar’s Q2 twenty twenty three earnings call transcript in a PDF format, but keep in mind that on the back end, we have 196,000 transcript in machine readable format to be processed systematically. So let’s have a look at this transcript.

So in the transcript, you can have a look at all the participants. So there are two CEOs that attend this earnings call and there are also analysts from investment banks that jump onto this earnings call and ask the questions. So there are two sections in an earnings call. First, they start with the presentation. So this is when the CEO and the CFO jump onto the call, tell everyone how much money they’ve made in that quarter and they also talk about other things like risks, outlook and basically provide details of the finance of that company.

And then there’s a second section which is the Q and A. So this is a section where the investment banking analysts, they jump onto the call, they heard all of the questions they heard of the pre prepared remarks that the executive just gave and they asked questions around those topics. So you can see here that these are different analysts and here the executive takes turns to answer those questions. The demo that Liam just gave in the case study was this question that came from Temi Zakaria. She is an equity researcher from JP Morgan and her question was on the backlog.

So backlog increased by RMB300 million and she was asking if that’s purely driven by pricing or an increase in the order volume. So now I’m jumping on to Snowflake and show you how we process this systematically on the back end. So let me pull up the same transcript. It’s the second quarter twenty twenty three and the company is Caterpillar. So in here you can see this transcript.

We are starting from the Q and A pairing table. So essentially what it says is that we have everything up from the very first question, which is normally just greetings, paired to the answers. So we have all the question and answers pair running from the first to the end. And on average, we have about 20 to 30 of these Q and A pairs in an earnings call transcript. On the back end, we paired all of these together so that we can process them, we can do all the vector embedding, the cosine similarity calculation that VM just mentioned and here I’m going to show you how powerful it is our machine readable transcript product is.

So it gives you the details of the transcript. So basic information like the call date, so this earnings call was conducted on the 08/01/2023. It’s the second quarter of twenty twenty three and you can also see the headlines. So these are all just basic information of the earnings call, right? Nothing too surprising.

But here I’m going to show you how that really the part that how amazing our machine readable transcript is. It’s about how we can link each of the questions to the person that’s asking the question. So we actually have our professionals dataset that allows you to map each of the components from the transcript to the person that’s asking a question. So here you can see that Temi Zakaria was on here and you can see her question pro ID. So this allows you to map this person to the estimates.

So it’s a very powerful tool that allows you to map the questions to the estimates so you will be able to analyze things like whether in an earnings call, if an executive is only picking is picking more bearish than bullish analysts who ask the questions or stuff like mapping this to the estimates and allows you to sort of predict sort of the outcome of the company. And then you can also do that for on the answer side. So you’ll be able to map that to the person who is answering that question by the pro ID. So you know that some of them are coming from the CFO and some of them are coming from the CEOs and this allows you to analyze the language that each of the person is using in answering these questions. So it opens up tons of NLP analysis work that you can do to the transcript.

So jumping back to our signals, what we did was that we vector embed both the questions and the answers and then we can calculate the coherence similarity between them. That’s for the first signal, the on topic signal. And we also have the second signal, which is the proactive and the reactive signal and that will have to generate the LMM response. And here is the prompt that generates all that. So as Liam mentioned, the prompt first started with pretend to be a top executive, please answer these questions that came from analysts and then we provided with 60% of the pre prepared remarks that the executive gave and then we asked the same question that the analyst asked.

So now let’s take the example of that question that came from Temi Zakaria. It’s a twenty fourth question in the component order, so let’s select that and we can also have to answer that corresponding to that question. So what’s happening on the back end is that the prompt I just showed you, we are giving that to an LMM and the LMM is generating the LMM response on the fly. So here you can see that this was the question. It was on the 300,000,000 increase in backlog and here’s the executive answer that we just saw in the PDF and now this is the LM answer.

So as Diya mentioned, we also summarized all three of them for consistency and for standardization. Here is the summary of the question. So indeed, it was talking about 300,000,000 increase. Here’s the answer. The executive answered that increase was used to both the both pricing and the volume.

The LMM answered similarly. It’s about pricing and the volume. So from here, you can vector in-depth both the question and the answers and here are your signal scores. With that, I’m going to hand back to Dan and he’s going to introduce Roman from Pronto.

Dan Samberg, Moderator, Head of Quant Research, S and P Global Market Intelligence: Thank you so much, Henry and Liam. That was a fantastic deep dive into how we construct these signals. To quickly recap, we built these signals using machine readable transcripts using an off the shelf large language model. The pre calculated signals for this research are now an integrated part of Pronto NLP, providing actionable insights to investors. Pronto NLP also includes many other signals, including those generated with a fine tuned, purpose built large language model designed specifically for financial applications.

And what we’d like to do now is to help us understand the ProntoNLP approach and how this enhancement aids in the generation of the signals. We are going to turn it over to Ronan Feldman to discuss a little more about Pronto. Ronan, over to you.

Ronan Feldman, Founder, Pronto NLP, S and P Global Market Intelligence Company: Thank you, Dan. Basically, what you see here is we take an earning call, we break it into its sections, like the presentation part and obviously the Q and A pairs. And then we break it further into sentences. And if the sentences are complex, we break it even into multiple phrases. For each phrase, we identify the sentiment that we have for that phrase, positive, negative and neutral.

Then we also identify the importance, how important is that particular phrase within the context of the whole section. So we have high, medium and low in terms of the importance. We can use it when we calculate scores in the signals, we can use the importance in order to provide weights. Then we also have an explanation. The explanation is extremely important because it’s like a guardrail to make sure that there are no hallucinations.

We look at the explanation and we see if it really correlates to the actual text. We look at the numbers that are in the explanation to see if they appear inside the text. If we see that there is no connection between the explanation and the actual text, we actually run it again. So that minimizes the chances of any hallucinations that we may get from the LLM. One of the nice things that we get from the LLM is that events are generated automatically.

There is no predefined taxonomy. So unlike a lot of other competing products where you have a predefined defined taxonomy and then where there is a new topic, you do not discover it until you manually change it. We use the LLM to automatically detect new topics. The problem is that there are 2,700,000 topics that were identified by the LLM. That’s way too many for any quantitative signal.

And this is why we use the embeddings of all the instances that the LLM identified and cluster them into 110 events. When you use actually our platform, which is part of the offering, you do get the original LLM tech. So you can actually see it and you can consume it using the API. Let’s go now to an example that actually utilizes the topics that we identify. So you saw previously that we identify all those events.

And you can see that combining all the amazing work that Dan, Liam and Henry did, we combine it with the topics that we identify. And then you can actually find interesting peaks for specific topics that are connected to off topicness, reactiveness, so to see what topic actually are exactly the topics that executives like to avoid. And let’s look at the first one. The first one, if we can see in 2021, we remember the supply chain issues during COVID. And that was a topic that both executives that were off topic, they did not like to discuss.

The next one, we can see is a recession. So this was in 2022. Consider there was a big jump in terms of the negative sentiment related to a recession and that again was the main topic for both executives that wanted to avoid answering the direct questions and when the questions are around recession, usually they try to find some detour. And the last one that we can actually get here is margin, which was in 2023. And again, for all of those executives that were below the score that you can see or the 0.8, we see that there was a big increase in such topics for both executives that try to avoid questions or find any way to go around them things around margin.

So back to

Liam Hynes, Global Head of New Product Development for Quant Research, S and P Global Market Intelligence: you, Dan.

Dan Samberg, Moderator, Head of Quant Research, S and P Global Market Intelligence: Thank you, Ronan. That was a great deep dive on Pronto NLP and a fantastic webinar so far, folks. We’ve got plenty, plenty of time for Q and A and a lot of questions coming in. So looking forward to that session. Before we jump into the questions, we’ve got a polling question up.

To what extent have you incorporated large language models into your investment management workflows? And we’ll give everyone just a few minutes, a few seconds to fill that out there. Lots of folks at different parts of the journey and lots of considerations to be made before bringing this in. Okay, we’ve got another ten seconds here to fill that out. Again, folks, there’s a Q and A widget in the webinar console and we’ll be shifting questions coming in, in droves here.

That’s great. Keep them coming while you fill out that polling question. All right, let’s take a look at our results. We actively use LLMs only 8.3%, experimenting 25%, exploring 33% and not currently using 33%. So still a very early part of the journey for many folks.

And I think this hopefully will help with getting started on that process. Okay. So let’s jump into some of the questions we’ve received so far. I’ll give the first one over to Liam. A question came in, how do you account for sector specific variations in executive Q and A when constructing your signals?

I think you addressed that, but maybe you could just recap real quick.

Liam Hynes, Global Head of New Product Development for Quant Research, S and P Global Market Intelligence: Sure, sure. So I’m going to be on topic here. The best way to do this is probably repeat the question. So how do we control for sector variations in the portfolio construction? What we do is we obviously construct the on topic scores and the proactive scores.

And then what we do is we go into at the end of every month, we go into each sector and then we rank within the sector. So we’ll get the top 20% and the bottom 20% in each sector and then we combine those 11 secondtors, top 20% and bottom 20%. And what that ensures is that there’s an equal representation of sectors in the long portfolio and in the short portfolio.

Dan Samberg, Moderator, Head of Quant Research, S and P Global Market Intelligence: Fantastic. So sector neutral, equal representation in long and short. So we don’t have a sector tilt in any of those. Here’s another one. I really like this question.

The the ask the LLM, is this question on topic, was it addressed in the prepared remarks?

Liam Hynes, Global Head of New Product Development for Quant Research, S and P Global Market Intelligence: Do you want me to take that one, Dan? Sure. Yes. Well, there’s a few reasons why we didn’t do that. One is that it’s probably too subjective for the LLM, so and also it would be a binary outcome for the LLM to determine whether or not the answer was on topic or not.

So it would either say that the question was on topic or was off topic. So it would be a binary assessment and it’s not a continuous quant assessment like the cosine score. So, the LLM doesn’t necessarily do this well. So if you go back to the Golden Ocean answer, technically the LLM might have said that that was on topic, right, because the executive mentioned the FFA curve, but quantitatively, it was off topic. And, you know, we’ve obviously published this research.

It’s along with a coding notebook and there’s a rag engine that we stood up on the back end for this and there’s a couple of reasons why we went down that route. We need consistency in responses from the LLM. Technically, you could ask the LLM, is this question on or off topic? And then you could ask the LLM the same question again and it might give you a different answer. So something where you’re relying on the LLM to be very subjective gives you very inconsistent results.

So I could run it back just today and tomorrow if I ask the LLM again, I might get varying results altogether. And so the way we did it this way is we ring fenced the LLM to generate a very refined feature and then we pipe that feature into our backtesting framework.

Dan Samberg, Moderator, Head of Quant Research, S and P Global Market Intelligence: Yes, that’s a very comprehensive answer. So yes, so quantification to your point, you sort of quantify it as opposed to just labeling it with a binary label. And then, if cosine score goes from zero to one or minus one to one depending on whether your vectors are all positive or not,

Liam Hynes, Global Head of New Product Development for Quant Research, S and P Global Market Intelligence: it’s

Dan Samberg, Moderator, Head of Quant Research, S and P Global Market Intelligence: certainly one might think arbitrarily that off topic is less than 0.5 or something. But I think you were finding that most of the off topic answers were zero it’s really like about a 0.8 threshold that kind of divided the universal on the medians. Is that right?

Liam Hynes, Global Head of New Product Development for Quant Research, S and P Global Market Intelligence: Yes, that’s correct. And actually, we have a slide in the appendix, I just put it up on the screen there, that might kind of help answer this subjective question, right? So we did a bit of an experiment where

Dan Samberg, Moderator, Head of Quant Research, S and P Global Market Intelligence: we

Liam Hynes, Global Head of New Product Development for Quant Research, S and P Global Market Intelligence: asked an LLM to look at the Q and A of an earnings call and we gave it two prompts. The first prompt said, you are a financial expert and then asked to score the Q and A section from very negative to very positive on a scale of minus two to plus two, so five increments. And then the second prompt, we just did not include a financial expert. We just said score the Q and A on a very negative minus two to a very positive plus two. And you can see here in the chart that the financial expert prompt came out with a mean of one.

And then when we excluded the financial expert from the prompt, it actually came out closer to minus one. So very small changes and the LN can be very subjective. So you don’t get these consistency in responses when you ask the LOM a very kind of broad based question.

Dan Samberg, Moderator, Head of Quant Research, S and P Global Market Intelligence: Makes sense. So the vectorization and the cosine score is all very consistent and well controlled, whereas the LOM responses has more variability. This actually dovetails with another question we’ve got and maybe Ronan, you could jump in on this one as a generic question. What are the biggest challenges in applying LLMs in the financial markets and how do you address those in both this research and Pronto NLP?

Ronan Feldman, Founder, Pronto NLP, S and P Global Market Intelligence Company: So I think one of the main issues is first you need to structure the input in the right way. In this case, we did it for earning calls. We actually do it for any kind of financial content like filings and then you need to accommodate for a lot of things that you need to filter. So just the pre processing until you get really clean text that is structured in the right way is very important. The other thing is I want to go back to the point that Dan mentioned before.

We actually do heavy fine tuning to the Lama models in order to get to a really good financial LLM. So we use over 10,000 tagged paragraphs using a bootstrapping approach, the classic teacher student framework. And we saw that that alone, if you do it in the right way, gave a huge boost in terms of the alpha that you get out of the signal.

Dan Samberg, Moderator, Head of Quant Research, S and P Global Market Intelligence: Maybe Liam, coming back to you, we’ve got a few questions here around comparing this approach to traditional sentiment analysis, Did you control for the sentiment of the call when you were calculating your returns and what other sort of considerations might have been made?

Liam Hynes, Global Head of New Product Development for Quant Research, S and P Global Market Intelligence: Yes, let me just push it to the slide with the backtest results. Yes, so you can see here on the third bucket there on the backtesting when we computed the portfolio returns, we looked at the one month forward returns. We did actually control for sentiment in the 114 returns. So what we did is we stripped out the Fama French for residuals, but then we actually looked at natural language processing signals. So sentiment being net positivity.

So we looked at the Lachlan McDonald dictionary that was published, I think, in 2010, and we looked at all the positive words from that dictionary in the call, all the negative words, and you can look at a ratio of the net positivity. So we actually stripped that positive sentiment out of the signal. So it’s a pure alpha signal. It doesn’t have that sentiment variable embedded into

Dan Samberg, Moderator, Head of Quant Research, S and P Global Market Intelligence: it. Got it. So these and how about correlation wise? Do you see any major correlation between earnings call sentiment and these other signals?

Liam Hynes, Global Head of New Product Development for Quant Research, S and P Global Market Intelligence: No, actually. We did some time on Macbeth, a regression on it and it actually came out very, very strong. So the on topic score has got about 80 basis points of a coefficient with a very strong T stat of around 3.5 and that’s after we controlled for seven or eight fundamental factors and three or four sentiment factors. So no, it seems to be kind of a unique on topic and proactive signal that we’re after finding here. It’s not correlated to sentiment, which makes sense really when you think about it, right?

Because sentiment is looking at whether or not the executive was speaking positively or negatively, right? But theoretically, I could be answering a question on or off topic and still have a positive or negative angle on it. So I could be answering a question off topic, but be very positive about how I was answering it off topic, right? So it is an accretive signal to sentiment. The way I like to think about it is, this is more of a behavioral signal rather than a sentiment signal, right?

We’re identifying behavioral characteristics of the manager on the call and the behavior is how do they answer the question? Are they on topic or off topic? And then what’s their presentation like? Are they proactive and reactive? So I would say it’s mutually exclusive to this sentiment.

Dan Samberg, Moderator, Head of Quant Research, S and P Global Market Intelligence: Henry, this next question I think is for you here. Can you talk a little bit about the large language model that was used? You mentioned LAMA, which LAMA model was used? And how much of the proactive on topic signal is using the original transcript product, how much is part of Pronto NLP?

Henry Chang, Quant Analyst, S and P Global Market Intelligence: Absolutely. So we use the LAMA 3.18 b model. And the reason that we use this model is because at that time, we were looking for an LMM that has large enough context length. That is a big thing for us. Because if you think about it, an average earnings call transcript has about 10,000 tokens in its prepared remarks and another 10,000 tokens in its answer.

So we need to find a model that’s large enough to fit all of these texts in. GBT 3.5 was an option at that time but then due to the context length constraint, we have to go with LAMA. And in terms of whether this is built on top of Pronto, at that time, this was purely done on machine readable transcript But then after we get those results, we can combine this signal with Pronto, which is another value add and strengthened result that we have.

Dan Samberg, Moderator, Head of Quant Research, S and P Global Market Intelligence: Excellent. Thank you. Liam, coming back to you. On these signals, could an executive deliberately go off topic to avoid discussing bad news? And how often do we see executives going off topic during earnings

Ronan Feldman, Founder, Pronto NLP, S and P Global Market Intelligence Company: calls?

Liam Hynes, Global Head of New Product Development for Quant Research, S and P Global Market Intelligence: Good question. So could an executive purposely go off topic to avoid answering questions, is that what you asked?

Dan Samberg, Moderator, Head of Quant Research, S and P Global Market Intelligence: Sure. And I think it’s just a clarifying sort of question as to what you’re trying to

Liam Hynes, Global Head of New Product Development for Quant Research, S and P Global Market Intelligence: capture. Right. Yes. We’re essentially trying to capture evasiveness, right, or the executive not wanting to cover a particular topic. So that’s what we’re trying to capture when we’re looking at the on topic and off topic from the executive.

And it seems to be systematically when you look across the Russell three thousand, it is efficacious and it is quite a strong signal. So executives, when if they just exhibit that one characteristic of just not being able to answer the question and remain on topic, it’s a very poor sign for forward results. And when you think about it, there’s two reasons why you might have an executive who goes off topic, right? One is that they just might have the competency to answer the question or they might know. And the second one is that they’re purposely pivoting or going off topic.

Both of those reasons are bad, right? One is evasiveness, which is not good news and the other one is the executive potentially doesn’t understand the question or the business model or the operations associated with it. I’m sorry, Dan, what was the second question, Russ?

Dan Samberg, Moderator, Head of Quant Research, S and P Global Market Intelligence: How often do executives go off topic? I think you had a slide towards the end there on the number of off topic and reactive questions and I think there was an absolute sort of number. Do you have a sense of what

Liam Hynes, Global Head of New Product Development for Quant Research, S and P Global Market Intelligence: Yes. So actually, this is quite interesting actually. So if we look at a cosine score threshold of about 0.8 both on the proactive score and the off topic score, you can see here that it kind of hovers around, let’s say, 28% or 30% of executives who are reactive and off topic. So it’s quite a sizable chunk, maybe anywhere between a quarter to just under a third. And actually, if you look at this chart, you can actually see that there’s some spikes here in the fourth quarter, right?

And the reason we think that there might be some spikes in the fourth quarter is that the fourth quarter earnings call is normally going to be centered around the full year results. So when executives are delivering their prepared remarks, it’s mainly going to be centered around financials and there might be a lot of room for the topics in there. Hence, you get this spike in reactiveness in the fourth quarter because they’re discussing full year results. And also, you see that in them being off topic as well in the fourth quarter. And that’s potentially because they might be focusing on the financial results or they might be prioritizing other components in the earnings call.

So they have less room for spontaneous discussion.

Dan Samberg, Moderator, Head of Quant Research, S and P Global Market Intelligence: Fantastic. Maybe one more question. We’ve got about five minutes left and all three of you maybe can weigh in. How should those that are interested in adopting large language modeling in their process and generative AI in general think about the approach, how should we get started?

Liam Hynes, Global Head of New Product Development for Quant Research, S and P Global Market Intelligence: Well, there’s one way you can get started. We actually published, let me jump on to the so if you scan this QR code, you’ll be able to get access to the paper every row, but in this paper, we also have a coding notebook and that is a very comprehensive thousands of lines of code in there in order to stand our readers up on machine readable transcripts, generating the data frame, getting those Q and A pairs, vector embedding the text, doing the cosine similarity scores, the whole shebang. Everything that we’ve covered in the presentation today and on the research is, you can see that in the coding notebook. If you click a link on the PDF, it will bring you to a non compute coding notebook where you can view all of our code, so fully transparent. That’s one way that you could successfully stand up some LLM integration onto your textual data suite.

Dan Samberg, Moderator, Head of Quant Research, S and P Global Market Intelligence: Okay. This has been a fantastic discussion and I want to thank all of our speakers, Liam, Henry, Ronan for sharing your expertise today. For those who want to revisit today’s session, a webinar replay will be available. You can also check out the related content widget, which should have links to the white paper, the coding notebook additional resources. On behalf of S and P Global Market Intelligence, thank you very much for joining us.

We look forward to continuing this conversation. Have a great day ahead, folks.

Liam Hynes, Global Head of New Product Development for Quant Research, S and P Global Market Intelligence: Thanks, Dan. Thanks, folks.

Ronan Feldman, Founder, Pronto NLP, S and P Global Market Intelligence Company: Yes. Thanks, Dan. That was wonderful.

This article was generated with the support of AI and reviewed by an editor. For more information see our T&C.

Latest comments

Risk Disclosure: Trading in financial instruments and/or cryptocurrencies involves high risks including the risk of losing some, or all, of your investment amount, and may not be suitable for all investors. Prices of cryptocurrencies are extremely volatile and may be affected by external factors such as financial, regulatory or political events. Trading on margin increases the financial risks.
Before deciding to trade in financial instrument or cryptocurrencies you should be fully informed of the risks and costs associated with trading the financial markets, carefully consider your investment objectives, level of experience, and risk appetite, and seek professional advice where needed.
Fusion Media would like to remind you that the data contained in this website is not necessarily real-time nor accurate. The data and prices on the website are not necessarily provided by any market or exchange, but may be provided by market makers, and so prices may not be accurate and may differ from the actual price at any given market, meaning prices are indicative and not appropriate for trading purposes. Fusion Media and any provider of the data contained in this website will not accept liability for any loss or damage as a result of your trading, or your reliance on the information contained within this website.
It is prohibited to use, store, reproduce, display, modify, transmit or distribute the data contained in this website without the explicit prior written permission of Fusion Media and/or the data provider. All intellectual property rights are reserved by the providers and/or the exchange providing the data contained in this website.
Fusion Media may be compensated by the advertisers that appear on the website, based on your interaction with the advertisements or advertisers
© 2007-2025 - Fusion Media Limited. All Rights Reserved.