Patrick Boyle On Finance

Is AI Actually Useful?

February 25, 2024 Patrick Boyle Season 4 Episode 8
Patrick Boyle On Finance
Is AI Actually Useful?
Show Notes Transcript Chapter Markers

A new Harvard Business School study analyzed the impact of giving AI tools, to white collar workers at Boston Consulting Group.

In the study, management consultants who were told to use Chat GPT when carrying out a set of consulting tasks were far more productive than their colleagues who were not given access the tool. Not only did AI-assisted consultants carry out tasks 25 per cent faster and complete 12 per cent more tasks overall, but their work was also assessed to be 40 per cent higher in quality than their unassisted peers.

In today's video we look at the pros and cons of using AI at work.

This Week's Sponsor:
Get Magical AI for free and save 7 hours every week: https://getmagical.com/patrick

Papers Mentioned:
Harvard Paper: https://www.hbs.edu/ris/Publication%20Files/24-013_d9b45b68-9e74-42d6-a1c6-c72fb70c7282.pdf
Nicholas Carlini Blog: https://nicholas.carlini.com/writing/2024/my-benchmark-for-large-language-models.html
Nicholas Carlini Quiz: https://nicholas.carlini.com/writing/llm-forecast/
Effects of AI on Employment Paper: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4527336

Patrick's Books:
Statistics For The Trading Floor:  https://amzn.to/3eerLA0
Derivatives For The Trading Floor:  https://amzn.to/3cjsyPF
Corporate Finance:  https://amzn.to/3fn3rvC

Patreon Page: https://www.patreon.com/PatrickBoyleOnFinance
Buy Me a Coffee: https://buymeacoffee.com/patrickboyle

Visit our website: www.onfinance.org
Follow Patrick on Twitter Here: https://twitter.com/PatrickEBoyle

Make More with Matt Heslin
Explore strategies to thrive financially, build legacy, and enhance life experiences.

Listen on: Apple Podcasts   Spotify

Support the Show.

Jensen Huang – the founder and CEO of NVIDIA announced on their earnings call this Wednesday that generative AI has hit a tipping point. He went on to say that demand for AI chips was surging worldwide across companies, industries and nations.  Nvidia has been the biggest single driver of returns in the S&P500 so far this year, and most of the other top performing companies make claims to be leaders in AI, so how useful are these new tools and to what extent do people need to learn how to use them to succeed in the workplace going forward?

Nicholas Carlini, a research scientist at Google DeepMind published a blog post a few days ago examining his use of a variety of AI Large Language Models to try and understand what these models are good at, and where they fail.  I found the post really interesting, as it surprised me how these models can be successful at certain really complex tasks, but then fail at tasks you might imagine would be easy for them. They can do calculus, but then struggle to count. I was surprised to see that the AI was able to write code to create a website showing an American Flag that changed color when clicked on but was then unable to write code showing a drawing of a cake while playing the "happy birthday" song. These two tasks seem to be of similar difficulty, but the outcomes were really different.

Carlini shows that GPT4 can write code to create a website where you can play tic tac toe but its unable to find the winning move in a game of tic tac toe – which seems like a much simpler task.  

The idea that AI’s can excel at complex tasks like idea generation, while struggling at tasks that should be easy for machines to do (like basic math) adds to the confusion over how useful they actually are. Given these surprising outcomes, are today’s AI models useful in the workplace, or do we need to wait for them to improve before using them?

A quiz on Carlini’s website allows you to guess how well AI’s performed at various tasks, giving you a feeling for how well you understand their capabilities. A lot of the confusion around the usefulness of these models comes from a combination of people exaggerating their capabilities and the fact that their developers don’t provide much guidance on the best ways to use these systems which appear to be mostly learned by trial-and-error and the sharing of experiences online.

There have been a number of significant breakthroughs in AI over the last decade. AlphaGo, an AI-based program, defeated a world champion Go player in 2016, which was quite a big deal and garnered a lot of press attention at the time but then quickly faded from the public’s interest.

Generative AI applications like ChatGPT, GitHub Copilot, Midjourney, and others have captured the imagination of the public much more than AlphaGo did, due to their availability and most importantly their ease of use—almost anyone can create a free account and use these models with next to no training. 

The latest generative AI applications can easily perform routine tasks – which is useful, but it is their ability to write text, compose music, and create digital art that has fascinated the public and drawn them in to experiment with these models in a way that prior technical advances in the field did not. As a result, everyone from small children to professionals have been trying to get to grips with generative AI’s impact on business and society but without much instruction on where these models work and where they fail.

According to McKinsey, generative AI could enable the automation of up to 70 percent of business activities, across almost all occupations, between now and 2030.  They argue that AI is likely to affect hours, tasks, and responsibilities for workers across wage rates and educational backgrounds, but that it will have an especially profound effect on professions traditionally requiring higher levels of education, whose work has been less affected by automation in the past.

A recent working paper from Harvard Business School entitled “Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality” illuminates the usefulness of these models by studying how Management Consultants when given access to Generative AI models while working on realistic tasks were more productive and had higher quality output than those without access to AI. The analysis – importantly - showed the situations where the models failed, and how the consultants interacted with the tools. The consultants with AI access were free to ignore the AI solutions if they wanted to, and the study looked at how often they chose to do so, and how good these decisions were. The paper shows under which circumstances organizations and individuals might benefit from AI usage, and how this might change as the technology improves over time.

The Harvard study was a pre-registered experiment conducted with The Boston Consulting Group, a top tier consulting firm. It involved 758 highly skilled BCG strategy consultants where after first establishing a performance baseline, subjects were assigned to one of three groups. A group with no AI access, a group with GPT-4 access, and a group with GPT-4 access with a prompt engineering overview which increased their familiarity with AI. 

The prompt engineering overview included instructional videos and documents that outlined effective usage strategies. 

The experiment aimed to understand how AI integration might reshape the workflows of highly skilled knowledge workers who have traditionally worked without AI assistance.

In the experiment the consultants were incentivized to perform as well as possible on the tasks assigned to them and were told that the top 20% of performers would be recognized by the committee that oversees their career development, meaning that there was no incentive to try to skew the results by deliberately underperforming at tasks. Everyone involved wanted to perform as well as possible.

The study involved two distinct experiments.  The first experiment involved the consultants being given problems to solve that were well within the capabilities of the AI model they were using, the second experiment – gave the consultants what were described as “outside the frontier tasks” which were tasks specifically designed such that the AI couldn’t easily complete them through simple copying and pasting of the instructions as a prompt.

Because these models are new and rapidly evolving, it can be a struggle for users to understand where AI can be helpful, and where its use can be inappropriate.  The authors describe this as the jagged technological frontier where some tasks are easily done by AI, while others, though seemingly similar in difficulty, are outside the capability of the models. The study was designed to both understand the productivity impact of these models on highly skilled workers and to understand how professionals navigate this jagged frontier.

Some of the confusion in understanding the capability of Large Language Models comes from the fact that these models have surprising capabilities that they weren’t specifically created to have, and for this reason even their developers are not sure where this jagged technological frontier lies.

The Harvard study focused on complex tasks which were developed to be realistic and were designed with the input of professionals. A senior executive at BCG said that these tasks were “very much in line with the daily work activities” of the subjects involved.

The paper explains that versions of these tasks are in fact already used by BCG to screen job applicants, typically from elite academic backgrounds for roles within the firm.

In each experiment, the participants first undertook a task without the aid of AI, to establish a baseline for performance. After this, participants were randomly assigned to one of the three groups to assess the influence of AI on their work.

The study found that performance in the initial task was predictive of performance in the experimental tasks, meaning that the tasks were of similar difficulty and tested similar skills.  Each task came with a time allocation, but the study focused more on qualitative differences rather than speed improvements brought about by using AI.

For the “inside the frontier task” – meaning the task that AI could be expected to be helpful on, participants were first tasked with coming up with a new beverage idea – as the control task, and then to come up with a new footwear idea for niche markets as the experiment task. The consultants had to delineate every step involved, from prototype description to market segmentation to entering the market. An executive from a leading global footwear company verified that the task design covered the entire process their company typically goes through. The task required creativity, analytical thinking, writing proficiency and persuasiveness in pitching the idea.

The consultants using AI performed 38% better than the ones without AI, and the ones using AI, who had been provided with training performed 42.5% better than the ones with no AI access.  So, AI improved performance, and AI plus training improved performance even more.

While speed was not the focus of the study, the control group completed on average 82% of their tasks, while the AI group completed 91% of their tasks and the AI plus training group completed 93% of the tasks.  So not only did access to AI improve the quality of the work, but it also improved the productivity of the workers.

A very interesting part of this study was that the most significant beneficiaries of AI access were the lower skilled workers. While both groups received a significant performance boost, the highest performers on the control task saw a 17% boost in their performance when given AI access but the bottom half skill performers saw a 43% boost in their performance.

This does make some intuitive sense, as if we think of AI models as excellent summarizers of existing knowledge that is already in the public-domain. A highly skilled consultant whose knowledge is already near that limit, will benefit less from the use of AI, than someone who is less knowledgeable to begin with. But it is interesting to see that AI can reduce the advantage that highly skilled workers have above lower skilled workers.

Another interesting finding was that the while the study found that subjects using AI produced higher quality ideas in the brainstorming sections of the study, there was a marked reduction in the variability of the ideas compared to those not using AI.  This reminds me of something I read a while ago, that I can’t find the source for, where the new AI models were described as being like television cameras pointed at a television. They can spit out good versions of things that already exist, but can’t necessarily reach beyond their training data. The Harvard study said that while GPT-4 aids in generating superior content, it might lead to more generic results.

OK, so what about the “outside the frontier tasks?  Well in this experiment, the objective was for subjects to offer actionable strategic recommendations to a hypothetical company. This experiment was specifically designed to be a task that AI would struggle with if users were simply copying and pasting the questions as a prompt. 

The task was once again the type of business case that BCG uses in job interviews. 

The task was based on an existing business case study that used data on a spreadsheet, along with a file presenting interviews with company insiders. To solve the task correctly, participants would have to look at the data on the spreadsheet using subtle but clear insights from the interviews. The spreadsheet data was designed to appear to be comprehensive, but the interview notes revealed crucial details. A good consultant would adjust the data on the spreadsheet to include the insights gleaned from the interviews. If you didn’t do this, or if you didn’t weigh the two sources of information correctly, you would come to the wrong conclusion.

This time around, subjects in the control group – with no AI access – came to the correct conclusion around 85% of the time, while the AI users were right 60% of the time, and those using AI after receiving training were right 70% of the time.  In this case the AI users completed the task quicker once again, but came to the wrong solution by relying on the AI tool rather than on their own analysis.

The test subjects were free to ignore the AI’s output, or even to cut the AI out entirely, but they rarely did. One take away from this was that while both groups of AI users underperformed, the training did seem to reduce the level of underperformance.

The researchers went on to examine the behavior of the groups who navigated this task well, to see how their behavior differed from the group that failed at the task.  They found that the groups that used the AI tool successfully followed one of two strategies.  They called one group Centaurs – after the mythical creature that is half-human and half-horse. This group switched between AI and human tasks, allocating responsibilities based on the strengths and capabilities of each entity, discerning which tasks were best suited for human intervention and which could be efficiently managed by AI.

The other successful strategy which the study called Cyborg behavior, involved not just delegating tasks, but intertwining their efforts with AI, alternating responsibilities at the subtask level, such as initiating a sentence for the AI to complete or working in tandem with the AI.  So certain consultants worked out how to use these tools successfully in situations where the tools struggled.

The study concludes that while AI can boost the performance of highly skilled knowledge workers the best approaches to using AI are not yet fully understood and need to be studied before implementing these models carelessly in the workplace. The paper highlights the importance of validating and interrogating AI output, much in the way a senior employee might check the work of a new hire to make sure that they haven’t made any obvious mistakes. The study shows that the worst outcomes came from those who tended to blindly adopt the AI output without questioning it.

These models are likely to change the way knowledge workers work in the coming years, not unlike when desktop computers appeared on office desks thirty or more years ago.  A big difference between now and then is that employers invested significantly in training employees to use computers, while large language models are mostly learned through trial and error, but the errors can be quite embarrassing.

A recent example of a bad implementation of AI is the lawsuit that Air Canada lost this week against a grieving passenger when it tried to walk away from the promises made by its AI-powered chatbot.

The chatbot told a customer that he could buy a ticket at full price now, and later apply for the airlines bereavement discount, which was not Air Canadas policy. The customer had to sue the airline to get the discount offered by the errant chatbot.

Large language models are known to make up answers when asked difficult questions or make up citations – impersonating the style of academic papers – but referencing papers that don’t actually exist. In a recent Stanford University study, researchers found that questions about federal court cases resulted in an extremely high error rate. ChatGPT fabricated responses 69 per cent of the time, while Meta’s Llama 2 model hit an 88 per cent rate of fabrication. To effectively use Ai tools in the workplace we need to better understand the jagged technological frontier described in this study, which will of course change over time as the models improve.

A paper published this summer called “The Short-Term Effects of Generative AI on Employment” showed that within a few months of the launch of Large Language Models, copywriters and graphic designers on online freelancing platforms saw a significant drop in the fees they could charge and in the number of jobs that they got. This suggested that AI was not only replacing their work, but also devaluing the work they could still get.

The most qualified and highest earning freelancers did the worst, demonstrating that being highly skilled provided no protection against loss of work or earnings to these freelancers.

The Harvard study showed a similar leveling of workers, where the less skilled consultants saw the greatest boost in performance when given access to AI tools, but only in tasks within the jagged frontier.  The second half of the study showed that (at least today) the more multifaceted a task was, the more human judgement – and real expertise was needed.

The Harvard study raises additional questions about the future of the workplace.  Typically, junior employees gain expertise through working for senior employees and being given progressively more challenging tasks.  If these “inside the frontier” tasks are delegated to AI in the future, rather than to junior employees, will it become more difficult for juniors to develop the expertise needed to tackle “outside the frontier tasks” in the future.

The significant effect on brainstorming type tasks seen in this study implies that AI will be used for this type of task going forward, but the fact that the outputs are high quality – but possibly generic implies that human generated ideas might stand out from the crowd in the future due to their distinctiveness.

Overall, I’m not awfully pessimistic about these new tools; I don’t believe that human effort will become obsolete in the future.  Similar arguments were made in the industrial revolution; they were made when computers became widely available and when internet access made knowledge more available to the public.  I think these tools will change the types of work we do, but that doesn’t have to be a bad thing.

I’d love to hear in the comments section if you use AI tools at work, where you have found them useful and what you have learned from your experiments?

Thanks for tuning in to this week’s podcast, if you found it interesting send a link to a friend to help the channel grow.  Have a great day and talk to you again soon, bye.

[Music]

(Cont.) Is AI Actually Useful?