hamelsmu · July 10, 2025 13:35 · Mar 27, 2024 · Mar 27, 2024 · Mar 27, 2024 · Mar 27, 2024
diff --git a/is_fine_tuning_valuable.md b/is_fine_tuning_valuable.md
@@ -6,24 +6,24 @@ I think that fine-tuning is still very valuable in many situations.  I’ve done
 
 - They are making developer tools - foundation models have been trained extensively on coding tasks.
 - They are building foundation models and testing for the most general cases.  But the foundation models themselves are also being trained for the most general cases.
-- They are building a personal assistant that isn’t scoped to any type of domain or use case, and is essentially similar to the same folks building foundation models.
+- They are building a personal assistant that isn’t scoped to any type of domain or use case and is essentially similar to the same folks building foundation models.
 
 Another common pattern is that people often say this in earlier stages of their product development. One sign that folks are in really early stages is that they don’t have a domain-specific eval harness.
 
 **It’s impossible to fine-tune effectively without an eval system which can lead to writing off fine-tuning if you haven't completed this prerequisite.**  It's impossible to improve your product without a good eval system in the long term, fine-tuning or not.
 
-You should do as much prompt engineering as possible before you fine-tune.  But not for reasons you would think!  The reasons for doing lots of prompt engineering is that its a great way to stress test your eval system! 
+You should do as much prompt engineering as possible before you fine-tune.  But not for reasons you would think!  The reason for doing lots of prompt engineering is that it's a great way to stress test your eval system! 
 
-If you find that prompt-engineering works fine (and you are systematically evaluating your product) then its fine to stop there.  I'm a big believer in using the simplest approach to solving a problem.  I just don't think you should write off fine-tuning just yet.
+If you find that prompt-engineering works fine (and you are systematically evaluating your product) then it's fine to stop there.  I'm a big believer in using the simplest approach to solving a problem.  I just don't think you should write off fine-tuning just yet.
 
 ### Examples where I've seen fine-tuning work well
 
 Generally speaking, fine-tuning works best to learn syntax, style and rules whereas techniques like RAG work best to supply the model with context or up-to-date facts.  
 
-_These are some examples from companies I've worked with.  Hopefully we will be able to share more details soon._
+_These are some examples from companies I've worked with.  Hopefully, we will be able to share more details soon._
 
-- [Honeycomb's Natural Language Query Assistant](https://www.honeycomb.io/blog/introducing-query-assistant) - prevoiusly, the "programming manual" for the Honeycomb query language was being dumped into the prompt along with many examples. While this was OK, fine-tuning worked much better to allow the model to learn the syntax and rules of this niche domain-specific language. 
+- [Honeycomb's Natural Language Query Assistant](https://www.honeycomb.io/blog/introducing-query-assistant) - previously, the "programming manual" for the Honeycomb query language was being dumped into the prompt along with many examples. While this was OK, fine-tuning worked much better to allow the model to learn the syntax and rules of this niche domain-specific language. 
 
-- [ReChat's Lucy](https://www.youtube.com/watch?v=B_DMMlDuJB0) - this is an AI real estate assitant integrated into an existing Real Estate CRM sytem.  ReChat needs LLM responses to be provided in a very idiosyncratic format that weaves together structured and unstructured data to allow the front-end to render widgets, cards and other interactive elements dynamically into the chat interface.  Fine-tuning was the key to making this work correctly.  [This talk](https://www.youtube.com/watch?v=B_DMMlDuJB0) has more details.
+- [ReChat's Lucy](https://www.youtube.com/watch?v=B_DMMlDuJB0) - this is an AI real estate assistant integrated into an existing Real Estate CRM system.  ReChat needs LLM responses to be provided in a very idiosyncratic format that weaves together structured and unstructured data to allow the front end to render widgets, cards and other interactive elements dynamically into the chat interface.  Fine-tuning was the key to making this work correctly.  [This talk](https://www.youtube.com/watch?v=B_DMMlDuJB0) has more details.
 
-P.S. fine-tuning is not only limited to open or "small" models. There are lots of folks who have been fine-tuning GPT-3.5, such as  [Perplexity.AI:](https://x.com/perplexity_ai/status/1695102998463009254?s=20) and [CaseText](https://casetext.com/blog/cocounsel-harnesses-gpt-4s-power-to-deliver-results-that-legal-professionals-can-rely-on/).
+P.S. Fine-tuning is not only limited to open or "small" models. There are lots of folks who have been fine-tuning GPT-3.5, such as  [Perplexity.AI:](https://x.com/perplexity_ai/status/1695102998463009254?s=20) and [CaseText](https://casetext.com/blog/cocounsel-harnesses-gpt-4s-power-to-deliver-results-that-legal-professionals-can-rely-on/), to name a few.
diff --git a/is_fine_tuning_valuable.md b/is_fine_tuning_valuable.md
@@ -10,9 +10,9 @@ I think that fine-tuning is still very valuable in many situations.  I’ve done
 
 Another common pattern is that people often say this in earlier stages of their product development. One sign that folks are in really early stages is that they don’t have a domain-specific eval harness.
 
-**It’s impossible to fine-tune effectively without an eval system which can lead to writing off fine-tuning if you haven't completed this prerequisite.**  I think its impossible to improve your product without a good eval system in the long term, fine-tuning or not.
+**It’s impossible to fine-tune effectively without an eval system which can lead to writing off fine-tuning if you haven't completed this prerequisite.**  It's impossible to improve your product without a good eval system in the long term, fine-tuning or not.
 
-I think that you should do as much prompt engineering as possible before you fine-tune.  But not for reasons you would think!  The reasons for doing lots of prompt engineering is that its a great way to stress test your eval system! 
+You should do as much prompt engineering as possible before you fine-tune.  But not for reasons you would think!  The reasons for doing lots of prompt engineering is that its a great way to stress test your eval system! 
 
 If you find that prompt-engineering works fine (and you are systematically evaluating your product) then its fine to stop there.  I'm a big believer in using the simplest approach to solving a problem.  I just don't think you should write off fine-tuning just yet.
 

diff --git a/is_fine_tuning_valuable.md b/is_fine_tuning_valuable.md
@@ -18,12 +18,12 @@ If you find that prompt-engineering works fine (and you are systematically evalu
 
 ### Examples where I've seen fine-tuning work well
 
+Generally speaking, fine-tuning works best to learn syntax, style and rules whereas techniques like RAG work best to supply the model with context or up-to-date facts.  
+
 _These are some examples from companies I've worked with.  Hopefully we will be able to share more details soon._
 
 - [Honeycomb's Natural Language Query Assistant](https://www.honeycomb.io/blog/introducing-query-assistant) - prevoiusly, the "programming manual" for the Honeycomb query language was being dumped into the prompt along with many examples. While this was OK, fine-tuning worked much better to allow the model to learn the syntax and rules of this niche domain-specific language. 
 
-- [ReChat's Lucy](https://www.youtube.com/watch?v=B_DMMlDuJB0) - this is an AI real estate assitant integrated into an existing Real Estate CRM sytem.  Initially, Lucy hit a plateau with prompt engineering - fixing one failure mode led to other failure modes, making prompt engineering feel like whack-a-mole.  Creating an evaluation system unlocked the ability to quickly improve the AI with fine-tuning.  [This talk](https://www.youtube.com/watch?v=B_DMMlDuJB0) has more details.
-
-Fine tuning is not only limited to open models. There are lots of folks who have been fine-tuning GPT-3.5, such as  [Perplexity.AI:](https://x.com/perplexity_ai/status/1695102998463009254?s=20) and [CaseText](https://casetext.com/blog/cocounsel-harnesses-gpt-4s-power-to-deliver-results-that-legal-professionals-can-rely-on/).
-
+- [ReChat's Lucy](https://www.youtube.com/watch?v=B_DMMlDuJB0) - this is an AI real estate assitant integrated into an existing Real Estate CRM sytem.  ReChat needs LLM responses to be provided in a very idiosyncratic format that weaves together structured and unstructured data to allow the front-end to render widgets, cards and other interactive elements dynamically into the chat interface.  Fine-tuning was the key to making this work correctly.  [This talk](https://www.youtube.com/watch?v=B_DMMlDuJB0) has more details.
 
+P.S. fine-tuning is not only limited to open or "small" models. There are lots of folks who have been fine-tuning GPT-3.5, such as  [Perplexity.AI:](https://x.com/perplexity_ai/status/1695102998463009254?s=20) and [CaseText](https://casetext.com/blog/cocounsel-harnesses-gpt-4s-power-to-deliver-results-that-legal-professionals-can-rely-on/).
diff --git a/is_fine_tuning_valuable.md b/is_fine_tuning_valuable.md
@@ -23,3 +23,7 @@ _These are some examples from companies I've worked with.  Hopefully we will be
 - [Honeycomb's Natural Language Query Assistant](https://www.honeycomb.io/blog/introducing-query-assistant) - prevoiusly, the "programming manual" for the Honeycomb query language was being dumped into the prompt along with many examples. While this was OK, fine-tuning worked much better to allow the model to learn the syntax and rules of this niche domain-specific language. 
 
 - [ReChat's Lucy](https://www.youtube.com/watch?v=B_DMMlDuJB0) - this is an AI real estate assitant integrated into an existing Real Estate CRM sytem.  Initially, Lucy hit a plateau with prompt engineering - fixing one failure mode led to other failure modes, making prompt engineering feel like whack-a-mole.  Creating an evaluation system unlocked the ability to quickly improve the AI with fine-tuning.  [This talk](https://www.youtube.com/watch?v=B_DMMlDuJB0) has more details.
+
+Fine tuning is not only limited to open models. There are lots of folks who have been fine-tuning GPT-3.5, such as  [Perplexity.AI:](https://x.com/perplexity_ai/status/1695102998463009254?s=20) and [CaseText](https://casetext.com/blog/cocounsel-harnesses-gpt-4s-power-to-deliver-results-that-legal-professionals-can-rely-on/).
+
+
diff --git a/is_fine_tuning_valuable.md b/is_fine_tuning_valuable.md
@@ -15,3 +15,11 @@ Another common pattern is that people often say this in earlier stages of their
 I think that you should do as much prompt engineering as possible before you fine-tune.  But not for reasons you would think!  The reasons for doing lots of prompt engineering is that its a great way to stress test your eval system! 
 
 If you find that prompt-engineering works fine (and you are systematically evaluating your product) then its fine to stop there.  I'm a big believer in using the simplest approach to solving a problem.  I just don't think you should write off fine-tuning just yet.
+
+### Examples where I've seen fine-tuning work well
+
+_These are some examples from companies I've worked with.  Hopefully we will be able to share more details soon._
+
+- [Honeycomb's Natural Language Query Assistant](https://www.honeycomb.io/blog/introducing-query-assistant) - prevoiusly, the "programming manual" for the Honeycomb query language was being dumped into the prompt along with many examples. While this was OK, fine-tuning worked much better to allow the model to learn the syntax and rules of this niche domain-specific language. 
+
+- [ReChat's Lucy](https://www.youtube.com/watch?v=B_DMMlDuJB0) - this is an AI real estate assitant integrated into an existing Real Estate CRM sytem.  Initially, Lucy hit a plateau with prompt engineering - fixing one failure mode led to other failure modes, making prompt engineering feel like whack-a-mole.  Creating an evaluation system unlocked the ability to quickly improve the AI with fine-tuning.  [This talk](https://www.youtube.com/watch?v=B_DMMlDuJB0) has more details.
diff --git a/is_fine_tuning_valuable.md b/is_fine_tuning_valuable.md
@@ -5,7 +5,7 @@ Here is my personal opinion about the questions I posed in [this tweet](https://
 I think that fine-tuning is still very valuable in many situations.  I’ve done some more digging and I find that people who say that fine-tuning isn't useful  are indeed often working on products where fine-tuning isn't likely to be useful:
 
 - They are making developer tools - foundation models have been trained extensively on coding tasks.
-- They are building foundation models and testing for the most general cases.  But the foundation models themselves are also being trained for for the most general cases.
+- They are building foundation models and testing for the most general cases.  But the foundation models themselves are also being trained for the most general cases.
 - They are building a personal assistant that isn’t scoped to any type of domain or use case, and is essentially similar to the same folks building foundation models.
 
 Another common pattern is that people often say this in earlier stages of their product development. One sign that folks are in really early stages is that they don’t have a domain-specific eval harness.

diff --git a/is_fine_tuning_valuable.md b/is_fine_tuning_valuable.md
@@ -1,6 +1,6 @@
 Here is my personal opinion about the questions I posed in [this tweet](https://x.com/HamelHusain/status/1772426234032541962?s=20):
 
-## Hamel's Opinion
+---
 
 I think that fine-tuning is still very valuable in many situations.  I’ve done some more digging and I find that people who say that fine-tuning isn't useful  are indeed often working on products where fine-tuning isn't likely to be useful:
 

diff --git a/is_fine_tuning_valuable.md b/is_fine_tuning_valuable.md
@@ -15,7 +15,3 @@ Another common pattern is that people often say this in earlier stages of their
 I think that you should do as much prompt engineering as possible before you fine-tune.  But not for reasons you would think!  The reasons for doing lots of prompt engineering is that its a great way to stress test your eval system! 
 
 If you find that prompt-engineering works fine (and you are systematically evaluating your product) then its fine to stop there.  I'm a big believer in using the simplest approach to solving a problem.  I just don't think you should write off fine-tuning just yet.
-
-## The future
-
-There are ways in which this opinion could become outdated.  For example, if GPT-5 trains on a bunch of domains very deeply as they did with coding, we could see fine tuning become less valuable, especially if GPT 5 becomes fast & cheap.
diff --git a/is_fine_tuning_valuable.md b/is_fine_tuning_valuable.md
@@ -14,7 +14,7 @@ Another common pattern is that people often say this in earlier stages of their
 
 I think that you should do as much prompt engineering as possible before you fine-tune.  But not for reasons you would think!  The reasons for doing lots of prompt engineering is that its a great way to stress test your eval system! 
 
-If you find that prompt-engineering works just fine then its fine to stop there.  I'm a big believer in using the simplest approach to solving a problem.  I just don't think you should write off fine-tuning just yet.
+If you find that prompt-engineering works fine (and you are systematically evaluating your product) then its fine to stop there.  I'm a big believer in using the simplest approach to solving a problem.  I just don't think you should write off fine-tuning just yet.
 
 ## The future
 

diff --git a/is_fine_tuning_valuable.md b/is_fine_tuning_valuable.md
@@ -14,4 +14,8 @@ Another common pattern is that people often say this in earlier stages of their
 
 I think that you should do as much prompt engineering as possible before you fine-tune.  But not for reasons you would think!  The reasons for doing lots of prompt engineering is that its a great way to stress test your eval system! 
 
-If you find that prompt-engineering works just fine then its fine to stop there.  I'm a big believer in using the simplest approach to solving a problem.  I just don't think you should write off fine-tuning just yet.
+If you find that prompt-engineering works just fine then its fine to stop there.  I'm a big believer in using the simplest approach to solving a problem.  I just don't think you should write off fine-tuning just yet.
+
+## The future
+
+There are ways in which this opinion could become outdated.  For example, if GPT-5 trains on a bunch of domains very deeply as they did with coding, we could see fine tuning become less valuable, especially if GPT 5 becomes fast & cheap.
diff --git a/is_fine_tuning_valuable.md b/is_fine_tuning_valuable.md
@@ -10,19 +10,8 @@ I think that fine-tuning is still very valuable in many situations.  I’ve done
 
 Another common pattern is that people often say this in earlier stages of their product development. One sign that folks are in really early stages is that they don’t have a domain-specific eval harness.
 
-**It’s impossible to fine-tune effectively without an eval system which can lead to writing off fine-tuning if you haven't completed this prerequisite. But also more importantly I think its impossible to improve your product without a good eval system in the long term, fine-tuning or not.**
+**It’s impossible to fine-tune effectively without an eval system which can lead to writing off fine-tuning if you haven't completed this prerequisite.**  I think its impossible to improve your product without a good eval system in the long term, fine-tuning or not.
 
-Systematic evalautation is the most critical part of the entire AI - development process.  If you have done the work of creating a great eval system and process, 99% of the work fine-tuning has already been done for you , which is curating high quality data.
+I think that you should do as much prompt engineering as possible before you fine-tune.  But not for reasons you would think!  The reasons for doing lots of prompt engineering is that its a great way to stress test your eval system! 
 
-I think that you should do as much prompt engineering as possible before you fine-tune.  But not for reasons you would think!  The reasons for doing lots of prompt engineering is that its a great way to stress test your eval system!
-
-### Common Objections
-
-O: Fine tuning makes things less portable and cumbersome
-A: Once you collect your data, the process of fine-tuning is extremely portable
-
-O: Fine-tuning is expensive
-A: The most expensive part of fine-tuning is learning how to do it.  You only need a few 100 examples to fine-tune OpenAI, and training an open model is very straightforward.
-
-O: I'm able to get great results with prompt engineering, and fine-tuning doesn't seem worth it.
-A: Excellent.  Don't fine-tune.  But also, make sure you are evaluating your system rigorously with automated tests and metrics if possible to check things.
+If you find that prompt-engineering works just fine then its fine to stop there.  I'm a big believer in using the simplest approach to solving a problem.  I just don't think you should write off fine-tuning just yet.
diff --git a/is_fine_tuning_valuable.md b/is_fine_tuning_valuable.md
@@ -1,8 +1,28 @@
- I think that fine-tuning is still very valuable in many situations.  I’ve done some more digging and I find that people who say that fine-tuning are not useful have a common pattern:
+Here is my personal opinion about the questions I posed in [this tweet](https://x.com/HamelHusain/status/1772426234032541962?s=20):
+
+## Hamel's Opinion
+
+I think that fine-tuning is still very valuable in many situations.  I’ve done some more digging and I find that people who say that fine-tuning isn't useful  are indeed often working on products where fine-tuning isn't likely to be useful:
 
 - They are making developer tools - foundation models have been trained extensively on coding tasks.
 - They are building foundation models and testing for the most general cases.  But the foundation models themselves are also being trained for for the most general cases.
 - They are building a personal assistant that isn’t scoped to any type of domain or use case, and is essentially similar to the same folks building foundation models.
 
-Another common pattern is that people often say this in earlier stages of their product development. One sign that folks are in really early stages is that they don’t have a domain-specific eval harness—they are still mostly doing vibe checks.  
-**It’s impossible to fine-tune without an eval harness, so people write off fine-tuning.**   But also most importantly I think its impossible to improve your product without an eval harness in the long term, fine-tuning or not.
+Another common pattern is that people often say this in earlier stages of their product development. One sign that folks are in really early stages is that they don’t have a domain-specific eval harness.
+
+**It’s impossible to fine-tune effectively without an eval system which can lead to writing off fine-tuning if you haven't completed this prerequisite. But also more importantly I think its impossible to improve your product without a good eval system in the long term, fine-tuning or not.**
+
+Systematic evalautation is the most critical part of the entire AI - development process.  If you have done the work of creating a great eval system and process, 99% of the work fine-tuning has already been done for you , which is curating high quality data.
+
+I think that you should do as much prompt engineering as possible before you fine-tune.  But not for reasons you would think!  The reasons for doing lots of prompt engineering is that its a great way to stress test your eval system!
+
+### Common Objections
+
+O: Fine tuning makes things less portable and cumbersome
+A: Once you collect your data, the process of fine-tuning is extremely portable
+
+O: Fine-tuning is expensive
+A: The most expensive part of fine-tuning is learning how to do it.  You only need a few 100 examples to fine-tune OpenAI, and training an open model is very straightforward.
+
+O: I'm able to get great results with prompt engineering, and fine-tuning doesn't seem worth it.
+A: Excellent.  Don't fine-tune.  But also, make sure you are evaluating your system rigorously with automated tests and metrics if possible to check things.
diff --git a/is_fine_tuning_valuable.md b/is_fine_tuning_valuable.md
@@ -4,4 +4,5 @@
 - They are building foundation models and testing for the most general cases.  But the foundation models themselves are also being trained for for the most general cases.
 - They are building a personal assistant that isn’t scoped to any type of domain or use case, and is essentially similar to the same folks building foundation models.
 
-Another common pattern is that people often say this in earlier stages of their product development. One sign that folks are in really early stages is that they don’t have a domain-specific eval harness—they are still mostly doing vibe checks. (I’ve found that surprisingly, many people don’t know how to build an eval harness.)  It’s impossible to fine-tune without an eval harness, so people write off fine-tuning.   But also most importantly I think its impossible to improve your product without an eval harness in the long term, fine-tuning or not.
+Another common pattern is that people often say this in earlier stages of their product development. One sign that folks are in really early stages is that they don’t have a domain-specific eval harness—they are still mostly doing vibe checks.  
+**It’s impossible to fine-tune without an eval harness, so people write off fine-tuning.**   But also most importantly I think its impossible to improve your product without an eval harness in the long term, fine-tuning or not.
diff --git a/is_fine_tuning_valuable.md b/is_fine_tuning_valuable.md
@@ -0,0 +1,7 @@
+ I think that fine-tuning is still very valuable in many situations.  I’ve done some more digging and I find that people who say that fine-tuning are not useful have a common pattern:
+
+- They are making developer tools - foundation models have been trained extensively on coding tasks.
+- They are building foundation models and testing for the most general cases.  But the foundation models themselves are also being trained for for the most general cases.
+- They are building a personal assistant that isn’t scoped to any type of domain or use case, and is essentially similar to the same folks building foundation models.
+
+Another common pattern is that people often say this in earlier stages of their product development. One sign that folks are in really early stages is that they don’t have a domain-specific eval harness—they are still mostly doing vibe checks. (I’ve found that surprisingly, many people don’t know how to build an eval harness.)  It’s impossible to fine-tune without an eval harness, so people write off fine-tuning.   But also most importantly I think its impossible to improve your product without an eval harness in the long term, fine-tuning or not.