Skip to content

Instantly share code, notes, and snippets.

@hamelsmu
Last active July 10, 2025 13:35
Show Gist options
  • Save hamelsmu/00855b345321ff4e3f53d12c3305b8ad to your computer and use it in GitHub Desktop.
Save hamelsmu/00855b345321ff4e3f53d12c3305b8ad to your computer and use it in GitHub Desktop.

Revisions

  1. hamelsmu revised this gist Mar 27, 2024. 1 changed file with 7 additions and 7 deletions.
    14 changes: 7 additions & 7 deletions is_fine_tuning_valuable.md
    Original file line number Diff line number Diff line change
    @@ -6,24 +6,24 @@ I think that fine-tuning is still very valuable in many situations. I’ve done

    - They are making developer tools - foundation models have been trained extensively on coding tasks.
    - They are building foundation models and testing for the most general cases. But the foundation models themselves are also being trained for the most general cases.
    - They are building a personal assistant that isn’t scoped to any type of domain or use case, and is essentially similar to the same folks building foundation models.
    - They are building a personal assistant that isn’t scoped to any type of domain or use case and is essentially similar to the same folks building foundation models.

    Another common pattern is that people often say this in earlier stages of their product development. One sign that folks are in really early stages is that they don’t have a domain-specific eval harness.

    **It’s impossible to fine-tune effectively without an eval system which can lead to writing off fine-tuning if you haven't completed this prerequisite.** It's impossible to improve your product without a good eval system in the long term, fine-tuning or not.

    You should do as much prompt engineering as possible before you fine-tune. But not for reasons you would think! The reasons for doing lots of prompt engineering is that its a great way to stress test your eval system!
    You should do as much prompt engineering as possible before you fine-tune. But not for reasons you would think! The reason for doing lots of prompt engineering is that it's a great way to stress test your eval system!

    If you find that prompt-engineering works fine (and you are systematically evaluating your product) then its fine to stop there. I'm a big believer in using the simplest approach to solving a problem. I just don't think you should write off fine-tuning just yet.
    If you find that prompt-engineering works fine (and you are systematically evaluating your product) then it's fine to stop there. I'm a big believer in using the simplest approach to solving a problem. I just don't think you should write off fine-tuning just yet.

    ### Examples where I've seen fine-tuning work well

    Generally speaking, fine-tuning works best to learn syntax, style and rules whereas techniques like RAG work best to supply the model with context or up-to-date facts.

    _These are some examples from companies I've worked with. Hopefully we will be able to share more details soon._
    _These are some examples from companies I've worked with. Hopefully, we will be able to share more details soon._

    - [Honeycomb's Natural Language Query Assistant](https://www.honeycomb.io/blog/introducing-query-assistant) - prevoiusly, the "programming manual" for the Honeycomb query language was being dumped into the prompt along with many examples. While this was OK, fine-tuning worked much better to allow the model to learn the syntax and rules of this niche domain-specific language.
    - [Honeycomb's Natural Language Query Assistant](https://www.honeycomb.io/blog/introducing-query-assistant) - previously, the "programming manual" for the Honeycomb query language was being dumped into the prompt along with many examples. While this was OK, fine-tuning worked much better to allow the model to learn the syntax and rules of this niche domain-specific language.

    - [ReChat's Lucy](https://www.youtube.com/watch?v=B_DMMlDuJB0) - this is an AI real estate assitant integrated into an existing Real Estate CRM sytem. ReChat needs LLM responses to be provided in a very idiosyncratic format that weaves together structured and unstructured data to allow the front-end to render widgets, cards and other interactive elements dynamically into the chat interface. Fine-tuning was the key to making this work correctly. [This talk](https://www.youtube.com/watch?v=B_DMMlDuJB0) has more details.
    - [ReChat's Lucy](https://www.youtube.com/watch?v=B_DMMlDuJB0) - this is an AI real estate assistant integrated into an existing Real Estate CRM system. ReChat needs LLM responses to be provided in a very idiosyncratic format that weaves together structured and unstructured data to allow the front end to render widgets, cards and other interactive elements dynamically into the chat interface. Fine-tuning was the key to making this work correctly. [This talk](https://www.youtube.com/watch?v=B_DMMlDuJB0) has more details.

    P.S. fine-tuning is not only limited to open or "small" models. There are lots of folks who have been fine-tuning GPT-3.5, such as [Perplexity.AI:](https://x.com/perplexity_ai/status/1695102998463009254?s=20) and [CaseText](https://casetext.com/blog/cocounsel-harnesses-gpt-4s-power-to-deliver-results-that-legal-professionals-can-rely-on/).
    P.S. Fine-tuning is not only limited to open or "small" models. There are lots of folks who have been fine-tuning GPT-3.5, such as [Perplexity.AI:](https://x.com/perplexity_ai/status/1695102998463009254?s=20) and [CaseText](https://casetext.com/blog/cocounsel-harnesses-gpt-4s-power-to-deliver-results-that-legal-professionals-can-rely-on/), to name a few.
  2. hamelsmu revised this gist Mar 27, 2024. 1 changed file with 2 additions and 2 deletions.
    4 changes: 2 additions & 2 deletions is_fine_tuning_valuable.md
    Original file line number Diff line number Diff line change
    @@ -10,9 +10,9 @@ I think that fine-tuning is still very valuable in many situations. I’ve done

    Another common pattern is that people often say this in earlier stages of their product development. One sign that folks are in really early stages is that they don’t have a domain-specific eval harness.

    **It’s impossible to fine-tune effectively without an eval system which can lead to writing off fine-tuning if you haven't completed this prerequisite.** I think its impossible to improve your product without a good eval system in the long term, fine-tuning or not.
    **It’s impossible to fine-tune effectively without an eval system which can lead to writing off fine-tuning if you haven't completed this prerequisite.** It's impossible to improve your product without a good eval system in the long term, fine-tuning or not.

    I think that you should do as much prompt engineering as possible before you fine-tune. But not for reasons you would think! The reasons for doing lots of prompt engineering is that its a great way to stress test your eval system!
    You should do as much prompt engineering as possible before you fine-tune. But not for reasons you would think! The reasons for doing lots of prompt engineering is that its a great way to stress test your eval system!

    If you find that prompt-engineering works fine (and you are systematically evaluating your product) then its fine to stop there. I'm a big believer in using the simplest approach to solving a problem. I just don't think you should write off fine-tuning just yet.

  3. hamelsmu revised this gist Mar 27, 2024. 1 changed file with 4 additions and 4 deletions.
    8 changes: 4 additions & 4 deletions is_fine_tuning_valuable.md
    Original file line number Diff line number Diff line change
    @@ -18,12 +18,12 @@ If you find that prompt-engineering works fine (and you are systematically evalu

    ### Examples where I've seen fine-tuning work well

    Generally speaking, fine-tuning works best to learn syntax, style and rules whereas techniques like RAG work best to supply the model with context or up-to-date facts.

    _These are some examples from companies I've worked with. Hopefully we will be able to share more details soon._

    - [Honeycomb's Natural Language Query Assistant](https://www.honeycomb.io/blog/introducing-query-assistant) - prevoiusly, the "programming manual" for the Honeycomb query language was being dumped into the prompt along with many examples. While this was OK, fine-tuning worked much better to allow the model to learn the syntax and rules of this niche domain-specific language.

    - [ReChat's Lucy](https://www.youtube.com/watch?v=B_DMMlDuJB0) - this is an AI real estate assitant integrated into an existing Real Estate CRM sytem. Initially, Lucy hit a plateau with prompt engineering - fixing one failure mode led to other failure modes, making prompt engineering feel like whack-a-mole. Creating an evaluation system unlocked the ability to quickly improve the AI with fine-tuning. [This talk](https://www.youtube.com/watch?v=B_DMMlDuJB0) has more details.

    Fine tuning is not only limited to open models. There are lots of folks who have been fine-tuning GPT-3.5, such as [Perplexity.AI:](https://x.com/perplexity_ai/status/1695102998463009254?s=20) and [CaseText](https://casetext.com/blog/cocounsel-harnesses-gpt-4s-power-to-deliver-results-that-legal-professionals-can-rely-on/).

    - [ReChat's Lucy](https://www.youtube.com/watch?v=B_DMMlDuJB0) - this is an AI real estate assitant integrated into an existing Real Estate CRM sytem. ReChat needs LLM responses to be provided in a very idiosyncratic format that weaves together structured and unstructured data to allow the front-end to render widgets, cards and other interactive elements dynamically into the chat interface. Fine-tuning was the key to making this work correctly. [This talk](https://www.youtube.com/watch?v=B_DMMlDuJB0) has more details.

    P.S. fine-tuning is not only limited to open or "small" models. There are lots of folks who have been fine-tuning GPT-3.5, such as [Perplexity.AI:](https://x.com/perplexity_ai/status/1695102998463009254?s=20) and [CaseText](https://casetext.com/blog/cocounsel-harnesses-gpt-4s-power-to-deliver-results-that-legal-professionals-can-rely-on/).
  4. hamelsmu revised this gist Mar 27, 2024. 1 changed file with 4 additions and 0 deletions.
    4 changes: 4 additions & 0 deletions is_fine_tuning_valuable.md
    Original file line number Diff line number Diff line change
    @@ -23,3 +23,7 @@ _These are some examples from companies I've worked with. Hopefully we will be
    - [Honeycomb's Natural Language Query Assistant](https://www.honeycomb.io/blog/introducing-query-assistant) - prevoiusly, the "programming manual" for the Honeycomb query language was being dumped into the prompt along with many examples. While this was OK, fine-tuning worked much better to allow the model to learn the syntax and rules of this niche domain-specific language.

    - [ReChat's Lucy](https://www.youtube.com/watch?v=B_DMMlDuJB0) - this is an AI real estate assitant integrated into an existing Real Estate CRM sytem. Initially, Lucy hit a plateau with prompt engineering - fixing one failure mode led to other failure modes, making prompt engineering feel like whack-a-mole. Creating an evaluation system unlocked the ability to quickly improve the AI with fine-tuning. [This talk](https://www.youtube.com/watch?v=B_DMMlDuJB0) has more details.

    Fine tuning is not only limited to open models. There are lots of folks who have been fine-tuning GPT-3.5, such as [Perplexity.AI:](https://x.com/perplexity_ai/status/1695102998463009254?s=20) and [CaseText](https://casetext.com/blog/cocounsel-harnesses-gpt-4s-power-to-deliver-results-that-legal-professionals-can-rely-on/).


  5. hamelsmu revised this gist Mar 27, 2024. 1 changed file with 8 additions and 0 deletions.
    8 changes: 8 additions & 0 deletions is_fine_tuning_valuable.md
    Original file line number Diff line number Diff line change
    @@ -15,3 +15,11 @@ Another common pattern is that people often say this in earlier stages of their
    I think that you should do as much prompt engineering as possible before you fine-tune. But not for reasons you would think! The reasons for doing lots of prompt engineering is that its a great way to stress test your eval system!

    If you find that prompt-engineering works fine (and you are systematically evaluating your product) then its fine to stop there. I'm a big believer in using the simplest approach to solving a problem. I just don't think you should write off fine-tuning just yet.

    ### Examples where I've seen fine-tuning work well

    _These are some examples from companies I've worked with. Hopefully we will be able to share more details soon._

    - [Honeycomb's Natural Language Query Assistant](https://www.honeycomb.io/blog/introducing-query-assistant) - prevoiusly, the "programming manual" for the Honeycomb query language was being dumped into the prompt along with many examples. While this was OK, fine-tuning worked much better to allow the model to learn the syntax and rules of this niche domain-specific language.

    - [ReChat's Lucy](https://www.youtube.com/watch?v=B_DMMlDuJB0) - this is an AI real estate assitant integrated into an existing Real Estate CRM sytem. Initially, Lucy hit a plateau with prompt engineering - fixing one failure mode led to other failure modes, making prompt engineering feel like whack-a-mole. Creating an evaluation system unlocked the ability to quickly improve the AI with fine-tuning. [This talk](https://www.youtube.com/watch?v=B_DMMlDuJB0) has more details.
  6. hamelsmu revised this gist Mar 27, 2024. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion is_fine_tuning_valuable.md
    Original file line number Diff line number Diff line change
    @@ -5,7 +5,7 @@ Here is my personal opinion about the questions I posed in [this tweet](https://
    I think that fine-tuning is still very valuable in many situations. I’ve done some more digging and I find that people who say that fine-tuning isn't useful are indeed often working on products where fine-tuning isn't likely to be useful:

    - They are making developer tools - foundation models have been trained extensively on coding tasks.
    - They are building foundation models and testing for the most general cases. But the foundation models themselves are also being trained for for the most general cases.
    - They are building foundation models and testing for the most general cases. But the foundation models themselves are also being trained for the most general cases.
    - They are building a personal assistant that isn’t scoped to any type of domain or use case, and is essentially similar to the same folks building foundation models.

    Another common pattern is that people often say this in earlier stages of their product development. One sign that folks are in really early stages is that they don’t have a domain-specific eval harness.
  7. hamelsmu revised this gist Mar 27, 2024. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion is_fine_tuning_valuable.md
    Original file line number Diff line number Diff line change
    @@ -1,6 +1,6 @@
    Here is my personal opinion about the questions I posed in [this tweet](https://x.com/HamelHusain/status/1772426234032541962?s=20):

    ## Hamel's Opinion
    ---

    I think that fine-tuning is still very valuable in many situations. I’ve done some more digging and I find that people who say that fine-tuning isn't useful are indeed often working on products where fine-tuning isn't likely to be useful:

  8. hamelsmu revised this gist Mar 26, 2024. 1 changed file with 0 additions and 4 deletions.
    4 changes: 0 additions & 4 deletions is_fine_tuning_valuable.md
    Original file line number Diff line number Diff line change
    @@ -15,7 +15,3 @@ Another common pattern is that people often say this in earlier stages of their
    I think that you should do as much prompt engineering as possible before you fine-tune. But not for reasons you would think! The reasons for doing lots of prompt engineering is that its a great way to stress test your eval system!

    If you find that prompt-engineering works fine (and you are systematically evaluating your product) then its fine to stop there. I'm a big believer in using the simplest approach to solving a problem. I just don't think you should write off fine-tuning just yet.

    ## The future

    There are ways in which this opinion could become outdated. For example, if GPT-5 trains on a bunch of domains very deeply as they did with coding, we could see fine tuning become less valuable, especially if GPT 5 becomes fast & cheap.
  9. hamelsmu revised this gist Mar 26, 2024. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion is_fine_tuning_valuable.md
    Original file line number Diff line number Diff line change
    @@ -14,7 +14,7 @@ Another common pattern is that people often say this in earlier stages of their

    I think that you should do as much prompt engineering as possible before you fine-tune. But not for reasons you would think! The reasons for doing lots of prompt engineering is that its a great way to stress test your eval system!

    If you find that prompt-engineering works just fine then its fine to stop there. I'm a big believer in using the simplest approach to solving a problem. I just don't think you should write off fine-tuning just yet.
    If you find that prompt-engineering works fine (and you are systematically evaluating your product) then its fine to stop there. I'm a big believer in using the simplest approach to solving a problem. I just don't think you should write off fine-tuning just yet.

    ## The future

  10. hamelsmu revised this gist Mar 26, 2024. 1 changed file with 5 additions and 1 deletion.
    6 changes: 5 additions & 1 deletion is_fine_tuning_valuable.md
    Original file line number Diff line number Diff line change
    @@ -14,4 +14,8 @@ Another common pattern is that people often say this in earlier stages of their

    I think that you should do as much prompt engineering as possible before you fine-tune. But not for reasons you would think! The reasons for doing lots of prompt engineering is that its a great way to stress test your eval system!

    If you find that prompt-engineering works just fine then its fine to stop there. I'm a big believer in using the simplest approach to solving a problem. I just don't think you should write off fine-tuning just yet.
    If you find that prompt-engineering works just fine then its fine to stop there. I'm a big believer in using the simplest approach to solving a problem. I just don't think you should write off fine-tuning just yet.

    ## The future

    There are ways in which this opinion could become outdated. For example, if GPT-5 trains on a bunch of domains very deeply as they did with coding, we could see fine tuning become less valuable, especially if GPT 5 becomes fast & cheap.
  11. hamelsmu revised this gist Mar 26, 2024. 1 changed file with 3 additions and 14 deletions.
    17 changes: 3 additions & 14 deletions is_fine_tuning_valuable.md
    Original file line number Diff line number Diff line change
    @@ -10,19 +10,8 @@ I think that fine-tuning is still very valuable in many situations. I’ve done

    Another common pattern is that people often say this in earlier stages of their product development. One sign that folks are in really early stages is that they don’t have a domain-specific eval harness.

    **It’s impossible to fine-tune effectively without an eval system which can lead to writing off fine-tuning if you haven't completed this prerequisite. But also more importantly I think its impossible to improve your product without a good eval system in the long term, fine-tuning or not.**
    **It’s impossible to fine-tune effectively without an eval system which can lead to writing off fine-tuning if you haven't completed this prerequisite.** I think its impossible to improve your product without a good eval system in the long term, fine-tuning or not.

    Systematic evalautation is the most critical part of the entire AI - development process. If you have done the work of creating a great eval system and process, 99% of the work fine-tuning has already been done for you , which is curating high quality data.
    I think that you should do as much prompt engineering as possible before you fine-tune. But not for reasons you would think! The reasons for doing lots of prompt engineering is that its a great way to stress test your eval system!

    I think that you should do as much prompt engineering as possible before you fine-tune. But not for reasons you would think! The reasons for doing lots of prompt engineering is that its a great way to stress test your eval system!

    ### Common Objections

    O: Fine tuning makes things less portable and cumbersome
    A: Once you collect your data, the process of fine-tuning is extremely portable

    O: Fine-tuning is expensive
    A: The most expensive part of fine-tuning is learning how to do it. You only need a few 100 examples to fine-tune OpenAI, and training an open model is very straightforward.

    O: I'm able to get great results with prompt engineering, and fine-tuning doesn't seem worth it.
    A: Excellent. Don't fine-tune. But also, make sure you are evaluating your system rigorously with automated tests and metrics if possible to check things.
    If you find that prompt-engineering works just fine then its fine to stop there. I'm a big believer in using the simplest approach to solving a problem. I just don't think you should write off fine-tuning just yet.
  12. hamelsmu revised this gist Mar 26, 2024. 1 changed file with 23 additions and 3 deletions.
    26 changes: 23 additions & 3 deletions is_fine_tuning_valuable.md
    Original file line number Diff line number Diff line change
    @@ -1,8 +1,28 @@
    I think that fine-tuning is still very valuable in many situations. I’ve done some more digging and I find that people who say that fine-tuning are not useful have a common pattern:
    Here is my personal opinion about the questions I posed in [this tweet](https://x.com/HamelHusain/status/1772426234032541962?s=20):

    ## Hamel's Opinion

    I think that fine-tuning is still very valuable in many situations. I’ve done some more digging and I find that people who say that fine-tuning isn't useful are indeed often working on products where fine-tuning isn't likely to be useful:

    - They are making developer tools - foundation models have been trained extensively on coding tasks.
    - They are building foundation models and testing for the most general cases. But the foundation models themselves are also being trained for for the most general cases.
    - They are building a personal assistant that isn’t scoped to any type of domain or use case, and is essentially similar to the same folks building foundation models.

    Another common pattern is that people often say this in earlier stages of their product development. One sign that folks are in really early stages is that they don’t have a domain-specific eval harness—they are still mostly doing vibe checks.
    **It’s impossible to fine-tune without an eval harness, so people write off fine-tuning.** But also most importantly I think its impossible to improve your product without an eval harness in the long term, fine-tuning or not.
    Another common pattern is that people often say this in earlier stages of their product development. One sign that folks are in really early stages is that they don’t have a domain-specific eval harness.

    **It’s impossible to fine-tune effectively without an eval system which can lead to writing off fine-tuning if you haven't completed this prerequisite. But also more importantly I think its impossible to improve your product without a good eval system in the long term, fine-tuning or not.**

    Systematic evalautation is the most critical part of the entire AI - development process. If you have done the work of creating a great eval system and process, 99% of the work fine-tuning has already been done for you , which is curating high quality data.

    I think that you should do as much prompt engineering as possible before you fine-tune. But not for reasons you would think! The reasons for doing lots of prompt engineering is that its a great way to stress test your eval system!

    ### Common Objections

    O: Fine tuning makes things less portable and cumbersome
    A: Once you collect your data, the process of fine-tuning is extremely portable

    O: Fine-tuning is expensive
    A: The most expensive part of fine-tuning is learning how to do it. You only need a few 100 examples to fine-tune OpenAI, and training an open model is very straightforward.

    O: I'm able to get great results with prompt engineering, and fine-tuning doesn't seem worth it.
    A: Excellent. Don't fine-tune. But also, make sure you are evaluating your system rigorously with automated tests and metrics if possible to check things.
  13. hamelsmu revised this gist Mar 26, 2024. 1 changed file with 2 additions and 1 deletion.
    3 changes: 2 additions & 1 deletion is_fine_tuning_valuable.md
    Original file line number Diff line number Diff line change
    @@ -4,4 +4,5 @@
    - They are building foundation models and testing for the most general cases. But the foundation models themselves are also being trained for for the most general cases.
    - They are building a personal assistant that isn’t scoped to any type of domain or use case, and is essentially similar to the same folks building foundation models.

    Another common pattern is that people often say this in earlier stages of their product development. One sign that folks are in really early stages is that they don’t have a domain-specific eval harness—they are still mostly doing vibe checks. (I’ve found that surprisingly, many people don’t know how to build an eval harness.) It’s impossible to fine-tune without an eval harness, so people write off fine-tuning. But also most importantly I think its impossible to improve your product without an eval harness in the long term, fine-tuning or not.
    Another common pattern is that people often say this in earlier stages of their product development. One sign that folks are in really early stages is that they don’t have a domain-specific eval harness—they are still mostly doing vibe checks.
    **It’s impossible to fine-tune without an eval harness, so people write off fine-tuning.** But also most importantly I think its impossible to improve your product without an eval harness in the long term, fine-tuning or not.
  14. hamelsmu created this gist Mar 26, 2024.
    7 changes: 7 additions & 0 deletions is_fine_tuning_valuable.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,7 @@
    I think that fine-tuning is still very valuable in many situations. I’ve done some more digging and I find that people who say that fine-tuning are not useful have a common pattern:

    - They are making developer tools - foundation models have been trained extensively on coding tasks.
    - They are building foundation models and testing for the most general cases. But the foundation models themselves are also being trained for for the most general cases.
    - They are building a personal assistant that isn’t scoped to any type of domain or use case, and is essentially similar to the same folks building foundation models.

    Another common pattern is that people often say this in earlier stages of their product development. One sign that folks are in really early stages is that they don’t have a domain-specific eval harness—they are still mostly doing vibe checks. (I’ve found that surprisingly, many people don’t know how to build an eval harness.) It’s impossible to fine-tune without an eval harness, so people write off fine-tuning. But also most importantly I think its impossible to improve your product without an eval harness in the long term, fine-tuning or not.