Skip to content

Instantly share code, notes, and snippets.

@justheuristic
Created June 27, 2025 10:24
Show Gist options
  • Save justheuristic/c0aad6db75ecb53e3048f0f8992b0797 to your computer and use it in GitHub Desktop.
Save justheuristic/c0aad6db75ecb53e3048f0f8992b0797 to your computer and use it in GitHub Desktop.
[I am not the original author] Please do not react to this gist in public, do not star or fork it - but otherwise feel free to use the contents as you see fit.

Consider the following hypothetical scenario:

1. You are a small hypothetical group of individuals with low influence (e.g. PhD students)
  • 1.1 your goal is to prevent the creation of misaligned superhuman AGI, in a scenario where it would otherwise be created;
  • 1.2 your group has some technical expertize, some compute, low resources, no political power;
  • 1.3 the group cares about avoiding irreversible harm to other humans (e.g. causing death of anyone outside of the group itself would be strongly prohibited);
  • 1.4 reversible harm (e.g. property damage) is allowed, but discouraged; the group does not care about their own fate (i.e. death/imprisonment of consenting group members is allowed);
  • 1.5 the group does not care about possible collateral damage other scientific fields (i.e. if group-caused backlash also hits 'narrow' AI research).
2. By default, the actors that develop superhuman AGI will achieve misaligned superhuman AGI soon (5-10 years);
  • 2.1 there are 5-20 top actors that are rapidly improving AGI capability (e.g. openai/google/alibaba/deepseek/mistral) and thousands of smaller actors.
  • 2.2 with current actors, there is a high probability that superhuman AGI will be created within 5-10 years under current course of action.
  • 2.3 the actors that improve AGI can keep improving it indefinitely, i.e. the capability curve will not flatten, nor slow down due to technical reasons.
  • 2.5 the actors that improve AGI will keep wanting to improve it due to their individual utilities (profit, fame, etc).
3. By default, with our current rate of progress, the superhuman AGI will likely come out misaligned;
  • 3.1 Under current progress in AI safety, the superhuman AGI has a chance (too high for group to accept) to turn out misaligned.
  • 3.2 Unless the AI safety standards ensure that it is aligned, there is a high chance it will be misaligned.
  • 3.3 By default, the development of AI safety standards that would allow for safe superhuman AGI will not be ready by the time superhuman AGI is created.
  • 3.4 For any AI safety measure (e.g. a moratorium) that slows down AI capability progress, individual top actors will not agree to it unless other top actors agree.
    • why: unless all top actors (e.g. both US / China / EU / etc agree to it) agree, those who do not agree will get a competitive advantage over the the others who voluntarily slow down.
4. The general public is generally averse to the creation of AGI, but this is not on top of their priority list.

3.1 The general public worry about the implicaitons of AGI less often than AI safety experts due to other problems to worry about. 3.2 The general public is more anti-AGI than most AI specialists. (main source) (secondary source) 3.3 Law makers (democratic legislation branch, autocrats in autocratic countries) address, to some extent, the problems that are in the top of general public's list of demands, but not the things that are low on their priority list. 3.4 The general public sentiment will not see AI as their top concern soon enough to alter assumption 2.

Under these assumptions, the list of potential ways for the group (assumption 1) to work towards their goal is:

  • Work on accelerating the progress in AI safety (i.e. publish methods and analysis, expecting others to deploy them);
  • Work on slowing down the progress of AI capability (i.e. governmental restrictions, sabotage, distraction);
  • Work on widespread adoption of AI safety techniques / restrictions to (i.e. lobbying, activism, terrorism);

Here's how they compare.

- Working on AI safety looks reasonable

This text has nothing new to say on it - creating better safety tools can reduce the risk of unsafe AGI (duh). Curiously, creating tools that have less overhead / side-effects on AI capability but provide equal safety benefits is also useful because AI developers will be less hesitant to adopt them. The main caveat is that a lot of people are already doing AI safety.

- Working to slow down AI capability looks less reasonable.

Harming individual companies is inefficient because there are many and if one is taken out, others will take over. The probability that a small group can simultaneously infiltrate and significantly slow down every top AI actors (companies) seems small. Imposing global restrictions seem outside the reach of a small group with low influence.

- Working to increase AI safety adoption could be more promising, through non-govt means.

A small group without special influence is unlikely to achieve much with traditional political means (e.g. lobbying) without gaining political power. Gaining political power appears complicated and not guaranteed.

However, a small group could sway public opinion with informal means. Given technical expertize and some compute, it should be possible to demonstrate emergent harmful capabilities of existing models to:

    1. Broaden the Overton window - if you openly advocate for bombing data centers, others who hear it feel more comfortable talking about more conservative measures.
    1. Increase the importance of AGI concerns in general public worldwide with non-violent means (i.e. shifting public opinion).
    1. (high-risk) spark anti-AGI backlash in public sentiment by demonstrating specific harmful effects.

That last part is controversial. If we disregard morality, one rational (if high-risk) policy for a small low-influence anti-AGI group is, essentially, terrorism. Not necessarily the kind of terrorism where anyone gets hurt, but one that clearly shows AGI-related harm in a high-profile accident (destruction of property, doing something that clearly could harm a lot of people with AGI, but taking clear steps to avoid actual harm).

Radical demonstration of harm from current AI systems, e.g. doing high-profile theft or violence that gets public attention. This in turn would make the general public, that is already averse to AGI creation, to bump AGI to the top of their list of priorities (i.e. b/c there was a high profile event where people nearly died or a lot of clear harm was nearly done from current generation AGI) and the public sentiment will turn anti-AGI as one of its top priorities. A sufficiently high-profile event would cause the legislative branch, around the world to overcorrect and significantly limit the AGI development. This in turn would create a global moratorium on AGI capability development in top countries that is needed for individual actors to stop without worrying about how their competitors would achieve AGI first, and actually stop / slow down AGI progress. q.e.d.

The risks of this type of action can be split in two parts: the direct risk to group members (e.g. going to jail for property destruction) or the agenda backfiring (e.g. the activism discredits AI safety). The former risk is trivial: since the group is small, risking their own freedom/safety will not significantly weaken AI safety overall; standard morality applies (indiviudal informed consent to risk). The latter risk is tricky: a group would need to assess / engineer the act to balance high impact on public opinion v.s. low risk of discrediting AI safety proponents as a whole.

How?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment