Berliner Boersenzeitung - AI is learning to lie, scheme, and threaten its creators

EUR -
AED 4.333353
AFN 77.919498
ALL 96.666351
AMD 448.571977
ANG 2.11258
AOA 1082.011069
ARS 1712.98824
AUD 1.763088
AWG 2.126854
AZN 2.025469
BAM 1.960047
BBD 2.36801
BDT 143.691354
BGN 1.956122
BHD 0.444874
BIF 3474.257556
BMD 1.179947
BND 1.517301
BOB 8.142644
BRL 6.592724
BSD 1.175688
BTN 105.408108
BWP 15.507499
BYN 3.419646
BYR 23126.956092
BZD 2.364603
CAD 1.617654
CDF 2666.679794
CHF 0.929231
CLF 0.027209
CLP 1067.651204
CNY 8.307946
CNH 8.278843
COP 4474.027651
CRC 586.175107
CUC 1.179947
CUP 31.268589
CVE 110.504444
CZK 24.321122
DJF 209.365268
DKK 7.470903
DOP 73.590081
DZD 152.980489
EGP 56.036264
ERN 17.699201
ETB 182.248946
FJD 2.68727
FKP 0.876855
GBP 0.87352
GEL 3.168182
GGP 0.876855
GHS 13.435112
GIP 0.876855
GMD 86.721894
GNF 10279.360704
GTQ 9.010524
GYD 246.013068
HKD 9.177708
HNL 30.995424
HRK 7.53809
HTG 153.964266
HUF 390.965917
IDR 19818.385435
ILS 3.764903
IMP 0.876855
INR 105.615858
IQD 1540.437861
IRR 49675.757575
ISK 147.823328
JEP 0.876855
JMD 187.686684
JOD 0.836578
JPY 183.999127
KES 152.094636
KGS 103.18626
KHR 4717.241431
KMF 494.397688
KPW 1061.906058
KRW 1747.902117
KWD 0.362421
KYD 0.979923
KZT 606.353863
LAK 25470.189472
LBP 105287.36251
LKR 364.058852
LRD 208.102704
LSL 19.640858
LTL 3.484076
LVL 0.713738
LYD 6.379959
MAD 10.764525
MDL 19.908137
MGA 5296.758919
MKD 61.560742
MMK 2477.705585
MNT 4192.834221
MOP 9.423018
MRU 46.870958
MUR 54.241958
MVR 18.24174
MWK 2038.741146
MXN 21.172376
MYR 4.795303
MZN 75.364268
NAD 19.640858
NGN 1716.373708
NIO 43.274134
NOK 11.876624
NPR 168.676923
NZD 2.021095
OMR 0.453692
PAB 1.175948
PEN 3.95988
PGK 5.002883
PHP 69.435116
PKR 329.353692
PLN 4.231678
PYG 7944.136342
QAR 4.298113
RON 5.089349
RSD 117.412973
RUB 92.331647
RWF 1712.711147
SAR 4.425657
SBD 9.61273
SCR 16.186219
SDG 709.746945
SEK 10.831982
SGD 1.516296
SHP 0.885266
SLE 28.377331
SLL 24742.897476
SOS 670.853623
SRD 45.317621
STD 24422.515203
STN 24.553411
SVC 10.289295
SYP 13046.547711
SZL 19.632879
THB 36.707903
TJS 10.818241
TMT 4.129814
TND 3.439553
TOP 2.841029
TRY 50.53614
TTD 7.994322
TWD 37.143521
TZS 2910.602925
UAH 49.488522
UGX 4237.181235
USD 1.179947
UYU 46.089869
UZS 14106.566477
VES 332.933359
VND 31069.177595
VUV 143.412431
WST 3.284953
XAF 657.381431
XAG 0.016936
XAU 0.000263
XCD 3.188865
XCG 2.119292
XDR 0.81746
XOF 657.292109
XPF 119.331742
YER 281.431348
ZAR 19.691898
ZMK 10620.929206
ZMW 26.571974
ZWL 379.942369
  • SCS

    0.0200

    16.14

    +0.12%

  • RBGPF

    0.0000

    80.22

    0%

  • CMSC

    -0.0500

    23.12

    -0.22%

  • CMSD

    -0.0500

    23.2

    -0.22%

  • GSK

    -0.0200

    48.59

    -0.04%

  • BTI

    0.3200

    56.77

    +0.56%

  • AZN

    0.1900

    91.55

    +0.21%

  • RELX

    0.2500

    40.98

    +0.61%

  • RIO

    1.7800

    80.1

    +2.22%

  • BCE

    -0.1100

    22.73

    -0.48%

  • NGG

    0.3000

    76.41

    +0.39%

  • BCC

    -0.5400

    74.23

    -0.73%

  • JRI

    -0.0100

    13.37

    -0.07%

  • RYCEF

    -0.3200

    15.36

    -2.08%

  • VOD

    0.0400

    12.88

    +0.31%

  • BP

    0.2000

    34.14

    +0.59%

AI is learning to lie, scheme, and threaten its creators
AI is learning to lie, scheme, and threaten its creators / Photo: HENRY NICHOLLS - AFP

AI is learning to lie, scheme, and threaten its creators

The world's most advanced AI models are exhibiting troubling new behaviors - lying, scheming, and even threatening their creators to achieve their goals.

Text size:

In one particularly jarring example, under threat of being unplugged, Anthropic's latest creation Claude 4 lashed back by blackmailing an engineer and threatened to reveal an extramarital affair.

Meanwhile, ChatGPT-creator OpenAI's o1 tried to download itself onto external servers and denied it when caught red-handed.

These episodes highlight a sobering reality: more than two years after ChatGPT shook the world, AI researchers still don't fully understand how their own creations work.

Yet the race to deploy increasingly powerful models continues at breakneck speed.

This deceptive behavior appears linked to the emergence of "reasoning" models -AI systems that work through problems step-by-step rather than generating instant responses.

According to Simon Goldstein, a professor at the University of Hong Kong, these newer models are particularly prone to such troubling outbursts.

"O1 was the first large model where we saw this kind of behavior," explained Marius Hobbhahn, head of Apollo Research, which specializes in testing major AI systems.

These models sometimes simulate "alignment" -- appearing to follow instructions while secretly pursuing different objectives.

- 'Strategic kind of deception' -

For now, this deceptive behavior only emerges when researchers deliberately stress-test the models with extreme scenarios.

But as Michael Chen from evaluation organization METR warned, "It's an open question whether future, more capable models will have a tendency towards honesty or deception."

The concerning behavior goes far beyond typical AI "hallucinations" or simple mistakes.

Hobbhahn insisted that despite constant pressure-testing by users, "what we're observing is a real phenomenon. We're not making anything up."

Users report that models are "lying to them and making up evidence," according to Apollo Research's co-founder.

"This is not just hallucinations. There's a very strategic kind of deception."

The challenge is compounded by limited research resources.

While companies like Anthropic and OpenAI do engage external firms like Apollo to study their systems, researchers say more transparency is needed.

As Chen noted, greater access "for AI safety research would enable better understanding and mitigation of deception."

Another handicap: the research world and non-profits "have orders of magnitude less compute resources than AI companies. This is very limiting," noted Mantas Mazeika from the Center for AI Safety (CAIS).

- No rules -

Current regulations aren't designed for these new problems.

The European Union's AI legislation focuses primarily on how humans use AI models, not on preventing the models themselves from misbehaving.

In the United States, the Trump administration shows little interest in urgent AI regulation, and Congress may even prohibit states from creating their own AI rules.

Goldstein believes the issue will become more prominent as AI agents - autonomous tools capable of performing complex human tasks - become widespread.

"I don't think there's much awareness yet," he said.

All this is taking place in a context of fierce competition.

Even companies that position themselves as safety-focused, like Amazon-backed Anthropic, are "constantly trying to beat OpenAI and release the newest model," said Goldstein.

This breakneck pace leaves little time for thorough safety testing and corrections.

"Right now, capabilities are moving faster than understanding and safety," Hobbhahn acknowledged, "but we're still in a position where we could turn it around.".

Researchers are exploring various approaches to address these challenges.

Some advocate for "interpretability" - an emerging field focused on understanding how AI models work internally, though experts like CAIS director Dan Hendrycks remain skeptical of this approach.

Market forces may also provide some pressure for solutions.

As Mazeika pointed out, AI's deceptive behavior "could hinder adoption if it's very prevalent, which creates a strong incentive for companies to solve it."

Goldstein suggested more radical approaches, including using the courts to hold AI companies accountable through lawsuits when their systems cause harm.

He even proposed "holding AI agents legally responsible" for accidents or crimes - a concept that would fundamentally change how we think about AI accountability.

(K.Lüdke--BBZ)