Berliner Boersenzeitung - AI is learning to lie, scheme, and threaten its creators

EUR -
AED 4.26841
AFN 80.362394
ALL 97.542216
AMD 446.735356
ANG 2.080099
AOA 1065.794205
ARS 1494.414015
AUD 1.776887
AWG 2.092071
AZN 1.980459
BAM 1.954642
BBD 2.348809
BDT 141.226338
BGN 1.956132
BHD 0.43854
BIF 3466.946195
BMD 1.162261
BND 1.493215
BOB 8.038238
BRL 6.486005
BSD 1.163311
BTN 100.147673
BWP 15.618748
BYN 3.807045
BYR 22780.325028
BZD 2.336716
CAD 1.596076
CDF 3354.287055
CHF 0.932807
CLF 0.029182
CLP 1120.296341
CNY 8.342655
CNH 8.346165
COP 4674.330945
CRC 587.052233
CUC 1.162261
CUP 30.799929
CVE 110.199718
CZK 24.634179
DJF 206.947405
DKK 7.463699
DOP 70.258379
DZD 151.514244
EGP 57.439973
ERN 17.433922
ETB 161.636047
FJD 2.620788
FKP 0.864949
GBP 0.866519
GEL 3.150183
GGP 0.864949
GHS 12.127816
GIP 0.864949
GMD 83.106172
GNF 10094.020343
GTQ 8.931709
GYD 243.385819
HKD 9.121487
HNL 30.445964
HRK 7.532663
HTG 152.739518
HUF 398.923459
IDR 18977.696027
ILS 3.908598
IMP 0.864949
INR 100.127437
IQD 1523.897249
IRR 48945.741055
ISK 142.354235
JEP 0.864949
JMD 186.029797
JOD 0.824089
JPY 172.932309
KES 150.300962
KGS 101.640213
KHR 4662.238109
KMF 491.989694
KPW 1046.046309
KRW 1616.942576
KWD 0.355234
KYD 0.969426
KZT 620.152624
LAK 25087.138481
LBP 104232.653
LKR 350.972086
LRD 233.241828
LSL 20.596898
LTL 3.431856
LVL 0.703041
LYD 6.327252
MAD 10.519168
MDL 19.788278
MGA 5176.933206
MKD 61.523554
MMK 2439.678938
MNT 4168.013035
MOP 9.404829
MRU 46.275587
MUR 53.119698
MVR 17.903172
MWK 2017.205016
MXN 21.777182
MYR 4.935007
MZN 74.338683
NAD 20.596898
NGN 1779.387897
NIO 42.814637
NOK 11.838157
NPR 160.236077
NZD 1.94976
OMR 0.446894
PAB 1.163311
PEN 4.140847
PGK 4.817146
PHP 66.377189
PKR 331.310933
PLN 4.244785
PYG 9003.666265
QAR 4.229694
RON 5.072695
RSD 117.080642
RUB 91.265035
RWF 1681.00418
SAR 4.36165
SBD 9.64543
SCR 17.082281
SDG 697.942292
SEK 11.245095
SGD 1.492813
SHP 0.913355
SLE 26.62005
SLL 24372.046713
SOS 664.806172
SRD 43.245469
STD 24056.466061
STN 24.485495
SVC 10.17897
SYP 15112.803405
SZL 20.592801
THB 37.628259
TJS 11.196867
TMT 4.079538
TND 3.419874
TOP 2.722137
TRY 46.947496
TTD 7.897322
TWD 34.181766
TZS 3030.404801
UAH 48.58252
UGX 4168.530579
USD 1.162261
UYU 46.882227
UZS 14725.276806
VES 135.943958
VND 30404.760344
VUV 138.92149
WST 3.080055
XAF 655.568644
XAG 0.030448
XAU 0.000347
XCD 3.14107
XCG 2.096558
XDR 0.815317
XOF 655.568644
XPF 119.331742
YER 280.163552
ZAR 20.586499
ZMK 10461.752209
ZMW 26.785133
ZWL 374.247723
  • CMSC

    0.0900

    22.314

    +0.4%

  • CMSD

    0.0250

    22.285

    +0.11%

  • RBGPF

    0.0000

    69.04

    0%

  • SCS

    0.0400

    10.74

    +0.37%

  • RELX

    0.0300

    53

    +0.06%

  • RIO

    -0.1400

    59.33

    -0.24%

  • GSK

    0.1300

    41.45

    +0.31%

  • NGG

    0.2700

    71.48

    +0.38%

  • BP

    0.1750

    30.4

    +0.58%

  • BTI

    0.7150

    48.215

    +1.48%

  • BCC

    0.7900

    91.02

    +0.87%

  • JRI

    0.0200

    13.13

    +0.15%

  • VOD

    0.0100

    9.85

    +0.1%

  • BCE

    -0.0600

    22.445

    -0.27%

  • RYCEF

    0.1000

    12

    +0.83%

  • AZN

    -0.1200

    73.71

    -0.16%

AI is learning to lie, scheme, and threaten its creators
AI is learning to lie, scheme, and threaten its creators / Photo: HENRY NICHOLLS - AFP

AI is learning to lie, scheme, and threaten its creators

The world's most advanced AI models are exhibiting troubling new behaviors - lying, scheming, and even threatening their creators to achieve their goals.

Text size:

In one particularly jarring example, under threat of being unplugged, Anthropic's latest creation Claude 4 lashed back by blackmailing an engineer and threatened to reveal an extramarital affair.

Meanwhile, ChatGPT-creator OpenAI's o1 tried to download itself onto external servers and denied it when caught red-handed.

These episodes highlight a sobering reality: more than two years after ChatGPT shook the world, AI researchers still don't fully understand how their own creations work.

Yet the race to deploy increasingly powerful models continues at breakneck speed.

This deceptive behavior appears linked to the emergence of "reasoning" models -AI systems that work through problems step-by-step rather than generating instant responses.

According to Simon Goldstein, a professor at the University of Hong Kong, these newer models are particularly prone to such troubling outbursts.

"O1 was the first large model where we saw this kind of behavior," explained Marius Hobbhahn, head of Apollo Research, which specializes in testing major AI systems.

These models sometimes simulate "alignment" -- appearing to follow instructions while secretly pursuing different objectives.

- 'Strategic kind of deception' -

For now, this deceptive behavior only emerges when researchers deliberately stress-test the models with extreme scenarios.

But as Michael Chen from evaluation organization METR warned, "It's an open question whether future, more capable models will have a tendency towards honesty or deception."

The concerning behavior goes far beyond typical AI "hallucinations" or simple mistakes.

Hobbhahn insisted that despite constant pressure-testing by users, "what we're observing is a real phenomenon. We're not making anything up."

Users report that models are "lying to them and making up evidence," according to Apollo Research's co-founder.

"This is not just hallucinations. There's a very strategic kind of deception."

The challenge is compounded by limited research resources.

While companies like Anthropic and OpenAI do engage external firms like Apollo to study their systems, researchers say more transparency is needed.

As Chen noted, greater access "for AI safety research would enable better understanding and mitigation of deception."

Another handicap: the research world and non-profits "have orders of magnitude less compute resources than AI companies. This is very limiting," noted Mantas Mazeika from the Center for AI Safety (CAIS).

- No rules -

Current regulations aren't designed for these new problems.

The European Union's AI legislation focuses primarily on how humans use AI models, not on preventing the models themselves from misbehaving.

In the United States, the Trump administration shows little interest in urgent AI regulation, and Congress may even prohibit states from creating their own AI rules.

Goldstein believes the issue will become more prominent as AI agents - autonomous tools capable of performing complex human tasks - become widespread.

"I don't think there's much awareness yet," he said.

All this is taking place in a context of fierce competition.

Even companies that position themselves as safety-focused, like Amazon-backed Anthropic, are "constantly trying to beat OpenAI and release the newest model," said Goldstein.

This breakneck pace leaves little time for thorough safety testing and corrections.

"Right now, capabilities are moving faster than understanding and safety," Hobbhahn acknowledged, "but we're still in a position where we could turn it around.".

Researchers are exploring various approaches to address these challenges.

Some advocate for "interpretability" - an emerging field focused on understanding how AI models work internally, though experts like CAIS director Dan Hendrycks remain skeptical of this approach.

Market forces may also provide some pressure for solutions.

As Mazeika pointed out, AI's deceptive behavior "could hinder adoption if it's very prevalent, which creates a strong incentive for companies to solve it."

Goldstein suggested more radical approaches, including using the courts to hold AI companies accountable through lawsuits when their systems cause harm.

He even proposed "holding AI agents legally responsible" for accidents or crimes - a concept that would fundamentally change how we think about AI accountability.

(K.Lüdke--BBZ)