A student and a card-built AI assistant examine a geometric puzzle

35 minutes · zero magic · real mathematics

Will AI replace
our minds?

How to use AI for thinking-intensive tasks without giving up control of the reasoning.

Bartosz Naskręcki UAM/CCAI Warsaw, 8 June 2026

Anonymous poll

What do you most often use AI for?

?

Explain

Help me understand a topic that is still unclear.

{ }

Do it

Write code, a text, or a presentation, or solve a problem.

✓

Check

Find the first error, test a case, or ask a question.

Most of us find ourselves in all three situations. This is not about guilt. It is about control.

Today's claim

AI can perform some tasks for us
and provide a ready-made result.

It does not take over your responsibility for understanding, verification, and choice.

The most important question is not “does AI think?” but “do I understand more after using it?”

The basic GPT mechanism

Which token fits best as the next one?

235 711?

13 72% 12 16% 17 8% cat 4%

The model predicts a token from structures learned during training. This is not random guessing, but neither is it a guarantee of understanding.

The percentages are illustrative. A model may sample an answer rather than always choosing the most likely token.

How does text enter the model?

First, the tokenizer splits the text into tokens

A token may be a word, part of a word, a symbol, or a fragment of code.

mathematics!

The split depends on the tokenizer. A token is not a “concept” in the human sense.

A row of geometric figures connected by arcs of different thickness

Attention mechanism

The model assigns weights to
parts of the context

Alice told Maya that she had won.

At each position, the model computes how strongly to connect it with each earlier position.

This is an algebraic operation, not conscious attention. The weights alone do not fully explain the answer.

One attention head in GPT

The weights determine how much earlier positions contribute

A(Q,K,V) = softmax((QK^T + M) / √d_k)V

Qqueries: what is this position looking for?

Kkeys: what can it be compared with?

Vvalues: what information will be combined?

Q = XW_QK = XW_KV = XW_V

`M` is the causal mask: a position cannot use future tokens.

Why can GPT not peek into the future?

The causal mask gives future positions zero weight

235?

21.0×××

3.3.7××

5.2.3.5×

?.1.2.3.4

A row describes one weighted combination of earlier values.

Multiple heads can capture different relationships in parallel.

Many layers

One attention operation is only the beginning

Each layer combines contextual information and transforms the vectors representing the tokens.

There is no single drawer labelled “Pythagorean theorem.” Knowledge and procedures are distributed across many parameters and activations.

next-token distribution

attention + feed-forward network

⋮

attention + feed-forward network

tokens + positions

Four different things

What changes the model, and what only changes how it is used?

1

Pretraining

Training on token prediction changes the model's parameters.

Pretraining: L(θ) = −Σ log p_θ(x_t|x<t)

2

Post-training

Further training on instructions, evaluations, preferences, and verifiable signals.

Post-training still changes the parameters.

3

Generation

While answering, the model may perform more steps, attempts, and checks.

A conversation does not usually train the model live.

4

Agent

The system lets the model read files, run code, and analyse tool outputs.

This is an action loop around a trained model.

Why are newer models strong at mathematics?

Mathematics provides structure and feedback

①

Structured notation

Definitions, proofs, and code contain recurring rules.

②

Outcome evaluation

A calculation, test, or verifier often provides an unambiguous signal.

③

Multiple attempts

More computation time allows several solution paths to be checked and errors to be corrected.

④

Tools

Python, computer algebra systems, tests, and proof checkers provide an external scratchpad.

Mathematics is a good testing ground for AI. That does not make AI consistently reliable at it.

Codex as an example of an agent

It does more than answer. It can act and verify.

1Readtask, files, constraints

→

2Planhypothesis and small tests

→

3Actedit, calculate, run

→

4Verifyinspect the result and revise

Codex is one example. The same method can work with other AI agents.

What about ideas?

AI can propose hypotheses and possible solutions. Assessing their value is harder.

Strength

Combining

It combines known techniques and analogies into new proposals.

Strength

Searching

It rapidly explores many variants, examples, and counterexamples.

No guarantee

Novelty and depth

It cannot reliably judge whether a proposal is genuinely new, important, and correct.

No guarantee

Intuition and meaning

It has neither human experience of the problem nor responsibility for the direction of inquiry.

Changing the representation

1 + 2 + 3 + … + 100 = ?

First test 1+…+10.

1 + 100 = 101

2 + 99 = 101

3 + 98 = 101

…

50 + 51 = 101

50 × 101 = 5050

Intuition often means rewriting a problem so that its structure becomes visible.

What does an infinite decimal mean?

In the real numbers: 0.999… = 1

0.999… := lim_n→∞(1 − 10⁻ⁿ) = 1

Algebraic shortcut after defining the limit

x = 0.999…

10x = 9.999…

9x = 9

x = 1

The limit route

The finite approximations are 1 − 10⁻ⁿ.

As n grows, 10⁻ⁿ → 0.

The limit is 1.

Puzzle · do not reveal the warning

A “proof” that 1 = 2. Identify the first invalid step.

Assume: a = b ≠ 0

a² = ab

a² − b² = ab − b²

(a − b)(a + b) = b(a − b)

a + b = b

2b = b, therefore 2 = 1

Choose first

Which transition contains the first error?

Do not judge only by the absurd conclusion.

Find the first error, not merely the wrong result

A condition hidden inside ordinary “cancellation”

(a − b)(a + b) = b(a − b)

a + b = b divide by a − b

ALARMa − b = 0

Division by zero is not allowed.

INSTRUCTION FOR THE AGENT“Check every step. Identify the first error and name the violated condition.”

1Reads:
a=b

→

2Computes:
a−b=0

→

3Rejects:
division by zero

A good audit checks each step separately and the conditions required for it, not just the final statement.

The most important safeguard

Sounds confident ≠ is true

MATHEMATICS

An elegantly written error

Every line looks familiar, but one violated condition destroys the entire proof.

FACTS

A “convincing quotation”

A model may invent a source, date, or quotation that sounds perfect but does not exist.

sounds confident

very

has been verified

?

A concrete problem

How many rectangles are in an m×k grid of cells?

Problem definition

The board has m rows and k columns. We count rectangles whose sides lie on grid lines; squares count too.

k+1 vertical linesm+1 horizontal lines

First guessm·k?

That is the number of cells, not all rectangles.

Choose the boundariesC(m+1,2)

Two horizontal lines determine the top and bottom sides.

The other directionC(k+1,2)

Two vertical lines determine the left and right sides.

General formulaR(m,k)

C(m+1,2)·C(k+1,2)

The agent creates a program that can be run and checked

The program finds an off-by-one error. Mathematics explains the formula.

INSTRUCTION“Write a program for an m×k grid. Test 1×1, 1×2, and 2×3 first.”

Plan

Choose two of the m+1 horizontal lines and two of the k+1 vertical lines.

C(m+1,2) · C(k+1,2)

- range(width)+ range(width + 1)

codex · rectangle_demo.py

$ ready to run tests

The computer checked: specific instances.Mathematics explained: each rectangle corresponds to a choice of four lines.

Many successes do not constitute a proof

For which first n is n²+n+41 not prime?

n²+n+41

For n=0,1,…,39, the values are prime.

for (let n = 0; ; n++) {
  const value = n*n + n + 41;
  if (classify(value) !== "prime") {
    console.log({n, value});
    break;
  }
}

$ searching for the first failure…waiting

The computer found: the first counterexample in the tested sequence.Mathematics still needs: an explanation of why it appears exactly there.

The counterexample is not an accident

Substitute n = c−1

f_c(n) = n²+n+c

f_c(c−1) = (c−1)²+(c−1)+c

= c²

c=41n=40

f₄₁(40)=41²=1681

Edge casec=1, n=0

The value 1 is neither prime nor composite.

The same problem · a different role for AI

One diagnostic question. One hint. Stop.

StudentI think the answer for 8×8 is 64. Here is my attempt.

AIHow many vertical lines bound eight columns of cells?

StudentNine. I choose the left and right boundaries.

AIHow many unordered pairs can be chosen from nine lines?

StudentC(9,2)=36. In both directions: 36·36=1296.

AINow apply the method to a 2×3 grid on your own.

SUCCESS CRITERIONThe student solves a new variant and explains the method without the AI transcript.

My own attempt: reveals the genuine starting point.
Diagnosis: one question locates the gap.
Hint: does not replace the student's next step.
Verification: a small case tests the method.
Transfer: a new problem demonstrates learning.

A prompt that supports learning

A prompt that does the work versus one that supports learning

WEAK

“Solve this.”

It defines neither the goal, the role, nor when to stop.

BETTER

“I am learning this topic. Do not solve the whole problem. First ask one diagnostic question. Then give one hint and wait. At the end, ask me to explain the method in my own words. Here is my attempt: …”

Five checking questions

Before you trust an answer

1What assumption is being used?
2What is the simplest small example I can use to check this?
3What is the edge case?
4Can I check this with another method, tool, or source?
5Can I explain the result without the AI conversation?

Not only Codex

The same habits help with different AI tools

Ask for questions, not only answers. Ask for explicit assumptions. Ask for edge cases. Separate calculation from proof. Ask: what would change this answer? Verify it yourself, with a teacher, a textbook, or code.

Putting “try first” into practice

Give yourself 3–5 minutes before using AI

03:00

What do I know?Definitions, given information, similar problems.
What am I trying?A diagram, hypothesis, calculation, or small example.
At exactly which step am I stuck?Show AI your own attempt and identify the exact point where you need help.

Use it without losing the learning

AI at school: six rules

Follow the rules of the assignment.If the teacher says “no AI,” do not use AI.
Begin with your own attempt.Keep your draft, questions, and the point where you got stuck.
Ask for a question or hint first.Do not let the tool perform the very part you are meant to learn.
Protect data and other people's material.Do not enter names, grades, health information, passwords, or private conversations. Do not upload material you are not allowed to share.
Disclose assistance when required.State whether AI helped with the idea, code, style, or verification.
Responsibility remains yours.Exams, competitions, and assessed work have their own rules.

Conditions are part of the theorem

Fermat's little theorem and a primality test

THEOREMp prime ⇒ a^p ≡ a (mod p)

For every integer a.

VERSION FOR gcd=1p prime and gcd(a,p)=1 ⇒ a^p−1 ≡ 1 (mod p)

This form leads to the Fermat test.

PSEUDOPRIMEn composite, gcd(a,n)=1

If nevertheless aⁿ⁻¹ ≡ 1 (mod n), then n is a pseudoprime to base a.

Passing a check is not a proof

341 passes the base-2 Fermat test

agent · fermat_test.pygcd(2,n)=1

StudentDo not confirm the hypothesis. Find a composite n that passes the test.

AgentI will check gcd(2,n)=1 and the condition 2ⁿ⁻¹≡1 (mod n).

JavaScript · modular exponentiation

$ waiting to run the test

FACTORIZATION341 = 11 · 31The number is composite.

TEST2³⁴⁰ ≡ 1 (mod 341)Passing the test does not prove primality.

EXPLANATION

10 | 340 and 2¹⁰≡1 (mod 11); 5 | 340 and 2⁵≡1 (mod 31). Since gcd(11,31)=1, we obtain 2³⁴⁰≡1 (mod 341).

Explore after the talk

Slides, sources, and labs

QR code linking to the public presentation website

nasqret.github.io/czy-ai-zastapi-nasza-glowe

Stable principle: verify claims and conditions.Moving target: product features, models, and benchmarks age quickly.

Students and an AI assistant examine idea seeds growing into a geometric tree

The answer after 35 minutes

Do not hand over
responsibility.

Delegate calculations, drafts, and tests. Keep understanding, control, and the choice of direction.

After using AI: do I understand more, and can I explain it?

Appendix · bibliography

What is this talk based on?

Vaswani et al. (2017)The transformer and attention mechanism Ouyang et al. (2022)Training models to follow instructions Lightman et al. (2023)Step-by-step verification OpenAI (2024)Reasoning and reinforcement learning OpenAI CodexDocumentation and best practices Project sourcesFull bibliography, caveats, and verification date

Appendix · audience questions

Three distinctions worth preserving

Result versus understanding

A correct answer does not establish whether a model has human intuition or experience of the world.

Candidate versus scientific idea

A new combination may be useful, but novelty, significance, and authorship require independent evaluation.

Verification versus trust

Trust is justified only where we know how to check the result and accept the cost of error.