CPUdev wiki - User contributions [en]

File:Transport-triggered architecture example.png

2025-05-03T21:54:34Z

Demindiro: Demindiro uploaded a new version of File:Transport-triggered architecture example.png

An example of a TTA.

File:Transport-triggered architecture example.png

2025-05-03T21:52:39Z

Demindiro: Demindiro uploaded a new version of File:Transport-triggered architecture example.png

An example of a TTA.

Transport triggered architecture

2025-05-03T21:49:42Z

Demindiro: Created page with "A '''Transport triggered architecture''' ('''TTA''') is a one instruction set computer which only moves data between various function units. It is a very minimal, simple a..."

A '''Transport triggered architecture''' ('''TTA''') is a [[one instruction set computer]] which only moves data between various function units. It is a very minimal, simple and flexible type of architecture.

A typical TTA consists of one or more transport buses, with various units attached to these buses. Each instruction configures the source and destination, i.e. moves data over the buses.
Certain data movements may trigger an attached unit to perform an action, hence ''transport triggered''.

TTAs expose a lot of microarchitectural details which are otherwise abstracted in more typical architectures. Due to this, (assembly) code written for a specific TTA tends to be non-portable.

[[File:Transport-triggered architecture example.png|thumb|An example of a TTA. Note that many more configurations are possible, including with multiple transport buses.]]

File:Transport-triggered architecture example.png

2025-05-03T21:47:40Z

Demindiro:

An example of a TTA.

User:Demindiro/Ramblings/Massively concurrent Stack-Stack-Machine

2025-04-05T12:45:35Z

Demindiro: Dump of a potentially silly idea

'''Out-of-Order Execution''' and '''Simultaneous Multi-Threading''' has been on my mind for a while, with many ideas on how to address it best. Below is one of my ideas in text form.

== Latency, OoOE, SMT and architectural state ==

As transistors shrink, processors get faster and faster. However, for practical it is still necessary to communicate with external components such as DRAM. Processor speed improves much faster than the latency between processor and external components. Nowadays processor are so large even latency with on-chip cache is a significant factor.

Most high-performance processors address these challenges with out-of-order execution: instructions later in the stream can be executed ahead of earlier instructions so long they do not depend on them.
Enabling OoOE requires several large and fast structures. How far a processor core can execute ahead depends entirely on the size of these structures.

Another technique is simultaneous multi-threading: one processor core can evaluate multiple instruction streams simultaneously. Hence, if one stream is blocked on long-latency instructions, the other can continue being evaluated.
SMT allows sharing execution resources but requires duplicating '''architectural state''', in particular registers. Depending on the complexity of the core, this may be a large or small cost. In particular, advanced OoOE processors tend to have at least 2-way SMT, with the extra state consuming a relatively small amount of die space.

SMT is not always beneficial: most (all?) implementations try to ensure fair scheduling, interleaving instructions from all streams. This may leads to cache thrashing and severely degrade the performance of some applications. On the other hand, it will perform much better than only OoOE if threads are often stalled waiting on memory operations to finish.

What if we could mitigate the cache thrashing? An interesting analogy is '''fair''' versus '''unfair''' mutexes: unfair mutexes tend to lead to higher throughput by allowing threads with data already in cache to continue executing, significantly improving cache utilization. On the other hand, it may cause unpredictable delays for certain tasks, which is why fair mutexes, at least by default, are often preferred.

If (user) applications had explicit control over SMT then perhaps it would be practical to have "unfair" scheduling for SMT too, reducing cache thrashing. But what should the interface look like? Ideally there would be no arbitrary limits on the amount of threads an application can schedule at once.
The most significant limit here is the amount of architectural state: having too much both bloats a core and makes task switching costlier, as more state needs to be (explicitly) saved to memory.

What if we required only the absolute minimum amount of state?

== Stack machines and stack of tasks ==

A stack machine requires only a tiny amount of state: a program counter and a stack pointer is sufficient. This makes switching tasks very cheap.

On the other hand, it requires very frequent access to memory, specifically the stack. This makes OoOE impractical. However, with massive and cheap concurrency the need for OoOE might be avoided altogether.

How can we achieve this massive concurrency? We can use a stack of tasks that are ready to run. The top-most entries are removed from the stack if the processor is able to schedule them. Tasks that are waiting on a operation remain in internal processor state.

Since tasks have very little architectural state it is cheap to implement the fork-join paradigm: push a pointer to a stack and an instruction address and off it goes! (TODO: what about joins? And how to cheaply allocate new stack space? Would a bump allocator + copying GC work? Or an arena?)

Hardware Description Language

2025-03-22T11:29:43Z

Demindiro: /* List of HDLs */ Add Calyx

A hardware description language, or HDL, is a tool to describe the behavior of electronic circuits.

While intended for simulation, most (all?) HDLs are also usable to synthesize physical circuits. However, beware some languages support constructs that cannot actually be translated to a physical circuit.

The most common HDL is [[Verilog]], followed by [[VHDL]]. Many other HDLs are translated to Verilog under the hood.

== Simulation ==

[[Verilator]]

== Synthesis ==

[[Yosys]]

== List of HDLs ==

This list is incomplete. Feel free to add more entries.

{|class="wikitable sortable"
|+ Caption text
|-
! Name !! Simulation !! Synthesis !! Allows non-synthesizable constructs
|-
| [https://calyxir.org/ Calyx]
| ?
| ?
| ?
|-
| [https://www.chisel-lang.org/ Chisel]
| ?
| ?
| ?
|-
| [https://clash-lang.org/ Clash]
| ?
| ?
| ?
|-
| [https://www.myhdl.org/ MyHDL]
| ?
| ?
| ?
|-
| [https://rust-hdl.org/ RustHDL]
| Yes
| Yes
| ?
|-
| [https://github.com/SpinalHDL/SpinalHDL SpinalHDL]
| Yes
| Yes
| ?
|}

== See also ==

=== External references ===

* [https://github.com/drom/awesome-hdl Awesome Hardware Description Languages] - a list of HDLs and associated tools

Hardware Description Language

2025-03-21T20:55:27Z

Demindiro: /* List of HDLs */ replace "type system" with "allows non-synthesizable constructs", which is the more interesting property.

A hardware description language, or HDL, is a tool to describe the behavior of electronic circuits.

While intended for simulation, most (all?) HDLs are also usable to synthesize physical circuits. However, beware some languages support constructs that cannot actually be translated to a physical circuit.

The most common HDL is [[Verilog]], followed by [[VHDL]]. Many other HDLs are translated to Verilog under the hood.

== Simulation ==

[[Verilator]]

== Synthesis ==

[[Yosys]]

== List of HDLs ==

This list is incomplete. Feel free to add more entries.

{|class="wikitable sortable"
|+ Caption text
|-
! Name !! Simulation !! Synthesis !! Allows non-synthesizable constructs
|-
| [https://www.chisel-lang.org/ Chisel]
| ?
| ?
| ?
|-
| [https://clash-lang.org/ Clash]
| ?
| ?
| ?
|-
| [https://www.myhdl.org/ MyHDL]
| ?
| ?
| ?
|-
| [https://rust-hdl.org/ RustHDL]
| Yes
| Yes
| ?
|-
| [https://github.com/SpinalHDL/SpinalHDL SpinalHDL]
| Yes
| Yes
| ?
|}

== See also ==

=== External references ===

* [https://github.com/drom/awesome-hdl Awesome Hardware Description Languages] - a list of HDLs and associated tools

Hardware Description Language

2025-03-21T20:53:07Z

Demindiro: /* List of HDLs */ sortable table + feature matrix

A hardware description language, or HDL, is a tool to describe the behavior of electronic circuits.

While intended for simulation, most (all?) HDLs are also usable to synthesize physical circuits. However, beware some languages support constructs that cannot actually be translated to a physical circuit.

The most common HDL is [[Verilog]], followed by [[VHDL]]. Many other HDLs are translated to Verilog under the hood.

== Simulation ==

[[Verilator]]

== Synthesis ==

[[Yosys]]

== List of HDLs ==

This list is incomplete. Feel free to add more entries.

{|class="wikitable sortable"
|+ Caption text
|-
! Name !! Simulation !! Synthesis !! Type system
|-
| [https://www.chisel-lang.org/ Chisel]
| ?
| ?
| ?
|-
| [https://clash-lang.org/ Clash]
| ?
| ?
| ?
|-
| [https://www.myhdl.org/ MyHDL]
| ?
| ?
| ?
|-
| [https://rust-hdl.org/ RustHDL]
| Yes
| Yes
| Static, strong
|-
| [https://github.com/SpinalHDL/SpinalHDL SpinalHDL]
| Yes
| Yes
| Static, strong
|}

== See also ==

=== External references ===

* [https://github.com/drom/awesome-hdl Awesome Hardware Description Languages] - a list of HDLs and associated tools

Hardware Description Language

2025-03-21T20:35:46Z

Demindiro: Created page with "A hardware description language, or HDL, is a tool to describe the behavior of electronic circuits. While intended for simulation, most (all..."

A hardware description language, or HDL, is a tool to describe the behavior of electronic circuits.

While intended for simulation, most (all?) HDLs are also usable to synthesize physical circuits. However, beware some languages support constructs that cannot actually be translated to a physical circuit.

The most common HDL is [[Verilog]], followed by [[VHDL]]. Many other HDLs are translated to Verilog under the hood.

== Simulation ==

[[Verilator]]

== Synthesis ==

[[Yosys]]

== List of HDLs ==

This list is incomplete. Feel free to add more entries.

Please keep the list alphabetic.

* [https://www.chisel-lang.org/ Chisel]
* [https://clash-lang.org/ Clash]
* [https://www.myhdl.org/ MyHDL]
* [https://rust-hdl.org/ RustHDL]
* [https://github.com/SpinalHDL/SpinalHDL SpinalHDL]

== See also ==

=== External references ===

* [https://github.com/drom/awesome-hdl Awesome Hardware Description Languages] - a list of HDLs and associated tools

Instruction encoding

2025-03-19T20:34:00Z

Demindiro: Hardwired zero register

There are many different ways to encode instructions, all with their own tradeoffs.

== Considerations ==

=== Fixed-width or variable-length ===

Even among variable-length ISAs there is a broad spectrum in complexity. For example, x86 is notorious for being difficult to decode. In contrast, RISC-V is also variable-length but trivial to decode.

=== Instruction density ===

Large instructions can encode many things but have poor density, while smaller instructions have good density but may not be able to encode many things.

Of particular note is the amount of directly addressable registers: a larger amount of registers requires more bits to encode, leaving less to encode other things.

=== Instruction formats ===

To reduce decoding complexity it is wise to define a few fixed formats, which all instructions should be defined in terms of.

== Techniques ==

=== Variable-length encoding ===

To keep decoding variable-length instructions simple one can use a fixed prefix of a few bits.

For example, an ISA with 8, 16 and 24-bit instructions could be encoded as:
<source>
23 16 15 8 7 0
8-bit xxxxxxx0
16-bit xxxxxxxx xxxxxx01
24-bit xxxxxxxx xxxxxxxx xxxxx011
</source>

=== Hardwired zero register ===

A substantial amount of instructions can be omitted by having a register which always reads 0.

=== Implicit destination registers ===

Instead of having explicitly named registers, one can instead have a queue of registers and address source registers relative to the "instruction distance".

Aside from being useful for out-of-order processors, this also avoids the need to encode a destination, freeing bits for other purposes.

Pipelining

2025-03-19T19:39:43Z

Demindiro: Fix example clockrate

A limiting factor in CPU performance is transistor '''propagation delay''': the amount of time it takes for a signal to traverse from start to end.
Hence, reducing the delay before a signal can hit a storage cell allows increasing clock speed.

For example, take the time profile a naive CPU design which executes a single instruction per cycle, at 0.33MHz:
<source>
0µs 1µs 2µs 3µs 4µs 5µs 6µs 7µs 8µs 9µs
add x1,x5 |--------------------|
ror x2,x4 |--------------------|
xor x1,x3 |--------------------|
</source>
By splitting instruction fetch, decode, and execute into separate stages we might be able to increase the clock rate to 1.00MHz:
<source>
0µs 1µs 2µs 3µs 4µs 5µs 6µs 7µs 8µs 9µs
add x1,x5 |--IF--|--ID--|--EX--|
ror x2,x4 |--IF--|--ID--|--EX--|
xor x1,x3 |--IF--|--ID--|--EX--|
</source>
Note that the fetch, decode and execute stages are independent. We can overlap these stages...:
<source>
0µs 1µs 2µs 3µs 4µs 5µs 6µs 7µs 8µs 9µs
add x1,x5 |--IF--|--ID--|--EX--|
ror x2,x4 |--IF--|--ID--|--EX--|
xor x1,x3 |--IF--|--ID--|--EX--|
</source>
... reducing total execution time from 9µs to 5µs!

Pipelining

2025-03-18T20:30:09Z

Demindiro: Draft

A limiting factor in CPU performance is transistor '''propagation delay''': the amount of time it takes for a signal to traverse from start to end.
Hence, reducing the delay before a signal can hit a storage cell allows increasing clock speed.

For example, take the time profile a naive CPU design which executes a single instruction per cycle, at 0.5MHz:
<source>
0µs 1µs 2µs 3µs 4µs 5µs 6µs 7µs 8µs 9µs
add x1,x5 |--------------------|
ror x2,x4 |--------------------|
xor x1,x3 |--------------------|
</source>
By splitting instruction fetch, decode, and execute into separate stages we might be able to increase the clock rate to 1.5MHz:
<source>
0µs 1µs 2µs 3µs 4µs 5µs 6µs 7µs 8µs 9µs
add x1,x5 |--IF--|--ID--|--EX--|
ror x2,x4 |--IF--|--ID--|--EX--|
xor x1,x3 |--IF--|--ID--|--EX--|
</source>
Note that the fetch, decode and execute stages are independent. We can overlap these stages...:
<source>
0µs 1µs 2µs 3µs 4µs 5µs 6µs 7µs 8µs 9µs
add x1,x5 |--IF--|--ID--|--EX--|
ror x2,x4 |--IF--|--ID--|--EX--|
xor x1,x3 |--IF--|--ID--|--EX--|
</source>
... reducing total execution time from 9µs to 5µs!

RISC vs CISC

2025-03-18T20:12:26Z

Demindiro: Draft

Contemporary usage of the terms RISC and CISC is muddy, but originally referred to the amount of operations a single instruction could (or should) do.

For example, swapping values between a register and a memory location on a RISC machine may be done as:
<source>
ld.w x3,[x2] ; 1 cycle
st.w [x2],x1 ; 1 cycle
mv x1,x3 ; 1 cycle
</source>
Whereas on a CISC machine it might be done as:
<source>
xchg x1,[x2] ; 3 cycles
</source>
RISC cores tend to be smaller and simpler, allowing higher clock speeds to be achieved. CISC cores tend to be more complex but usually have better code density.

In practice, modern ISAs are a hybrid between the two approaches, with both simple and complex instructions.

Instruction encoding

2025-03-18T19:59:37Z

Demindiro: Draft

There are many different ways to encode instructions, all with their own tradeoffs.

== Considerations ==

=== Fixed-width or variable-length ===

Even among variable-length ISAs there is a broad spectrum in complexity. For example, x86 is notorious for being difficult to decode. In contrast, RISC-V is also variable-length but trivial to decode.

=== Instruction density ===

Large instructions can encode many things but have poor density, while smaller instructions have good density but may not be able to encode many things.

Of particular note is the amount of directly addressable registers: a larger amount of registers requires more bits to encode, leaving less to encode other things.

=== Instruction formats ===

To reduce decoding complexity it is wise to define a few fixed formats, which all instructions should be defined in terms of.

== Techniques ==

=== Variable-length encoding ===

To keep decoding variable-length instructions simple one can use a fixed prefix of a few bits.

For example, an ISA with 8, 16 and 24-bit instructions could be encoded as:
<source>
23 16 15 8 7 0
8-bit xxxxxxx0
16-bit xxxxxxxx xxxxxx01
24-bit xxxxxxxx xxxxxxxx xxxxx011
</source>

=== Implicit destination registers ===

Instead of having explicitly named registers, one can instead have a queue of registers and address source registers relative to the "instruction distance".

Aside from being useful for out-of-order processors, this also avoids the need to encode a destination, freeing bits for other purposes.

File:Alu.png

2023-02-27T19:10:35Z

Demindiro: Demindiro uploaded a new version of File:Alu.png

Symbolic representation of an ALU

ALU

2023-02-27T18:06:47Z

Demindiro: /* Overview */

An ALU (Arithmetic Logic Unit) is a circuit for performing various integer operations.
It is an essential component of any CPU.

It does not perform operations on floating point numbers, which is handled by a [[FPU]] instead.

= Overview =

[[File:Alu.png|thumb]]

A typical ALU has:

* 2 integer inputs.
* 1 integer output.
* 1 opcode selection input.
* 1 status input.
* 1 status output.

The status signal is used to indicate carry/overflow/...

An ALU may have more than 2 inputs. For example, fused multiply-add requires 3 input operands.
Likewise, an ALU may have more than 1 output, e.g. to calculate both quotient and remainder in one operation.

= Common operations =

== Bitwise AND, OR, XOR ==

These operations simply involve wiring each bit to the corresponding gate.

== Shift/rotate left/right ==

Shifting or rotating a number by a fixed amount of bits simply involves wiring each input bit to its corresponding output bit.

Shifting by a variable amount is trickier, especially for large numbers.
Using a single multiplexer will require an excessive amount of wires and gates to support every possible rotation.

A sequence multiple small multiplexers can be used, each shifting the number by <code>x * 2y</code>.
While this is slower it saves a lot on gates.

# Right shifter for 8 bits numbers.
s4 = y[2] ? (x >> 4) : x
s2 = y[1] ? (s4 >> 2) : s4
s1 = y[0] ? (s2 >> 1) : s2

== Addition ==

An adder is composed of multiple 1-bit adders.
Each of these 1-bit adders has ''three'' inputs and ''two'' outputs.

The addition of two 1-bit numbers has a 2-bit result.
The high bit is called the '''carry'''.
For multi-bit adder this carry is ''carried'' to the next 1-bit adder.

The LSb (<code>out</code>) can be calculated by XORing all three input bits together.
The MSb (<code>carry_out</code>) can be calculated by testing whether two or more bits are 1.

out = in_a ^ in_b ^ carry_in
carry_out = (in_a & in_b) | (in_b & carry_in) | (carry_in & in_a)

Note that:
* <code>(x & z) | (y & z) = (x | y) & z</code>
* <code>x | y = (x ^ y) | (x & y)</code>
* <code>x & !x = 1</code>

From which follows:

(in_a & in_b) | (in_b & carry_in) | (carry_in & in_a)
= (in_a & in_b) | ((in_a | in_b) & carry_in)
= (in_a & in_b) | (((in_a ^ in_b) | (in_a & in_b)) & carry_in)
= (in_a & in_b) | ((in_a ^ in_b) & carry_in) | ((in_a & in_b) & carry_in)
= (in_a & in_b) | ((in_a ^ in_b) & carry_in)

Hence we can save a few gates by using the result of <code>in_a ^ in_b</code>

t = in_a ^ in_b
out = t ^ carry_in
carry_out = (in_a & in_b) | (t & carry_in)

=== Faster addition ===

While chaining 1-bit adders is a cheap & easy way to implement a multi-bit adder,
it is also slow as the signal must propagate from the lowest to the highest bit.
To speed this up a ''carry-lookahead'' is used.

This lookahead computes the carry for multiple bits up,
which allows using multiple small multi-bit adders in parallel.

Suppose that <code>g = x & y</code> and <code>p = x ^ y</code>, then:

c = (x & y) | ((x ^ y) & z) = g | (p & z)

Since we chain each 1-bit adder:

c1 = g0 | (p0 & c0)
c2 = g1 | (p1 & c1)
c3 = g2 | (p2 & c2)
...

If we fully work out <code>c</code> on the right side of each expression we get:

c1 = g0 | (p0 & c0)
c2 = g1 | (p1 & c1)
= g1 | (p1 & (g0 | (p0 & c0)))
= g1 | (p1 & g0) | (p1 & p0 & c0)
c3 = g2 | (p2 & c1)
= g2 | (p2 & (g1 | (p1 & g0) | (p1 & p0 & c0)))
= g2 | (p2 & g1) | (p2 & p1 & g0) | (p2 & p1 & p0 & c0)
...

We can group the <code>p</code> and <code>g</code> signals:

P = ... p2 & p1 & p0
G = ... (... p2 & g1) | (... p2 & p1 & g0)

And calculate the final carry signal:

C = G | (P & c0)

=== Addition in HDLs & simulators ===

Note that most HDLs & simulators already provide a component for performing addition.
This component should be preferred as it can be synthesized more efficiently than a manual implementation
(e.g. by using built-in adders or other components in FPGAs, such as Xilinx's CARRY4).

== Subtraction ==

For two-complements arithmetic subtraction can be done with <code>a - b = a + (-b)</code>.
<code>-b</code> is equivalent to <code>~b + 1</code>, i.e. apply bitwise NOT to <code>b</code> and add 1.
Adding 1 can be done by setting the carry high.

== Equality ==

The cheapest way to test for equality is by using bitwise XOR:
if two numbers are equal, their XOR is 0.
If not, it is nonzero.
Then do a bitwise OR of all bits. If 1, the numbers are not equal.

== Less Than ==

For unsigned arithmetic a Less Than test can be done with a subtraction and checking the carry bit.

::'''TODO''' signed arithmetic requires checking carry and high bit IIRC?

== Multiplication ==

== Division ==

= Pitfalls =

== Division operation in Verilog, VHDL et al. ==

It may be tempting to use the division operator directly (e.g. <code>alu_o = alu_a / alu_b</code>).
'''Do not do this''' as it will generate a huge and slow combinatorial circuit.

Instead do each part of the division step by step, one step per clock cycle.

= See Also =

* [[Binary arithmetic]]

File:Alu.png

2023-02-27T17:58:37Z

Demindiro:

Symbolic representation of an ALU

ALU

2023-02-18T15:33:55Z

Demindiro: First draft of ALU page.

An ALU (Arithmetic Logic Unit) is a circuit for performing various integer operations.
It is an essential component of any CPU.

It does not perform operations on floating point numbers, which is handled by a [[FPU]] instead.

= Overview =

= Common operations =

== Bitwise AND, OR, XOR ==

These operations simply involve wiring each bit to the corresponding gate.

== Shift/rotate left/right ==

Shifting or rotating a number by a fixed amount of bits simply involves wiring each input bit to its corresponding output bit.

Shifting by a variable amount is trickier, especially for large numbers.
Using a single multiplexer will require an excessive amount of wires and gates to support every possible rotation.

A sequence multiple small multiplexers can be used, each shifting the number by <code>x * 2y</code>.
While this is slower it saves a lot on gates.

# Right shifter for 8 bits numbers.
s4 = y[2] ? (x >> 4) : x
s2 = y[1] ? (s4 >> 2) : s4
s1 = y[0] ? (s2 >> 1) : s2

== Addition ==

An adder is composed of multiple 1-bit adders.
Each of these 1-bit adders has ''three'' inputs and ''two'' outputs.

The addition of two 1-bit numbers has a 2-bit result.
The high bit is called the '''carry'''.
For multi-bit adder this carry is ''carried'' to the next 1-bit adder.

The LSb (<code>out</code>) can be calculated by XORing all three input bits together.
The MSb (<code>carry_out</code>) can be calculated by testing whether two or more bits are 1.

out = in_a ^ in_b ^ carry_in
carry_out = (in_a & in_b) | (in_b & carry_in) | (carry_in & in_a)

Note that:
* <code>(x & z) | (y & z) = (x | y) & z</code>
* <code>x | y = (x ^ y) | (x & y)</code>
* <code>x & !x = 1</code>

From which follows:

(in_a & in_b) | (in_b & carry_in) | (carry_in & in_a)
= (in_a & in_b) | ((in_a | in_b) & carry_in)
= (in_a & in_b) | (((in_a ^ in_b) | (in_a & in_b)) & carry_in)
= (in_a & in_b) | ((in_a ^ in_b) & carry_in) | ((in_a & in_b) & carry_in)
= (in_a & in_b) | ((in_a ^ in_b) & carry_in)

Hence we can save a few gates by using the result of <code>in_a ^ in_b</code>

t = in_a ^ in_b
out = t ^ carry_in
carry_out = (in_a & in_b) | (t & carry_in)

=== Faster addition ===

While chaining 1-bit adders is a cheap & easy way to implement a multi-bit adder,
it is also slow as the signal must propagate from the lowest to the highest bit.
To speed this up a ''carry-lookahead'' is used.

This lookahead computes the carry for multiple bits up,
which allows using multiple small multi-bit adders in parallel.

Suppose that <code>g = x & y</code> and <code>p = x ^ y</code>, then:

c = (x & y) | ((x ^ y) & z) = g | (p & z)

Since we chain each 1-bit adder:

c1 = g0 | (p0 & c0)
c2 = g1 | (p1 & c1)
c3 = g2 | (p2 & c2)
...

If we fully work out <code>c</code> on the right side of each expression we get:

c1 = g0 | (p0 & c0)
c2 = g1 | (p1 & c1)
= g1 | (p1 & (g0 | (p0 & c0)))
= g1 | (p1 & g0) | (p1 & p0 & c0)
c3 = g2 | (p2 & c1)
= g2 | (p2 & (g1 | (p1 & g0) | (p1 & p0 & c0)))
= g2 | (p2 & g1) | (p2 & p1 & g0) | (p2 & p1 & p0 & c0)
...

We can group the <code>p</code> and <code>g</code> signals:

P = ... p2 & p1 & p0
G = ... (... p2 & g1) | (... p2 & p1 & g0)

And calculate the final carry signal:

C = G | (P & c0)

=== Addition in HDLs & simulators ===

Note that most HDLs & simulators already provide a component for performing addition.
This component should be preferred as it can be synthesized more efficiently than a manual implementation
(e.g. by using built-in adders or other components in FPGAs, such as Xilinx's CARRY4).

== Subtraction ==

For two-complements arithmetic subtraction can be done with <code>a - b = a + (-b)</code>.
<code>-b</code> is equivalent to <code>~b + 1</code>, i.e. apply bitwise NOT to <code>b</code> and add 1.
Adding 1 can be done by setting the carry high.

== Equality ==

The cheapest way to test for equality is by using bitwise XOR:
if two numbers are equal, their XOR is 0.
If not, it is nonzero.
Then do a bitwise OR of all bits. If 1, the numbers are not equal.

== Less Than ==

For unsigned arithmetic a Less Than test can be done with a subtraction and checking the carry bit.

::'''TODO''' signed arithmetic requires checking carry and high bit IIRC?

== Multiplication ==

== Division ==

= Pitfalls =

== Division operation in Verilog, VHDL et al. ==

It may be tempting to use the division operator directly (e.g. <code>alu_o = alu_a / alu_b</code>).
'''Do not do this''' as it will generate a huge and slow combinatorial circuit.

Instead do each part of the division step by step, one step per clock cycle.

= See Also =

* [[Binary arithmetic]]

Binary arithmetic

2023-02-18T13:32:04Z

Demindiro: First draft. Written under the assumption that readers are not familiar with binary arithmetic yet, so heavy on examples.

As the name digital implies, digital computers work on on bits with a value of either 0 or 1.
Accordingly, all operations, including arithmetic, operate on 0 or 1 values.

Numbers that consists only of 0 or 1s are called '''binary''' or '''base2''' numbers.
Binary numbers are often suffixed with a subscript 2, for example 10110112.

= Converting between decimal and binary =

== Decimal as a sum of powers of 10 ==

Any decimal number can be written as a sum of power of 10s.
For example:

* 42 = 4*101 + 4*100
* 57005 = 5*104 + 7*103 + 0*102 + 0*101 + 5*100
* 48.879 = 4*101 + 8*100 + 8*10-1 + 7*10-2 + 9*10-3

Binary numbers can be written in the same way though with powers of 2 instead of 10.

== Converting from binary to decimal ==

* 10102 = 1*23 + 0*22 + 1*21 + 0*20 = 8 + 2 = 10
* 100.012 = 22 + 2-2 = 4 + 1/4 = 4.25

== Converting from decimal to binary ==

# Divide the decimal number by 2.
# Append the remainder to the ''left'' of the binary number.
# Repeat with quotient as dividend until it is 0.

For example:

* 29 = 111012
** 29 / 2 = 14, remainder 1
** 14 / 2 = 7, remainder 0
** 7 / 2 = 3, remainder 1
** 3 / 2 = 1, remainder 1
** 1 / 2 = 0, remainder 1

The same technique can be used for rational numbers.

# First multiply by some power of 2 to convert it to an integer.
# Apply above procedure.
# Divide by the same power of 2 to get the correct result.

= Mathematical operations =

The same techniques used for addition, multiplication ... used for decimal numbers can also be used for binary numbers.

== Addition & subtraction ==

22 101102
+ 19 + 100112
---- --------
41 1010012

== Multiplication ==

12 11002
* 5 * 1012
---- -------
10 11002
+ 50 02
---- + 1100002
60 ---------
1111002

== Division ==

372 | 5 101110100 | 1012
- 35 +---- - 101 +---------
---- | 74 ----- | 10010102
22 | 01 |
- 20 | - 0 |
---- | ----- |
2 | 11 |
- 0 |
----- |
110 |
- 101 |
----- |
11 |
- 0 |
----- |
110 |
- 101 |
----- |
10 |
- 0 |
----- |
10 |

Ethernet

2023-02-17T20:39:26Z

Demindiro: Quick draft of Ethernet page

Ethernet is a ''wired'' protocol for connecting many devices in a network such that they can transfer data between each other.
It defines both the physical layers (10BASE-T, 100BASE-TX ...) as well the data format of an Ethernet packet & frame.

= Ethernet packet & frame =

There are two major formats of ethernet ''packets'':
the regular variant which supports MTUs (Maximum Transmission Unit) of up to 1500 bytes
and "jumbo frames", which support MTUs of up to 9000 bytes.
The latter are only supported for gigabit links and higher.

'''Note:''' each byte is transmitted starting from its LSb to its MSb.

== Ethernet packet ==

Every Ethernet packet starts with a preamble.
This preamble is a repeated pattern of <code>01010101</code> bits (hex: <code>0x55</code>).
This pattern is used to synchronize the clock as well as signal to the receiver a frame is about to be transmitted.
The preamble consists of 7 bytes.

Immediately followed after the preamble is the SFD (Start Frame Delimiter).
This byte has the value <code>01110101</code> (<code>0xd5</code>) and indicates the start of the Ethernet Frame.

After the Ethernet Frame there is an IPG (InterPacket Gap) of 12 bytes during which no data is sent whatsoever.

{| class="wikitable"
|+ Ethernet packet
|-
! Preamble !! SFD !! Ethernet frame !! IPG
|-
| 0x55, 7 times || 0xd5 || ... || 12 bytes of nothing
|}

== Ethernet frame ==

An Ethernet frame is at least 64 bytes long.

Each frame ends with a FCS (Frame Check Sequence) which is a CRC32 of all preceding frame data.

{| class="wikitable"
|+ Ethernet frame
|-
! Destination MAC address !! Source MAC address !! EtherType !! Payload !! FCS
|-
| 6 bytes || 6 bytes || 2 bytes || At least 46 bytes || 4 bytes
|}

The "EtherType" <code>0x8100</code> indicate a 802.1Q tag is present, in which case the format is as follows:

{| class="wikitable"
|+ Ethernet frame with 802.1Q tag
|-
! Destination MAC address !! Source MAC address !! 802.1Q tag !! EtherType !! Payload !! FCS
|-
| 6 bytes || 6 bytes || 4 bytes || 2 bytes || At least 42 bytes || 4 bytes
|}

The "real" EtherType is located right after the tag.

{| class="wikitable"
|+ 802.1Q tag
|-
! TPID !! TCI
|-
| 0x8100 || ...
|}

= Interfaces =

There are various types of interfaces to interact with Ethernet controllers.

== Media Independent Interface (MII) ==

The MII is the most common type of interface. Several variants exists for higher speeds or lower pin count.

=== Standard ===

The standard MII interfaces supports 10BASE-T and 100BASE-TX.

{| class="wikitable"
|+ MII pins
|-
! Pin(s) !! Description
|-
| TX_CLK || Transmission clock. Must be used when sending data.
|-
| TXD[0:3] || 4 bits of data to transmit if TX_EN is high.
|-
| TX_EN || When high, transmit data.
|-
| TX_ER || When high, force a transmission error.
|-
| RX_CLK || Receive clock. Must be used when receiving data.
|-
| RXD[0:3] || 4 bits of received data when RX_DV is high.
|-
| RX_DV || Whether data is currently being received.
|-
| RX_ER || Whether an error occured during reception.
|-
| CRS || Carrier sense. TODO
|-
| COL || Collision detect. TODO
|-
| MDIO
|-
| MDC
|}

When receiving or sending data, the '''high''' nibble (4 bits) is sent first, then the low nibble.

It is possible some bytes of the preamble may be missing on reception. This may happen if the controller detects the signal late.

UART

2023-02-17T16:32:09Z

Demindiro: First draft of UART page

UART (Universal Asynchronous Receiver-Transmitter) is a simple protocol for exchanging data between two devices. It uses one line for transmitting and one receiving data.

= Clock rate & encoding =

The clock rate & encoding must be configured beforehand on both devices, as UART does not provide a way to negotiate settings.

There are several ways to encode the data.
The encoding is usually indicated as <code><Hz> <data bits><parity><stop bits></code>

For example, <code>9600 8N1</code> indicates a clock of 9600Hz with 8 data bits, no parity bit and 1 stop bit. <code>38400 7E2</code> indicates a clock of 38400Hz, 7 data bits, 1 even parity bit and 2 stop bits.

The parity bit is used for error detection. It can detect a single bit flip during transmission.

{| class="wikitable"
|+ Parity bit encodings
|-
! Letter !! Description
|-
| N || No parity bit
|-
| E || Even parity bit. 0 if even amount of data bits with value of 1, otherwise 1.
|-
| O || Odd parity bit. 1 if even amount of data bits with value of 1, otherwise 0.
|-
| M || Mark. TODO
|-
| P || Space. TODO
|}

By default ('''idle'''), the line must be pulled high. When data is about to be transmitted, the line must be pulled low ('''start'''). Then the '''data''' and '''parity''' bits must be sent. Finally, the '''stop''' bit(s) must be sent which will pull the line high, completing the transaction.

[[File:UART state machine & waveform example.png]]

= Tools =

On Linux, you can use minicom to communicate with a UART-enabled device. It allows you to configure the encoding & data rate and save the settings to a file.

If the device is connected over USB, it will show up as <code>/dev/ttyUSBX</code>, where X is some integer.

File:UART state machine & waveform example.png

2023-02-17T16:29:01Z

Demindiro:

UART state machine & waveform example