Yuhan's blog

Sunday, January 05, 2025

MPI - Quick Tutorial

MPI is a C program that can run at multiple processors, and the processors can be at multiple machines. It will SSH into the remote machines and run the task on the processor

Install MPI

sudo apt install openmpi-bin openmpi-common libopenmpi-dev

MPI has its own compiler and runtime:

mpicc --version

mpiexec --version

Compile MPI

mpicc -o hello-mpi hello-mpi.c

Run MPI

mpiexec hello-mpi

(This will execute hello with 16 processors because my machine has 16 processors. If ./hello-mpi is invoked directly, hello will be executed just once.)

Create a hostfile:

localhost slots=4

Execute the hostfile to run with just 4 processors:

mpirun --hostfile hostfile hello-mpi

This will run in 4 processors.

You can directly run with just 4 processors without specifying a host file:

mpirun -np 4 hello-mpi

If the number of processors exceeds the system has, the request will error out. For example:

mpirun -np 17 hello-mpi

(This resulted in errors on my machine with "There are not enough slots available in the system to satisfy the 17 slots that were requested by the application." Similar error would happen if -np and --hostfile is combined and -np exceeds the slots on a machine defined in hostfile.）

mpirun and mpiexec can be used exchangeable.

How to Get Result from Remote

Results on different machines can be reduced with MPI_Reduce(), usually on machine 0. For example: to sum the result:

double local_result = compute_local_result();

double global_result;

MPI_Reduce(&local_result, &global_result, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);

MPI_Allreduce will reduce the result and send to all machines.

Use MPI_Gather to collect results from all nodes as an array of items into a node (node 0) without reducing.

double local_result = compute_local_result();

double all_results[num_processes];

MPI_Gather(&local_result, 1, MPI_DOUBLE, all_results, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);

MPI_Allgather will collect results in an array and make available in all machines.

You can also customize the communication among the nodes. For example, to collect results from all other MPI nodes in node 0.

if (rank != 0) {
MPI_Send(&local_result, 1, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD);
} else {
for (int i = 1; i < num_processes; i++)

MPI_Recv(&worker_result, 1, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);

}

MPI_Scan will visit the processors in logical order by their rank. (Meanwhile they run in parallel behind the scene.) You can get prefix sum easily by MPI_Scan(&rank, &prefix, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD).

MPI_Comm_dup and MPI_Comm_split are able to create custom communicators. (Use MPI_Comm_free to free the custom communicators) MPI_Comm_split can be used in divide-and-conquer.

MPI_COMM_SELF can be used to isolate from collective operations.

How to trigger Python with MPI?

Instead of running a C program, use mpiexec to call a python program. For example,

mpirun -np 4 python 1.py

Of course, your python script needs MPI support MPI; otherwise, it would run the same content. Install mpi4py by:

sudo apt-get update

sudo apt-get install libopenmpi-dev

conda install python=3.10

conda install mpi4py

Copy and run the basic example in https://mpi4py.readthedocs.io/en/stable/tutorial.html, and run:

mpiexec -np 2 python 2.py

References:

https://mpitutorial.com/tutorials/mpi-hello-world/

https://mpitutorial.com/tutorials/mpi-reduce-and-allreduce/

https://mpi4py.readthedocs.io/en/stable/tutorial.html

https://www.youtube.com/watch?v=SGs0LzouRDg

Sunday, December 08, 2024

Short Explanation of KL Divergence

What is KL divergence? It measures the distance between 2 Gaussian distributions. Be more precise - given 2 distributions P and Q, and let P be the target distribution. KL divergence is: CrossEntropy(P|Q) - Entropy(P)

What is Entropy and Cross Entropy? Entropy is a special case of Cross Entropy where the two distributions are the same.

As explained in the video tutorial (see the video in the reference), starting with entropy, given a Gaussian distribution with probability function p(x), do 10 actual samples to get a sequence. The probability of getting the sequence is the product of the p(xi) of each step. To make the equation into summation, take a log at both side of the equation, each sample become log(p(xi)). Group the p(xi) of the same values together and divide by total, that should also be p(x), because we are repeating the sampling step. That makes the Entropy(P)= −∑ p(x) ln(p(x))

Note that p(x) is the probability of the actual sampling, and ln(p(x)) is the probability of the distribution.

Get another distribution Q with q(x), replace the ln(p(x)) part with ln(q(x)), it becomes Cross Entropy(P|Q)= −∑ p(x) ln(q(x)). It measures the probability to generate the same sequence with a different distribution. This value is larger than Entropy(P) because it is a different distribution.

KL divergence is CrossEntropy(P|Q) - Entropy(P)

References:

KL Divergence: https://youtu.be/sjgZxuCm_8Q?si=CAU5g6_DGxto0h0J

Short Explanation of Variational Autoencoder (VAE) and Controlled VAE

A Variational Autoencoder is an auto encoder with a twist.

An Autoencoder is network that takes in a large input (ex: an image), encodes it into a smaller vector, and decode it back into the original input. The smaller intermediate vector is called "latent vector".

A Variational Autoencoder takes in a large input and encodes it into a Gaussian distribution in the latent vector space (so it encodes into a mean and standard deviation). To decode it, randomly sample from this Gaussian distribution to get a latent vector, and the decoder should bring it back to the original input.

Note that since the latent vector is not exact every time, the nearby values in the Gaussian distribution should all be able to bring back an output close to the output. If your network was trained to encode 2 images, the point in the middle of the two distributions will have characteristics of both images.

VAE is usually trained with a weighted sum of KL divergence + difference from the desired output (where input and desired output are the same in this case). KL divergence measures how far the sampled latent vector is. The formula in the calculation only considers the one sample point, because one of the distributions was 0 centered and was normalized. So more intuitively, the further away of the sampling, the more tolerance to the difference between the desired output and the actual output.

In case of Controlled VAE, a text is also decoded into latent vector. And the text latent vector and the image latent vector is concatenated as the input to the decode.

References:

VAE: https://youtu.be/X73mKrIuLbs?si=e7tYZRoWm8QO60R1

Monday, November 11, 2024

Study CUDA Programming - Summary

Install Nvidia Driver. Be able to run nvcc compiler. The name of the file matters - the file extension needs to be ".cu" to be compiled by the cvcc compiler. Install C++ compiler, too.

Run in command line:

nvcc hello.cu -o hello

How to define a cuda function:

__global__ void myFunc(float* output, float* input) {
}

How to allocate CUDA memory

How to free CUDA memory

How to copy data from memory to GPU device

cudaMemcpy(pDestination, pSource, byte_size, cudaMemcpyHostToDevice)

How to copy data from GPU device to memory

cudaMemcpy(pDestination, pSource, byte_size, cudaMemcpyDeviceToHost)

How to let CPU wait for GPU execution result:

cudaDeviceSynchronize();

How to invoke the coda function:

myFunc<<<1,1>>>(output, input);

What does <<<1,1>>> mean?

1 block, and 1 thread per block

How to run more in parallel with GPU?

Pass <<<1, 50>>>

This will run 50 threads. Each thread has a thread local variable: threadIdx.x, whose value is an int from [0, 49], local to the thread.

What does <<<1,50>>> mean?

1 block, and 50 threads per block

Is there threadIdx.y in a CUDA function?

Yes. In fact, threadId.x, threadId.y, threadId.z forms a 3D thread block. Likewise, the number of blocks can be in 3D as well. Each block is a collection of a set of 3D threads. In the coda function, it can be referred as blockIdx.x, blockIdx.y, blockIdx.z.

Is it possible to know the thread block size inside a CUDA function?

Yes. Block size is identified by thread local variable: blockDim.x, blockDim.y, blockDim.z. In addition, grid size (collection of blocks) can also be detected by gridDim.x, gridDim.y, gridDim.z.

How to pass 3D grid of 3D thread blocks?

Define dim grid(4,5,6). Define dim3 block(3,2,1). Pass these in the CUDA function call: myFunc<<<grid, block>>>(...)

Can I have just 1 huge block and 1 million threads?

No. Nvidia GPU will limit it at 1024 threads per block.

I want to synchronize threads in a block?

__syncthreads()

If the number of threads is less than 32, then the threads are always in sync.

Keep if-else to value only:

Given that the parallel threads always executes the same instruction, when an if-else state branches the logic, the branched logic must run in different thread-divergence groups. The group with unsatisfied condition will run in no-op, waiting for its turns. But, if-else that chooses constant can be converted to a clover instruction. Ex: if(vector_size < 100) constant1; else constant 2;

Math Trick

Use right shift to detect if an integer is 0 or positive: 1 + x >> 31, because

0 >> 31 = 0
-1 >> 31 = -1
-99 >> 31 = -1

CUDA Memory Model

Global Memory:

- Long latency (400-800 cycles)

- Throughput 200 GBPS

- Shared by all thread blocks

Texture Memory

- Read-only (12KB)

- ~800 GBPS

- Optimized for 2D spatial locality

- Store meta data. Optimized on access to neighboring cell.

Constant Memory

- Read-only (64 KB)

L2 Cache

- 768KB

- Shared among thread blocks

- Fast Atomics

(Useful for synchronize threads as atomic variables)

L1 & Shared Memory

- Local to a thread block, shared within one thread block.

- 64 KB per thread block

- 16 KB shared, and 48KB L1.

- L1 cannot be programmed. Shared memory can be programmed

- Low latency (20-30 cycles)

- 1 TBPS

- Local to one thread, 21 registers per thread

- 32KB, per thread block

- Handled by compiler

- 8 TBPS

Improve bandwidth:

- share / reuse data

- compression

- Recompute than store + fetch

Latency

- I/O causes time to read/write to GPU

- In practice, memory I/O can become bottleneck many times.

- Latency hiding is done by exploiting multi-threading

Why do we need multiple thread blocks?

- Because a thread block can halt when on I/O operations. When the halt happens, other threadBlock can take the GPU resource and run, achieving better throughput. This is called super-linear speedup.

Locality

- Spatial and Temporal locality

- You will want a page to be loaded and used often before switching to another page. So, it is faster to iterate a 2D array row by row, than iterating column by column.

Memory Coalescing

- Store data in memory within close spatial distance to avoid frequent loading to memory. (Since each memory load takes 32 cycles)

- Coalesced vs. strided: strided data have chunks of data for each object. When data is accessed in strided stored, it will require more memory load. Unrelated fields are loaded without being accessed.

- Random access is the worst. This happens especially when the index to the array depends on a variable.

- In CPU context, each thread should access consecutive element in memory chunks (strided). Array of structures is preferred.

- In GPU context, memory coalescing means a chunk should be accessed by consecutive threads (coalesced). Thus, structure of Arrays is preferred. (Ex: each field of a list of records is stored in 1 array. 1 structure represents 1 table.)

- Tip: in the cuda function, use (blockIdx.x * blockDim.x + threadIdx.x) as index to each field.

Shared Memory:

How to define in __global__ function?
__shared__ float a[N];
__shared__ unsigned s;

How to initialize their value:
if (threadId.x == 0) s = 0;

Shared memory is accessible in a thread block. Shared memory is organized into 32 banks. Access in the same bank are sequential (blocking). Exception case: access the same address of the same bank in the same warp doesn't block. It is called broadcast. (Note that if 2 threads access 2 different addresses of the same bank will cause blocking.)

Use this command to allocate the 64KB between L1 cache and shared memory:

cudaDeviceSetCacheConfig(kernelFuncName, cudaFuncCachePreferShared)

Other options:

to give more L1 cache: cudaFuncCachePreferL1
to give equal L1 and shared memory: cudaFuncCachePreferEqual
no preference of L1 and shared memory: cudaFuncCachePreferNone

When your shared memory size is larger than the max allowed shared memory, compilation error will occur

The larger of the shared memory, the less of the L1 cache.

Dynamic Shared Memory:

To vary the shared memory size in a function, pass the shared memory size via calling with 3 parameters: <<<numBlocks, numThreadsPerBlock, sizeOfSharedMemory>>>

Inside the kernel function, define an array pointer for the dynamic shared memory:

__global__ void kernelWithDynamicSharedMemory() {
extern __shared int s[];

Can I define multiple arrays? Yes, but there is only one extern __shared__ array. Split the array by pointer inside the method.

Tex2D memory

Define variable: texture<float, 2 cudaReadModeElementType> texRef;

In main: cudaBindTextureToArray(texRef, cuArray, ...);

In kernel: tex2D(texRef, x, y);

Constant memory

It is read-only, and 64KB per SM. Define

Define variable: __const__ unsigned meta[1];

In main: cudaMemcpyToSymbol(meta, &hmeta, 1*size(unsigned));

In kernel: a = meta[0]

Parallel in block vs in threads:

- Each block is assigned to a different StreamingProcessor (SM). The blocks do not communicate with each other.

- Each thread runs at different cores of a StreamingProcessor. Threads are bundled into warps. Each warp has 32 threads.

Host Memory: the memory accessed by CPU

Device Memory: the memory accessed by GPU

Host Memory is allocated by malloc. Host memory can be mapped to OS memory paging. Use cudaMallocHost to avoid OS memory paging.

To transfer data from Host Memory to Device, use Asynchronous memory transfer, and ideally copy all memory to device at once:

cudaStreamCreate(&stream2)

cudaMemcpyAsync(dest2, src2, size, dir, stream2)

Each SM has 16KB of shared memory. The shared memory is essentially a user managed cache. The latency of the shared memory is comparable to registers.

Each thread has its dedicated registers. A block of threads shares shared memory. Meanwhile, blocks shares device memory.

Use structure of arrays, avoid structure of structures (always use separate arrays, even for x,y,z). For x,y,z, they can have the same float3 pointer to dereference into 3 adjacent values (to avoid dereferencing 3 times).

Memory is divided into banks. Each bank can service 1 thread per cycle. When accesses a blank is in conflict, serialize the accesses. (Here sounds like a lock)

We said shared memory is as fast as register, but that's true only when bank conflict is avoided. When a bank is accessed by 8 threads, the linear access stride = 8. Conflict can be detected by visual profiler (to look for warp_serialize events).

The fast case: if all t threads of a half-warp access different banks, there is no bank conflict. If all threads of a half-wrap read the identical address, there is no bank conflict. (It is called broadcast.)
The slow case: Bank conflict happened - multiple threads in the same half-warp access the same bank. Accesses must be serialized. Cost = max $ of simultaneous access on a single bank.
(What is half-war access? 1 warp is 32 threads. Half-warp is 16 threads. Different warps can take different branches. 1 SM can run many warps of a block. Warps in a block can synchronize.

A Single-instruction can run at multiple threads at the same time. A GPU has thousands of Registers and Registered are partitioned over threads.

Each SM has 16kb of shared memory. The blocks that require less than 8kb of shared memory can be handled in the same SM. Thus, warps can be from different blocks.

Threads per block should be multiple of 32 to have maximum threads per block. More warps per block creates deeper pipeline. And in this way it hide latency. However, then umber of threads is limited by the number of Registers. Use less than 8kb of shared memory to have multiple warps on the same SM. (So, this is a trade-off to the size of shared memory.) Recompute is often faster than loading from memory. Applying recompute is an optimization direction.

Regarding cudaMemCpy (device memory): Every successive 128 bytes (or 32 single precision values) can be access by a warp. All 32 reads should be done in one step. When memory is not in successive 128 blocks of memory, it would take twice of the time to read.

Regarding cudaMemcpytoSymbol (constant memory): copy to __constant__ float var, the cache works the best when a warp reads the same value. (Cuda 6 has a new device memory model)

Latency: in cycles

Throughput: cores / sm

The needed Parallelism: operations / SM = latency * throughput. This is called arithmetic. parallelism.

Consider Instruction-level parallelism (ILP).

ILP is a trick to let a CUDA function to calculate and return multiple values. It is often done by duplicating the logic and duplicate the variable names. It is a tradeoff by using more Registers per thread, and use less threads. ILP is the number of independent instructions in a loop.
As of seen in the "Better Performance at Lower Occupancy" article, larger ILP level requires less threads to achieve 100% utilization. With ILP=3, it only needs half of the threads (from 576 to 256 threads) to achieve 100% utilization. However ILP doesn't scale up to 4.

Another trick was to hide memory latency, by keeping 100KB in the flight. An example is to issue multiple independent reads by reading values in an array into multiple local variables:

__global__ void memcpy( float *dst, float *src )
{
int iblock= blockIdx.x + blockIdx.y * gridDim.x;
int index = threadIdx.x + 2 * iblock * blockDim.x;
float a0 = src[index];
//no latency stall
float a1 = src[index+blockDim.x];
//stall
dst[index] = a0;
dst[index+blockDim.x] = a1;
}

By copying 14 float4 values per thread, the app runs with 4% of occupancy to hide 84 of utilization.

CUDA Function Declarations and their meanings:

__device__ float deviceFunc() // call device from a device;

__global__ float kernelFunc() // call device from host; must return void

__host__ float hostFunc() // call host from host

A function can have multiple function types. Example:

__host__ __device__ void dhfun() { ...

cudaHostAlloc : allocates memory in the Host in page-locked form (which won't be written to the disk). It can be transferred faster between host and device. This memory can also be accessed by GPU via DMA, avoiding a memory copy.

A __global__ function can be called from both device and host.

Note: Variables cannot be declared as __host__. Only __device__ variable. Calling a __device__ variable from host will pass compilation but will reach error in runtime.

Thrust is a parallel algorithms library for CUDA, similar to STL on CPU

NVIDIA/thrust: [ARCHIVED] The C++ parallel algorithms library. See https://github.com/NVIDIA/cccl

Supports vectors and associated transformers.
It is black of where code executes.

Mutual Exclusion Algorithm:

Bakery Algorithm

By assigned token number to each thread
If there exists another thread waiting and holding a token less than my token, wait.
After the processing in my thread is done, register that my thread is done.
N-thread mutually exclusive

Atomic Operations

cuda operations: atomicCAS, atomicMin, atomicAdd, atomicCAS, etc.

atomicCAS: takes in 3 variables: address, compare value, new value.

If compare value == value in the address, new the new value. Otherwise do nothing.
Return old value in the address

(if old value == compare value, that means new value has been taken. Otherwise, it means the new value wasn't taken. Use this to create critical section for a single thread to enter.)

Works on GPU across warps
But it hangs for threads belonging to the same wrap, because 1 warp-thread acquires the lock and waits for other warp threads to reach the instruction, while other threads in the warp await this successful thread in the do-while.

The correct way to is use atomicCAS in do-while, and check with an if statement after entering the critical section. and upon exit of the critical section, unlock by setting the compared address to a value that can match comparison in a different thread.
Example:

do {
old = atomicCAS(&lockvar, 0, 1);
if(old == 0) {
// Do your task in the critical section
*lockvar = 0; // unlock by setting it to 0 so that other thread can pick it up.
}
} while(old != 0)

Note that with CPU, you can unlock outside of the while loop. But with GPU, the unlock needs to happen in the while loop.

Can also be used to enforce to a single run from many threads. (Just one if statement with atomicCAS)

For a complete list of atomic operations, see this: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions
Works with both global and shared memory

Barriers

It is a program point where all threads need to reach before any thread can proceed.
End of kernel is an implicit barrier for all GPU threads (global barrier)
CUDA 9 supports grid.sync() to make explicit global barrier.
Threads in a thread-block can synchronize using __synchthreads()

__synchthreads also creates memory barrier so that write (of shared memory) in one thread is visible to other threads in the block.
__threadfence() ensures that writes to global and shared memory are visible to all other threads on the device.

__threadfence_block(): write visibility within the same block
__threadfence_system(): write visibility across both the device and the host (CPU)
__threadfence(): write visibility across the device
Note that a threadfence only ensures visibility of memory write. It does not block threads.

Reductions

It is process of converting a set of values into fewer values
Reducible operation must satisfy associativity property (which means apply order doesn't change the outcome)

Min, Max, Sum, XOR
Can be implemented with atomics, but that adds sequentiality.

A better approach with improved parallelism in GPU:

Example: sum the adjacent pair of values in different threads. So they don't block each other. Keep doing this until there is only 1 value.
Complexity measurement:

Takes log(n) steps. First step runs n threads. Second step runs n/2 threads, ... , the last step runs 1 thread.

Prefix Sum:

Algorithm 1: for each value in the list, sum it with the previous value. Repeat the process to get the previous value from 2 cells back, 4 cells back, etc.
datarace: a thread is reading a value in memory while another thread is writing to it, there's a datarace. __synchthreads() should be placed in the middle of the read state and write statement.

Debugging:

Use Cuda-gdb https://docs.nvidia.com/cuda/cuda-gdb/index.html

compile with: nvcc -g -G file.cu
Run the output from the previous step with: cuda-gdb a.out

info cuda kernels - shows device status
info threads - shows execution of all threads.
info cuda sms - show streaming multi processors
info cuda warps - show warps
info cuda lanes - show information related to each thread.

set break point by:

break main - set break point in main function
break file.cu:223 - set break point in file at line number
set cuda break_on_launch - kernel entry breakpoint
break file.cu:23 if threadIdx.x == 1 && i < 5 - conditional break point

Profiling:

Time taken by kernels
Memory utilization
Cache misses
Divergence
Coalescing

Use nvprof to run the app. It tells the % of time used by each kernel. And % of time of each cuda command.

Dynamic Parallelism should be unrolled when possible (dynamic parallelism was supported until architecture 35. It can be specified during compilation.

nvcc -arch=sm_35 dynpar.cu

A global function can invoke another global function for parallel processing in a device.

Parent kernel is associated with a parent grid. Child kernels are associated with child grids. Parent and child kernel shares the global and constant memory, but they have distinct local and shared memories. Global memory operation in the parent is visible to the child. All global memory operation in child is visible to parent when parent calls cudaDeviceSynchronize().

Multi-GPU

One host (CPU) controls multiple GPUs
Use cudaSetDevice(i) before performing any cuda operation (ex: cudaMemcpy)

cudaGetDeviceCount
cudaDeviceCanAccessPeer - a device accesses another device

Warp Voting

__ballot - wrap which warp threads satisfy the predicate.

Return the test of the predicate in a warp as a 32-bit number.

__all - all warp threads satisfy the predicate
__any - any warp threads satisfy the predicate
Application:

Example: warp voting for atomics: on if(condition) atomicInc a counter. This can be easily replaced by counting the bits of 1's in the ballot.

Use __popc(mask) to return the number of set bits.
This would allow atomicAdd of 32 threads in 1 operation, reducing the blocking from atomic operation.

References:

https://www.youtube.com/watch?v=cvo3gnInQ7M&list=PL1ysOEBe5977vlocXuRt6KBCYu_sdu1Ru&index=1

https://www.olcf.ornl.gov/wp-content/uploads/2013/02/Intro_to_CUDA_C-TS.pdf

https://www.nvidia.com/content/cudazone/download/Advanced_CUDA_Training_NVISION08.pdf

https://www.nvidia.com/content/gtc-2010/pdfs/2238_gtc2010.pdf

https://www.ce.jhu.edu/dalrymple/classes/602/Class13.pdf

Tuesday, September 10, 2024

Key ideas in the book of: "Reinforcement Learning - An introduction" of Sutton & Barto (Part 3)

Planning

In the previous chapter of the book, Monte Carlo and Temporal Difference were introduced, where real experience is used to learn a prediction model for return. As the training of a value function / policy function is directly based on state transition and the reward in the real world, it is called Direct Reinforcement Learning, or Modeless Learning). An alternative is to add an environment model to emulate the environment for state transition and reward. The emulated transition can also be used by training the prediction model for return. This is called Planning or Model-based Learning.

Train on Randomly Sampled past observations

Since we have a Model for state transition, does that mean all the prediction model for return is trained solely using emulated state transition? The answer is no. The training usually happens with a state transition in the real world, followed by training using n trials of emulated state transitions. The n trials were randomly picked from the previously observed states.

Prioritized Sweeping

Note that most of the observed state transitions will have 0 reward and very small amount of change in return . It is inefficient to train at these states. As a result, randomly picking the state transition for training is not plausible in a large state space. An improvement is by using a priority queue to order the real experience (state, action) pairs: by putting the (state, action) pair with largest errors in the front.

Sunday, August 18, 2024

Key ideas in the book of: "Reinforcement Learning - An introduction" of Sutton & Barto (Part 2)

Temporal Difference vs. Monte Carlo method

In Monte Carlo method, a trajectory (episode) has to end with a terminal state. The final return was calculated aggregating all rewards along the path. Then from the final step backward, assign each previous step with the return discounted by a rate.

Problems with Monte Carlo method:

It has to reach a terminal state. That means the sampling path is longer.
The rewards in the middle is not considered individually. That means the estimation in the middle is not accurate.
Consider having negative reward and when the final return becomes 0: a trajectory is not going to be useful for learning.

In Temporal Difference, instead of requiring a complete episode, a sublist can be used. At each step i, the estimated return at step i is the estimated return of the next step (i+1) plus the actual reward at step i. The future return on is then discounted.

The main difference from Monte Carlo method:

The sampling path does not have to be at a terminal state.
The rewards in the middle are considered at their exact steps. They are excluded individually from the future return.
When there's only 1 reward, the two methods are the same.

The sum of the immediate reward at (at state s) and discounted future return from state s is called Bellman equation.

TD converges faster and reaches better performance at the same amount of training, comparing to the Monte Carlo method.

Batch Update

Instead of updating the neural network at every step. Play several episodes and the update the neural network (optimizer.step() in PyTorch).

In Monte Carlo method, we collect all state-to-return pairs, and take an average of the return values grouping by the states. And train the neural network.

This is a problem for a state that rarely occurs. Consider the this example: state A causes state B but failed to get a reward in 1 episode. While state B on average gives two reward at other episodes. The aggregation on A will falsely result in 0 return. (This is untrue because A should can only cause B, which has return of 2.)

A better way often adopted by TD is to consider state transitions between adjacent states. In the above example, State A is related to the return of the Future states, where B's return contribute to State A.

https://www.youtube.com/watch?v=OOlQj8UEJnI

https://www.youtube.com/watch?v=L64E_NTZJ_0

n-step Bootstrapping

In Temporal Difference, we are looking at the reward at the step i and the future return of the next state (i+1), while in Monte Carlo method we are looking at the rewards of all steps. The middle ground is to look at the rewards at multiple steps and the future return of a further state. Call this n-step boostrapping.

https://www.youtube.com/watch?v=1i5a4yj0Mwg

Monday, August 12, 2024

Important Sampling: intuitive explanation

We'd like to use Monte Carlo method to estimate the integral by randomly sampling from a distribution and take their average. But for large dataset with many dimensions, that would require taking too many samples to achieve a good standard deviation. Importance sampling is a way to reduce the sampling requirement and achieve a good standard deviation.

This overall ideas is to invent a new normal distribution q(x) around position X on the axis, where the product of probability p(x) and value f(x) is large. Then you can randomly sample from the new distribution, and let the value function be f(x) * p(x) / q(x) and then take an average.

Example:

We have weather data of 10 years with temperature by days and its rainfall amount. The goal is to estimate the overall rainfall in a year. Let x be the temperature. p(x) is the probability of temperature given a day. f(x) is the rainfall amount given a temperature. So overall rainfall is ∫ p(x) * f(x) dx. We can iterate through all temperatures to take an average, or by Monte Carlo method - randomly sampling some temperatures (thus by distribution of p(x)) and take an average. Or by Monte Carlo method important sampling to sample from a new distribution q(x) to center at:

- where p(x) is large (where a temperature is common)

- where f(x) is large (where a rain fall is large)

- and hopefully we pick where both p(x) * f(x) are large, aka the importance values.

Take a Gaussian distribution q(x) around that region of x temperature values to centering the important values. Sample from q(x) distribution and take an average of f(x) * p(x) / q(x). (You can take a smaller sample. That's the whole point.)

Intuitively, at the temperature x where q(x) is small (under sampled), 1/q(x) applies a higher weight. At the places where q(x) is large (over sampled), 1/q(x) applies a lower weight.

Questions from the example:

- What happens to negative temperature? Answer: we are getting p(x) for probability of a temperature - not the temperature value itself. So the value will be positive.

- How to pick the distribution q(x)? Answer: it is arbitrary and a manual step.

Reference:

https://www.youtube.com/watch?v=C3p2wI4RAi8

https://www.youtube.com/watch?v=V8kruQjqpuw

https://www.youtube.com/watch?v=3Mw6ivkDVZc

Saturday, July 06, 2024

Key ideas in the book of: "Reinforcement Learning - An introduction" of Sutton & Barto (Part 1)

This is a summary of the key ideas from Sutton and Barto's book on Reinforcement Learning. (Here is the book http://www.incompleteideas.net/book/the-book-2nd.html)

Why do I need Reinforcement Learning when I already have the complete game tree for Tic-tac-toe?

That's because you want to also estimate the other player, not just the game itself. The goal is to find a hack to exploit a bug in the adversary.

If the opponent is smart, adapt to a smart way to not lose the game. If the opponent is dumb, find a dumb way to beat him - you are not looking for the optimal solution.

How to do Reinforcement Learning?

At each step, you have a list of actions that can be made. Each action has a score. When playing seriously, you take the action with the highest score. (You cannot adjust action values in this mode.) When you are not serious and are exploring a new way, take the action by probability by using the weighted value from the action scores.

Not every step has a result (reward). So you are collecting steps as a path.

Once a step reaches a result (reward), recall the steps in the whole path and adjust the action value at each step to be a little higher (and thus lowering the chance of other actions).

Imagine walking in a maze. Once you randomly walk from point A to point B, each action in the step of the path is valuable. You want to adjust the action score at each step in an increasing magnitude when it gets closer to the goal, so that during path following, you can look for the action with higher score.

K-armed Bandit problem

Given a slot machine with k arms (buttons), the jackpot (reward) is normally distributed around a value between 1 to k, and this location may vary over 24 hours (but it was certain for the given hour), RL can learn pick an arm at different hours to maximize the accumulative reward.

Note that in this problem every action (choice of arm) at an hour has no effect to other actions in a different hour. So there's no path involved. There is no parameter other than the timestamp.

If there is parameter to indicate the state of the slot machine (for example, color of the screen vary depending on the center of reward distribution), it is called "Contextual Bandit problem", for having the extra parameter to assist guessing. (Note that there is still no effect between actions)

Tricks to improve RL:

Assign a greater than 0 initial value helps RL to explore at the early stage, as starting from 0 often favors the first path found in the solution space too much. This is called Optimistic Initial value.
Upper Confidence bound problem: (I skipped)
Gradient bandit problem: (I skipped)

Markov Decision Process

Unlike K-armed Bandit problem, when an action change the state, affecting future actions, the problem is now a Markov process.

Return:

Given a timestamp t, Return is the sum of all future reward in the path after time t.

Discounted Return:

Similar to Return, but since we don't know if the reward will happen in the future, the weight of a reward in the future step is reduced by a fraction per step. (so the future steps were power of the fraction)

This is to make the near reward more attractive.

Policy Function:

It is a function taking in current state S and possible action a, return a probability for the action to happen.

Note that this could also be a value and apply softmax to choose an action by weight.

Value Function:

A value function only takes in the current state S, to evaluate the future return / discounted return under a policy.

Note that a value function has to be under a policy since the future return is based on the path that the policy goes through. It is possible to have both value function and policy function in a system. For example: actor-critic algorithm.

Optimal Value Function:

At each state, iterate all possible policies, and take the one policy that gives max Return comparing to other policies. Such a value function is called an optimal value function.

The value from the Optimal Value Function can be used to guide an optimal policy, which can choosing the action with the largest value from the optimal value function.

Episode:

Since not all tasks have an end state, when dealing with a task that goes on indefinitely, the process is usually divided into episodes. Each episode defines its own goal.

Monte Carlo Method

When the value function cannot be modeled, we focus on finding a policy function - to choose an action based on current state. It is usually good (optimal) to find out by random sampling and take an average on the Returns at a given (state, action) pair. (This process is so-called Monte Carlo Method) Once trained, we can use this mapping to choose the action that gives the largest Return at a given state.

The random sampling needs to cover all actions. To help this, we will need to let it try all initial states and all initial actions.

Each training takes in 1 episode. Each episode is a list of (state, action) pairs. Iterate the (state, action) pairs backwards from the goal state to the initial state: the return at each state is the reward at the current state + the discounted return in the future states. Note that at the goal state the discounted return is 0. (So, iterate backward and multiply the accumulated return with discount rate r)

During this "training", just average the return at the (state, action) pair. (That means your sample episodes likely need to cover the station action more than once). Avoid "training" the same (state, action) pair twice in the same episode, by skipping to train only at the first occurrence toward the start of the episode.

Note the "training" here does nothing more than taking an average.

Avoid random initialization

Sometime, it might not be practical to construct all random actions. Instead, we let the stepping to be stochastic - at a small probability to randomly pick an action to explore, while at most of time, it follows a greedy algorithm to pick the action with the largest return.

On-Policy Training vs. Off-Policy Training

So far what is described here is has only 1 policy function, which is known as on-policy training. It is usually faster to train. But it is only a special case of off-policy training. In off-policy training, two policy functions are involved: one policy for running and one policy for exploration. On-policy training is just a method there the target policy is the same as the behavior policy for exploration. By separating the two policies, we are not letting the same policy to be greedy and to be willing to try random actions at the same time.

To find such a policy function, it says the target policy must have a non-zero probability at the actions as the target policy to perform the same action as on the behavior policy.

There is a problem with Monte Carlo sampling at the tail of a distribution, where it usually requires a lot of samples. Importance sampling is a method to apply a second distribution at the end of the tail of the first distribution so that estimation requires less samples.

Off-policy method usually utilize importance sampling to find the behavior policy, where it collects less samples to improve the target policy.

On-policy method could also use importance sampling to learn from the previous experience (from a different policy, as the policy is kept updated), so that past experience can also be replayed in on-policy training. This was demonstrated in the PPO algorithm: https://www.youtube.com/watch?v=IScp-mZ7iS0