Yuhan's blog: 2024

Sunday, December 08, 2024

Short Explanation of KL Divergence

What is KL divergence? It measures the distance between 2 Gaussian distributions. Be more precise - given 2 distributions P and Q, and let P be the target distribution. KL divergence is: CrossEntropy(P|Q) - Entropy(P)

What is Entropy and Cross Entropy? Entropy is a special case of Cross Entropy where the two distributions are the same.

As explained in the video tutorial (see the video in the reference), starting with entropy, given a Gaussian distribution with probability function p(x), do 10 actual samples to get a sequence. The probability of getting the sequence is the product of the p(xi) of each step. To make the equation into summation, take a log at both side of the equation, each sample become log(p(xi)). Group the p(xi) of the same values together and divide by total, that should also be p(x), because we are repeating the sampling step. That makes the Entropy(P)= −∑ p(x) ln(p(x))

Note that p(x) is the probability of the actual sampling, and ln(p(x)) is the probability of the distribution.

Get another distribution Q with q(x), replace the ln(p(x)) part with ln(q(x)), it becomes Cross Entropy(P|Q)= −∑ p(x) ln(q(x)). It measures the probability to generate the same sequence with a different distribution. This value is larger than Entropy(P) because it is a different distribution.

KL divergence is CrossEntropy(P|Q) - Entropy(P)

References:

KL Divergence: https://youtu.be/sjgZxuCm_8Q?si=CAU5g6_DGxto0h0J

Short Explanation of Variational Autoencoder (VAE) and Controlled VAE

A Variational Autoencoder is an auto encoder with a twist.

An Autoencoder is network that takes in a large input (ex: an image), encodes it into a smaller vector, and decode it back into the original input. The smaller intermediate vector is called "latent vector".

A Variational Autoencoder takes in a large input and encodes it into a Gaussian distribution in the latent vector space (so it encodes into a mean and standard deviation). To decode it, randomly sample from this Gaussian distribution to get a latent vector, and the decoder should bring it back to the original input.

Note that since the latent vector is not exact every time, the nearby values in the Gaussian distribution should all be able to bring back an output close to the output. If your network was trained to encode 2 images, the point in the middle of the two distributions will have characteristics of both images.

VAE is usually trained with a weighted sum of KL divergence + difference from the desired output (where input and desired output are the same in this case). KL divergence measures how far the sampled latent vector is. The formula in the calculation only considers the one sample point, because one of the distributions was 0 centered and was normalized. So more intuitively, the further away of the sampling, the more tolerance to the difference between the desired output and the actual output.

In case of Controlled VAE, a text is also decoded into latent vector. And the text latent vector and the image latent vector is concatenated as the input to the decode.

References:

VAE: https://youtu.be/X73mKrIuLbs?si=e7tYZRoWm8QO60R1

Monday, November 11, 2024

Study CUDA Programming - Summary

Install Nvidia Driver. Be able to run nvcc compiler. The name of the file matters - the file extension needs to be ".cu" to be compiled by the cvcc compiler. Install C++ compiler, too.

Run in command line:

nvcc hello.cu -o hello

How to define a cuda function:

__global__ void myFunc(float* output, float* input) {
}

How to allocate CUDA memory

How to free CUDA memory

How to copy data from memory to GPU device

cudaMemcpy(pDestination, pSource, byte_size, cudaMemcpyHostToDevice)

How to copy data from GPU device to memory

cudaMemcpy(pDestination, pSource, byte_size, cudaMemcpyDeviceToHost)

How to let CPU wait for GPU execution result:

cudaDeviceSynchronize();

How to invoke the coda function:

myFunc<<<1,1>>>(output, input);

What does <<<1,1>>> mean?

1 block, and 1 thread per block

How to run more in parallel with GPU?

Pass <<<1, 50>>>

This will run 50 threads. Each thread has a thread local variable: threadIdx.x, whose value is an int from [0, 49], local to the thread.

What does <<<1,50>>> mean?

1 block, and 50 threads per block

Is there threadIdx.y in a CUDA function?

Yes. In fact, threadId.x, threadId.y, threadId.z forms a 3D thread block. Likewise, the number of blocks can be in 3D as well. Each block is a collection of a set of 3D threads. In the coda function, it can be referred as blockIdx.x, blockIdx.y, blockIdx.z.

Is it possible to know the thread block size inside a CUDA function?

Yes. Block size is identified by thread local variable: blockDim.x, blockDim.y, blockDim.z. In addition, grid size (collection of blocks) can also be detected by gridDim.x, gridDim.y, gridDim.z.

How to pass 3D grid of 3D thread blocks?

Define dim grid(4,5,6). Define dim3 block(3,2,1). Pass these in the CUDA function call: myFunc<<<grid, block>>>(...)

Can I have just 1 huge block and 1 million threads?

No. Nvidia GPU will limit it at 1024 threads per block.

I want to synchronize threads in a block?

__syncthreads()

If the number of threads is less than 32, then the threads are always in sync.

Keep if-else to value only:

Given that the parallel threads always executes the same instruction, when an if-else state branches the logic, the branched logic must run in different thread-divergence groups. The group with unsatisfied condition will run in no-op, waiting for its turns. But, if-else that chooses constant can be converted to a clover instruction. Ex: if(vector_size < 100) constant1; else constant 2;

Math Trick

Use right shift to detect if an integer is 0 or positive: 1 + x >> 31, because

0 >> 31 = 0
-1 >> 31 = -1
-99 >> 31 = -1

CUDA Memory Model

Global Memory:

- Long latency (400-800 cycles)

- Throughput 200 GBPS

- Shared by all thread blocks

Texture Memory

- Read-only (12KB)

- ~800 GBPS

- Optimized for 2D spatial locality

- Store meta data. Optimized on access to neighboring cell.

Constant Memory

- Read-only (64 KB)

L2 Cache

- 768KB

- Shared among thread blocks

- Fast Atomics

(Useful for synchronize threads as atomic variables)

L1 & Shared Memory

- Local to a thread block, shared within one thread block.

- 64 KB per thread block

- 16 KB shared, and 48KB L1.

- L1 cannot be programmed. Shared memory can be programmed

- Low latency (20-30 cycles)

- 1 TBPS

- Local to one thread, 21 registers per thread

- 32KB, per thread block

- Handled by compiler

- 8 TBPS

Improve bandwidth:

- share / reuse data

- compression

- Recompute than store + fetch

Latency

- I/O causes time to read/write to GPU

- In practice, memory I/O can become bottleneck many times.

- Latency hiding is done by exploiting multi-threading

Why do we need multiple thread blocks?

- Because a thread block can halt when on I/O operations. When the halt happens, other threadBlock can take the GPU resource and run, achieving better throughput. This is called super-linear speedup.

Locality

- Spatial and Temporal locality

- You will want a page to be loaded and used often before switching to another page. So, it is faster to iterate a 2D array row by row, than iterating column by column.

Memory Coalescing

- Store data in memory within close spatial distance to avoid frequent loading to memory. (Since each memory load takes 32 cycles)

- Coalesced vs. strided: strided data have chunks of data for each object. When data is accessed in strided stored, it will require more memory load. Unrelated fields are loaded without being accessed.

- Random access is the worst. This happens especially when the index to the array depends on a variable.

- In CPU context, each thread should access consecutive element in memory chunks (strided). Array of structures is preferred.

- In GPU context, memory coalescing means a chunk should be accessed by consecutive threads (coalesced). Thus, structure of Arrays is preferred. (Ex: each field of a list of records is stored in 1 array. 1 structure represents 1 table.)

- Tip: in the cuda function, use (blockIdx.x * blockDim.x + threadIdx.x) as index to each field.

Shared Memory:

How to define in __global__ function?
__shared__ float a[N];
__shared__ unsigned s;

How to initialize their value:
if (threadId.x == 0) s = 0;

Shared memory is accessible in a thread block. Shared memory is organized into 32 banks. Access in the same bank are sequential (blocking). Exception case: access the same address of the same bank in the same warp doesn't block. It is called broadcast. (Note that if 2 threads access 2 different addresses of the same bank will cause blocking.)

Use this command to allocate the 64KB between L1 cache and shared memory:

cudaDeviceSetCacheConfig(kernelFuncName, cudaFuncCachePreferShared)

Other options:

to give more L1 cache: cudaFuncCachePreferL1
to give equal L1 and shared memory: cudaFuncCachePreferEqual
no preference of L1 and shared memory: cudaFuncCachePreferNone

When your shared memory size is larger than the max allowed shared memory, compilation error will occur

The larger of the shared memory, the less of the L1 cache.

Dynamic Shared Memory:

To vary the shared memory size in a function, pass the shared memory size via calling with 3 parameters: <<<numBlocks, numThreadsPerBlock, sizeOfSharedMemory>>>

Inside the kernel function, define an array pointer for the dynamic shared memory:

__global__ void kernelWithDynamicSharedMemory() {
extern __shared int s[];

Can I define multiple arrays? Yes, but there is only one extern __shared__ array. Split the array by pointer inside the method.

Tex2D memory

Define variable: texture<float, 2 cudaReadModeElementType> texRef;

In main: cudaBindTextureToArray(texRef, cuArray, ...);

In kernel: tex2D(texRef, x, y);

Constant memory

It is read-only, and 64KB per SM. Define

Define variable: __const__ unsigned meta[1];

In main: cudaMemcpyToSymbol(meta, &hmeta, 1*size(unsigned));

In kernel: a = meta[0]

Parallel in block vs in threads:

- Each block is assigned to a different StreamingProcessor (SM). The blocks do not communicate with each other.

- Each thread runs at different cores of a StreamingProcessor. Threads are bundled into warps. Each warp has 32 threads.

Host Memory: the memory accessed by CPU

Device Memory: the memory accessed by GPU

Host Memory is allocated by malloc. Host memory can be mapped to OS memory paging. Use cudaMallocHost to avoid OS memory paging.

To transfer data from Host Memory to Device, use Asynchronous memory transfer, and ideally copy all memory to device at once:

cudaStreamCreate(&stream2)

cudaMemcpyAsync(dest2, src2, size, dir, stream2)

Each SM has 16KB of shared memory. The shared memory is essentially a user managed cache. The latency of the shared memory is comparable to registers.

Each thread has its dedicated registers. A block of threads shares shared memory. Meanwhile, blocks shares device memory.

Use structure of arrays, avoid structure of structures (always use separate arrays, even for x,y,z). For x,y,z, they can have the same float3 pointer to dereference into 3 adjacent values (to avoid dereferencing 3 times).

Memory is divided into banks. Each bank can service 1 thread per cycle. When accesses a blank is in conflict, serialize the accesses. (Here sounds like a lock)

We said shared memory is as fast as register, but that's true only when bank conflict is avoided. When a bank is accessed by 8 threads, the linear access stride = 8. Conflict can be detected by visual profiler (to look for warp_serialize events).

The fast case: if all t threads of a half-warp access different banks, there is no bank conflict. If all threads of a half-wrap read the identical address, there is no bank conflict. (It is called broadcast.)
The slow case: Bank conflict happened - multiple threads in the same half-warp access the same bank. Accesses must be serialized. Cost = max $ of simultaneous access on a single bank.
(What is half-war access? 1 warp is 32 threads. Half-warp is 16 threads. Different warps can take different branches. 1 SM can run many warps of a block. Warps in a block can synchronize.

A Single-instruction can run at multiple threads at the same time. A GPU has thousands of Registers and Registered are partitioned over threads.

Each SM has 16kb of shared memory. The blocks that require less than 8kb of shared memory can be handled in the same SM. Thus, warps can be from different blocks.

Threads per block should be multiple of 32 to have maximum threads per block. More warps per block creates deeper pipeline. And in this way it hide latency. However, then umber of threads is limited by the number of Registers. Use less than 8kb of shared memory to have multiple warps on the same SM. (So, this is a trade-off to the size of shared memory.) Recompute is often faster than loading from memory. Applying recompute is an optimization direction.

Regarding cudaMemCpy (device memory): Every successive 128 bytes (or 32 single precision values) can be access by a warp. All 32 reads should be done in one step. When memory is not in successive 128 blocks of memory, it would take twice of the time to read.

Regarding cudaMemcpytoSymbol (constant memory): copy to __constant__ float var, the cache works the best when a warp reads the same value. (Cuda 6 has a new device memory model)

Latency: in cycles

Throughput: cores / sm

The needed Parallelism: operations / SM = latency * throughput. This is called arithmetic. parallelism.

Consider Instruction-level parallelism (ILP).

ILP is a trick to let a CUDA function to calculate and return multiple values. It is often done by duplicating the logic and duplicate the variable names. It is a tradeoff by using more Registers per thread, and use less threads. ILP is the number of independent instructions in a loop.
As of seen in the "Better Performance at Lower Occupancy" article, larger ILP level requires less threads to achieve 100% utilization. With ILP=3, it only needs half of the threads (from 576 to 256 threads) to achieve 100% utilization. However ILP doesn't scale up to 4.

Another trick was to hide memory latency, by keeping 100KB in the flight. An example is to issue multiple independent reads by reading values in an array into multiple local variables:

__global__ void memcpy( float *dst, float *src )
{
int iblock= blockIdx.x + blockIdx.y * gridDim.x;
int index = threadIdx.x + 2 * iblock * blockDim.x;
float a0 = src[index];
//no latency stall
float a1 = src[index+blockDim.x];
//stall
dst[index] = a0;
dst[index+blockDim.x] = a1;
}

By copying 14 float4 values per thread, the app runs with 4% of occupancy to hide 84 of utilization.

CUDA Function Declarations and their meanings:

__device__ float deviceFunc() // call device from a device;

__global__ float kernelFunc() // call device from host; must return void

__host__ float hostFunc() // call host from host

A function can have multiple function types. Example:

__host__ __device__ void dhfun() { ...

cudaHostAlloc : allocates memory in the Host in page-locked form (which won't be written to the disk). It can be transferred faster between host and device. This memory can also be accessed by GPU via DMA, avoiding a memory copy.

A __global__ function can be called from both device and host.

Note: Variables cannot be declared as __host__. Only __device__ variable. Calling a __device__ variable from host will pass compilation but will reach error in runtime.

Thrust is a parallel algorithms library for CUDA, similar to STL on CPU

NVIDIA/thrust: [ARCHIVED] The C++ parallel algorithms library. See https://github.com/NVIDIA/cccl

Supports vectors and associated transformers.
It is black of where code executes.

Mutual Exclusion Algorithm:

Bakery Algorithm

By assigned token number to each thread
If there exists another thread waiting and holding a token less than my token, wait.
After the processing in my thread is done, register that my thread is done.
N-thread mutually exclusive

Atomic Operations

cuda operations: atomicCAS, atomicMin, atomicAdd, atomicCAS, etc.

atomicCAS: takes in 3 variables: address, compare value, new value.

If compare value == value in the address, new the new value. Otherwise do nothing.
Return old value in the address

(if old value == compare value, that means new value has been taken. Otherwise, it means the new value wasn't taken. Use this to create critical section for a single thread to enter.)

Works on GPU across warps
But it hangs for threads belonging to the same wrap, because 1 warp-thread acquires the lock and waits for other warp threads to reach the instruction, while other threads in the warp await this successful thread in the do-while.

The correct way to is use atomicCAS in do-while, and check with an if statement after entering the critical section. and upon exit of the critical section, unlock by setting the compared address to a value that can match comparison in a different thread.
Example:

do {
old = atomicCAS(&lockvar, 0, 1);
if(old == 0) {
// Do your task in the critical section
*lockvar = 0; // unlock by setting it to 0 so that other thread can pick it up.
}
} while(old != 0)

Note that with CPU, you can unlock outside of the while loop. But with GPU, the unlock needs to happen in the while loop.

Can also be used to enforce to a single run from many threads. (Just one if statement with atomicCAS)

For a complete list of atomic operations, see this: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions
Works with both global and shared memory

Barriers

It is a program point where all threads need to reach before any thread can proceed.
End of kernel is an implicit barrier for all GPU threads (global barrier)
CUDA 9 supports grid.sync() to make explicit global barrier.
Threads in a thread-block can synchronize using __synchthreads()

__synchthreads also creates memory barrier so that write (of shared memory) in one thread is visible to other threads in the block.
__threadfence() ensures that writes to global and shared memory are visible to all other threads on the device.

__threadfence_block(): write visibility within the same block
__threadfence_system(): write visibility across both the device and the host (CPU)
__threadfence(): write visibility across the device
Note that a threadfence only ensures visibility of memory write. It does not block threads.

Reductions

It is process of converting a set of values into fewer values
Reducible operation must satisfy associativity property (which means apply order doesn't change the outcome)

Min, Max, Sum, XOR
Can be implemented with atomics, but that adds sequentiality.

A better approach with improved parallelism in GPU:

Example: sum the adjacent pair of values in different threads. So they don't block each other. Keep doing this until there is only 1 value.
Complexity measurement:

Takes log(n) steps. First step runs n threads. Second step runs n/2 threads, ... , the last step runs 1 thread.

Prefix Sum:

Algorithm 1: for each value in the list, sum it with the previous value. Repeat the process to get the previous value from 2 cells back, 4 cells back, etc.
datarace: a thread is reading a value in memory while another thread is writing to it, there's a datarace. __synchthreads() should be placed in the middle of the read state and write statement.

Debugging:

Use Cuda-gdb https://docs.nvidia.com/cuda/cuda-gdb/index.html

compile with: nvcc -g -G file.cu
Run the output from the previous step with: cuda-gdb a.out

info cuda kernels - shows device status
info threads - shows execution of all threads.
info cuda sms - show streaming multi processors
info cuda warps - show warps
info cuda lanes - show information related to each thread.

set break point by:

break main - set break point in main function
break file.cu:223 - set break point in file at line number
set cuda break_on_launch - kernel entry breakpoint
break file.cu:23 if threadIdx.x == 1 && i < 5 - conditional break point

Profiling:

Time taken by kernels
Memory utilization
Cache misses
Divergence
Coalescing

Use nvprof to run the app. It tells the % of time used by each kernel. And % of time of each cuda command.

Dynamic Parallelism should be unrolled when possible (dynamic parallelism was supported until architecture 35. It can be specified during compilation.

nvcc -arch=sm_35 dynpar.cu

A global function can invoke another global function for parallel processing in a device.

Parent kernel is associated with a parent grid. Child kernels are associated with child grids. Parent and child kernel shares the global and constant memory, but they have distinct local and shared memories. Global memory operation in the parent is visible to the child. All global memory operation in child is visible to parent when parent calls cudaDeviceSynchronize().

Multi-GPU

One host (CPU) controls multiple GPUs
Use cudaSetDevice(i) before performing any cuda operation (ex: cudaMemcpy)

cudaGetDeviceCount
cudaDeviceCanAccessPeer - a device accesses another device

Warp Voting

__ballot - wrap which warp threads satisfy the predicate.

Return the test of the predicate in a warp as a 32-bit number.

__all - all warp threads satisfy the predicate
__any - any warp threads satisfy the predicate
Application:

Example: warp voting for atomics: on if(condition) atomicInc a counter. This can be easily replaced by counting the bits of 1's in the ballot.

Use __popc(mask) to return the number of set bits.
This would allow atomicAdd of 32 threads in 1 operation, reducing the blocking from atomic operation.

References:

https://www.youtube.com/watch?v=cvo3gnInQ7M&list=PL1ysOEBe5977vlocXuRt6KBCYu_sdu1Ru&index=1

https://www.olcf.ornl.gov/wp-content/uploads/2013/02/Intro_to_CUDA_C-TS.pdf

https://www.nvidia.com/content/cudazone/download/Advanced_CUDA_Training_NVISION08.pdf

https://www.nvidia.com/content/gtc-2010/pdfs/2238_gtc2010.pdf

https://www.ce.jhu.edu/dalrymple/classes/602/Class13.pdf

Tuesday, September 10, 2024

Key ideas in the book of: "Reinforcement Learning - An introduction" of Sutton & Barto (Part 3)

Planning

In the previous chapter of the book, Monte Carlo and Temporal Difference were introduced, where real experience is used to learn a prediction model for return. As the training of a value function / policy function is directly based on state transition and the reward in the real world, it is called Direct Reinforcement Learning, or Modeless Learning). An alternative is to add an environment model to emulate the environment for state transition and reward. The emulated transition can also be used by training the prediction model for return. This is called Planning or Model-based Learning.

Train on Randomly Sampled past observations

Since we have a Model for state transition, does that mean all the prediction model for return is trained solely using emulated state transition? The answer is no. The training usually happens with a state transition in the real world, followed by training using n trials of emulated state transitions. The n trials were randomly picked from the previously observed states.

Prioritized Sweeping

Note that most of the observed state transitions will have 0 reward and very small amount of change in return . It is inefficient to train at these states. As a result, randomly picking the state transition for training is not plausible in a large state space. An improvement is by using a priority queue to order the real experience (state, action) pairs: by putting the (state, action) pair with largest errors in the front.

Sunday, August 18, 2024

Key ideas in the book of: "Reinforcement Learning - An introduction" of Sutton & Barto (Part 2)

Temporal Difference vs. Monte Carlo method

In Monte Carlo method, a trajectory (episode) has to end with a terminal state. The final return was calculated aggregating all rewards along the path. Then from the final step backward, assign each previous step with the return discounted by a rate.

Problems with Monte Carlo method:

It has to reach a terminal state. That means the sampling path is longer.
The rewards in the middle is not considered individually. That means the estimation in the middle is not accurate.
Consider having negative reward and when the final return becomes 0: a trajectory is not going to be useful for learning.

In Temporal Difference, instead of requiring a complete episode, a sublist can be used. At each step i, the estimated return at step i is the estimated return of the next step (i+1) plus the actual reward at step i. The future return on is then discounted.

The main difference from Monte Carlo method:

The sampling path does not have to be at a terminal state.
The rewards in the middle are considered at their exact steps. They are excluded individually from the future return.
When there's only 1 reward, the two methods are the same.

The sum of the immediate reward at (at state s) and discounted future return from state s is called Bellman equation.

TD converges faster and reaches better performance at the same amount of training, comparing to the Monte Carlo method.

Batch Update

Instead of updating the neural network at every step. Play several episodes and the update the neural network (optimizer.step() in PyTorch).

In Monte Carlo method, we collect all state-to-return pairs, and take an average of the return values grouping by the states. And train the neural network.

This is a problem for a state that rarely occurs. Consider the this example: state A causes state B but failed to get a reward in 1 episode. While state B on average gives two reward at other episodes. The aggregation on A will falsely result in 0 return. (This is untrue because A should can only cause B, which has return of 2.)

A better way often adopted by TD is to consider state transitions between adjacent states. In the above example, State A is related to the return of the Future states, where B's return contribute to State A.

https://www.youtube.com/watch?v=OOlQj8UEJnI

https://www.youtube.com/watch?v=L64E_NTZJ_0

n-step Bootstrapping

In Temporal Difference, we are looking at the reward at the step i and the future return of the next state (i+1), while in Monte Carlo method we are looking at the rewards of all steps. The middle ground is to look at the rewards at multiple steps and the future return of a further state. Call this n-step boostrapping.

https://www.youtube.com/watch?v=1i5a4yj0Mwg

Monday, August 12, 2024

Important Sampling: intuitive explanation

We'd like to use Monte Carlo method to estimate the integral by randomly sampling from a distribution and take their average. But for large dataset with many dimensions, that would require taking too many samples to achieve a good standard deviation. Importance sampling is a way to reduce the sampling requirement and achieve a good standard deviation.

This overall ideas is to invent a new normal distribution q(x) around position X on the axis, where the product of probability p(x) and value f(x) is large. Then you can randomly sample from the new distribution, and let the value function be f(x) * p(x) / q(x) and then take an average.

Example:

We have weather data of 10 years with temperature by days and its rainfall amount. The goal is to estimate the overall rainfall in a year. Let x be the temperature. p(x) is the probability of temperature given a day. f(x) is the rainfall amount given a temperature. So overall rainfall is ∫ p(x) * f(x) dx. We can iterate through all temperatures to take an average, or by Monte Carlo method - randomly sampling some temperatures (thus by distribution of p(x)) and take an average. Or by Monte Carlo method important sampling to sample from a new distribution q(x) to center at:

- where p(x) is large (where a temperature is common)

- where f(x) is large (where a rain fall is large)

- and hopefully we pick where both p(x) * f(x) are large, aka the importance values.

Take a Gaussian distribution q(x) around that region of x temperature values to centering the important values. Sample from q(x) distribution and take an average of f(x) * p(x) / q(x). (You can take a smaller sample. That's the whole point.)

Intuitively, at the temperature x where q(x) is small (under sampled), 1/q(x) applies a higher weight. At the places where q(x) is large (over sampled), 1/q(x) applies a lower weight.

Questions from the example:

- What happens to negative temperature? Answer: we are getting p(x) for probability of a temperature - not the temperature value itself. So the value will be positive.

- How to pick the distribution q(x)? Answer: it is arbitrary and a manual step.

Reference:

https://www.youtube.com/watch?v=C3p2wI4RAi8

https://www.youtube.com/watch?v=V8kruQjqpuw

https://www.youtube.com/watch?v=3Mw6ivkDVZc

Saturday, July 06, 2024

Key ideas in the book of: "Reinforcement Learning - An introduction" of Sutton & Barto (Part 1)

This is a summary of the key ideas from Sutton and Barto's book on Reinforcement Learning. (Here is the book http://www.incompleteideas.net/book/the-book-2nd.html)

Why do I need Reinforcement Learning when I already have the complete game tree for Tic-tac-toe?

That's because you want to also estimate the other player, not just the game itself. The goal is to find a hack to exploit a bug in the adversary.

If the opponent is smart, adapt to a smart way to not lose the game. If the opponent is dumb, find a dumb way to beat him - you are not looking for the optimal solution.

How to do Reinforcement Learning?

At each step, you have a list of actions that can be made. Each action has a score. When playing seriously, you take the action with the highest score. (You cannot adjust action values in this mode.) When you are not serious and are exploring a new way, take the action by probability by using the weighted value from the action scores.

Not every step has a result (reward). So you are collecting steps as a path.

Once a step reaches a result (reward), recall the steps in the whole path and adjust the action value at each step to be a little higher (and thus lowering the chance of other actions).

Imagine walking in a maze. Once you randomly walk from point A to point B, each action in the step of the path is valuable. You want to adjust the action score at each step in an increasing magnitude when it gets closer to the goal, so that during path following, you can look for the action with higher score.

K-armed Bandit problem

Given a slot machine with k arms (buttons), the jackpot (reward) is normally distributed around a value between 1 to k, and this location may vary over 24 hours (but it was certain for the given hour), RL can learn pick an arm at different hours to maximize the accumulative reward.

Note that in this problem every action (choice of arm) at an hour has no effect to other actions in a different hour. So there's no path involved. There is no parameter other than the timestamp.

If there is parameter to indicate the state of the slot machine (for example, color of the screen vary depending on the center of reward distribution), it is called "Contextual Bandit problem", for having the extra parameter to assist guessing. (Note that there is still no effect between actions)

Tricks to improve RL:

Assign a greater than 0 initial value helps RL to explore at the early stage, as starting from 0 often favors the first path found in the solution space too much. This is called Optimistic Initial value.
Upper Confidence bound problem: (I skipped)
Gradient bandit problem: (I skipped)

Markov Decision Process

Unlike K-armed Bandit problem, when an action change the state, affecting future actions, the problem is now a Markov process.

Return:

Given a timestamp t, Return is the sum of all future reward in the path after time t.

Discounted Return:

Similar to Return, but since we don't know if the reward will happen in the future, the weight of a reward in the future step is reduced by a fraction per step. (so the future steps were power of the fraction)

This is to make the near reward more attractive.

Policy Function:

It is a function taking in current state S and possible action a, return a probability for the action to happen.

Note that this could also be a value and apply softmax to choose an action by weight.

Value Function:

A value function only takes in the current state S, to evaluate the future return / discounted return under a policy.

Note that a value function has to be under a policy since the future return is based on the path that the policy goes through. It is possible to have both value function and policy function in a system. For example: actor-critic algorithm.

Optimal Value Function:

At each state, iterate all possible policies, and take the one policy that gives max Return comparing to other policies. Such a value function is called an optimal value function.

The value from the Optimal Value Function can be used to guide an optimal policy, which can choosing the action with the largest value from the optimal value function.

Episode:

Since not all tasks have an end state, when dealing with a task that goes on indefinitely, the process is usually divided into episodes. Each episode defines its own goal.

Monte Carlo Method

When the value function cannot be modeled, we focus on finding a policy function - to choose an action based on current state. It is usually good (optimal) to find out by random sampling and take an average on the Returns at a given (state, action) pair. (This process is so-called Monte Carlo Method) Once trained, we can use this mapping to choose the action that gives the largest Return at a given state.

The random sampling needs to cover all actions. To help this, we will need to let it try all initial states and all initial actions.

Each training takes in 1 episode. Each episode is a list of (state, action) pairs. Iterate the (state, action) pairs backwards from the goal state to the initial state: the return at each state is the reward at the current state + the discounted return in the future states. Note that at the goal state the discounted return is 0. (So, iterate backward and multiply the accumulated return with discount rate r)

During this "training", just average the return at the (state, action) pair. (That means your sample episodes likely need to cover the station action more than once). Avoid "training" the same (state, action) pair twice in the same episode, by skipping to train only at the first occurrence toward the start of the episode.

Note the "training" here does nothing more than taking an average.

Avoid random initialization

Sometime, it might not be practical to construct all random actions. Instead, we let the stepping to be stochastic - at a small probability to randomly pick an action to explore, while at most of time, it follows a greedy algorithm to pick the action with the largest return.

On-Policy Training vs. Off-Policy Training

So far what is described here is has only 1 policy function, which is known as on-policy training. It is usually faster to train. But it is only a special case of off-policy training. In off-policy training, two policy functions are involved: one policy for running and one policy for exploration. On-policy training is just a method there the target policy is the same as the behavior policy for exploration. By separating the two policies, we are not letting the same policy to be greedy and to be willing to try random actions at the same time.

To find such a policy function, it says the target policy must have a non-zero probability at the actions as the target policy to perform the same action as on the behavior policy.

There is a problem with Monte Carlo sampling at the tail of a distribution, where it usually requires a lot of samples. Importance sampling is a method to apply a second distribution at the end of the tail of the first distribution so that estimation requires less samples.

Off-policy method usually utilize importance sampling to find the behavior policy, where it collects less samples to improve the target policy.

On-policy method could also use importance sampling to learn from the previous experience (from a different policy, as the policy is kept updated), so that past experience can also be replayed in on-policy training. This was demonstrated in the PPO algorithm: https://www.youtube.com/watch?v=IScp-mZ7iS0

Tuesday, April 16, 2024

Quick Tutorial: Tiny Llama Finetuning

Tiny Llama is a small enough LLM model that could run from your laptop. It can generate somewhat good replies. People are fine-tuning it for their own purposes. I am going to show you how to play with it.

How to Play with a model on Huggingface ?

Go to the hugging face page, they already included the instructions on how to start the app: https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0

How to run it locally?

Download the files to your disk from the "Files" tag:

- https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/tree/main

You will have to click on the download icon to download the files. Copying and pasting text leads to corrupted files. Put them all in one folder and run this python script:

Which one to use?

There are a few models. Use the chat model. It gives better result in implementing a question / answer chatbot.

All models are good at continuing the text that you provide. The chat model allows separate question and answers well by separating user input and assistant output. Surround user input in <|user|> tag , ending it with </s>. Surround desired reply with <|assistant|>, also ending it with </s>.

You can also use it without tags, but expect it look like creative writing than a conversation.

In additional to "text-generation" Transformers pipeline has a "conversation" mode, which allows the input the be a list of maps, each map is a message in the conversation. (So that it converts the list of maps into text with tags for you.)

How to fine tune with peft?

Use this script. It picks small percentage of a certain neural network components to fine tune.

https://gist.github.com/yuhanz/740ab396f237e8d2f41f07a10ccd8de9

You want to initialize TrainingArguments with push_to_hub=False, so that it doesn't require huggingface credentials. Replace the model name with relative path to folder containing model files in your disk.

Prepare the training data into conversations separated by tags. Use tokenizer to convert structured data into text with tags:

text = tokenizer.apply_chat_template(messages, tokenize=False)

Note that the configuration max_seq_length decides size of the input and the generated text. This may depend on the tokenizer for the model.

References:

https://www.kaggle.com/code/tommyadams/fine-tuning-tinyllama

https://www.databricks.com/blog/efficient-fine-tuning-lora-guide-llms

https://huggingface.co/docs/transformers/main/chat_templating

Thursday, April 04, 2024

Handling AWS API Gateway Error: Invalid HTTP endpoint encoding detected for URI

I encountered this error at work when deploying AWS API Gateway:

Invalid HTTP endpoint encoding detected for URI (Service: ApiGateway, Status Code: 400

That was because the URI containing encoded %xx characters that AWS API Gateway doesn't like. (Even though they are valid and work well with curl.) Replacing those characters fixed the issue. Here are a list of not welcomed characters that I found out from my experiments:

%21

%24

%2C

%3A

%0A

They are corresponding to characters !$,: and new_line

Friday, March 29, 2024

Summary on Outlier's tutorial: Cross Attention

Self attention: we'd like to find a weighted sum of feature vectors for a series of input, whether it is a paragraph of texts or a list of image segments. Feature vector for each input vector is calculated by the Value matrix. The weight is calculated by the Query matrix (for the vector at focus) multiplying the Key matrix (for each vector in the input).

Cross attention: now consider having two inputs: one image and one text. The Query matrix applies to the image. And Key matrix and Value matrix applies to the text. Find the attention between each image input fragment to each word. Taking this value as weights, apply to the Value matrix to create the weighted sum. From this vector, then you can multiply another matrix to transform it back to the dimension of the input image to construct a (different) image.

Reference: https://www.youtube.com/watch?v=aw3H-wPuRcw

Thursday, March 28, 2024

Summary on Shusen Wang's tutorial: Item-to-Item (I2I) model

The most common usage is User-Item-Item: user has a list of favorite items. Use the favorite items to find similar items (a.k.a Item-to-Item), and recommend these items to the user.

How to measure similarity: if user likes two items, and the two items have similarity. (Similar to Item Collaborative Filtering)

Another way to measure similarity: compare items' feature vectors.

User-User-Item: User 1 is similar to User 2. User 2 likes item A. So recommend item A to User 1.

User-Author-Item: User 1 likes Author 2, and Author 2 has item A. So, recommend item A to user 1.

User-Author-Athor-Item: User 1 likes Author2.Author 2 and Author 3 are similar. So recommend Author 3's items to User 1.

Using multiple models with separate shares in retrieving records, achieves better results than using the just one model to retrieve all items.

Summary on Shusen Wang's tutorial: Boost New Items

New Items cannot compete with older item. Thus it is necessary to give new items a boost in search when they are created to guaranteed some impressions.

It is difficult to assign the boost score, and it often leads to giving too many impressions to new items. Thus limiting impressions is necessary.

One way is to vary boost scores according to the impressions that have been served. (Ex: Create a table and vary the boost score by reducing the boost value depending on how close to the deadline and how many impressions that have been served.)

Vary the number of impressions based on the predicted quality of the item - give good item more impressions.

Reference: https://www.youtube.com/watch?v=QGD-1Feq1ZQ

Summary of Shusen Wang's tutorial: DPP to improve variety in Recommendation

Determinantal Point Processes (DPP)

Let the vectors construct a volume. The larger the volume, the more variety. When the volume is low, then there are items similar to each other.

Given V to a d x k matrix, collecting k vectors of d dimensions. The volume is calculated as the determinant, which is V.transpose * V.

DPP math formula:

argmax [ log det(V.transpose * V) ]

Hulu applied DPP in recommendation system:

argmax [ a * total reward + (1- a) * log det(V.transpose * V) ]

To solve this equation, Greedy search is applied to find 1 item at a time that has a good reward while not too close to the already picked items. Hulu provided a bit more efficient solution to find the similarity part, by using Cholesky Decomposition.

Cholesky Decomposition is to make a matrix become a triangular matrix to multiply its own transpose. In Hulu's proposal, Cholesky Decomposition at each step is calculated based on Cholesky Decomposition of the previous step to avoid repeated calculation.

References:

https://www.youtube.com/watch?v=HjpJeUSekKs

https://www.youtube.com/watch?v=wi8xVHiZZr4

https://lsrs2017.files.wordpress.com/2017/08/lsrs_2017_lamingchen.pdf

Tuesday, March 12, 2024

How To Search AWS CloudWatch Log by Date?

Knowing a timestamp where an activity happens, go to your CloudWatch log and search by prefix using the date:

Locate by last event time that is larger than your timestamp. There will be many candidate logs.

Open each candidate log, search with the keyword "INIT_START" to make sure the log started before your timestamp. If it matches, then you can perform other search by keywords.

Repeat on all candidate log streams.

To locate a particular event in a log stream at a timestamp, search by keyword "END" and it will give you the start and the end of the request. Knowing the time range of the request, then you can perform further search.

Tuesday, March 05, 2024

Summary on Shusen Wang's tutorial: Look-alike Prediction Recommendation

Starting with seed users, from whom we know the users' background: age group, education level, income, etc. Find users who didn't provide much information by similarity, including User Collaborative Filtering.

Focus on the users who interacted with a new item - treat them as seed users. Take an average on the seed user to get a vector. Use this vector to spread to look-alike users (by looking up in a vector db). Note that this feature vector needs to be updated when more users are interacting with this item.

Reference: https://www.youtube.com/watch?v=pjmRo8Uzzqg

Thursday, February 29, 2024

Shusen Wang's video tutorial: LastN feature, DIN model, SIM model

LastN features came from the last N items that a user interacted with. The process is as simple as calculate embedding vector of the item. Having N vectors, take an average and this become a LastN feature of the user. (An alternative of average could be taking attention)

DIN model is another alternative of LastN feature: taking weighted average of the feature vectors of the LastN items. (This is the same as attention.) How to get weights? Use the most relevant 500 items to the user and take average to get a vector. Use the vector to calculate the similarity with each vector in LastN features. Apply the similarity score as weight and take weighted sum to get a new vector.

How to use DIN model: given an item recommended to the user, calculate its similarity to lastN items, and then take a weighted (by similarity) sum of the LastN vector. Apply this vector as a user feature.

DIN model need to calculate attentions N times. Thus N cannot be very large. It is limited to short-term interest of a user.

SIM model: To improve DIN model: quickly removing irrelevant items in LastN to avoid calculating all attentions. This allows N to be a few thousands. Here is how: for a group of candidate recommended item, find k similar items in N items. (call it TopK.) Then calculate attention.

Two steps of SIM model: 1. search 2. attention. Hardsearch: find by category and keyword. SoftSearch: Take item's embedding as vectors and search. Measure the timestamp of the item from now and convert that into a separate feature embedding. Concatenate it with the item feature. SIM model needs timestamp feature because it measures user's long-term behavior (while DIN model is the short-term behavior).

During the last step SIM will pick items from the list with the relevance as reward, minus the similarity with the already picked items.

MarginalRelevance (i) = a * reward (i) - (1-a) * max[ sim(i,j) for each j item that has been picked ]

When the picked set is big, max [ sim (i,j) ] is close to 1, causing the algorithm to fail. The solution is not to compare for similarity at all items in the picked set - but rather only to compare with the last few in a sliding window.

References:

https://www.youtube.com/watch?v=Stbc9goPKXQ

https://www.youtube.com/watch?v=0hPep80Oy6k

https://www.youtube.com/watch?v=_4J9aF5KR84

Monday, February 26, 2024

Summary on Shusen Wang's video of: SENet & Bilinear Cross

SENet was originally used in computer vision, but was later applied to recommendation system.

Consider discrete attributes on a product. Convert them into embedding to get features. For m features, each of which is expressed in a k-dimension vector. Take average pooling to make it into an m * 1 vector. (? why would that help?) .

From m * 1 vector, apply a fully connected layer and a ReLU layer. Output an (m / r) * 1 dimension vector. (r is a compression ratio).

From (m / r) * 1 dimension vector, apply fully connected and a sigmoid layer. Output an m * 1 dimension vector.

Now use this m * 1 vector from the last step as weight - apply this weights to the initial m*k matrix again with row-wise multiplication to get another m*k matrix.

(The intuitive meaning of the operations above is to take away irrelevant features from the matrix.)

The core of SENet is to apply weight at each field. Note that the fields can have features in different dimensions.

Cross two vectors between field i and field j. (All fields are in m-dimension.)

- by inner product: Xi.transpose * Xj = scaler

- Hadamard product of two vectors (element-wise multiplication) = vector.

Bilinear Cross:

- by inner product: Xi.transpose * Wij * Xj . (Inserted a matrix in between the inner product)

Note that there are too many pairs of fields. So we can only take a selective set of field pairs to perform Bilinear Cross.

- Hadamard product of two vectors: Hadamard( Xi , Wij * Xj )

Likewise, in practice, selective set of field pairs of used to perform Bilinear Cross.

FiBiNet:

- Combine SENet and Bilinear Cross.

- Take the embeddings for each field to 3 paths: bilinear network, SENet + bilinear network and the original embeddings concatenated. Concatenate all 3 vectors together and apply another network.

Reference: https://www.youtube.com/watch?v=nF37qtNvw1E

Saturday, February 24, 2024

Summary on Shusen Wang's video of: LHUC Network (Learning Hidden Unit Contributions)

LHUC is good for sorting at the last step of a recommendation system. The network was originally from research in speech recognition.

Consider the case of speech recognition, taking the voice signals as an input vector, and the feature of the speaker another input. Let the two inputs pass two separate networks (could be fully connected networks). And perform Hadamard product of the two outcomes. (Obviously the output dimensions need to be the same). Wire combined result into another fully connected layer.

Take the feature of the speaker as input, with another fully connected layer. Perform Hadamard product at the result. This process can repeat many times (by adding fully connected layer at the main path and Hadmard product the input).

Note that the network that transforms the feature of the speaker input can be a multiple fully connected layer with a sigmoid function and multiplies 2. The scale 2 allows features to be exaggerated to adapt to distinct feature of a speaker.

In case of apply LHUC network to a recommendation system, the voice signal is replaced with product features, while the speaker feature is replaced with the user features.

Reference: https://www.youtube.com/watch?v=TxIedW94hu0

Summary on Shusen Wang's video of: DCN (Deep & Cross Network)

DCN (Deep & Cross Network) is used to replace fully connected layer.

Cross Layer:

- Taking an intermediate output vector xi, and pass it through a fully connected layer to receive y of the same size as input x0. Take a Hadamard product of x0 and y, and call it z. Let Xi+1 = z + xi be the output.

Xi+1 = x0 * [W * xi + b] + xi

Cross Network:

- Apply cross layer multiple times with the same x0

Deep & Cross Network:

- Take all features as a vector to feed in a fully connected layer and a second path of Cross Network. Concatenate the two results and feed into another fully connected layer.

- widely used in industry for recommendation - to sort the items, or used in shared bottom.

Reference: https://www.youtube.com/watch?v=yNeRO5m63JQ

Friday, February 23, 2024

Summary on Shusen Wang's video of: Converting attributes to Features

Here are a few ways to deal with attribute to make them into features.

Discrete values: apply one-hot encoding. Then make embeddings from one-hot encoding input.
Continuous values: divide value ranges into ranges, and then apply one-hot encoding for each range.
Large continuous values: apply log(1+x), and then treat it as a regular Continuous value.

An alternative is to convert large values into ratios. For example, number of clicks into click rate.

Thumbnail image of a record can also be converted into vector to assist classification / prediction tasks.

User attributes and Item attributes are relatively static, has little changes. They are perfect caching candidates But user stats and items status may change on every click / user interaction. It is better to separate stats and the basic attributes.

Feature crosses are non-linear features derived from features with non-linear operations. A weighted sum of a vector (x1, x2, ... xd) would be a linear operation, while multiplication of features creates "crosses". For an example of second-order feature crossing, a weighted sum of xi * xj could introduce a non-linear term. To visualize how this works better - consider a vector with quality and price per unite as features. A multiplication of the two predicts the price effectively, while linear combination of the two features can not accurately describe the price.

n features creates n ** 2 of second-order feature crossing. This could become a lot of new dimensions. To reduce the dimensions, we can consider it as an n * n matrix, and then convert it into multiplication of a low-rank matrix and its transpose.

Call the low rank matrix V. Thus at the (i,j) position,

weight(i,j) * xi * xj = (V.transpose)i * Vj

Using matrix V makes the model become a Factorized Machine (FM), which is simply a linear model plus second-order feature crossing.

Note: FM has become less popular nowadays in recommendation system.

Collaborative Filtering features that are based on userIds and itemIds, can be more accurate than tag matching. However, for a new item that hasn't had any user interaction, it is impossible to estimate Collaborative Filtering features accurately. Thus during cold start of a new item, tag matching is used to take top K similar items to get their Collaborative Filtering features - use their average as the initial guessed value for this feature. Keep using the guessed value, until enough user has visited the new item.

Use K-mean cluster to divide the items into clusters - there could be thousands of clusters. Use the centroid of the cluster as the index for searching. When a new item is created, it will be matched to clusters. For example to match top 1000 clusters. Similarly when a user comes up, take her last N recent items, and match clusters. Once found clusters, pick items from them.

References:

https://www.youtube.com/watch?v=J7N4xjqg0rk

https://www.youtube.com/watch?v=exVPXVFPMDk

https://www.youtube.com/watch?v=lboewzsA8DU

Wednesday, February 21, 2024

Summary on Shusen Wang's video of: Ways to combine click rate, like rate, favorite rate, etc. into 1 score

In multi-target prediction, click probability, like rate, favorite rate, etc are generated together for an item recommended to a user. They need to be combined into 1 score to make ordering easy. Here are a few ways to combine the predictions.

Simple weighted sum:

click_rate + w1 * like_rate + w2 * favorite_rate + ...

Weighted sum with click_rate as precondition: (as like action has to happen after click happens)

click_rate * (1 + w1 * like_rate + w2 * favorite_rate + ... )

Weighted product instead of weighted sum:

[(1 + w1 * click_rate) ** a1] * [1 + w2 * like_rate)] ** a2 ] * ...

Weighted sum based on ranks from the list sorting by 1 dimension. For example, the

w1 / (click_rank ** a1 + b1) + w2 / (like_rank ** a2 + b2) + ...

For online purchase, it has to go through all steps for a transaction to happen. Thus the probability needs to be multiplied to make 1 score:

( click_rate ** a1 ) * ( cart_rate ** a2 ) * ( pay_rate ** a3 ) * ( price ** a4 )

Reference: https://www.youtube.com/watch?v=D2iqM2puJ2I

Summary on Shusen Wang's video of: Multi-gate Mixture-of-Experts (MMoE)

The idea is to train multiple expert neural networks and apply weighted sum of their results. The weights is decided by another neural network based on the same input but connected to a softmax layer to generate weights. The vector from the weighted sum is then feed into another neural network layer to predict a aggression task.

When multiple predictions (targets) are made, they shared the same expert neural networks, but each target has its own neural network for weight generation, and succeeding neural network to process the vector from weighted sum.

When 1 weight is close to 1 and other weights are are close to 0, only 1 expert neural network performs work, skipping all other expert networks. This polarization would be an unexpected setup that wastes resource. To avoid polarization in the weights, Dropout operation is performed on the softmax output during training to give 10% of the chance to drop the result from an expert network. As a result, the solution of putting all logic into 1 expert networks leads to poor performance - this in turn encourages multiple expert networks are trained at the same time.

Reference: https://www.youtube.com/watch?v=JIEwaPARjfk

Summary on Shusen Wang's video of: Reduce negative samples and Adjust prediction Rate

Often classifiers are trained on unequal amount of positive data and negative data. It is a common practice to reduce the negative data (by a rate a to the positive samples). The reducing the negative samples causes the predicted value larger than the actual occurrence. However, the value can be adjusted by:

Let

P_true = n_positive / (n_positive + n_negative)

P_pred = n_positive / (n_positive + n_negative * a)

And the adjusted expectation can be derived:

P_true = a * P_pred / ((1-P_pred) + a * P_pred )

So the predicted expectation can be adjusted to scale as if trained based on equal amount of samples from both sides.

Reference: https://youtu.be/kY4W46MQqsg?si=U33AhcwpvCvQ3G_2&t=526

Tuesday, February 20, 2024

Summary on Shusen Wang's video of: Deep Retrieval

Deep Retrieval considers features vector of an item as a path. It is an online method to find the matching user.

Make a lookup of a list of items by path and make a lookup of a list of paths by item. Have a prediction model to match a path to a user. During online recommendation, given a user, match a set of paths and follow the path to get to the recommended item.

How an item is expressed as a path: each path is a sequential traverse of an ordered layer. Each layer has k possible node for step, and let the number of layers to be 3 for an example. We call this depth = 3 and width = k. A path is described as 3 nodes.

Given a path [a,b,c], multiple items will be retrieved:

At layer 1, given a user feature vector x, the prediction model for node a is P1(a | x)
At layer 2, given user x and node a, prediction model for node b is P2(b | a ; x)
At layer 3, given user x and node a, b, prediction model for node c is P3(c | a,b ; x)
The the interest of user x to path [a,b,c] is:

[ P(a,b,c | x) = P1(a | x) * P2(b | a; x) * P3(c | a,b ; x)

The input of the prediction model is x, and the output of the prediction model is a k-dimension vector. Take a softmax on this value to direct path walking. The next step could be taking the largest value, or to perform beam search.

The next layer has a different prediction model taking dim(x) + k dimensions. (The input will be x concatenated with the output (k dimension) from the first layer). And likewise have a prediction model and apply softmax function.

Likewise for the succeeding layers.

So, each path consists a series of nodes for a user. P(a,b,c | x). For an item, multiply P with 1 for user x who clicked on the item and multiply 0 for user who didn't click. Take a summation of the P values for all users. This summation is the score for the given path to an item.

Item feature:

Take a summation of the item scores of a few paths, and the negation of this summation is the loss function for the item. We'd like to minimize this negation so that the model give large scores for the item.

Note that there could be a popular path that give large scores for many items. To avoid the popular path from happening, add penalty on paths with many items (How? By subtracting the penalty from the score function of item given a path).

Let reg (path) be the number of items represented by a path. Then to find a path that represents an item with the smallest loss:

path = argmin[path, ( loss(item) + a * reg (path) )]

During training, first update the model that predict a path given an user. Then update the item feature using the trained model.

Reference: https://www.youtube.com/watch?v=BYtzZ48hRFM (voiced in Chinese)

Monday, February 19, 2024

Summary on Shusen Wang's video of: Self-supervised Learning

Often we have data that have features (keywords) but no classification information. For example, videos in a website can have keywords, but we haven't yet have user information to know if a certain feature can lead to a user to like the video. However, since there are many keywords on the same video, it is logical to believe that the feature vectors based on different keywords on the same video are close to each other - and more distant between the feature vectors based on different keywords on different videos. Separating the feature vectors based on this information is a self-supervised learning process.

Random mask: randomly task keywords at a certain attributes.

Dropout are techniques in collecting features during training from the same item. Dropout is to randomly take away a percentage of keywords (assuming there are many keywords). In this way, the feature on an item is more generalized.

Complementary features: split keywords on 1 item into two sets, and each set map into a feature vector (Thus the two are complementary features) by only seeing 50% of the keywords at a time. Since they are on the same item, the two vectors should large cosine similarity.

Mask related features: calculate the mutual information between two keywords on the items. (Calculated as the sum of all keywords (u,v) MI(U,V) = sum( p(u,v) * log (p(u,v) / (p(u) * p(v) ), where p(u,v) are the probability that the two keywords on the same item. Then, for each keyword k, find half of other keywords that are related to this keyword k. Mask the half of the keywords that are closely related to this keyword k, and use the 50% of less related keywords.

(In practice, the method of masking related features is hard to calculate and hard to maintain.)

How to train: convert all records into vectors: by applying multiple mask techniques and let it multiply matrix to convert it into a vector. Pair vectors to calculate similarity. Take the similarity into a softmax layer. Let expected the cosine similarity of the vectors from the same object be 1 and let the vectors from different object be 0. Take the crossEntropyLoss between the calculated value and expected value.

It is also possible to combine the loss function of self-supervised learning with the loss function of the Contrast Learning, by taking a simple summation.

Reference: https://www.youtube.com/watch?v=Ra3MVhneR9E (voiced in Chinese)

Saturday, February 17, 2024

Quick Note: Contrast Learning in DSSM

Contrast learning is a way to setup the loss function. At point-wise loss function, negative samples and positive samples are considered separately. At Contrast Learning, the loss function (aka pair-wise loss) is constructed by taking 3 items: 2 positively related item and 1 unrelated item. The cosine similarity of the 2 positively related items should be large, and the similarity of the unrelated pair item should be small. The difference of the two similarity values is used as the loss function - the further apart of the two values the better. Since the optimization step is to reduce the loss, the training will lead it to fit both samples.

To make the positive sample and the negative sample separate better, usually a desired distance m is defined. If the difference is larger than m, it is considered as no loss. If the difference is less than m, loss will be the difference.

So the loss function is written as

Loss(a, b_positive, b_negative) = max{0, cos(a, b_negative) - cos(a, b_positive) + m}

It is also possible to write the function in logistic loss:

Loss(a, b_positive, b_negative) = log(1 + exp[ sigma * (cos(a, b_negative) - cos(a, b_positive)) ] )

List-wise loss is a similar idea of the pairwise loss function, but considers more samples in the same loss function. Each a training record consists 1 pair of positive samples (a, b) pairing with the input, and multiple negative samples (b_neg_1, b_neg_2, etc) , and take cosine similarity between the pairs: cos(a,b), cos(a, b_neg_1), cos(a, b_neg_2), etc. Put all results in a Softmax function. Let the expected result to be (1, 0, 0, 0, ... ). Train it with CrossEntropyLoss between the result and the expected result.

Note: When trying point-wise loss function (that is training negative and positive samples separately), the ratio of the amount of positive samples and negative samples should be from 1:2 to 1:3,

Reference:

https://www.youtube.com/watch?v=2Mc10LZ-DB0 (voiced in Chinese)

Friday, February 16, 2024

Quick note: Collaborative Filtering vs. Swing

You probably have heard of Collaborative Filtering, which is to find similarity of two items by counting the number of shared interested users. They are matched via a cosine similarity (or dot product in another word).

There comes a problem: two users are within the same interest group, the shared items may not be that similar.

Swing takes one more step: assign a weight for each pair of shared users - if the two users have shared a lot of items, let the weight of the pair of the user be smaller: 1 / (a + overlap(u1, u2) . (a is a positive term to avoid dividing by zero).

Collaborative Filtering may also happen in similarity of users. Recommendation is made from similar users' list of items. To avoid popular item from appearing in every recommendation, weight of item is reduced by the popularity with 1/ log(1+ num_users_like_the_item)

Reference:

https://www.youtube.com/watch?v=DUUMNTDuJ3Q

https://www.youtube.com/watch?v=7O9zFMNdrZ8

Tuesday, February 06, 2024

How to get Active Contour

A summary of the video tutorial Active Contour | Boundary Detection.

1. Draw a circle around the image (roughly around the object). This is the initial contour.

2. Calculate the gradient of the image.

3. Blur the gradient values in the images so that the value spread to other part of the image, creating a slope to guide a point to the position with large gradient value.

4. Apply greedy search from the contour to move control point to the points with large gradient values.

5. It is almost done, but the contour is not smooth. A modification to this solution could be to adjust the gradient value with smoothness and elasticity of the contour. The two values are defined as the derivative and second derivative on the contour.

Saturday, January 20, 2024

Gaussian Mixure Model in Object Tracking

Summary on this Object Tracking lesson: https://www.youtube.com/watch?v=0nz8JMyFF14

The overall idea is to find large supporting evidence / small standard deviation: if this value is large, it is background. Else, foreground.

The evidence is probably the value for the point in the Gaussian model. Or it can be simply the distance from the mean of the Gaussian model. The lesson didn't mention clearly.

For each pixel there will be a Gaussian Mixture Model. Compute a pixel color histogram for the first N frames of a video. Normalize the histogram, and model it as a mixture of 3 to 5 Gaussians.

During detection, for a value X, | X - gaussian_center | < 2.5 standard_deviation, it is considered part of the Gaussian distribution.

To make the model adapt to new data, update Histogram H using the new pixel intensity. If new histogram differs from the old histogram a lot, refit the GMM.

GMM can be calculated using Expectation Maximization - Start by randomly picking means and standard deviations, cluster data points using these Gaussian distributions; then refine means and standard deviations. Repeat this process until the Gaussian distributions don't change any more (converged). Or, just use Python sklearn.mixture GaussianMixture.

Tuesday, January 16, 2024

Reinforcement Learning: Actor Critic

Why Reinforcement Learning?

Reinforcement learning is typically used to generate a sequence of actions to achieve a goal. Note that it is different from the classification problem, which answers a question like whether a given feature vector should have a yes/no (or categorical) answer. However, each step of the sequence of actions may be decided similar to a classification problems: given a feature vector for the current environment state, what is the next action to generate in the sequence to achieve a goal state. Main differences:

a sequence of actions (trajectory) vs. a single decision
environment may be a black box.

The goal state is assigned with a reward to guide the action. There are 2 approaches to solve it: find a policy function that given a state S, returns the probability to pick an action a. Or, find a value function that predict the reward in this state - use this value function to greedily pick an next action a that gives the most award (the observed reward at the next state S + the guessed reward moving from S).

Note that with a policy function, the actions are picked randomly by their probability. The policy function could be used to randomly sample a few trajectories to test out the rewards, and take an average to represent how good a state is. In reality there could be many trajectories. We just need a few sample to have a Monte Carlo estimation.

Describe RL in a more human-understandable fashion

It is a maze game, you are asked to find a path from point A to point B.

But you don't know how the maze looks like. However, putting a BBQ chicken at point B, you can follow the smell from point A to reach point B. The BBQ chicken here is the reward, and the smell is the expected future reward (call it return).

But the smell isn't there, either. To implement the smell, we need to assign every step toward the food a greater imagined smell value. This is the discounted return.

But we don't know where the food is, either. So we take random actions until reaching the BBQ, and then assign our steps with imagined smell values.

If we step on poop during the random exploration, assign our steps with negative smell values, so that they are visited less often. And try walking again following the imagined smell.

How to assign an imagined smell value to a step? Use a neural network, or any model.

Applications:

Chatbot - outputs a sequence of words, to provide answer to a question
AlphaGo - chess, use a sequence of moves to value best state
Control - plan a trajectory to control robot / game to reach a certain state.

Q Function

Q function is the quality of an action in an state S. It is usually guessed from a observed trajectory - by the sum of the observed rewards from an intermediate step i to the end state. This sum also have a named called discounted return. (It is called discounted because the sum is a weighted sum - a reward at a further step has less weight in the sum)

Actor Critic

In Actor Critic, two networks are trained. Critic network gives a score of how well an action is under an environment state. (So it takes 2 variables: action and state) Critic network is trained based on reward from the system. Actor network chooses a sequence of actions under each environment state so that the average score from the Critic network is higher. (So it takes 1 variable: state) It is trained based on the scores from the Critic network.

Proximal Policy Optimization

PPO is an improvement based on Actor Critic, where the learning on actor is capped to avoid taking too large of a step. It can be done by 2 ways: limit the loss - it is too large, clip to the max allowed value; Or, use a ratio between the probability of an action in the new policy and the old policy.

Trust Region Policy Optimization

In Trust Region policy optimization, use a function L to approximate a target function which we want to find parameters for. L and the target function are only similar within a small range. In this range, find the parameters for the max value of L, assuming it has the same parameter of the target function. Then start over from finding the the approximate function L again. The process repeats between Approximation and Maximization.

The Monte Carlo method can be used to make the approximation L, by taking sample trajectories.

To make sure the new parameters after maximization are within a small region to the old parameters, KL divergence is used to compare the probability distribution of the policy function. Or it is also possible to measure the distance of the parameters directly.