Linformer – make transformer affordable via low rank projection

transformer has been commonly used in recommendation system and it is the key building block to model sequence. however, transformer is very expensive. the computational cost of transformer is n^2 * d + n * d^2, where n represents the sequence length, for social media platform, it is easy build a sequence with potentially hundreds of actions, so a sequence of length 1K immediately will turn into a 1M operations.

Where this 1M comes from is during the self attention when calculating the attention scores QK_t.

Q: n * d, K: n * d => Q @ K_t = (n * d) @ (n * d)T = n ^ 2.

To avoid this from happening, researchers found that matrix are low rank matrices, which means by dimension reduction, we can project these matrices smaller matrices that will have a fixed dimension of r instead of n.

def linformer_attention(Q, K, V, E_K, E_V):
# Q: (n, d_k)
# K, V: (n, d_k)
# E_K, E_V: (n, r)
K_proj = E_K.T @ K # (r, d_k)
V_proj = E_V.T @ V # (r, d_k)

scores = (Q @ K_proj.T) / sqrt(d_k) # (n, r)
attn = softmax(scores, dim=-1) # (n, r)

out = attn @ V_proj # (n, d_k)
return out

In this psuedo code, we notice the attention score matrix is now n by r that grow linear with the sequence. In the very end, when we multiple with the value matrix, the dimension of r cancels out and we keep the final output exactly the same as the vanilla matrix. If we pick r to be 100 when n = 1000, using linformer will reduce the computation complexity by 10x.

Reference:
https://arxiv.org/pdf/2006.04768

LLM inferencing

https://www.youtube.com/watch?v=fcgPYo3OtV0&t=2336s

Here are the study notes when I took the inference class from stanford cs336, discussing techniques to improve the inference efficiency.

Multi Head Attention MHA
Grouped Head Attention GHA (use few keys and values)
Multi Query Attention MQA
Multi Head Latent Attention MLA (compressed latent KV): deepseek reduce N*H from 16348 to 512
MLA is not compatible with ROPE

Cross Layer Attention (CLA): reuse the same K, V projection between different layers

Local Attention
full n^2 attention, sliding window attention, dilated sliding window, global+sliding window
sliding window attention: KV cache size remain constant as you increase longer sequence size
solution: interleave local attention with global attention (hybrid layers): for every 6 layers you have global attention

inference is memory limited
– lower dimensional KV cache (GQA, MLA, shared KV cache)
– local attention on some of the layers

alternative to transformers
– state space models -> continuous state space, fast discrete representations
– diffusion models
– mamba
– jamba: interleave Transformers and Mamba layers: 1:7 ratio
– BASED: use linear attention + local attention
– MiniMax: use linear attention + full attention (once in a while) (456 parameter MoE)

quantization: reduce precision of numbers, need to worry about accuracy
fp32 4bytes: needed for parameters and optimizers states during training
bf16: default for inference
int8: for inference only
LLM.int8()

model pruning

speculative decoding and speculative sampling: small model generation + big model evaluation is faster than big model generation, you get a 2x speedup, try to make draft model as close to the target model (model distillation)

paged attention

vllm quick start

vllm is a popular python library for serving LLM.

I tested it on my local Ubuntu (I have windows and Ubuntu dual boot) and it worked great right off the shelf!

1. Installation.

installation of vllm is as easy as a pip install. because the inference is highly hardware dependent, depends on your hardware, you need to find the proper installation guide whether you have a nvidia card, a AMD card, a TPU, CPU etc. Sometimes, you even need to build from source directly to ensure the proper installation and best performance.

2. Check VRAM and GPU

make sure you have sufficient VRAM to host the model. Here I have a 3090 with 24GB VRAM.

3. Download the model and Serve

vllm supports many SOTA models, here is a list of all the models that they support. I am testing phi3 microsoft/Phi-3-mini-4k-instruct which is small enough to set up locally. Meanwhile, by downloading the instruct model, I can later refer it within within the gradio chat window to explore the chat experience.

After the model is served, we can interact with the model via API, at this moment, vllm have 3 endpoints implemented, list models (screenshot below), chat completion and create completion endpoints. The vllm API is standardized in a way such that it is a drop in replacement for openai. You can literally take your existing openai code and replace the URL with localhost and your code should just work.

An interesting observation is that the VRAM of 3090 is almost fully occupied, not sure why.

4. Chat UI

vllm comes with many examples. One of the example is to set up a gradio openai chatbot so you can interact with the LLM.

python script.py -m microsoft/Phi-3-mini-4k-instruct

I have heard how even it is to set up model UIs using gradio. By looking into the source code of the UI script, it is so clean and easy, you basically only need to define a predict function by passing it the history and message.

I also tested the throughput by instructing phi to repeat the same word multiple times for a while. During the time of the experimentation, the average generation throughput was about 70 tokens/s.

Deeplearning.ai – pretraining-llm

Just accomplished the short course of pretraining LLM from deeplearning.ai – link

What a good use of a Wednesday night 1 hour’s time, otherwise, I probably will be watching youtube shorts while drinking a beer 🙂

In this tutorial, the instructor Sung(CEO) and Lucy(CSO) from upstage walked through the key steps in training a LLM by using huggingface libraries, mostly importantly, they introduced several techniques like depth upscaling and downsizing to accelerate the pretraining by leveraging weights from existing models but with a different configuration.

The chapter that I personally liked the most is the Model Initialization, this is the step where you customize the configuration and initializes the weights.

For example, here they customized a new model with 16 layers (~308M parameters) and initialized the weights by simply concatenating the first 8 layers from a 12 layers pretrained model and then take the last 8 from the same model. it is like 1,2,3,4, (5,6,7,8,) (5,6,7,8,) 9,10,11,12 and somehow the generate texts have some linguistic coherence instead of completely gibberish.

They claim the depth upscaling and save the training cost by a whopping 70%.

Here is the paper published by upstage AI.

LlamaConfig {
  "_name_or_path": "./models/upstage/TinySolar-248m-4k",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "max_position_embeddings": 32768,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 12,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.37.2",
  "use_cache": true,
  "vocab_size": 32000
}

I trained my own GPT2! thanks to Andrej

RAG – 5 liners in detail

RAG – in a 5 lines

Following the first post of RAG in 5 lines, this post will cover what is happening behind the scene in more detail, what are some of the knobs you can tune so better understand the working internals of llama_index.

llama-index has done such a good job abstracting away the complexities behind the index and query commands. I mentioned that it took me almost 90 seconds to index the 10 articles.

Data Volume

Now let’s look back at how much data that we are working with.

import tiktoken

data_directory = './data'
encoder = tiktoken.get_encoding('cl100k_base')

total_token = 0
total_file_size = 0
total_char_size = 0
for file_name in os.listdir(data_directory):
    file_path = os.path.join(data_directory, file_name)
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()
        token_count = len(encoder.encode(text))
        print(f"File: {file_name}, Token Count: {token_count}, FileSize: {os.path.getsize(file_path)}")
        total_token += token_count
        total_file_size += os.path.getsize(file_path)
        total_char_size += len(text)

print(f"Total token count: {total_token}")
print(f"Total char size: {total_char_size}")
print(f"Total file size: {total_file_size}")
print(f"Average token size: {total_file_size/total_token}")

The 10 book occupy about 4.9MB in total in plain text format, by using the cl100k_base BPE, we know they contain in total about 1.2 million tokens. On average, each token takes 4 bytes.

I want to admitthat I had a false start at the beginning. Initially I wanted to index 100 books, then I stopped it after noticing how long the wait was and later changed to 10 books. Now, by noticing that our API usage has 1.9 million tokens in total, that roughly line up with our total number of tokens 1.2 million. (the additional 0.7 million was probably due to my false start).

The theoretical API throughput should be 5M tokens per minute, and our indexing process was about 500K per minute (~10%), this is very slow by default. later on, we can discuss why this happened and how to address it.

Chunk

A key step is RAG is to properly break larger documents into smaller pieces and index them accordingly. In the 5 liner example, there is no place to find the details about how each document is broken down into its own chunks, it is because there are very extensive default settings where the configurations were set there. That means users still have the freedom to customize but if you do not know, there is a default setting for your application.

you can access the default setting by running the following command:

from llama_index.core import Settings
print(Settings)

_Settings(
    _llm=OpenAI(
        callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x000001C41069AA10>,
        system_prompt=None,
        messages_to_prompt=<function messages_to_prompt at 0x000001C4638B39C0>,
        completion_to_prompt=<function default_completion_to_prompt at 0x000001C46395E3E0>,
        output_parser=None,
        pydantic_program_mode=<PydanticProgramMode.DEFAULT: 'default'>,
        query_wrapper_prompt=None,
        model='gpt-3.5-turbo',
        temperature=0.1,
        max_tokens=None,
        logprobs=None,
        top_logprobs=0,
        additional_kwargs={},
        max_retries=3,
        timeout=60.0,
        default_headers=None,
        reuse_client=True,
        api_key='...',
        api_base='https://api.openai.com/v1',
        api_version=''
    ),
    _embed_model=OpenAIEmbedding(
        model_name='text-embedding-ada-002',
        embed_batch_size=100,
        callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x000001C41069AA10>,
        num_workers=None,
        additional_kwargs={},
        api_key='...',
        api_base='https://api.openai.com/v1',
        api_version='',
        max_retries=10,
        timeout=60.0,
        default_headers=None,
        reuse_client=True,
        dimensions=None
    ),
    _callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x000001C41069AA10>,
    _tokenizer=None,
    _node_parser=SentenceSplitter(
        include_metadata=True,
        include_prev_next_rel=True,
        callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x000001C41069AA10>,
        id_func=<function default_id_func at 0x000001C4639A0900>,
        chunk_size=1024,
        chunk_overlap=200,
        separator=' ',
        paragraph_separator='\n\n\n',
        secondary_chunking_regex='[^,.;。？！]+[,.;。？！]?'
    ),
    _prompt_helper=None,
    _transformations=[
        SentenceSplitter(
            include_metadata=True,
            include_prev_next_rel=True,
            callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x000001C41069AA10>,
            id_func=<function default_id_func at 0x000001C4639A0900>,
            chunk_size=1024,
            chunk_overlap=200,
            separator=' ',
            paragraph_separator='\n\n\n',
            secondary_chunking_regex='[^,.;。？！]+[,.;。？！]?'
        )
    ]
)

in the default setting, you can find a lot of useful info like it is using open_ai 3.5turbo as the default LLM and use openai ada002 as the default embedding model. Then you can find they are using the chunk_size of 1024 with a 200 overlap as the sentence splitter.

I will not dive too deep into how chunking works, but you should totally check out the ChunkViz from Greg Kamradt.

In the previous section, we know there are about 1.2 million tokens in total. If the chunk size is 1024 with an overlap of 200, we might have ~800 tokens per chunk, which translates to 1.2 million / 800 ~ 1,500 chunks. Now let’s verify whether that was the situation.

>> len(index.docstore.docs)
1583

Document Store

Since now we verified that the raw documents got split into smaller chunks. These smaller documents are now stored in a docstore which is within the index object. as default, the index is using almost a python dictionary which is called SimpleDocumentStore to stores the raw text.

The code below is to iterate through all the docs. Here we print out the first element in the dictionary with its key, its text, number of tokens, and number of characters. Again, we can verify that the token size is within the 1024 chunk size limit and there is on average a 4 to 1 relationship between the number of characters and the number of tokens.

Like any database / store, the document store has many operations mainly around getter/setter, just like the CRUD operations for a database.

Embedding

Embedding is a numerical representation of an entity, in this case, a chunk. based on the default setting, we know they are being batched sent to openai api with a batch size of 100. Given that we have 1,500 chunks, this should translate to 15 requests at most.

Again, it is very likely that we wasted some requests while trying to index 100 docs at the beginning, this verified our hypothesis that the batch indexing is working as expected.

All the embeddings are stored in the vector store.

Vector Store

The vector store is yet another place where all the embeddings are stored, in addition, there are a lot of metadata also stored there representing the underlying document.

for k, v in vs.data.metadata_dict.items():
    print(k)
    pprint.pprint(v)
    break

for k, v in vs.data.embedding_dict.items():
    print(k, v)
    break

easter egg – not matching embedding

I tried to generate the embedding for a doc myself and compare to its stored embeddings in the vector store, even if they are very similar, I cannot seem to match them 100%. I will not be surprised if there is some pre/post processing somewhere to remove non-printing characters but this is good enough 🙂

Query

behind the scene of query_engine.query, it converts the query into its embedding and searches it against the vector store to get the best matched record(s). In fact, a simple query_engine will be converted into a retriever which is an synonym in this case.

The code below demonstrated that the query_engine.query will pass in the query as a QueryBundle object.

from llama_index.core.schema import QueryBundle

# Step 1: Generate embeddings for the query
em_query = get_embedding(question)

# Step 2: Create a QueryBundle
query_bundle = QueryBundle(question)

# Step 3: Retrieve nodes using the retriever
retriver = index.as_retriever()
nodes = retriver.retrieve(query_bundle)

Here we directly calculate embeddings for the query using openai api, and then pass the question to initialize a QueryBundle object that will be used to for retrieval.

Now we can have two observations (at least in this simple 5 liner scenario):
1. query_bundle simply takes the query and embeds it
2. the query_engine.query is running retriever.retrieve behind the scene

now, let’s try to manually verify the outcome. If we generate the embeddings for the query via openai. Then we can iterate through the vector store and calculate the pairwise cosine similarities, we store the similarities in a dictionary and then we can sort it and highlight the top 2 most similar items.

Based on the outcome, we can verify that the two records 3fe and c5b are the top two results and our similarities match llama-index retrieved too, even their similarity scores match perfectly.

vs.query(VectorStoreQuery(query_embedding=[em_query], similarity_top_k=2))

VectorStoreQueryResult(nodes=None,
similarities=[array([0.84504648]),
array([0.83960497])],
ids=[‘3fe80e3e-a181-4fbc-9868-05e73b9fad8e’,
‘c5bb1a3f-ef15-4ee2-b878-adb3dc04b1d6’])

dict_similarity = {}
for k, v in vs.data.embedding_dict.items():
    cos_sim = dot(v, em_query)/(norm(v)*norm(em_query))
    dict_similarity[k] = cos_sim

sorted_dict_similarity = dict(sorted(dict_similarity.items(), key=lambda x: x[1], reverse=True))
sorted_dict_similarity

Synthesis

Now we have located the chunk that is semantically most similar to the query, we still need one more step to further refine in order to provide a concise yet relevant answer. Otherwise, the user is asking “what consistite the US congress”, and we returns a big chunk of text from the constitution for the user to read. This step will need to distill the chunk(s) down to a one or two sentences. In our example, it can even be as short as “house of representative and senate”.

In Llama-index, it is achieved via a component called synthesizer. The retrieved nodes will be served as context, along with the query will be fed directly to a LLM to refine.

l# lama_index.core.prompts.default_prompts.py
DEFAULT_TEXT_QA_PROMPT_TMPL = (
    "Context information is below.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given the context information and not prior knowledge, "
    "answer the query.\n"
    "Query: {query_str}\n"
    "Answer: "
)

Summary

RAG – in a 5 lines

RAG is taking over the world as an augmentation or even replacement to traditional search. It has claims like context augmented leveraging LLM. Through new techniques like LLM and vector search, it offers a new way to offer better quality responses to users. This article is meant to demonstrate how RAG works in the bare minimal example.

Llama-index is one of the the most popular libraries these days for building a rag and its has its famous 5-line starter:

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("Some question about the data should go here")
print(response)

Before running the code, we need few preparation steps:

OpenAI Key

To leverage the LLM part, the easiest way is to just call openai api, once you create a key and place it under .env file, you can load into the environment variable as a best practice.

import os 
import dotenv 
dotenv.load_dotenv()
print(os.environ['OPENAI_API_KEY'])

Books Data

We need some toy data to work with, Gutenberg books is a good place to get some free books. The code below will download the first 10 books from gutenberg project by manipulating the url. The books will be stored in a local directory – data. (precreate it if it does not exist yet)

number_of_books = 10
for i in range(1, number_of_books+1):
    url = f'https://www.gutenberg.org/cache/epub/{i}/pg{i}.txt'
    response = requests.get(url)
    if response.status_code == 200:
        with open(f'./data/pg{i}.txt', 'w') as f:
            f.write(response.text)
    else:
        print(f'Failed to download {url}')

Load the Data and Index

docs = SimpleDirectoryReader('data').load_data()
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine()

The three lines above is all you need to load the data, create an index and set the index to be used as query engine. The index initialization took me a while (1minute and 30 seconds) to index the 10 books. So let’s put it into action.

Query

The first few books that we downloaded happen to literatures at the establishment of the US, which also includes the US consistitution, so lets try to ask a question and see how it works.

>> response = query_engine.query("what does the congress of the US consists of?")
>> response

Response(response='The Congress of the United States consists of a Senate and House of Representatives.', source_nodes=

voila, we got the answer that we need!

In the next post, we will dive into each of the building blocks and understand what is happening behind the scene.

The most intuitive tutorial about transformers

First Encounter with GraphQL

For a long period of time, developers use RESTful APIs to pull data from services and build website on top of it. A disadvantage of REST is users tend to spend a lot of time playing with the query parameters and parsing the outputs because frontend is constrained by the backend, sometimes only need a single data point from a heavy requests that is quite wasteful. GraphQL is a way to enable frontend developers to selectively prescribe what to pull and backend will only focus on what to feed, making the communication super versatile, and efficient after all. GraphQL is more about about the “QL” – query language and less about the “Graph”. During my first encounter with GraphQL, I found a great resemblance of GraphQL with SQL, you define your schema and users “query it”.

This article documents my first encounter with GraphQL while going through the book – Javascript Everywhere from Adam Scott.

How to Use it

Just like a SQL table, one needs to know the schema of the data first before using it. Above is a screenshot of the GraphQL schema visualized in apollo explorer.

We know there are 3 fields (columns) that we can query, the hello field will return helloworld for testing purpose, notes to return a list of all the notes and note to return a specific note.

In this simple example, there are 3 fields available to query, and each field can have 3 attributes that you can customize. Depending on the use cases, some users may only need the content, some may only need authors, some may need all fields, if you take all the combinations of different user cases into account, it is very hard to expect all the users cases. Instead, GraphQL defines the schema with all the ingredients available, and the users just need to query what they need.

The explorer is very helpful to interact, by nature, it is just a typical HTTP request with the graph query in a json format stored in the payload. See below, you can also choose between inline variable or extract variable and both works.

How to Build it

Define Schema

gql“ in Javascript represents a tagged template literals, it parses the GraphQL schema language the typeDefs object will need to be passed to the apollo server during initialization.

Define Resolver

The schema defines the “what”, and the resolver takes care of the “how”. In this simple example, the Query block defines how the three query works.

Try it Yourself

Now we saw how the schema and resolver work together. Let’s try to add a new field so it returns all the notes but the one specified by the user (this may apply to feed new unread articles).

Where is the Graph

We just covered the query language part, now let’s think about where is the graph. There is a great article by Bogdan discussing about the graph part of GraphQL.

You can also visualize any graphql using a tool called graphvoyager.

Interview Study Notes – Luna(Xin) Dong

Luna Dong Podcast – Building the Knowledge Graph at Amazon with Luna Dong
Three key features of Knowledge Graph:
1. structured data (entities and relationships)
2. canonicalization/rich/clean
3. data are connected
In Amazon, there exists separation of digital product from retail products because the former tend to have better and more well structured metadata, the retail products tend to require to extract digital data from various kinds of real life data for example (images, raw texts, …)

“is-a” relationship, “event” information, use the “seed knowledge” to automate the building of train data. The more you know, the faster you learn.

Knowledge Extraction: from web, product description (text, …) web tables, product data is collected from text and images

Data Integration: “is_a_director_of” is the same as “director” relationship, database and NLP community, we put things together to decide whether two things are wrong by looking into inconsistency (color, product flavor, data sources). Data fusion is to decide which version is right – is this person’s birthday on Feb 28th or Mar 28th? Through this process, you can learn embeddings that can be used for downstream tasks like search, recommendation, Q&A and many others.

Human in the loop is important because we need high quality data, it is important to seed the training data, annotate the data and calibrate and analyze the overall performance, and it is also important to address the last mile failure if we want to be 99% accurate.

The most inspiring moment from Luna is from Amazon’s fulfillment center, to combine machine power and human power.

Data acquisition: the product manufacturer’s website contains a lot of information, start with general crawling and sometimes do targeted crawling. It is not a binary, it is a mixture.

Embedding: conditional embedding. spicy is a valid flavor, spicy is unlikely to be part of an ice cream flavor, capture these constraints in an implicit way, these spicy flavor in general can be covered in certain types of products.

Triple – subject-predicate-object, look at all the triples together, and clean the embeddings from that way, the embeddings can propagate in the graph. Some products have the flavor spicy. Graph neural network is one of the most effective way to solve the problem.

The knowledge graph is a production system, the knowledge is generated from a lot of products – there are three major applications – search (intent), recommendation (similarities but still some difference), display of information (structured information and structured knowledge, better comparison table).

Luna said most knowledge graphs are built and owned by large corporations, she wishes there are tools for smaller business. She said there are three levels, first being the database and tooling storing knowledge graphs, the second being the techniques to entity and relationship extraction.

Open knowledge is an effort to connect and hook up different data sources.

datafireball

Author Archives: datafireball

Linformer – make transformer affordable via low rank projection

LLM inferencing

vllm quick start

1. Installation.

2. Check VRAM and GPU

3. Download the model and Serve

4. Chat UI

Deeplearning.ai – pretraining-llm

I trained my own GPT2! thanks to Andrej

RAG – 5 liners in detail

Data Volume

Chunk

Document Store

Embedding

Vector Store

easter egg – not matching embedding

Query

Synthesis

Summary

RAG – in a 5 lines

OpenAI Key

Books Data

Load the Data and Index

Query

The most intuitive tutorial about transformers

First Encounter with GraphQL

How to Use it

How to Build it

Define Schema

Define Resolver

Try it Yourself

Where is the Graph

More Questions

Interview Study Notes – Luna(Xin) Dong