ETL for LLMs: Extracting Knowledge from Content

Archana Vaidheeswaran

March 22, 2024

Companies are awash with data, most of which is unstructured or semi-structured. This data is scattered across various platforms like support tickets, emails, Slack messages, documentation, FAQs, and website content.

While this data is a goldmine of information, leveraging it effectively poses a significant challenge. This is where Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) come into play, especially when integrated into a well-structured ETL (Extract, Transform, Load) pipeline.

The Challenge with Unstructured Data

LLMs excel at analyzing text and extracting useful information. However, their performance is contingent on the relevance and quality of the data they are given as context. Given their limited context length, supplying them with data directly related to the query or task at hand is crucial.

Also, when content is multimodal, such as MP3 audio files or JPEG images, they need to be preprocessed into text format for an LLM to analyze the data. Audio transcription or image description APIs can be used to generate text from multimodal data.

The Role of RAG in Enhancing LLMs

RAG offers a compelling solution by combining the power of LLMs with a retrieval-based approach. The retrieval stage of RAG fetches relevant information from a vast corpus of data and provides it as context to the LLM before generating a response, called the LLM completion.

The great thing about this approach is that the contextual data does not have to be structured by the user (or developer) - the ETL pipeline will turn unstructured data into structured textual information for the LLM.

However, we still need an efficient way to provide the LLM with information most relevant to the query it is processing.

This is where vector databases come in. By representing chunks of text data as high-dimensional vectors (embeddings), a vector database allows fast similarity searches to find the most relevant information for a given query.

The user prompt is encoded into a vector, and an approximate nearest neighbor search is performed to retrieve the text chunks whose vectors are most similar to the query vector. These relevant chunks are then provided to the LLM as context along with the user prompt.

Some popular open-source vector databases used for RAG include Pinecone, Weaviate, Qdrant, and Chroma. Also, managed services such as Azure AI Search supports vector indexing. These are optimized for storing and searching through huge volumes of vector data.

ETL Pipelines for LLMs: The Backbone of RAG Applications

To harness the full potential of RAG, the unstructured data must first be transformed into a format that these models can efficiently process. This transformation is achieved through an ETL pipeline, tailored for LLMs. Let's delve into the stages of this pipeline:

1. Extraction

The first step involves extracting text from various sources. This text may come in multiple formats and from disparate sources, each with its unique structure and metadata. Some ETL tools also support text extraction from images using OCR. The extraction process collates this data into a uniform format for further processing.

Common data sources that ETL pipelines extract from include:

  • Documents (i.e. PDF, DOCX, PPTX)

  • Web scraping

  • Audio and podcasts (MP3)

  • Video recordings (Zoom, Loom)

  • Cloud storage (S3, GCS, Azure Blob Storage)

  • SaaS tools (Slack, Notion, Google Drive, etc.)

  • Email (Google Mail, Microsoft Outlook Mail)

  • Product management tools (Jira, Linear, GitHub Issues)

  • Code repositories (GitHub, Azure Repos)

  • Data lakes (Snowflake, Databricks)


2. Transformation

Transformation is pivotal in converting raw data into a structured form that LLMs can utilize. This stage encompasses several sub-processes:

Cleaning

This involves removing irrelevant content, correcting errors, and standardizing the text to ensure consistency. Some common cleaning steps are:

  • Removing HTML tags, code snippets, and special characters

  • Fixing encoding issues and standardizing to UTF-8

  • Expanding contractions and abbreviations

  • Lowercasing and removing extra whitespace

Chunking

Since LLMs have a finite context window (e.g. 4096 tokens for GPT-3), data is broken down into manageable chunks that fit within this window while maintaining coherence. There are various approaches to chunking:

  • Using fixed-size windows with some overlap between chunks

  • Splitting on semantic boundaries like sentences or paragraphs

  • Using progressive summarization to create hierarchical chunks

Summarizing

Long documents are condensed into summaries, capturing essential information to aid the LLMs in understanding the context without needing to parse through the entire document.

Embedding

This process involves creating vector representations of text chunks, enabling efficient retrieval of relevant information during the RAG process.


3. Loading

The transformed text chunks and their embeddings are loaded into a vector database, optimized for fast approximate nearest neighbor search and retrieval by the RAG application. The vector database serves as the knowledge base from which the RAG system fetches the most relevant information to augment the LLM's responses. This loading phase is a crucial part of the indexing and preparation steps in the RAG workflow.

When loading data, key things to consider are:

  • Identifying and removing any personally identifiable information (PII) or sensitive data from the text chunks before indexing.

  • Embedding dimensionality and distance metrics

  • Indexing and sharding for scalability with large datasets

  • Updating and synchronizing embeddings as new data is ingested

  • Storing and aligning chunk metadata for retrieval


Bringing It All Together: The Content Workflow Pipeline

Building an effective RAG application relies heavily on a robust ETL pipeline. A proper ETL pipeline ensures that the RAG system can swiftly access the most relevant and concise information from the vector database whenever a query is made, which the LLM uses to generate informed, accurate, and relevant responses.

ETL tools and platforms like Graphlit provide managed components across the RAG stack to help streamline the development of these pipelines. This includes scalable data connectors for ingesting data from a wide variety of data sources.

Graphlit provides an end-to-end content workflow pipeline, which eliminates the need for integrating projects like LangChain or LlamaIndex, selecting a vector database like Pinecone or Qdrant, or building ETL pipelines from scratch. Graphlit handles all of this as a managed, usage-priced platform, optimized for RAG-based vertical AI applications, and made easy to integrate by developers.

By transforming unstructured data into an optimized and accessible format, companies can unlock the true potential of their data assets, leveraging it to power compelling RAG applications. The synergy between modern ETL processes, vector databases, and RAG is crucial to deliver next-generation apps that understand and interact with your company's ever-growing information.