15 VectorDB with R and Python

In order to assist individuals in locating information about occupational diseases from extensive epidemiological reports, there is a pressing need to make the content of such reports searchable. These reports are often stored as PDF files, which can be cumbersome to navigate without proper search capabilities. To address this challenge, a system is being developed to enable efficient and accurate content search within these PDF documents. The main objective is to transform these documents from static, text-based PDF files into an interactive, searchable database.

To achieve this, the project leverages both R and Python programming languages, utilizing each for their strengths in data handling and processing:

R’s role involves managing packages, performing data preprocessing, and generating reports. R is particularly effective for data manipulation and automated document generation, which are crucial for preparing the data and presenting results.
Python’s role includes extracting text from PDFs, segmenting and vectorizing the text, managing a vector database, and implementing a query-answer system. Python excels in handling extensive text processing and natural language processing tasks, making it ideal for the core functionalities of text extraction, vectorization, and database management.

The integration of R and Python allows for a robust system that can preprocess, index, and search large volumes of text efficiently. This system is not only designed to improve access to information within epidemiological reports but also to enhance the responsiveness of querying processes, aiding researchers and public health professionals in their work.

By creating a searchable database of occupational health documents, the project aims to provide a valuable tool for those researching occupational diseases, enabling them to find relevant information swiftly and effectively. This initiative represents a critical step towards leveraging technology to improve accessibility and utility of important health data.

15.1 Install Packages

This setup script is a common starting point in data analysis projects where R is used to handle, process, and visualize data. By ensuring that all necessary libraries are present and loaded, the script makes subsequent steps involving data manipulation, interaction with Python, and document generation more streamlined and error-free. The inclusion of pdftools and reticulate specifically highlights an intention to work with PDF data and possibly integrate R and Python in the workflow, which is essential for complex data science projects involving diverse data sources and analytical techniques.

# Install & Load R Packages (if necessary)
if (!require("tidyverse")) install.packages("tidyverse")
if (!require("pdftools")) install.packages("pdftools")
if (!require("reticulate")) install.packages("reticulate")
if (!require("tokenizers")) install.packages("tokenizers")
if (!require("TheOpenAIR")) install.packages("TheOpenAIR")
if (!require("rmarkdown")) install.packages("rmarkdown")
if (!require("tictoc")) install.packages("tictoc")
if (!require("httr")) install.packages("httr")
if (!require("jsonlite")) install.packages("jsonlite")

library(tidyverse)
library(pdftools)
library(reticulate)
library(tokenizers)
library(TheOpenAIR)
library(rmarkdown)
library(tictoc)
library(httr)
library(jsonlite)

15.2 Reticulate Packages

R and Python

The reticulate package in R is a comprehensive toolkit designed to create a seamless integration between R and Python. This package allows you to run Python code directly from R, use Python libraries, and manage Python environments. It plays a crucial role in projects where both R’s statistical capabilities and Python’s extensive range of libraries are needed. Here’s an explanation of how reticulate is used in the script provided, followed by detailed instructions typical for a tutorial or lecture notes.

Overview of reticulate The reticulate package bridges R and Python, enabling the use of Python scripts, modules, and virtual environments within an R session. This integration allows data scientists and analysts to leverage the strengths of both programming languages within a single workflow. Features of reticulate include:

Execution of Python Code: Directly run Python code from an R script. Python Environment Management: Control and switch between different Python environments. Data Conversion: Automatically convert data between R and Python, making it easy to pass data back and forth between the two languages. Library Usage: Access and utilize Python libraries as if they were part of R’s ecosystem. Specific Usage in the Script Setting Up a Python Environment The script uses reticulate to create and manage a Python virtual environment, which is an isolated environment that allows different Python projects to run with their specific dependencies:

virtualenv_create(envname = "./jinha_langchain_env")
use_virtualenv("./jinha_langchain_env")

virtualenv_create(envname = "./jinha_langchain_env"): This command creates a new Python virtual environment named jinha_langchain_env. Virtual environments are essential for managing package versions and dependencies without affecting the global Python installation.
use_virtualenv("./jinha_langchain_env"): This command activates the virtual environment, directing reticulate to use it for subsequent Python operations.

Installing Python Packages

py_install(packages = c( "langchain", "openai", "pypdf", "bs4", "python-dotenv", "chromadb", "tiktoken", "transformers", "langchain-community", 
"langchain-openai" ), envname = "./jinha_langchain_env", pip=TRUE)

py_install(): Installs the specified Python packages into the active virtual environment. This function is similar to using pip install in Python and ensures that all necessary libraries are available in the environment.

Running Python Code

reticulate::py_run_string('print("Hello, world!")')

Hello, world!

Setting the OpenAI API Key in R and Python

Sys.setenv(OPENAI_API_KEY ="your key")

Transferring the API key from R to python

api_key_for_py <- r_to_py(Sys.getenv("OPENAI_API_KEY"))

Download PDF files

if(!dir.exists("data")){dir.create("data")}
files = "https://raw.githubusercontent.com/jinhaslab/opendata/main/data/kcomwel.pdf"
download.file(files, destfile = "data/kcomwel.pdf")

15.3 LangChain

LangChain is a Python library that simplifies the development of applications involving language models and text data. It provides an interface for loading, processing, and interacting with documents, as well as integrating language models for tasks such as question answering, document summarization, and more. The library is particularly useful when dealing with large volumes of text or complex language processing workflows.

Key Features of LangChain

Document Loading and Processing: LangChain supports various methods for loading text data from different sources, including PDF files, plain text files, and more. It provides tools to handle and preprocess these documents to make them suitable for further analysis.
Integration with Language Models: It integrates seamlessly with multiple language models, particularly those from OpenAI (like GPT models). This integration allows users to leverage powerful pre-trained models for a variety of NLP tasks directly within their applications.
Modular Design: The library is built with a modular design, enabling developers to plug in different components as needed. For example, different document loaders, text processors, and model interfaces can be swapped in and out depending on the project requirements.
Ease of Use: LangChain aims to make it easier for developers to implement complex NLP tasks without needing to manage the intricate details of model interactions or text data manipulation.

PDF processing

# Load LangChain module
langchain <- import("langchain")

# Initialize a document loader for PDF files
loader <- langchain$document_loaders$PyPDFLoader("data/kcomwel.pdf")

# Load all pages of the PDF document
all_pages0 <- loader$load()

kr_all_pages  <- all_pages0[67:100]
# Extract and handle specific pages
kr_page_contents = lapply(67:100, function(i) {all_pages0[[i]]$page_content})

Breakdown of the Process:

Importing LangChain: The library is imported into the environment, making its functions and classes available for use.
Document Loader: The PyPDFLoader class is used here. It specializes in loading PDF files and extracting their content. This is particularly useful for handling documents that are in PDF format and need to be converted into a format suitable for NLP tasks.
Loading Documents: The load() method reads the PDF and converts each page into a format that can be processed further. It allows for easy extraction of text data from the document.
Text Extraction: The script specifically extracts text from pages 67 to 100, focusing on a subset of the document that is relevant to the task at hand.

15.4 Embeding

Translate Documents

First, we will translate the contents of a document from Korean to English using the OpenAI API.

source("source/translate_text_with_openai.R")
#en_page_contents =lapply(1:34, function(i) {translate_text_with_openai(kr_page_contents[[i]])})
#saveRDS(en_page_contents, "data/en_page_contents.rds")
en_page_contents = readRDS("data/en_page_contents.rds")

for (i in 1:34){
kr_all_pages[[i]]$page_content = en_page_contents[[i]][1]
}
all_pages = kr_all_pages

Document Splitting Split documents using Langchain’s recursive and character-based text splitters.

reticulate::py_run_string('
import openai
openai.api_key = r.api_key_for_py  
from langchain.text_splitter import RecursiveCharacterTextSplitter
my_doc_splitter_recursive = RecursiveCharacterTextSplitter()
my_split_docs = my_doc_splitter_recursive.split_documents(r.all_pages)')

Why This Code? Python Environment Setup: The script begins by importing necessary Python libraries (openai) and configuring the openai.api_key. This step is crucial for any subsequent operations that require API access for embeddings or other AI-driven analysis. Document Splitting with RecursiveCharacterTextSplitter: This splitter is used to divide the document into segments recursively based on character count, maintaining the integrity of the document structure. This is particularly useful for hierarchical documents or those with embedded lists or sections that need individual attention. Purpose: Splitting documents into smaller pieces makes further processing (like embedding or machine learning analysis) more manageable and efficient. It can also help improve the performance of algorithms by focusing on smaller, more relevant snippets of text.

Further Document Splitting for Detailed Analysis

reticulate::py_run_string('chunk_size = 1000
chunk_overlap = 150
from langchain.text_splitter import CharacterTextSplitter
c_splitter = CharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap, separator=" ")
c_split_docs = c_splitter.split_documents(r.all_pages)
print(len(c_split_docs))')

CharacterTextSplitter Parameters:

chunk_size: This specifies the number of characters each document segment should contain. A size of 1000 characters is typically a good balance between having enough context for analysis and keeping the data volume manageable.
chunk_overlap: This allows segments to overlap by a specified number of characters (150 in this case), which ensures that no contextual information is lost at the boundaries of chunks. This is important for maintaining continuity and context in analyses such as topic modeling or sentiment analysis.
separator: Defines how chunks are separated, usually by spaces or punctuation, to ensure that the splits do not occur mid-word.
Purpose: This second splitting method is often used for creating a finer division of the text, suitable for detailed NLP tasks like sentiment analysis, topic extraction, or feeding into deep learning models where the size of the input is a limiting factor.

Embedding and Database Creation

Create embeddings for the documents and store them in a local database.

#embeding.py
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
import os

# Ensure the directory for storing the database exists
chroma_store_directory = "./db/chroma_dbs"
os.makedirs(chroma_store_directory, exist_ok=True)

# Function to create an embedding object using an API key
def get_embedding_function(api_key):
    return OpenAIEmbeddings(api_key=api_key)

# Initialize the embeddings object with the API key
api_key = os.getenv('OPENAI_API_KEY')
if not api_key:
    raise ValueError("API key not found. Please set the OPENAI_API_KEY environment variable.")

embed_object = get_embedding_function(api_key)


vectordb = Chroma.from_documents(
    documents=my_split_docs,
    embedding=embed_object,
    persist_directory=chroma_store_directory
)

# Print the number of embeddings created
print(vectordb._collection.count())

Parameters:

documents: This parameter takes my_split_docs, which is expected to be a list or an iterable of document texts. These are the documents that will be processed and embedded into vector representations.
embedding: The embed_object supplied here is an instance of OpenAIEmbeddings or a similar class that provides a method to convert text documents into vector embeddings. This embedding object handles how text is transformed into a numerical format that can be stored and compared efficiently.
persist_directory: This is the filesystem path (chroma_store_directory) where the Chroma system will save the embedded vectors. This directory is used to store the embeddings persistently, meaning they can be reloaded and used in future sessions without needing to recompute them.

use_python(Sys.which("python"), required = TRUE)

py_run_string(paste0("import os; os.environ['OPENAI_API_KEY'] = '", Sys.getenv("OPENAI_API_KEY"), "'"))

# Now run your Python file
tic()
source_python("source/embeding.py")

toc()

3.355 sec elapsed

vectordb$persist()

#source_python("source/load_db.py")

Query and Retrieve Information Perform similarity search and query answering on the embedded documents.

my_question = "what is silica exposure level in Racehorse Trainers"
result = py_run_string(sprintf('
my_question = "%s"
sim_docs = vectordb.similarity_search(my_question)
mm_docs = vectordb.max_marginal_relevance_search(my_question, k = 3, fetch_k = 5)
from langchain.chat_models import ChatOpenAI
the_llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
from langchain.chains import RetrievalQA
qa_chain = RetrievalQA.from_chain_type(the_llm,retriever=vectordb.as_retriever())
answer = qa_chain.run(my_question)
', my_question))

Question Setup

my_question is defined as a string that contains the query: “what is silica exposure level in Racehorse Trainers”. This question is what the system will attempt to find relevant information on from the document corpus.

Document Similarity Search

sim_docs = vectordb.similarity_search(my_question): This function call searches through the vector database (vectordb) to find documents that are semantically similar to the question posed. It returns documents that have the closest embedding vectors to the vector of the query, suggesting they contain related content.

Max Marginal Relevance Search

mm_docs = vectordb.max_marginal_relevance_search(my_question, k = 3, fetch_k = 5): This search method is a more refined approach where the system not only looks for documents most similar to the query but also ensures that the results are diverse. The parameters k and fetch_k control how many documents are returned (k) and how many documents to consider in the relevance calculation (fetch_k). It aims to reduce redundancy in the returned documents, ensuring a broader coverage of information related to the query.

Initialization of the Language Model

from langchain.chat_models import ChatOpenAI
the_llm = ChatOpenAI(model_name=“gpt-3.5-turbo”, temperature=0): Here, an instance of a language model (in this case, GPT-3.5 Turbo from OpenAI) is initialized. * The temperature=0 setting typically makes the model’s responses more deterministic and focused.

Retrieval-Based Question Answering

from langchain.chains import RetrievalQA
qa_chain = RetrievalQA.from_chain_type(the_llm, retriever=vectordb.as_retriever()): This sets up a retrieval-based question answering chain. Here, the system uses the previously initialized language model (the_llm) and a retriever setup (vectordb.as_retriever()), which utilizes the vector database to fetch relevant information that the language model will use to generate an answer.
answer = qa_chain.run(my_question): This line actually executes the retrieval and question-answering process. The language model uses the documents retrieved as context to generate an answer to the query.

answer = py$answer

cat(answer)

The silica exposure level in Racehorse Trainers can vary based on the specific activities they are involved in, such as horse exercise activities in the circular arena. In the case of the horse exercise activities described in the context provided, the concentration of respiratory crystalline silica in total dust was approximately 1.7%. This resulted in estimated concentrations of crystalline silica exceeding the Ministry of Labor’s exposure standard of 0.05 mg/m³. Therefore, it is important for Racehorse Trainers to be aware of the silica exposure levels in their specific work environments and take necessary precautions to minimize exposure.