Using LLaMA to summarize a pdf document

Aug 22, 2023

LLaMa 2 is essentially a pretrained generative text model developed by Meta. The model's parameters range from 7 billion to 70 billion, depending on your choice, and it has been trained on a massive dataset of 1 trillion tokens. Tokens, in this context, refer to text that has been converted into numerical representations in vector space. Initially, this model was developed solely for research purposes. However, as the community has grown, Meta has also made it available for commercial purposes. LLaMa-2 consistently outperforms its competitors in various external benchmarks, demonstrating its superior capabilities in reasoning, coding, proficiency, and knowledge tests.

You can find more information about LLaMa 2 and access it at this link: LLaMa 2

Now, let's dive into how to use it. Since we'll be downloading the models, and to avoid cluttering my local workspace with model binaries, I'll be using Google Colab. In this tutorial, I'll walk you through creating a simple summarizer app using the LLaMa model.

Now lets install all the libraries:

!pip install -q transformers einops accelerate langchain bitsandbytes
!pip install pypdf

Importing all the libraries:

import subprocess
subprocess.run(["huggingface-cli", "login", "--token", "<Your Token>"])
from langchain.llms import HuggingFacePipeline
from transformers import AutoTokenizer
import transformers
import torch
import warnings
warnings.filterwarnings('ignore')

Now let’s get the LLaMA model that we want to use:

model="meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model)

## Creating a pipeline:
pipeline=transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
    max_length=10000,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id
    )

Defining the model on how deterministic you want it to be aka setting the temperature value:

llm = HuggingFacePipeline(pipeline=pipeline, model_kwargs={'temperature':0})

As similar to previous post let’s integrate memory so that we are able to save the context during the conversation turns.

##Adding memory
from langchain.memory import ConversationBufferMemory
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
memory = ConversationBufferMemory()
prompt_template_name = PromptTemplate(input_variables =['content'],
                                      template="Can you summarize and polish the language for the text: {content}")
chain = LLMChain(llm=llm, prompt=prompt_template_name, memory=memory)

Now lets load the document using PyPDFLoader library.

## Document loaders
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("/content/wb-axp1.pdf")
pages = loader.load()
pages

Finally running the chain command to get the summary:

chain.run(pages[0].page_content)

output of the content:

Polishing the language of the text can help make it clearer and more concise. Here is an example of how the text could be rewritten with more refined language:
1964: AMERICAN EXPRESS FACES FINANCIAL SCANDAL
In 1964, American Express faced a major financial scandal that threatened its very existence. An investigation revealed that one of its subsidiaries, Allied Business Credit, was involved in a fraudulent scheme and had incurred significant liabilities. Allied had been operating a scheme where it would buy expensive items, such as tanks, fill them with seawater, and then sell them to unsuspecting buyers at inflated prices. When the scheme was discovered, American Express and its subsidiary filed for bankruptcy protection, leaving millions of dollars in liabilities unpaid. The news sent shockwaves through the financial industry, and American Express's stock price plummeted, losing nearly half of its value in a matter of days.
As the scandal unfolded, fears grew that American Express might not survive the fallout. Morale was low, and the company's reputation was in tatters. Shareholders launched a lawsuit against the company's CEO, Howard Clark, after he oversaw the offer of a large settlement to creditors. The move was seen as an unnecessary fulfillment of a moral obligation and was widely criticized. Despite the negative publicity, Buﬀett, a young investor who would later become one of America's most successful entrepreneurs, saw an opportunity beneath the scandal. He began his primary research, speaking to customers, vendors, and competitors to gather insights into the company's performance. His conclusion was that American Express would continue operating as usual, with little change in its use of travelers cheques and credit cards.
The rewritten text is more concise and uses more sophisticated language, making it easier to understand the events described. It also includes additional details and insights into the mindset of Warren Buffett, the young investor who saw an opportunity in the scandal.

Ciao!

lamphu’s Substack

Discussion about this post