Example: Converting Text into Dataframes¶
In this example, we'll demonstrate how to convert a text into dataframes using OpenAI Function Call. We will define the necessary data structures using Pydantic and show how to convert the text into dataframes.
Motivation
Often times when we parse data we have an opportunity to extract structured data, what if we could extract an arbitrary number of tables with arbitrary schemas? By pulling out dataframes we could write tables or .csv files and attach them to our retrieved data.
Defining the Data Structures¶
Let's start by defining the data structures required for this task: RowData
, Dataframe
, and Database
.
from pydantic import Field, BaseModel
from typing import List, Any
class RowData(BaseModel):
row: List[Any] = Field(..., description="The values for each row")
citation: str = Field(
..., description="The citation for this row from the original source data"
)
class Dataframe(BaseModel):
"""
Class representing a dataframe. This class is used to convert
data into a frame that can be used by pandas.
"""
name: str = Field(..., description="The name of the dataframe")
data: List[RowData] = Field(
...,
description="Correct rows of data aligned to column names, Nones are allowed",
)
columns: List[str] = Field(
...,
description="Column names relevant from source data, should be in snake_case",
)
def to_pandas(self):
import pandas as pd
columns = self.columns + ["citation"]
data = [row.row + [row.citation] for row in self.data]
return pd.DataFrame(data=data, columns=columns)
class Database(BaseModel):
"""
A set of correct named and defined tables as dataframes
"""
tables: List[Dataframe] = Field(
...,
description="List of tables in the database",
)
The RowData
class represents a single row of data in the dataframe. It contains a row
attribute for the values in each row and a citation
attribute for the citation from the original source data.
The Dataframe
class represents a dataframe and consists of a name
attribute, a list of RowData
objects in the data
attribute, and a list of column names in the columns
attribute. It also provides a to_pandas
method to convert the dataframe into a Pandas DataFrame.
The Database
class represents a set of tables in a database. It contains a list of Dataframe
objects in the tables
attribute.
Using the Prompt Pipeline¶
To convert a text into dataframes, we'll use the Prompt Pipeline in OpenAI Function Call. We can define a function dataframe
that takes a text as input and returns a Database
object.
import instructor
from openai import OpenAI
# Apply the patch to the OpenAI client
# enables response_model keyword
client = instructor.patch(OpenAI())
def dataframe(data: str) -> Database:
return client.chat.completions.create(
model="gpt-4-0613",
temperature=0.1,
response_model=Database,
messages=[
{
"role": "system",
"content": """Map this data into a dataframe a
nd correctly define the correct columns and rows""",
},
{
"role": "user",
"content": f"{data}",
},
],
max_tokens=1000,
)
The dataframe
function takes a string data
as input and creates a completion using the Prompt Pipeline. It prompts the model to map the data into a dataframe and define the correct columns and rows. The resulting completion is then converted into a Database
object.
Evaluating an Example¶
Let's evaluate the example by converting a text into dataframes using the dataframe
function and print the resulting dataframes.
dfs = dataframe("""My name is John and I am 25 years old. I live in
New York and I like to play basketball. His name is
Mike and he is 30 years old. He lives in San Francisco
and he likes to play baseball. Sarah is 20 years old
and she lives in Los Angeles. She likes to play tennis.
Her name is Mary and she is 35 years old.
She lives in Chicago.
On one team 'Tigers' the captain is John and there are 12 players.
On the other team 'Lions' the captain is Mike and there are 10 players.
""")
for df in dfs.tables:
print(df.name)
print(df.to_pandas())
The output will be: