Structured Data From Unstructured Data: Address Extraction with Graphlit, GPT-4 Turbo
January 23, 2024
Extract Postal Addresses
Many AI-enabled applications need to extract structured data from unstructured data, i.e. web pages, PDFs, or audio transcripts.
Extracting postal addresses from text can be difficult with existing NLP solutions, but LLMs like OpenAI GPT-4 do an excellent job with structured data extraction.
With LLMs such as OpenAI GPT-3.5 and GPT-4, they offer function calling as a way for the model to output a JSON object containing arguments. The OpenAI GPT-4 Turbo 128K (1106) and GPT-3.5 Turbo 16k (1106) also support the model calling multiple functions in parallel.
Note, the LLM does not literally call the function itself. It formats the arguments of a function call, in JSON format, so that the application can call the function themselves.
JSON Schema for Tool
Here is a JSON schema we can use for our data extraction tool. You can see that it describes the type of each property in the address, and a human-readable description of each property, which the LLM uses to understand how to map the property to the text being extracted.
Create Extraction Specification
First, we need to create a
specification to use with data extraction, and define the tools to be executed by the LLM.
Here we are using the OpenAI GPT-4 Turbo 128K model, which in our experience, provides the best quality data extraction, although being somewhat more costly and slower than the other OpenAI models. You can test different models to find the best one for your use case.
In this example, we've ingested a Web page of homes in Seattle.
We may be a realtor, and our goal is to extract all the addresses of the homes on this page.
We can use our specification above with the defined
get_address tool and extract all addresses from this web page.
As you can see, each extraction provides the JSON
value which adheres to the tool schema provided, and references the
endTime where the data was extracted from the source
We have now extracted a complete postal address from this web page, and synchronize with Google Maps or another software application.
This shows off the power of using LLMs, such as OpenAI GPT-4, for extracting structured data from unstructured data, like web pages.
Please email any questions on this tutorial or the Graphlit Platform to firstname.lastname@example.org.