Structured Data From Unstructured Data: Address Extraction with Graphlit, GPT-4 Turbo

Kirk Marple

January 23, 2024

Extract Postal Addresses

Many AI-enabled applications need to extract structured data from unstructured data, i.e. web pages, PDFs, or audio transcripts.

Extracting postal addresses from text can be difficult with existing NLP solutions, but LLMs like OpenAI GPT-4 do an excellent job with structured data extraction.

With Graphlit, we now offer a new GraphQL mutation extractContents which makes this easy.

With LLMs such as OpenAI GPT-3.5 and GPT-4, they offer function calling as a way for the model to output a JSON object containing arguments. The OpenAI GPT-4 Turbo 128K (1106) and GPT-3.5 Turbo 16k (1106) also support the model calling multiple functions in parallel.

Note, the LLM does not literally call the function itself. It formats the arguments of a function call, in JSON format, so that the application can call the function themselves.


JSON Schema for Tool

Here is a JSON schema we can use for our data extraction tool. You can see that it describes the type of each property in the address, and a human-readable description of each property, which the LLM uses to understand how to map the property to the text being extracted.

{
    "$schema": "http://json-schema.org/draft-07/schema#",
    "type": "object",
    "properties": {
        "streetAddress": {
            "type": "string",
            "description": "The street address, including house number and street name."
        },
        "city": {
            "type": "string",
            "description": "The name of the city."
        },
        "state": {
            "type": "string",
            "description": "The name of the state or province."
        },
        "postalCode": {
            "type": "string",
            "description": "The postal or ZIP code."
        },
        "country": {
            "type": "string",
            "description": "The name of the country."
        }
    },
    "required": ["streetAddress", "city", "state", "postalCode", "country"]
}


Create Extraction Specification

First, we need to create a specification to use with data extraction, and define the tools to be executed by the LLM.

Here we are using the OpenAI GPT-4 Turbo 128K model, which in our experience, provides the best quality data extraction, although being somewhat more costly and slower than the other OpenAI models. You can test different models to find the best one for your use case.

Mutation:

mutation CreateSpecification($specification: SpecificationInput!) {
  createSpecification(specification: $specification) {
    id
    name
    state
    type
    serviceType
  }
}

Variables:

{
  "specification": {
    "type": "EXTRACTION",
    "serviceType": "OPEN_AI",
    "openAI": {
      "model": "GPT4_TURBO_128K_1106",
      "temperature": 0.1,
      "probability": 0.2
    },
    "tools": [
      {
        "name": "get_address",
        "description": "Extract address properties.",
        "schema": "{\"$schema\":\"http://json-schema.org/draft-07/schema#\",\"type\":\"object\",\"properties\":{\"streetAddress\":{\"type\":\"string\",\"description\":\"The street address, including house number and street name.\"},\"city\":{\"type\":\"string\",\"description\":\"The name of the city.\"},\"state\":{\"type\":\"string\",\"description\":\"The name of the state or province.\"},\"postalCode\":{\"type\":\"string\",\"description\":\"The postal or ZIP code.\"},\"country\":{\"type\":\"string\",\"description\":\"The name of the country.\"}},\"required\":[\"streetAddress\",\"city\",\"state\",\"postalCode\",\"country\"]}"
      }
    ],
    "name": "GPT-4 Extraction"
  }
}

Response:

{
  "type": "EXTRACTION",
  "serviceType": "OPEN_AI",
  "id": "3ffd0dcd-208b-465d-afc5-66f3bef7fe40",
  "name": "GPT-4 Extraction",
  "state": "ENABLED"
}


Extract Contents

In this example, we've ingested a Web page of homes in Seattle.

We may be a realtor, and our goal is to extract all the addresses of the homes on this page.

We can use our specification above with the defined get_address tool and extract all addresses from this web page.

As you can see, each extraction provides the JSON value which adheres to the tool schema provided, and references the pageNumber or startTime/endTime where the data was extracted from the source content.


Mutation:

mutation ExtractContents($prompt: String!, $filter: ContentFilter, $specification: EntityReferenceInput!) {
  extractContents(prompt: $prompt, filter: $filter, specification: $specification) {
    specification {
      id
    }
    content {
      id
    }
    value
    startTime
    endTime
    pageNumber
    error
  }
}

Variables:

{
  "prompt": "Find me all the street addresses.",
  "specification": {
    "id": "3ffd0dcd-208b-465d-afc5-66f3bef7fe40"
  }
}

Response:

[,
{
    "specification": {
      "id": "d6f3cce9-2b4c-47f1-9397-58e1f4d6d9c0"
    },
    "content": {
      "id": "726e99d2-d637-4796-8041-148e94ee37ec"
    },
    "value": "{\r\n  \"streetAddress\": \"823 B NE 70th St\",\r\n  \"city\": \"Seattle\",\r\n  \"state\": \"WA\",\r\n  \"postalCode\": \"98115\",\r\n  \"country\": \"USA\"\r\n}",
    "pageNumber": 8
  },
  {
    "specification": {
      "id": "d6f3cce9-2b4c-47f1-9397-58e1f4d6d9c0"
    },
    "content": {
      "id": "726e99d2-d637-4796-8041-148e94ee37ec"
    },
    "value": "{\r\n  \"streetAddress\": \"912 N 100th St #B\",\r\n  \"city\": \"Seattle\",\r\n  \"state\": \"WA\",\r\n  \"postalCode\": \"98133\",\r\n  \"country\": \"USA\"\r\n}",
    "pageNumber": 2
  },
  {
    "specification": {
      "id": "d6f3cce9-2b4c-47f1-9397-58e1f4d6d9c0"
    },
    "content": {
      "id": "726e99d2-d637-4796-8041-148e94ee37ec"
    },
    "value": "{\r\n  \"streetAddress\": \"Greenwood Avenue North and North 85th Street\",\r\n  \"city\": \"Seattle\",\r\n  \"state\": \"WA\",\r\n  \"postalCode\": \"98103\",\r\n  \"country\": \"USA\"\r\n}",
    "pageNumber": 3
  },
  {
    "specification": {
      "id": "d6f3cce9-2b4c-47f1-9397-58e1f4d6d9c0"
    },
    "content": {
      "id": "726e99d2-d637-4796-8041-148e94ee37ec"
    },
    "value": "{\r\n  \"streetAddress\": \"Greenwood Ave\",\r\n  \"city\": \"Seattle\",\r\n  \"state\": \"Washington\",\r\n  \"postalCode\": \"\",\r\n  \"country\": \"USA\"\r\n}",
    "pageNumber": 5
  }
]


Extracted Address

We have now extracted a complete postal address from this web page, and synchronize with Google Maps or another software application.


{
	"streetAddress": "825 B NE 70th St",
	"city": "Seattle",
	"state": "WA",
	"postalCode": "98115",
	"country": "USA"
}

This shows off the power of using LLMs, such as OpenAI GPT-4, for extracting structured data from unstructured data, like web pages.


Summary

Please email any questions on this tutorial or the Graphlit Platform to questions@graphlit.com.

For more information, you can read our Graphlit Documentation, visit our marketing site, or join our Discord community.