Exploring Market Intelligence Data with Graphlit, Reddit and OpenAI LLMs

Kirk Marple

October 29, 2023

Recap

Our goal is to leverage the unstructured data from Reddit, along with APIs and LLMs from Azure and OpenAI, to build a conversational knowledge graph in Graphlit. This knowledge graph can be used to explore the 'raw' unstructured data in new ways, and gather business insights for your domain-specific applications.

In the previous blog post, we showed how to ingest Reddit posts into Graphlit, and assign a content workflow to extract 'observed entities' from the extracted text, and from linked web pages.

For more detail on the GraphQL queries and mutations discussed, please see our Developer Documentation.

We used the createFeed mutation to read 1000 Reddit posts (and comments) from r/ArtificialInteligence.

//Mutation:
mutation CreateFeed($feed: FeedInput!) {
  createFeed(feed: $feed) {
    id
    name
    state
    type
  }
}

//Variables:
{
  "feed": {
    "type": "REDDIT",
    "reddit": {
      "subredditName": "ArtificialInteligence",
      "readLimit": 1000
    },
    "workflow": {
      "id": "d25193ed-aadf-4010-8298-12ead33b2335"
    },
    "schedulePolicy": {
      "recurrenceType": "ONCE"
    },
    "name": "Reddit Feed"
  }
}

// Response:
{
  "type": "REDDIT",
  "id": "6227d3ac-e3a5-4b1e-b638-2a8ab00a0745",
  "name": "Reddit Feed",
  "state": "ENABLED"
}


Explore with LLMs

Now that we have the Reddit posts extracted, and the knowledge graph created in Graphlit, we can use that data to extract insights and repurpose for various needs, like summarization or Tweets.

First, we need to use createSpecification mutation to create an LLM specification, which configures the LLM service (Azure OpenAI), LLM model (GPT-3.5 Turbo 16k), and the parameters such as completion token limit. Assigning completionTokenLimit to 1024 says that the LLM should produce at maximum 1024 tokens in its response.

Also, we have assigned embedCitations to true, which means we will receive an array of cited content sources with our completed conversation message. This lets us refer back to the Reddit post, web page, etc. where the LLM references its response.

//Mutation:
mutation CreateSpecification($specification: SpecificationInput!) {
  createSpecification(specification: $specification) {
    id
    name
    state
    type
    serviceType
  }
}

//Variables:
{
  "specification": {
    "type": "COMPLETION",
    "serviceType": "AZURE_OPEN_AI",
    "searchType": "VECTOR",
    "azureOpenAI": {
      "model": "GPT35_TURBO_16K",
      "temperature": 0.1,
      "probability": 0.2,
      "completionTokenLimit": 1024
    },
    "strategy": {
      "strategyType": "WINDOWED",
      "contentLimit": 10,
      "messageLimit": 10,
      "embedCitations": true
    },
    "name": "Reddit Specification"
  }
}

// Response:
{
  "type": "COMPLETION",
  "serviceType": "AZURE_OPEN_AI",
  "id": "55a5cfe1-e956-4190-860b-f58631f586aa",
  "name": "Reddit Specification",
  "state": "ENABLED"
}


Locate Organization via Faceted Query

Next, for this example, we want ask the GPT model only about content sources where the Organization "OpenAI" was observed (via entity extraction).

We can query content facets to retrieve an ordered list of faceted observable entities, and their counts.

In this example, the Organization "OpenAI" was referenced 51 times across all ingested content.

But, even more interestingly, we can use this via content filtering to only return content where this Organization "OpenAI" was mentioned.

Let's remember the Organization ID d913c05e-22fb-49ef-ad51-eddc0e4d7444 for the next step.


// Mutation:
query QueryContents($filter: ContentFilter!, $facets: [ContentFacetInput]) {
  contents(filter: $filter, facets: $facets) {
    facets {
      facet
      count
      type
      observable {
        type
        observable {
          id
          name
        }
      }
    }
  }
}

// Variables:
{
  "facets": [
    {
      "facet": "OBSERVABLE"
    }
  ]
}

// Response:
{
  "facets": [	
    // ...
    {
      "facet": "OBSERVABLE",
      "observable": {
        "type": "ORGANIZATION",
        "observable": {
          "name": "OpenAI",
          "id": "d913c05e-22fb-49ef-ad51-eddc0e4d7444"
        }
      },
      "type": "OBJECT",
      "count": 51
    },
	// ...
  ]
}


Create Filtered Conversation

Now that we have an LLM specification created, which describes how we want the LLM to handle our user prompt. And we have the Organization entity we want to filter on.

So we can create a conversation which uses that specification, and sets up an observations filter for that organization.

Essentially, we are now having a conversation with a 'slice' of the Graphlit Knowledge Graph, which is filtered only by content which has an 'edge' in the graph to the OpenAI Organization entity.


// Mutation:
mutation CreateConversation($conversation: ConversationInput!) {
  createConversation(conversation: $conversation) {
    id
    name
    state
    type
  }
}

// Variables:
{
  "conversation": {
    "type": "CONTENT",
    "filter": {
      "observations": [
        {
          "type": "ORGANIZATION",
          "observable": {
            "id": "d913c05e-22fb-49ef-ad51-eddc0e4d7444"
          }
        }
      ]
    },
    "specification": {
      "id": "55a5cfe1-e956-4190-860b-f58631f586aa"
    },
    "name": "Reddit Conversation"
  }
}

// Response:
{
  "type": "CONTENT",
  "id": "c3ffba04-eadf-49e0-9e6b-6057de88ecea",
  "name": "Reddit Conversation",
  "state": "OPENED"
}


Use LLM for Sentiment Analysis

A common use for LLM-based market intelligence is asking for sentiment or perception from the text provided to the LLM.

In this case, we're asking the question:

We can now initiate the conversation by calling promptConversation, passing the newly created conversation ID, and our user prompt.

The promptfield is an example of what an end-user may have typed into your chatbot text field.

In the results, we are returning a list of citations, which cross-reference the paragraphs of the completed prompt, with the content source they were extracted from or referenced by.



// Mutation:
mutation PromptConversation($prompt: String!, $id: ID) {
  promptConversation(prompt: $prompt, id: $id) {
    conversation {
      id
    }
    message {
      role
      author
      message
      citations {
        content {
          id
        }
        index
        startTime
        endTime
        pageNumber
      }
      tokens
      completionTime
      timestamp
    }
    messageCount
  }
}

// Variables:
{
  "prompt": "What is the overall perception of OpenAI, as mentioned in these Reddit posts?",
  "id": "c3ffba04-eadf-49e0-9e6b-6057de88ecea"
}

// Response:
{
  "conversation": {
    "id": "c3ffba04-eadf-49e0-9e6b-6057de88ecea"
  },
  "message": {
    "role": "ASSISTANT",
    "message": "OpenAI may be losing its lead in AI for the first time since the GPT-2 release in 2019 [0]\n\nOpenAI had to stop working on a new model, codenamed Arrakis, that would presumably make ChatGPT run more efficiently [0]\n\nOpenAI is making $1.3 billion in ARR and has been releasing new models like GPT-4 vision and DALL-E 3 [0]\n\nGoogle's Gemini is posited to beat GPT-4 and potentially surpass OpenAI [0]\n\nThe future of AI is uncertain and the path to AGI is full of obstacles [0]\n\nLeading the AI race is a double-edged sword for OpenAI [0]\n\nOpenAI has some new data to innovate with [1]\n\nAI is rapidly changing various industries and our day-to-day activities [3]\n\nAI's impact on the economy is a concern as more jobs are being overtaken by AI [3]\n\nOpenAI has revised its core values to prioritize building AGI [6]\n\n63% of Americans want regulation to actively prevent superintelligent AI [9]\n\nGPT-4 with vision has limitations in image analysis and faces challenges in accuracy, privacy, and bias [11]",
    "citations": [
      {
        "content": {
          "id": "2080977f-2970-4a32-b389-c8b66d4b951d"
        },
        "index": 0
      },
      {
        "content": {
          "id": "4d9b9222-14ed-4de7-b4df-a040a301076d"
        },
        "index": 1
      },
      {
        "content": {
          "id": "3f03974b-857c-4112-b4d2-cc2f8ff4027b"
        },
        "index": 2
      },
      {
        "content": {
          "id": "5e72648b-7ef6-486c-8daf-a92eccc24f40"
        },
        "index": 3
      },
      // ...
    ],
    "tokens": 611,
    "completionTime": "PT9.2027833S",
    "timestamp": "2023-10-30T04:45:49.299Z"
  },
  "messageCount": 2
}


Explore Content Citations

We take the first citation ID 4d9b9222-14ed-4de7-b4df-a040a301076d and query for all the relevant content fields.

This will be useful for displaying the cited sources in the UI of an application built on top of Graphlit. We can provide the raw extracted text from the Reddit post, or a JSON format of extracted text from documents, web pages, or audio transcripts, which you can render into your own UI.

Also, you can see the filtered organization "OpenAI" below in the list of observed entities.


// Query:
query QueryContents($filter: ContentFilter!) {
  contents(filter: $filter) {
    results {
      id
      state
      document {
        title
        author
      }
      observations {
        type
        observable {
          id
          name
        }
      }
      text
      originalDate
      type
      fileType
    }
  }
}

// Variables:
{
  "filter": {
    "id": "2080977f-2970-4a32-b389-c8b66d4b951d"
  }
}

// Response:
{
  "results": [
    {
      "type": "POST",
      "originalDate": "2023-10-18T16:53:20Z",
      "text": "[News] Posted by u/AlbertoRomGar on Reddit: OpenAI may be losing its lead in AI for the first time since the GPT-2 release in 2019\n\nWill Google retake the AI throne with Gemini before 2023 ends? The AI race is at a tipping point.\n\nThe Information scooped yesterday that in mid-2023 OpenAI had to stop working on a new model, codenamed Arrakis, that would presumably make ChatGPT run more efficiently.\n\nThis was the company's main ongoing development after finishing GPT-4 in the summer of 2022.\n\nWhy did they stop? Because the model didn't work as expected. In a space where a gap of a few months is the difference between being the leader or not, this was an important setback for OpenAI.\n\nThe young startup is more than fine - this isn't a life-or-death situation. It's making $1.3 billion in ARR, has been releasing new juicy models like GPT-4 vision, DALL-E 3, etc., and has more aces up the sleeve for the DevDay conference on November 6th.\n\nBut this \"failure\" (I guess we can say that if we compare it with the flawless 4-year run it's had so far!) may allow Google to surpass OpenAI with Gemini, which is posited to beat GPT-4.\n\nIf that happens, we can take away two insights from all this:\n\n* Nothing is certain about the future of AI. No one knows which steps are safer or more promising on our way toward AGI. OpenAI's \"aura of invincibility\" as The Information puts it, was an outlier feature that lasted much more than anyone expected. AGI is somewhere ahead of us, awaiting patiently, but the path there is full of obstacles we must overcome. OpenAI couldn't get its next release on time and that's both illuminating and humbling.\n* Leading the AI race is a double-edged sword. OpenAI is seen by everyone as the favorite candidate. If it doesn't manage to overcome this hurdle and maintain its leadership or release soon something much better than GPT-4, people will cease to praise it so fervently. They claimed Google dead after ChatGPT was announced - who knows if now they will predict OpenAI's death. OpenAI will have to solve its technical problems with the model, its business pressures with Microsoft, and its social complaints with the general public.\n\nWhat do you think, will OpenAI keep or lose its leadership before 2023 ends?\n\nSources:\n\n[https://www.theinformation.com/articles/openai-dropped-work-on-new-arrakis-ai-model-in-rare-setback](https://www.theinformation.com/articles/openai-dropped-work-on-new-arrakis-ai-model-in-rare-setback)\n\n[https://www.theinformation.com/articles/openais-revenue-crossed-1-3-billion-annualized-rate-ceo-tells-staff](https://www.theinformation.com/articles/openais-revenue-crossed-1-3-billion-annualized-rate-ceo-tells-staff)\n\n[https://openai.com/blog/announcing-openai-devday](https://openai.com/blog/announcing-openai-devday)\n\n*I cover timely events in AI* [*here*](https://thealgorithmicbridge.substack.com/)*.*\n\nu/gopietz commented:\n\nOh honestly, just shut up.\n\nWhat's the point of publishing this bullshit. Creating artificial pressure where there is none. For the time being OpenAI is so far ahead, they could take a nap for 6 months. Instead they're pushing a faster version of GPT-4, GPT-4 Vision, DALLE3, functions, web browsing, plugins.\n\nMeanwhile Google Bard is losing against open source models and AWS had to invests billions to take a far off second place.\n\nWhat the hell are you talking about?\n\nu/agm1984 commented:\n\nOpenAI has some new data from which to innovate, so that bodes positive at least.\n\nu/Ion_GPT commented:\n\nAnother post to promote Google.\n\n",
      "document": {
        "title": "OpenAI may be losing its lead in AI for the first time since the GPT-2 release in 2019",
        "author": "AlbertoRomGar"
      },
      "observations": [
        // ...
        {
          "type": "ORGANIZATION",
          "observable": {
            "name": "Google",
            "id": "86c66ed2-8f3b-48cb-a5f7-247f009b15f8"
          }
        },
        {
          "type": "ORGANIZATION",
          "observable": {
            "name": "OpenAI",
            "id": "d913c05e-22fb-49ef-ad51-eddc0e4d7444"
          }
        },
        // ...
      ],
      "id": "2080977f-2970-4a32-b389-c8b66d4b951d",
      "state": "FINISHED"
    }
  ]
}


You can continue the conversation with this filtered set of data in the knowledge graph, but calling promptConversation again, with the same conversation ID, and it will append the completed messages to the existing conversation.

In this case, we're asking the question:

Here we're showing an example of creating a longer format response, elaborating on the risks to OpenAI, relevant to the Reddit content sources.



// Mutation:
mutation PromptConversation($prompt: String!, $id: ID) {
  promptConversation(prompt: $prompt, id: $id) {
    conversation {
      id
    }
    message {
      role
      author
      message
      citations {
        content {
          id
        }
        index
        startTime
        endTime
        pageNumber
      }
      tokens
      completionTime
      timestamp
    }
    messageCount
  }
}

// Variables:
{
  "prompt": "Can you elaborate on the risks to OpenAI, in multiple long paragraphs?",
  "id": "c3ffba04-eadf-49e0-9e6b-6057de88ecea"
}

// Response:
{
  "conversation": {
    "id": "c3ffba04-eadf-49e0-9e6b-6057de88ecea"
  },
  "message": {
    "role": "ASSISTANT",
    "message": "OpenAI may be losing its lead in AI for the first time since the GPT-2 release in 2019. The Information reported that OpenAI had to stop working on a new model, codenamed Arrakis, due to it not working as expected. This setback could allow Google to surpass OpenAI with Gemini, which is posited to beat GPT-4. This highlights that nothing is certain about the future of AI and that leading the AI race is a double-edged sword. [0][1]\n\nOpenAI's GPT-4 with vision still has flaws, according to a recent paper. The system has limitations in accurately interpreting images and understanding certain symbols. It also struggles with identifying dangerous substances and exhibits bias in relating responses to body weight and body positivity. Despite these flaws, the vision capabilities of GPT-4 are seen as valuable for data visualizations. [3]\n\nUK Prime Minister Rishi Sunak has warned that AI poses a risk on par with pandemics or nuclear war. He emphasized the need to treat the human extinction threat from AI with the same level of concern. AI is feared to escalate the frequency and complexity of cyber attacks, frauds, and scams by 2025. Additionally, AI could aid in creating bioweapons, enhance attack planning, and spread propaganda. To address these concerns, Sunak called for the formation of a global expert panel and the establishment of the world's first AI safety institute.",
    "citations": [
      {
        "content": {
          "id": "2080977f-2970-4a32-b389-c8b66d4b951d"
        },
        "index": 0
      },
      {
        "content": {
          "id": "cb98e3ac-6592-4d2c-ae82-bb24f7d23b34"
        },
        "index": 1
      },
      {
        "content": {
          "id": "d7990f54-ec0b-49c7-99e1-a0a9b270b4f6"
        },
        "index": 2
      },
      {
        "content": {
          "id": "5e72648b-7ef6-486c-8daf-a92eccc24f40"
        },
        "index": 3
      },
      {
        "content": {
          "id": "ccff8712-b158-4c74-b292-f9b70d6da5d8"
        },
        "index": 4
      }
    ],
    "tokens": 427,
    "completionTime": "PT6.2617331S",
    "timestamp": "2023-10-30T04:49:29.008Z"
  },
  "messageCount": 4
}


Repurpose as Tweet

In this case, we're asking the question:


You will need to guide the LLM on the length of the response, so it will fit into the maximum number of characters for a Tweet.

(NOTE: Any examples shown are not personal or corporate opinions, and are based solely on Reddit posts ingested into Graphlit.)

From the resulting message, you can copy/paste and edit the text into an X (fka Twitter) post like this:



// Mutation:
mutation PromptConversation($prompt: String!, $id: ID) {
  promptConversation(prompt: $prompt, id: $id) {
    conversation {
      id
    }
    message {
      role
      author
      message
      citations {
        content {
          id
        }
        index
        startTime
        endTime
        pageNumber
      }
      tokens
      completionTime
      timestamp
    }
    messageCount
  }
}

// Variables:
{
  "prompt": "Please summarize those risks into a single paragraph, which can fit into the length of tweet.",
  "id": "c3ffba04-eadf-49e0-9e6b-6057de88ecea"
}

// Response:
{
  "conversation": {
    "id": "c3ffba04-eadf-49e0-9e6b-6057de88ecea"
  },
  "message": {
    "role": "ASSISTANT",
    "message": "AI poses a global risk equivalent to pandemics and nuclear war, with major security threats expected within two years. It could escalate cyber attacks, aid in creating bioweapons, enhance attack planning, and spread propaganda. The UK government is taking measures to tackle the AI threat, including the formation of a global expert panel and the establishment of the world's first AI safety institute. UK Prime Minister Rishi Sunak has called for a global AI summit to address the threats and opportunities of AI. [0]",
    "citations": [
      {
        "content": {
          "id": "ccff8712-b158-4c74-b292-f9b70d6da5d8"
        },
        "index": 0
      }
    ],
    "tokens": 145,
    "completionTime": "PT2.1645656S",
    "timestamp": "2023-10-30T04:52:58.608Z"
  },
  "messageCount": 6
}


Summary

Given these building blocks, we've shown you how to query facets for ingested content, which can be used for charting the highest-frequency observed entities, and selecting entities for further filtering.

From these filtered sets of content, you can create conversations across that subset of content, and use any configured LLM (via specification) to ask questions, repurpose content into new formats, and extract other knowledge from the original ingested data.

We look forward to hearing how you make use of Graphlit and our conversational knowledge graph in your applications.

If you have any questions, please reach out to us here.