The power of AI is its ability to turn unstructured data into structured data

So why are we spending all our time generating unstructured data?

Jul 04, 2023

People are intrigued by LLMs’ ability to generate new text, change styles, and generally chat. While impressive, turning text written in one style into text written in your own voice (or into legaleze) is essentially taking unstructured data and turning it into more unstructured data. I’m surprised about the lack of attention to doing the exact opposite: turning unstructured data into structured data.

There are many examples of generating the verbiage in legal docs, but imagine understanding the document and creating structured data, such as:

{
  ”type”: “NDA”,
  ”parties”: {
    ”name”:”Party 1”,
    ”name”:”Party 2”  
  },
  "arbitration_state": “DE”,
  "date_of_signature": “1/1/23”,
  "expires_on": “1/1/25”
}

At best, someone is typing that into a database and at worst it’s just sitting in the document. But with this type of structured data, we can now easily query the database. As an example, we could compile list of all of the NDAs which are coming up for expiration in the next three months.

We can do this type of data operation retroactively, which means we don’t have to know what information we’d like to extract in advance. If, years after getting the original documents, you now have a new question to ask of your data, you can use these natural language understanding techniques to add the additional data and then run your queries.

Generating new documents in different styles is a neat parlor trick and has real use. But the ability to understand well seems at least as important as generating new (often mediocre) content. But I suspect true productivity gains will come from turning unstructured data into structured data we can manipulate and query.

Ray Zhu

Jul 5, 2023

take it one step further, you can entirely skip generating the structured data. With LLM's, you can directly query on unstructured data, and it'll only get better at doing this with time

Expand full comment

Sharif Islam

The key aspect is "ability to understand". Depending on how the LLM is trained and the context, the structured data might be misleading. For example, I asked chatGPT to create a json structure based on these two statements: "Apple is good. Let's invest in it." It came up with the following:

{

"statements": [

"text": "Apple is good.",

"sentiment": "positive"

"text": "Let's invest in it.",

}

]

Which is nice. But because my input lacked context I still need to provide additional metadata.

1 more comment...

Matt’s Substack

Discussion about this post