I knew LLMs were good for bridging unstructured to structured data. However, it’s always interesting what happens when theory meets practice and real world messiness.
I’m working on a project which extracts Art Show events as structured json from a variety of website pages. It’s interesting finding all the edge cases and iterating through patterns to get the process refined and reasonably consistent.
Sometimes I’m improving the LLM side. Many times I’m improving basic data processing code such as normalizing titles in a consistent way. Or using similarity techniques for identifying items which should be merged.
There are still plenty of mistakes being made by the LLM, but I think I’ll be able to get things pretty stable. It’s interesting that some of the newer OpenAI models no longer expose a temperature parameter in the api.