This is a research project focused on helping AI coding agents retrieve pesticide label PDFs from the web and convert them into structured JSON. The goal is to turn difficult, inconsistent regulatory documents into a cross-linked, searchable dataset that software can reason about directly.
The dataset is organized around products, crops, chemicals, and pests. Those four layers are what ultimately let the app answer practical questions like which products are sold for a crop, which actives they contain, and which pests they target.
Products are the market-facing label artifacts. They carry formulation details, registrants, label text, directions, restrictions, and application data. Crops are normalized so that broad label language can be mapped onto a stable set of crop concepts. Chemicals are organized around the resistance-action systems used in the industry. Pests are indexed so that label claims can be connected to specific organisms and then back to crops and products.
A major part of the work is normalizing external standards into machine-usable JSON. The project has structured IRAC, FRAC, HRAC, and IR-4 into linked registries so that products can refer to canonical chemical, crop, and resistance-action concepts instead of raw label wording alone.
That standardization work is what makes the rest of the indexing possible. Once those authority systems are represented cleanly, labels from different companies and regions can be interpreted against the same underlying structure.
The project stores source material as JSON and connects it through stable canonical names, or cnames. Those cnames act as the internal linking layer between products, crops, pests, chemicals, registrants, and authority standards. The result is a dataset that is easy to validate, cross-reference, and expose through web APIs.
This approach also keeps the indexing logic explicit. Instead of hiding resolution in runtime heuristics, the repo tries to make the mappings visible in curated registry files, normalized authority stacks, and product label data.
The web app also exposes a public JSON API for the data it serves. Products, crops, pests, chemicals, and the supporting authority structures can all be retrieved through API endpoints, which makes the project useful not just as a website but also as a machine-readable research dataset.
That API is intended to be a first-class interface to the normalized data. It allows agents and external tools to browse the same linked structures that drive the UI, including product applications, crop and pest indexes, chemical authority hierarchies, and related registry metadata.
Codex and other AI coding agents have been used throughout the workflow: retrieving documents, splitting PDFs into text and image artifacts, extracting fields into systematic structures, recognizing repeated label patterns, normalizing terminology, following authority standards, and looking up scientific names and related references.
The agents are not just writing code. They are also helping with document handling, schema design, data cleanup, linked-reference construction, validation, and the iterative refinement of the web app itself.
The long-term goal is broader regional coverage. The project is meant to absorb product labels from multiple markets, connect them to the crops and pests that matter locally, and expose that information in a form that both humans and software agents can use directly. That includes more regional crop-normalization layers, more product coverage, and better tools for comparing equivalent chemistry across different branded products and jurisdictions.