Datasets are at the heart of LangChain's evaluation methods. These are collections of
Examples that define inputs for those evaluations. Each example generally consists of an
dictionary and, optionally, an
dictionary that outlines expected results for performance checks. To create your datasets, you can either manually curate them or leverage historical logs of your application’s interactions. This is a great way to create relevant data that reflects real-world performance scenarios.
You can get more insight on creating datasets through the
LangSmith SDK guidelines.
Evaluators are functions in LangChain devised to score how well the system performs based on given examples. They harness the inputs and outputs data to return an
, which specifies metrics alongside their assigned scores. LangChain supports various types of evaluators, such as:
The flexibility of having multiple evaluators means you can tailor the assessment criteria according to your specific objectives, as discussed in the
LangSmith Evaluators documentation.
Here’s a step-by-step breakdown of how evaluations are carried out in LangChain:
You can utilize functions like
within LangSmith to bat out these evaluations seamlessly.