FlowiseAI
English
English
  • Introduction
  • Get Started
  • Contribution Guide
    • Building Node
  • API Reference
    • Assistants
    • Attachments
    • Chat Message
    • Chatflows
    • Document Store
    • Feedback
    • Leads
    • Ping
    • Prediction
    • Tools
    • Upsert History
    • Variables
    • Vector Upsert
  • Using Flowise
    • Agentflows
      • Multi-Agents
      • Sequential Agents
        • Video Tutorials
    • API
    • Analytic
      • Arize
      • Langfuse
      • Lunary
      • Opik
      • Phoenix
    • Document Stores
    • Embed
    • Monitoring
    • Streaming
    • Telemetry
    • Uploads
    • Variables
    • Workspaces
    • Evaluations
  • Configuration
    • Auth
      • App Level
      • Chatflow Level
    • Databases
    • Deployment
      • AWS
      • Azure
      • Alibaba Cloud
      • Digital Ocean
      • Elestio
      • GCP
      • Hugging Face
      • Kubernetes using Helm
      • Railway
      • Render
      • Replit
      • RepoCloud
      • Sealos
      • Zeabur
    • Environment Variables
    • Rate Limit
    • Running Flowise behind company proxy
    • SSO
    • Running Flowise using Queue
    • Running in Production
  • Integrations
    • LangChain
      • Agents
        • Airtable Agent
        • AutoGPT
        • BabyAGI
        • CSV Agent
        • Conversational Agent
        • Conversational Retrieval Agent
        • MistralAI Tool Agent
        • OpenAI Assistant
          • Threads
        • OpenAI Function Agent
        • OpenAI Tool Agent
        • ReAct Agent Chat
        • ReAct Agent LLM
        • Tool Agent
        • XML Agent
      • Cache
        • InMemory Cache
        • InMemory Embedding Cache
        • Momento Cache
        • Redis Cache
        • Redis Embeddings Cache
        • Upstash Redis Cache
      • Chains
        • GET API Chain
        • OpenAPI Chain
        • POST API Chain
        • Conversation Chain
        • Conversational Retrieval QA Chain
        • LLM Chain
        • Multi Prompt Chain
        • Multi Retrieval QA Chain
        • Retrieval QA Chain
        • Sql Database Chain
        • Vectara QA Chain
        • VectorDB QA Chain
      • Chat Models
        • AWS ChatBedrock
        • Azure ChatOpenAI
        • NVIDIA NIM
        • ChatAnthropic
        • ChatCohere
        • Chat Fireworks
        • ChatGoogleGenerativeAI
        • Google VertexAI
        • ChatHuggingFace
        • ChatLocalAI
        • ChatMistralAI
        • IBM Watsonx
        • ChatOllama
        • ChatOpenAI
        • ChatTogetherAI
        • GroqChat
      • Document Loaders
        • API Loader
        • Airtable
        • Apify Website Content Crawler
        • Cheerio Web Scraper
        • Confluence
        • Csv File
        • Custom Document Loader
        • Document Store
        • Docx File
        • File Loader
        • Figma
        • FireCrawl
        • Folder with Files
        • GitBook
        • Github
        • Json File
        • Json Lines File
        • Notion Database
        • Notion Folder
        • Notion Page
        • PDF Files
        • Plain Text
        • Playwright Web Scraper
        • Puppeteer Web Scraper
        • S3 File Loader
        • SearchApi For Web Search
        • SerpApi For Web Search
        • Spider Web Scraper/Crawler
        • Text File
        • Unstructured File Loader
        • Unstructured Folder Loader
        • VectorStore To Document
      • Embeddings
        • AWS Bedrock Embeddings
        • Azure OpenAI Embeddings
        • Cohere Embeddings
        • Google GenerativeAI Embeddings
        • Google VertexAI Embeddings
        • HuggingFace Inference Embeddings
        • LocalAI Embeddings
        • MistralAI Embeddings
        • Ollama Embeddings
        • OpenAI Embeddings
        • OpenAI Embeddings Custom
        • TogetherAI Embedding
        • VoyageAI Embeddings
      • LLMs
        • AWS Bedrock
        • Azure OpenAI
        • Cohere
        • GoogleVertex AI
        • HuggingFace Inference
        • Ollama
        • OpenAI
        • Replicate
      • Memory
        • Buffer Memory
        • Buffer Window Memory
        • Conversation Summary Memory
        • Conversation Summary Buffer Memory
        • DynamoDB Chat Memory
        • MongoDB Atlas Chat Memory
        • Redis-Backed Chat Memory
        • Upstash Redis-Backed Chat Memory
        • Zep Memory
      • Moderation
        • OpenAI Moderation
        • Simple Prompt Moderation
      • Output Parsers
        • CSV Output Parser
        • Custom List Output Parser
        • Structured Output Parser
        • Advanced Structured Output Parser
      • Prompts
        • Chat Prompt Template
        • Few Shot Prompt Template
        • Prompt Template
      • Record Managers
      • Retrievers
        • Extract Metadata Retriever
        • Custom Retriever
        • Cohere Rerank Retriever
        • Embeddings Filter Retriever
        • HyDE Retriever
        • LLM Filter Retriever
        • Multi Query Retriever
        • Prompt Retriever
        • Reciprocal Rank Fusion Retriever
        • Similarity Score Threshold Retriever
        • Vector Store Retriever
        • Voyage AI Rerank Retriever
      • Text Splitters
        • Character Text Splitter
        • Code Text Splitter
        • Html-To-Markdown Text Splitter
        • Markdown Text Splitter
        • Recursive Character Text Splitter
        • Token Text Splitter
      • Tools
        • BraveSearch API
        • Calculator
        • Chain Tool
        • Chatflow Tool
        • Custom Tool
        • Exa Search
        • Google Custom Search
        • OpenAPI Toolkit
        • Code Interpreter by E2B
        • Read File
        • Request Get
        • Request Post
        • Retriever Tool
        • SearchApi
        • SearXNG
        • Serp API
        • Serper
        • Tavily
        • Web Browser
        • Write File
      • Vector Stores
        • AstraDB
        • Chroma
        • Couchbase
        • Elastic
        • Faiss
        • In-Memory Vector Store
        • Milvus
        • MongoDB Atlas
        • OpenSearch
        • Pinecone
        • Postgres
        • Qdrant
        • Redis
        • SingleStore
        • Supabase
        • Upstash Vector
        • Vectara
        • Weaviate
        • Zep Collection - Open Source
        • Zep Collection - Cloud
    • LiteLLM Proxy
    • LlamaIndex
      • Agents
        • OpenAI Tool Agent
        • Anthropic Tool Agent
      • Chat Models
        • AzureChatOpenAI
        • ChatAnthropic
        • ChatMistral
        • ChatOllama
        • ChatOpenAI
        • ChatTogetherAI
        • ChatGroq
      • Embeddings
        • Azure OpenAI Embeddings
        • OpenAI Embedding
      • Engine
        • Query Engine
        • Simple Chat Engine
        • Context Chat Engine
        • Sub-Question Query Engine
      • Response Synthesizer
        • Refine
        • Compact And Refine
        • Simple Response Builder
        • Tree Summarize
      • Tools
        • Query Engine Tool
      • Vector Stores
        • Pinecone
        • SimpleStore
    • Utilities
      • Custom JS Function
      • Set/Get Variable
      • If Else
      • Sticky Note
    • External Integrations
      • Zapier Zaps
  • Migration Guide
    • Cloud Migration
    • v1.3.0 Migration Guide
    • v1.4.3 Migration Guide
    • v2.1.4 Migration Guide
  • Use Cases
    • Calling Children Flows
    • Calling Webhook
    • Interacting with API
    • Multiple Documents QnA
    • SQL QnA
    • Upserting Data
    • Web Scrape QnA
  • Flowise
    • Flowise GitHub
    • Flowise Cloud
Powered by GitBook
On this page
  • Datasets
  • Evaluators
  • Evaluations
  • Re-run evaluation
  • Video Tutorial
Edit on GitHub
  1. Using Flowise

Evaluations

PreviousWorkspacesNextConfiguration

Last updated 3 months ago

Evaluations are only available for Cloud and Enterprise plan

Evaluations help you monitor and understand the performance of your Chatflow/Agentflow application. On the high level, an evaluation is a process that takes a set of inputs and corresponding outputs from your Chatflow/Agentflow, and generates scores. These scores can be derived by comparing outputs to reference results, such as through string matching, numeric comparison, or even leveraging an LLM as a judge. These evaluations are conducted using Datasets and Evaluators.

Datasets

Datasets are the inputs that will be used to run your Chatflow/Agentflow, along with the corresponding outputs for comparison. User can add the input and anticipated output manually, or upload a CSV file with 2 columns: Input and Output.

Input
Output

What is the capital of UK

Capital of UK is London

How many days are there in a year

There are 365 days in a year

Evaluators

Evaluators are like unit tests. During an evaluation, the inputs from Datasets are ran on the selected flows and the outputs are evaluated using selected evaluators. There are 3 types of evaluators:

  • Text Based: string based checking:

    • Contains Any

    • Contains All

    • Does Not Contains Any

    • Does Not Contains All

    • Starts With

    • Does Not Starts With

  • Numeric Based: numbers type checking:

    • Total Tokens

    • Prompt Tokens

    • Completion Tokens

    • API Latency

    • LLM Latency

    • Chatflow Latency

    • Agentflow Latency (coming)

    • Output Characters Length

  • LLM Based: using another LLM to grade the output

    • Hallucination

    • Correctness

Evaluations

Now that we have Datasets and Evaluators prepared, we can start running an evaluation.

1.) Select dataset and chatflow to evaluate. You can select multiple datasets and chatflows. Using the example below, every inputs from Dataset1 will be ran against 2 chatflows. Since Dataset1 has 2 inputs, a total of 4 outputs will be produced and evaluated.

2.) Select the evaluators. Only string based and numeric based evaluators are available to be selected at this stage.

3.) (Optional) Select LLM Based evaluator. Start Evaluation:

4.) Wait for evaluation to be completed:

5.) After evaluation is completed, click the graph icon at the right side to view the details:

The 3 charts above show the summary of the evaluation:

  • Pass/fail rate

  • Average prompt and completion tokens used

  • Average latency of the request

Table below the charts shows the details of each execution.

Re-run evaluation

When the flows used on evaluation have been updated/modified, a warning message will be shown:

You can re-run the same evaluation using the Re-Run Evaluation button at the top right corner. You will be able to see the different versions:

You can also view and compare the results from different versions:

Video Tutorial