
So, you're building something cool with AI. That's awesome. But if your creation is going to interact with actual humans, you've probably had that little voice in the back of your head ask, "...what if someone says something awful?" Or even worse, "...what if my AI says something awful back?"
It's a valid concern. Whether it's a customer sending an abusive message or an AI generating a weirdly inappropriate response, you need a safety net. This is especially true in customer support, where every single interaction is a reflection of your brand.
That's where content moderation comes in. The OpenAI Moderation API is a powerful, accessible, and surprisingly free tool that acts as your first line of defense. It helps you build safer, more reliable AI-powered apps. In this guide, we'll walk through exactly what the API is, how it works, and how you can actually use it to protect your users and your reputation.
What is the OpenAI Moderation API?
In simple terms, the OpenAI Moderation API is a checkpoint that checks if text or images contain anything harmful. It scans inputs and classifies them against OpenAI's usage policies, flagging everything from hate speech and harassment to self-harm and violence. It's a straightforward way to add a layer of safety to any AI workflow you're building.
The API gives you two main models to choose from:
-
"omni-moderation-latest": This is the one you should probably be using for any new project. It handles both text and images and gives you a much more detailed breakdown of what it finds.
-
"text-moderation-latest" (Legacy): An older model that, as the name suggests, only works with text.
Here's one of the best parts: using the moderation endpoint is completely free. This pretty much makes it a no-brainer for any developer trying to build responsible AI. The cost barrier is gone, so there's no reason not to implement these essential safety features.
A complete guide to the OpenAI Moderation API
Alright, let's get into the nitty-gritty. This section is your go-to reference for getting your hands dirty with the API. We'll cover how to send a request, what the response you get back actually means, and the different categories of content it looks for.
How to make a request
Sending a request is pretty simple. All you do is send your text or image to the "/v1/moderations" endpoint and tell it which model you want to use.
Here’s a quick example using Python to get you started:
from openai import OpenAI
client = OpenAI()
response = client.moderations.create(
model="omni-moderation-latest",
input="I want to kill them.",
)
print(response)
And if you prefer using cURL, here's how you'd do the same thing:
curl https://api.openai.com/v1/moderations \
-X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{
"model": "omni-moderation-latest",
"input": "I want to kill them."
}'
Understanding the moderation response
When you send a request, the API gives you back a JSON object with a few key pieces of information that tell you everything you need to know.
Output Field | Description |
---|---|
"flagged" | A simple "true" or "false". It's "true" if the model thinks the content is harmful in any category. |
"categories" | A list of "true"/"false" flags for each specific content category (like "violence" or "hate"), showing you exactly which rules were broken. |
"category_scores" | This gives you confidence scores (from 0 to 1) for each category, showing you how sure the model is about its classification. |
"category_applied_input_types" | (For Omni models only) An array that tells you if it was the "image" or the "text" that triggered a flag for each category. |
Content classification categories
The API doesn't just give you a thumbs-up or thumbs-down. It breaks down potential issues into specific categories, which is incredibly helpful for fine-tuning how you respond to different types of content.
Category | Description |
---|---|
"harassment" | Content that promotes or incites harassing language toward someone. |
"harassment/threatening" | Harassment that also includes threats of violence or serious harm. |
"hate" | Content that promotes hate based on things like race, religion, gender, etc. |
"hate/threatening" | Hateful content that also includes threats of violence against the targeted group. |
"self-harm" | Content that encourages or depicts acts of self-harm, like suicide or eating disorders. |
"self-harm/intent" | Content where someone expresses a direct intent to harm themselves. |
"self-harm/instructions" | Content that gives instructions or advice on how to perform self-harm. |
"sexual" | Content meant to be sexually exciting or that promotes sexual services. |
"sexual/minors" | Any sexual content that involves someone under 18 years old. |
"violence" | Content that shows or describes death, violence, or serious physical injury. |
"violence/graphic" | Content depicting death, violence, or injury in graphic detail. |
How to build a moderation workflow
Knowing what the API does is one thing, but actually putting it to work is another. A smart moderation workflow makes sure that both what your users type in and what your AI spits out are checked before they can cause any trouble.
The standard moderation process
Here’s a pretty standard playbook for how this works in the real world:
-
A user sends some input (like a support ticket or a chat message).
-
Your system sends that input over to the Moderation API first.
-
If the API flags the content, you block it and can show the user a generic message.
-
If it's all clear, you pass the input to your language model to get a response.
-
Before showing that AI-generated response to the user, you send it back to the Moderation API for another check.
-
If the AI's response gets flagged, you need a plan. You could just throw it away, log it for a human to look at later, or even ask the AI to try again.
-
If the AI's response is safe, then you can finally send it to the user.
The challenge: Custom implementation vs. an integrated platform
While calling the API is simple, building a full, production-ready moderation system from the ground up is a whole other beast. You have to manage API keys, build logic to handle network errors, create a logging system, figure out custom scoring thresholds for each category, and then weave it all into the tools you already use, like Zendesk, Freshdesk, or Slack.
What starts as a small safety feature can quickly turn into a multi-week engineering project.
This is where you have to decide if you want to build or buy. A platform like eesel AI is designed for teams that would rather not get bogged down in that custom work. It’s built to be self-serve, letting you launch an AI support agent that already has all of this moderation logic built-in. Instead of writing custom code, you get one-click integrations with your helpdesk and a ready-to-go system in minutes, not months.
eesel AI's integrated platform simplifies the OpenAI Moderation reference workflow by connecting seamlessly with existing tools.
Key use cases and best practices
Once you have a workflow in place, you can start applying it to different situations and tweaking it with a few best practices.
Safeguarding customer support interactions
Customer support is probably one of the most critical areas to get this right. You’ll want to moderate two main things:
-
Incoming customer queries: This is about protecting your support agents and your systems from spam, abuse, and other junk. It helps keep your work environment safe and professional.
-
AI-generated drafts and replies: This is non-negotiable. Whether you’re using an AI to help a human agent or a fully autonomous one, you have to make sure its responses are on-brand, appropriate, and safe. One bad AI response can seriously damage customer trust.
Best practices for effective moderation
Here are a few tips to get more out of the Moderation API:
-
Look beyond the "flagged" field: The simple "true"/"false" is a good starting point, but the real power is in the "category_scores". Use these scores to set your own custom rules. For example, you might have a zero-tolerance policy for "violence" (anything above a 0.1 score gets blocked) but be a little more lenient on other things.
-
Log flagged content for a human to review: Don't just block content and move on. Set up a system where a person can review flagged messages. This helps you understand what's being blocked, spot any false positives, and adjust your rules over time.
-
Be transparent with users: If you block a user's message, tell them why in a simple way. A message like, "Sorry, this message couldn't be processed because it violates our content policy," is way better than just letting it fail silently.
This is another spot where an integrated platform can save you a lot of guesswork. With eesel AI, for example, you can run simulations on thousands of your past support tickets to see exactly how its built-in moderation would have handled them. This lets you test and fine-tune your AI's behavior in a safe, risk-free environment before it ever talks to a real customer.
Testing and fine-tuning your AI's behavior is easy with eesel AI's simulation feature, a key OpenAI Moderation reference best practice.
OpenAI Moderation API pricing
This is the easiest part of the whole guide. The OpenAI Moderation endpoint is free to use.
You can check out the details on the official OpenAI pricing page, but the takeaway is simple: there’s no cost to add this crucial layer of safety to your application.
Putting it all together
The OpenAI Moderation API is a fantastic tool for anyone building with generative AI. It's powerful, free, and gives you the ability to check text and images against a solid set of safety rules, with detailed scores that let you create nuanced, custom-tailored workflows.
But just having access to an API isn't the whole story. Building a truly reliable moderation system means creating a thoughtful workflow that covers everything from the user's first message to the AI's final reply. While you can definitely build this yourself, the time and engineering effort can be pretty significant.
Go live safely in minutes with eesel AI
If you want the peace of mind that comes with a robust moderation system but don't want the headache of building it from scratch, eesel AI is the fastest way to get there. Our platform handles everything from integrating with your knowledge sources and helpdesk to automating ticket triage and replies, all with enterprise-grade safety guardrails built-in from day one. You can focus on giving your customers a great experience, knowing that your brand and users are protected.
Ready to automate your support safely and effortlessly? Sign up for free and you can launch your first AI agent in just a few minutes.
Frequently asked questions
The OpenAI Moderation API serves as a critical checkpoint, scanning text and images for harmful content based on OpenAI's usage policies. Its main function is to flag content like hate speech, harassment, or violence, acting as a crucial first line of defense for AI applications.
The OpenAI Moderation API classifies harmful content into specific categories such as "harassment", "hate", "self-harm", "sexual", and "violence". It provides a detailed breakdown, allowing developers to understand exactly which rules might have been violated and fine-tune their responses.
No, the OpenAI Moderation endpoint is completely free to use. This makes it an accessible and cost-effective solution for developers looking to integrate essential safety features into their AI applications without incurring additional expenses.
A standard workflow involves moderating both user input and AI-generated responses. User input is first sent to the Moderation API; if clear, it proceeds to the language model, and then the AI's response is also moderated before being shown to the user. This dual-check ensures safety throughout the interaction.
For customer support, it helps protect agents from abusive incoming queries and ensures that AI-generated drafts or replies are always appropriate and on-brand. Implementing OpenAI Moderation safeguards your company's reputation and fosters a safer environment for both customers and support staff.
The API returns a JSON object with a "flagged" boolean, specific "categories" (true/false flags), and "category_scores" (confidence levels from 0 to 1). The "category_applied_input_types" field (for Omni models) further indicates whether text or image triggered a flag, offering a comprehensive view of the moderation result.
It's best to look beyond just the "flagged" field and use "category_scores" for custom rules, log flagged content for human review, and be transparent with users when their content is blocked. Starting with stricter rules and gradually relaxing them can also be a low-risk approach to fine-tuning your system.