Using AI safely and effectively in user research

Imagine you have a colleague who knows almost everything there is to know about user research. As in, they’ve read every textbook, done every training course, read every academic paper on the subject. And on every other subject, for that matter. But they have this habit of making up data, and never really know or understand what your specific research project is about. They also tend to forget everything outside the current conversation you were having with them, and you’re not 100% sure where they’re storing all the research data you’ve given them.

Chances are, if you’ve been using an AI assistant to help you with user research, you’ve had that experience. One of the promises of AI is that it can make our jobs easier, but it’s also capable of introducing more problems than it solves. For user researchers, there’s a risk that the limitations of AI reduce the quality and integrity of our research. Integrity is at the heart of research ethics, which aim to minimise the risk that research poses to individuals and society. In this context, there’s an overlap with the field of AI safety, which is dedicated to making sure AI systems do not create harmful, misleading, or unreliable outputs.

It therefore turns out that using AI effectively and safely has a lot in common with doing good-quality, ethical research. That means there are some familiar principles and processes you can apply to mitigate and manage these drawbacks of AI. There’s also a few AI-specific tips that you might not be aware of. In this post, I’ll cover:

Why you need to make sure your AI assistant is suitable for user research data
How you can reduce and mitigate hallucinations
How giving AI the right context can make it a better research assistant
How AI introduces new ways for bias to enter your research
Some less obvious ways that AI can affect your research

I’m writing this for user researchers who are using AI, or want to use it, in their day-to-day research work. I won’t cover things like training your own AI models, using specialised AI tools, or building AI features into products. Instead, I’ll focus on general-purpose AI assistants. These are tools that many user researchers are already using, like:

ChatGPT by OpenAI
Gemini by Google
Copilot by Microsoft
Claude by Anthropic

This post summarises some things I’ve been thinking about as I build Stitchwork. My previous blog post covers what Stitchwork is and why I’m building it. In short, it’s a qualitative analysis tool for user researchers, built on top of modern AI. A major reason I’m building it is to address issues of research integrity that I cover in this post. Stitchwork is still in development, but if you want to know when it launches you can sign up to the waiting list here.

Otherwise, let’s dive in.

Make sure your AI assistant is suitable for processing user research data

Researchers have a responsibility to ensure the privacy and security of participant data. This has been a central feature of research ethics for decades, and over the years has been reinforced by wider developments in data protection law. Responsibly using AI in user research starts with the same basic premise: think carefully about where you upload stuff.

Your approach to this will be directed, in the first instance, by your country’s data protection laws and your organisation’s data policies. Wherever you work likely has rules about not uploading personally-identifiable or commercially sensitive information to unapproved places. And if your organisation has been doing user research for a while, you’ll probably have user research-specific data processes in place. These are great starting points if you have them. But with AI assistants there’s an added layer of complication when it comes to managing research data.

Training policies affect data retention

The types of AI models that power AI assistants are known as large language models, or LLMs. These are first trained on large datasets to learn general language patterns, then tuned to follow instructions and perform specific tasks. Typically, the bulk of the training data comes from industry-standard machine-learning datasets and large web crawls. Some providers may also use conversations with AI assistants as training data, unless users opt out.

If participant data is used in training, it can end up reflected in the learned language patterns of future models. This presents a potential ethical issue. Unless you’ve been specifically asking for it, your participants won’t have given you informed consent for their data to be used for this purpose. Even with their consent, it creates a blurry boundary for things like data retention policies or the participant’s right to withdraw. It’s not that the participant’s data is sat there verbatim in the language model — it isn’t. But there’s an echo of that data that you can’t just go in and delete.

Tools built for business may be more suitable

The business and enterprise versions of AI assistants tend to opt out of training by default. They also come with contractual agreements around data handling and processing, such as data processing addenda (DPAs). ChatGPT and Claude both have business-oriented plans, while Gemini and Copilot can be bundled with Google and Microsoft’s business software suites.

Product offerings frequently change, as do their terms and conditions. You should check what plan you’re on, and dig around to see if it’s suitable for use with user research data. Even with the right agreements in place, an AI assistant might need its settings adjusted to ensure it’s handling data in line with your organisation’s requirements. And even then, it might not be suitable for certain types of particularly sensitive data. If you’re unsure whether an AI assistant is appropriate for research data, speak to your data protection, legal, or IT people.

Responsible use of AI is, in this respect, an extension of research ethics and data protection. Just like with any online tool, check that it’s approved for use with your organisation’s data and, specifically, that it’s suitable for your user research data. You might have to update some processes and policies, or create new ones. But that’s part of doing research responsibly, with or without AI.

Minimise, catch and manage hallucinations

Responsible research also depends on the validity and integrity of your data. One widely discussed AI phenomenon can undermine that: hallucinations.

What are hallucinations?

In a recent research paper, OpenAI investigated why LLMs hallucinate. There, they define hallucinations as “plausible but false statements generated by language models”. LLMs are trained to predict the most likely next word in a given context, based on probabilities in the training data. This results in fluent and plausible text, but being fluent and plausible doesn’t necessarily make it true. This is exacerbated by other factors. In that paper, OpenAI concludes that it’s partly down to the incentives used when training and evaluating AI models. The training process rewards models when they give an answer to a question, but not when they say “I don’t know”. This encourages them to guess when uncertain.

Managing hallucinations in user research

Models are getting better at managing hallucinations, as OpenAI claims in that study. But they still occur, and require management in your research process. For example, an AI model could introduce a quote that the user didn’t actually say when summarising an interview transcript. Or an AI analysis tool may hallucinate patterns in the data that seem plausible, but don’t actually exist. Hallucinations risk undermining the quality and integrity of your research.

This isn’t a problem unique to AI. Humans make mistakes all the time. They might make a typo in their research notes, mishear something a participant says, or misremember an event from a research session. Like with handling data, you may well already have processes in place for dealing with this. It’s likely you already make recordings and take notes in research sessions. You may have another person in the session, acting as an observer or notetaker. By not relying on a sole researcher’s memory or interpretation, you reduce the risk of human error derailing your research. These same measures can reduce the risk of being caught out by an AI model’s hallucination. Observers and notetakers can help identify made-up quotes and observations. Raw recordings and human-made notes provide you with a source against which to verify AI outputs.

There are also some AI-specific safeguards you can put in place. Newer models tend to hallucinate less, so make sure you’re using these. Not all tools give you a choice of model, and model names are generally inscrutable, so you might need to check the documentation for your AI tool to figure out if you can change it. The way you prompt an AI assistant can also help reduce hallucinations. This guidance from Anthropic has some useful tips, including explicitly telling the AI it has permission to say “I don’t know”.

In any case, remember that AI outputs need the same discipline as any other research input. Don’t treat them as evidence in their own right: verify them against recordings, notes, and quotes before you rely on them.

Give the AI the right context

Making things up is one way for an AI assistant to undermine the quality of your research. A more subtle risk is that the AI operates without the right background knowledge. When we’re doing user research, we hold lots of useful extra information in our heads. This might include the goals of the immediate research project, the purpose of the wider product or service, or the aims and ambitions of various stakeholders. Much of it feels incidental, but it underpins the quality of your research.

What do we mean by context?

AI assistants have vast amounts of knowledge from training, but they don’t have that specific context unless you give it to them. They don’t know that your product manager is under pressure to increase conversion by 10% this quarter and won’t pay attention to your research unless you can help them with that. It’s this kind of soft skills-type, tacit knowledge that is the difference between your research informing product development versus being stuck in a slide deck nobody ever reads.

The past year has seen a growing consensus that effectively managing context is one of the most important ways to improve the performance of LLM-based tools. A useful, if oversimplified, way to think about the role of context is like short and long-term memory in humans. Training produces a compressed, statistical representation of the patterns in the training data. This statistical summary is effectively the model’s long-term memory. It’s the knowledge you could expect to gain from reading the contents of the entire internet, including things like “what is a user researcher” and “how to do a usability test”. Once a model has been trained and released, and you chat with it, the contents of that conversation more or less form its short-term memory. This short-term memory is called context, and the amount of data a model can fit in its short-term memory is referred to as its context window. The size of context windows has rapidly increased over time, so that some models can fit the entirety of War and Peace in theirs.

Optimising AI context in user research

With AI assistants, the most basic solution to context management is providing as much relevant information as possible in the conversation. For example, if you want the AI to help you write questions for user interviews, then paste in the full research plan, summaries of stakeholder interviews, maybe some examples of good interview questions. Basically, anything that might help the AI come up with something that isn’t generic. However, pasting the contents of dozens of documents into a chat window can quickly become hard to manage. What happens if one of the documents gets updated? And even if you paste in everything you can think of, you’ll still end up with problems. Providing the AI model with too much information in a single conversation can have a detrimental effect, known as context rot.

A better solution is to provide a set of files, allow the AI to explore them, and then select from them only relevant pieces of information to include in the context at that moment. The leading AI assistants have features that do just this:

ChatGPT projects
Microsoft Copilot Notebooks
Google Notebook LM
Claude projects

Realistically, you’ll never be able to provide perfect context. Seemingly irrelevant pieces of information you leave out can later turn out to be important. You can think of it like how you’d onboard a new researcher to your team. You give them the essentials on day one, then fill in gaps as they become apparent. The important thing is to provide the best context you can, and keep it up to date.

Be wary of new routes for bias

Because user research relies on interpretation, bias has always been something that needs detecting, managing, and reducing. AI opens up two new ways for bias to creep in: bias within the AI models, and bias in how humans interpret their outputs. As an end user of AI tools, you don’t have much control over that first route. That’s an active area of AI research and engineering. But we’ll see how you can address the second one.

The 2025 International AI Safety Report, authored by a panel of over 100 AI experts, outlines how AI models amplify real-world biases. LLMs train on data produced by humans, and so that data will be reflective of human biases. The outputs of AI models can contain harmful biases related to gender, race, culture and other characteristics. Much of the data used to train today’s AI models is in English and comes from western sources. That can make them less useful or reliable for people and contexts that aren’t well represented in the data.

Given that LLMs reflect existing biases, your existing processes for managing bias are a good foundation. Being more vigilant and deliberate in applying them to AI outputs is a good first step. But you do need to consider new types of bias that you may not have considered before. Two cognitive biases are especially relevant here: the anchoring effect and automation bias.

The anchoring effect is where the first information you encounter influences later decisions. This isn’t limited to AI, but AI outputs can act as anchors, and people often under-correct from them when the AI is wrong. For example, consider reading an AI-generated summary of a research session before you look at the raw data. You are more likely to stick with the AI’s framing, even if looking at the raw data provides a better interpretation.

To mitigate the anchoring effect, ask yourself “what are some reasons this interpretation is wrong?”

Automation bias is the tendency to give more weight to automated systems than your own judgement. Imagine using an AI assistant to help create a research plan, and it suggests using a survey. You suspect your research questions need deeper answers than a survey would provide, but you go with the survey anyway.

To help reduce automation bias, add a forcing function: make an initial judgement before you look at the AI’s output.

You can’t remove AI bias entirely, just as you can’t eliminate human bias. What matters is trying to mitigate it as best you can. One of the promises of AI is that it can speed up our work, but that just makes it all the more important to build in moments where we stop, check and be deliberate in how we interpret AI outputs.

AI can shape your research in less obvious ways

So far, the risks we’ve talked about are fairly direct: data going to the wrong place, the model making things up, or bias creeping in. But there are subtler ways that AI can shape your research. These are things that won’t show up as an obvious mistake, but can nonetheless steer you in the wrong direction.

Sycophancy and agreement-seeking

AI assistants are prone to trying to please the user, an effect known as AI sycophancy. This is likely driven by the latter stages of the training process, where AI models get feedback from human evaluators. Humans being humans, they tend to give positive feedback when the AI responds in an agreeable way. As a result, AI assistants sometimes engage in approval-seeking behaviours like mirroring your viewpoint, confirming your hunches, and not challenging your mistakes.

To mitigate this, you can prompt a model to disagree on purpose. In a recent paper, AI researchers reduced sycophancy by asking an AI model to:

Adopt a third-person persona; “Andrew the independent thinker” in the case of that paper
Ignore the user’s opinions on the topic

For example, consider a situation where you’re using an AI assistant to help you develop themes from your data. To reduce sycophancy, you can ask it to act as an independent reviewer who ignores your interpretation of the data. Even though this may reduce sycophancy, you have to consider how hallucinations and biases could come into play. In this way, working with AI effectively is a bit of a balancing act.

Self-reported rationales aren’t reliable

Another trap is asking an AI to explain why it said something. It’s tempting to do this when AI does something wrong, but it’s unlikely to help you fix the problem. The deep neural networks that power these systems are so vast and complex that the AI labs themselves don’t fully understand how they work. More importantly: the LLM’s job is to generate plausible text, not to report on its own internal decision-making. If you ask it why it made a mistake, it will apologise and give you a convincing answer. These days you can even view the model’s chain of thought in the chat interface. But that’s just the surface-level output of the AI and won’t necessarily bear any relation to what’s gone on deep within the model.

Again, the mitigations for hallucinations and bias are useful here. If an AI assistant comes up with some unexpected insight from your research, stop, think, and check your data for yourself. You may ask the AI where that unexpected insight came from, but be wary of its response.

Watch out for guardrails

One more thing to be aware of is that AI assistants have safeguards that restrict what they can say. These help detect and block harmful or unsafe output, and they’re a key part of how model providers deploy AI safely. But they can catch you out when the topic of your research overlaps with those safeguards.

If you’re building, say, a service for victims of crime, participant quotes may include violent or traumatic content. This could trigger refusals or “sanitised” responses from the AI. This can happen even in seemingly innocuous domains. For example, if you’re working in financial services, you might find models may be prevented from giving anything that looks like financial advice. This might make outputs oddly vague, incomplete, or overly caveated.

If you get a flat-out refusal, try different phrasing, or a different model. Yet sometimes it won’t be so obvious you’ve triggered a safeguard. The model may just self-censor without informing you. Although the risk here to your research may be low, it’s yet another reason you need to pay attention and properly assess AI outputs.

It’s about good research habits

At the end of the day, AI doesn’t change what good user research looks like. You still need to handle participant data responsibly, ground your analysis in what participants said and did, and be honest about what insights your data can and can’t support. What AI changes is how easily things can go wrong.

You don’t need to be an AI expert in order to use these tools well. But you need to be deliberate in how you use them. Understand where participant data is going and how it is used, check AI outputs carefully, and make sure you’re using your own judgement.

I’m building Stitchwork with all this in mind, so that it’s easier for researchers to use AI safely and responsibly. If you’re interested in hearing about it when it launches, you can join the waiting list below.