Data cleaning

Clean data is the foundation of trustworthy analytics. Whether you’re building dashboards, training models, or answering executive questions, poor-quality data leads to wasted time, broken logic, and unreliable insights.

In traditional workflows, data cleaning is often tedious and repetitive—taking time away from actual analysis. TrueState simplifies this process by combining automatic data profiling with intuitive, no-code transformation nodes that can be configured via natural language.

Data cleaning in TrueState is performed inside a pipeline, where you can add one or more Data Transform nodes to clean and shape your dataset. These nodes are configured through the Pipeline agent, which handles everything from column renaming and null handling to recoding values and flagging outliers.

Why data cleaning matters

Even the best datasets are rarely analysis-ready. Common issues include:

Inconsistent naming conventions (e.g., “NY”, “New York”, “N.Y.”)
Missing or null values
Outliers or invalid data types
Duplicate records
Mismatched encodings in categorical fields
Timestamp formatting issues

These problems can skew metrics, break aggregations, or mislead models and stakeholders alike.

Cleaning early, and with the right tools, avoids compounding problems in downstream dashboards, models, and reports.

How cleaning works in TrueState

Data cleaning happens inside the Pipeline section of the platform, where each dataset flows through a series of configurable nodes. To clean data, you’ll use the Data Transform node.

When you create a Transform node:

The Pipeline agent will help you identify issues and suggest fixes
You describe your intent in plain language (e.g., “standardise dates” or “remove duplicates”)
The agent configures the transformation automatically
You preview and apply the changes interactively

All transformations are recorded as part of the pipeline and can be reordered, modified, or removed as needed.

The Pipeline agent translates natural language into versioned, inspectable logic—so your cleaning steps are transparent and repeatable.

Common data cleaning operations

Here are some of the most frequent cleaning tasks performed with Data Transform nodes:

Standardising formats

Make data consistent across columns and rows.

Convert dates to ISO format
Normalize currency or percentage formats
Enforce consistent casing (e.g., title case)

Example prompt: "Standardise all date columns to ISO format."

Handling missing values

Fill or remove incomplete data based on usage needs.

Fill with mean, median, or custom values
Forward/backward fill for time series
Drop columns or rows with high null percentages

Example prompt: "Drop columns where more than 40% of values are missing."

Managing outliers

Isolate or remove extreme values that may distort analysis.

Detect with Z-score or IQR methods
Cap or remove extreme values
Flag outliers for later review

Example prompt: "Cap outliers in the spend column at the 95th percentile."

Resolving duplicates

Clean up repeated or near-duplicate entries.

Deduplicate by specific columns (e.g., email or user ID)
Use exact or fuzzy matching logic

Example prompt: "Remove duplicates based on user_id and signup_date."

Normalising categories

Merge similar or inconsistent values across rows.

Map different spellings or encodings to a single label
Replace placeholder values like “N/A”, “none”, or “null”

Example prompt: "Combine 'New York', 'NY', and 'N.Y.' into one category."

Recasting data types

Ensure each column is correctly typed for downstream analysis.

Convert strings to numeric
Parse dates into datetime format
Fix mixed-type columns

Example prompt: "Convert this column to numeric and drop unparseable rows."

Type issues are flagged automatically as part of profiling. You can fix them manually or ask the agent to assist.

Cleaning text data

Some datasets contain one or more text columns—fields like descriptions, feedback, transcripts, or open-ended responses.

Basic cleaning of text data (e.g., trimming whitespace or lowercasing values) can be done via Data Transform nodes. However, advanced tasks—such as sentiment analysis, keyword extraction, and classification—are handled through Text Analytics nodes, which are also available in the pipeline canvas.

For deeper workflows involving natural language, see the Text Analytics guide.

Use a Transform node to prepare your text fields (e.g., removing nulls or cleaning casing), then chain into a Text Analytics node for structured interpretation.

Best practices for working with Transform nodes

Describe your goal, not the method: The agent will choose the right operation for your intent.
Iterate in stages: Keep each Transform node focused on a specific task.
Preview before applying: Each node offers a data preview so you can confirm results before saving.
Name your nodes clearly: Use labels like “Handle missing values” or “Normalise product names” to improve traceability.

If you’re unsure where to start, say "What cleaning steps do you recommend?" The agent will make suggestions based on automatic profiling.

Example workflow

Upload a dataset → Profiling runs automatically in the background
Add a Transform node → The agent helps detect issues and configure fixes
Describe your cleaning goals → e.g., "Prepare this for lead scoring"
Review suggested changes → It is recommended to inspect the data at each step to ensure the Pipeline agent has implemented changes as-intended.
Continue building → Chain additional Transform or Text Analytics nodes
Export → Cleaned data is ready for dashboards, AI agents, or models

Pipelines are versioned and re-runnable—meaning you can automate cleaning each time your dataset updates.

Glossary

Pipeline – A sequence of connected nodes that clean, transform, or analyse a dataset in TrueState.
Data Transform node – A pipeline node that performs cleaning or formatting operations on structured data.
Pipeline agent – The AI assistant that helps you configure nodes using natural language.
Profiling – Automatic background analysis of a dataset’s quality and structure.
Text Analytics node – A node for extracting structured insights from unstructured text fields.

Common pitfalls to avoid

1. Cleaning without purpose

Always connect cleaning steps to a downstream need such as creating a model, building a dashboard, or working towards an insight.

2. Applying too many changes at once

Break large cleaning jobs into multiple nodes to simplify debugging and iteration.

3. Ignoring column types

Even subtle typing issues (e.g., strings in numeric fields) can cause downstream failures.

4. Failing to label nodes

Clear naming makes pipelines easier to maintain and collaborate on.

Next steps

Go to the Pipeline section in TrueState
Upload or select a dataset
Use the Pipeline agent to apply and refine transformations
Continue with downstream analytics or modelling steps

Get Started

Essentials

Guides

Why data cleaning matters

How cleaning works in TrueState

Common data cleaning operations

Standardising formats

Handling missing values

Managing outliers

Resolving duplicates

Normalising categories

Recasting data types

Cleaning text data

Best practices for working with Transform nodes

Example workflow

Glossary

Common pitfalls to avoid

Next steps

Get Started

Essentials

Guides

​Why data cleaning matters

​How cleaning works in TrueState

​Common data cleaning operations

​Standardising formats

​Handling missing values

​Managing outliers

​Resolving duplicates

​Normalising categories

​Recasting data types

​Cleaning text data

​Best practices for working with Transform nodes

​Example workflow

​Glossary

​Common pitfalls to avoid

​Next steps

Why data cleaning matters

How cleaning works in TrueState

Common data cleaning operations

Standardising formats

Handling missing values

Managing outliers

Resolving duplicates

Normalising categories

Recasting data types

Cleaning text data

Best practices for working with Transform nodes

Example workflow

Glossary

Common pitfalls to avoid

Next steps