Data cleaning
Lightweight AI-driven analytics with TrueState
Clean data is the foundation of trustworthy analytics. Whether you’re building dashboards, training models, or answering executive questions, poor-quality data leads to wasted time, broken logic, and unreliable insights.
In traditional workflows, data cleaning is often tedious and repetitive—taking time away from actual analysis. TrueState simplifies this process by combining automatic data profiling with intuitive, no-code transformation nodes that can be configured via natural language.
Data cleaning in TrueState is performed inside a pipeline, where you can add one or more Data Transform nodes to clean and shape your dataset. These nodes are configured through the Pipeline agent, which handles everything from column renaming and null handling to recoding values and flagging outliers.
Why data cleaning matters
Even the best datasets are rarely analysis-ready. Common issues include:
- Inconsistent naming conventions (e.g., “NY”, “New York”, “N.Y.”)
- Missing or null values
- Outliers or invalid data types
- Duplicate records
- Mismatched encodings in categorical fields
- Timestamp formatting issues
These problems can skew metrics, break aggregations, or mislead models and stakeholders alike.
Cleaning early, and with the right tools, avoids compounding problems in downstream dashboards, models, and reports.
How cleaning works in TrueState
Data cleaning happens inside the Pipeline section of the platform, where each dataset flows through a series of configurable nodes. To clean data, you’ll use the Data Transform node.
When you create a Transform node:
- The Pipeline agent will help you identify issues and suggest fixes
- You describe your intent in plain language (e.g., “standardise dates” or “remove duplicates”)
- The agent configures the transformation automatically
- You preview and apply the changes interactively
All transformations are recorded as part of the pipeline and can be reordered, modified, or removed as needed.
The Pipeline agent translates natural language into versioned, inspectable logic—so your cleaning steps are transparent and repeatable.
Common data cleaning operations
Here are some of the most frequent cleaning tasks performed with Data Transform nodes:
Standardising formats
Make data consistent across columns and rows.
- Convert dates to ISO format
- Normalize currency or percentage formats
- Enforce consistent casing (e.g., title case)
Example prompt: "Standardise all date columns to ISO format."
Handling missing values
Fill or remove incomplete data based on usage needs.
- Fill with mean, median, or custom values
- Forward/backward fill for time series
- Drop columns or rows with high null percentages
Example prompt: "Drop columns where more than 40% of values are missing."
Managing outliers
Isolate or remove extreme values that may distort analysis.
- Detect with Z-score or IQR methods
- Cap or remove extreme values
- Flag outliers for later review
Example prompt: "Cap outliers in the spend column at the 95th percentile."
Resolving duplicates
Clean up repeated or near-duplicate entries.
- Deduplicate by specific columns (e.g., email or user ID)
- Use exact or fuzzy matching logic
Example prompt: "Remove duplicates based on user_id and signup_date."
Normalising categories
Merge similar or inconsistent values across rows.
- Map different spellings or encodings to a single label
- Replace placeholder values like “N/A”, “none”, or “null”
Example prompt: "Combine 'New York', 'NY', and 'N.Y.' into one category."
Recasting data types
Ensure each column is correctly typed for downstream analysis.
- Convert strings to numeric
- Parse dates into datetime format
- Fix mixed-type columns
Example prompt: "Convert this column to numeric and drop unparseable rows."
Type issues are flagged automatically as part of profiling. You can fix them manually or ask the agent to assist.
Cleaning text data
Some datasets contain one or more text columns—fields like descriptions, feedback, transcripts, or open-ended responses.
Basic cleaning of text data (e.g., trimming whitespace or lowercasing values) can be done via Data Transform nodes. However, advanced tasks—such as sentiment analysis, keyword extraction, and classification—are handled through Text Analytics nodes, which are also available in the pipeline canvas.
For deeper workflows involving natural language, see the Text Analytics guide.
Use a Transform node to prepare your text fields (e.g., removing nulls or cleaning casing), then chain into a Text Analytics node for structured interpretation.
Best practices for working with Transform nodes
- Describe your goal, not the method: The agent will choose the right operation for your intent.
- Iterate in stages: Keep each Transform node focused on a specific task.
- Preview before applying: Each node offers a data preview so you can confirm results before saving.
- Name your nodes clearly: Use labels like “Handle missing values” or “Normalise product names” to improve traceability.
If you’re unsure where to start, say "What cleaning steps do you recommend?"
The agent will make suggestions based on automatic profiling.
Example workflow
- Upload a dataset → Profiling runs automatically in the background
- Add a Transform node → The agent helps detect issues and configure fixes
- Describe your cleaning goals → e.g.,
"Prepare this for lead scoring"
- Review suggested changes → It is recommended to inspect the data at each step to ensure the Pipeline agent has implemented changes as-intended.
- Continue building → Chain additional Transform or Text Analytics nodes
- Export → Cleaned data is ready for dashboards, AI agents, or models
Pipelines are versioned and re-runnable—meaning you can automate cleaning each time your dataset updates.
Glossary
- Pipeline – A sequence of connected nodes that clean, transform, or analyse a dataset in TrueState.
- Data Transform node – A pipeline node that performs cleaning or formatting operations on structured data.
- Pipeline agent – The AI assistant that helps you configure nodes using natural language.
- Profiling – Automatic background analysis of a dataset’s quality and structure.
- Text Analytics node – A node for extracting structured insights from unstructured text fields.
Common pitfalls to avoid
1. Cleaning without purpose
Always connect cleaning steps to a downstream need such as creating a model, building a dashboard, or working towards an insight.
2. Applying too many changes at once
Break large cleaning jobs into multiple nodes to simplify debugging and iteration.
3. Ignoring column types
Even subtle typing issues (e.g., strings in numeric fields) can cause downstream failures.
4. Failing to label nodes
Clear naming makes pipelines easier to maintain and collaborate on.
Next steps
- Go to the Pipeline section in TrueState
- Upload or select a dataset
- Use the Pipeline agent to apply and refine transformations
- Continue with downstream analytics or modelling steps