Introduction
Unstructured data makes up the majority of information in organisations—often more than 80%. Unlike structured data in databases, unstructured data includes emails, documents, spreadsheets, images, videos, and even AI-generated content. It is scattered across systems, unmanaged, and frequently contains sensitive personal data.
What Is Unstructured Data?
Unstructured data refers to information that lacks a predefined format or organisational structure. Examples include job applications and CVs stored as free-text documents, event registration forms with open-ended responses, and emails with attachments containing personal identifiers. These files are typically saved on network drives, cloud platforms, or local devices—often without consistent naming conventions, metadata, or retention policies.
The Scale of the Problem
Studies show that unstructured data accounts for up to 90% of all organisational data. Worse, much of it is “dark data”—information that is collected but never used or properly managed. This leads to increased storage costs, compliance risks under regulations like GDPR, and difficulty in locating and securing sensitive information. To give an example of the scale, one terabyte of data in an unstructured format, contains around one million files.
More than 80% of organisational data is unstructured
1 TB ~1 million unstructured files
Why It Matters
Unstructured data often contains personal information, such as health details or national ID numbers. Without proper controls, this data can be exposed, misused, or retained far longer than legally permitted. GDPR mandates data minimisation and purpose limitation—organisations must only collect and retain data that is necessary. A real-world example: A folder labelled ‘Project Applications’ contained anaesthesiologist CVs from 2004, including personal IDs. These files had no valid retention basis and should have been deleted or moved to a secure HR system years ago.
Content Classification Is Essential
Effective data governance starts with knowing what you are managing. Content classification tools help identify sensitive information, such as personal data, financial records, or confidential terms. Without classification, organisations cannot reliably enforce retention, access, or deletion policies.
Classification is the foundation for automation, compliance, and risk reduction. In the following small supplier data overview, organisations had thousands of items requiring deeper analysis. Notably, organisations were also sharing data externally, even via anonymous links—and as we validated, none of these links had expiration dates. The analysis was created with AvePoint Insights and Policies.
AI Without Data Awareness Is a Risk
AI tools like Microsoft Copilot or ChatGPT offer powerful capabilities—but using them without understanding your data landscape is dangerous. If your unstructured data contains outdated, sensitive, or misclassified content, AI may surface or process it inappropriately. Before deploying AI, ensure your data is classified, governed, and access-controlled.
Environment Limitations
It is important to note that environments like Microsoft 365 and Google Workspace offer different governance features depending on licensing level. Legacy file servers, by default, lack most governance capabilities—such as version control, access auditing, or classification. These features can be added using third-party tools, but require planning and investment.
Conclusion
Unstructured data is not just a technical challenge—it is a strategic imperative.
Organisations must invest in tools, processes, and awareness to ensure that this vast and growing data pool is secure, compliant, and useful. Without content classification and governance, even the best AI tools can become a burden.
Want to explore this topic further? Read our article “Unstructured Data Threats and How Top Experts Say You Can Handle It”. Click here to read more!
This is Part 1 of our 4-part series on unstructured data. In the next articles, we’ll go deeper into practical methods for managing and governing it.