Identify the Data – Even in Legacy Archives | Part 2/4

Introduction

Legacy data often hides in plain sight—stored in personal folders, shared drives, or temporary locations. These files may contain sensitive information, outdated formats, or undocumented decisions. Identifying what exists, where it resides, and whether it’s still relevant is a critical step in responsible data governance.

Why Legacy Data Matters

Many organisations retain documents far beyond their useful life. These files might not have a valid legal basis and should have been deleted or migrated to a proper application. Such oversights are common and pose compliance risks under GDPR and other regulations. For example, in one environment, data aging spanned over 10 years, which might still be relevant.

However, when analysing the data, we found that there are large portions of data that are outdated, irrelevant, or duplicates of existing data. Examples include files like printer driver disks for DOS operating systems—spanning over 20 years back—which are typically unnecessary and simply increase storage and backup costs. Data has been analyzed by Arctera Information Governance application.

Legacy data can span 20+ years

Old files only add unnecessary storage and backup costs

Digitisation and Discovery

Legacy data isn’t limited to paper archives. Digitised content—especially when migrated from old systems—can still be unmanaged. Without proper classification, digitisation simply shifts the problem from one format to another. Organisations must assess whether old data is needed and ensure it is stored securely and appropriately.

Metadata – The First Layer of Insight

Metadata provides essential clues about a file’s origin and lifecycle: creation date and creator, last accessed date and reader, last modified date and modifier, file type, size, and permissions. However, metadata can be unreliable. System migrations, external vendors, or outdated software may corrupt timestamps. Typically, the modification date is the most trustworthy indicator of a file’s age.

Content Classification Is Essential

Metadata alone isn’t enough. Organisations must classify content to understand what’s inside each file. Tools can detect personal identifiers, financial data, or confidential terms. Classification enables automation, retention enforcement, and risk mitigation. Without it, legacy data remains a blind spot—and a liability.

AI Without Data Awareness Is a Risk

AI tools like Microsoft Copilot or ChatGPT can process vast amounts of data—but if your legacy files contain outdated, sensitive, or misclassified content, AI may expose it unintentionally. Before deploying AI, ensure your data is classified, governed, and access-controlled. Otherwise, you risk amplifying existing vulnerabilities.

Cloud Migration Doesn’t Equal Modernisation

Migrating legacy data to the cloud “as-is” is common—but dangerous. Moving unmanaged files to Microsoft 365 or Google Workspace does not reduce risk or update content. Old formats (e.g. WordPerfect 5.1) remain unsupported, and sensitive data remains exposed unless actively governed.

Migration must be paired with classification, clean-up, and policy enforcement. For example, in one environment, there were thousands of files that could no longer be content-classified, but notably, we found hundreds of WordPerfect 5.x files, all from around 1990—over 35 years old. Opening these files is difficult, as software support often no longer exists.

Environment Limitations

Governance capabilities vary across platforms. Microsoft 365 and Google Workspace offer different features depending on licensing level. Legacy fileservers, by default, lack classification, access auditing, and retention enforcement. These can be added using third-party tools but require investment and planning.

Conclusion

Legacy data is often overlooked, yet it holds both risks and opportunities. By identifying what exists, analysing its content, and applying clear retention rules, organisations can reduce exposure, improve compliance, and prepare for future innovations like AI.

But remember: migrating data to the cloud without governance doesn’t solve the problem—it simply relocates it.

This is Part 2 of our 4-part series on unstructured data. If you haven’t yet, check out Part 1: Unstructured Data and Its Challenges. In the next articles, we’ll go deeper into practical methods for managing and governing it.

Want to explore this topic further? Read our article “Unstructured Data Threats and How Top Experts Say You Can Handle It”. Click here to read more!

What's new?

In the blog you will find current information, interesting articles and a lot of detailed information related to data protection.

Read these also

Design the Target Level | Part 3/4

Unstructured data is growing rapidly, and without a clear governance strategy, organisations risk being overwhelmed by outdated, unmanaged, and potentially

Identify the Data – Even in Legacy Archives | Part 2/4

Introduction

Why Legacy Data Matters

Digitisation and Discovery

Metadata – The First Layer of Insight

Content Classification Is Essential

AI Without Data Awareness Is a Risk

Cloud Migration Doesn’t Equal Modernisation

Environment Limitations

Conclusion

What's new?

Read these also

Design the Target Level | Part 3/4

Share on social media

Request a quote for services

Identify the Data – Even in Legacy Archives | Part 2/4

Introduction

Why Legacy Data Matters

Digitisation and Discovery

Metadata – The First Layer of Insight

Content Classification Is Essential

AI Without Data Awareness Is a Risk

Cloud Migration Doesn’t Equal Modernisation

Environment Limitations

Conclusion

What's new?

Read these also

Nordic Privacy Arena 2025 – key insights and discussion topics

Execute and Maintain the Target | Part 4/4

Design the Target Level | Part 3/4

Share on social media

Request a quote for services