In Sharepoint you may store files in the document libraries. Among with files themselves, it is possible to add additional metadata to each file. It is on of the ways to categorize content. In new Sharepoint 2013 platform it is even more important with their attraction to search-based solutions. Metadata values are stored differently for different files:
- for Office documents metadata is stored in the file itself. It includes new open xml format (docx, xlsx, etc), and old formats (doc, xls, etc);
- for other documents metadata is stored in the content database (there are also several mentions in the network that you may change this behavior by installing some extensions to the Sharepoint, but I didn’t find such extensions, if you know them, please share in comments).
So for example when you copy Word document (docx) from one document library to another (document libraries may be located in different web applications on different farms), metadata will be preserved. But if you will copy e.g. pdf document, all metadata will be lost. In this article I will show how to clear office files from the metadata. It can be useful when you reorganized content structure and want to start with clear version, without inheriting the garbage of old metadata (which even can be deleted in new version if we talk about managed metadata).
First of all we need to understand how metadata is stored in the office documents. I recommend the following article: Document Information Panel and Document Properties in SharePoint Server 2010. It says that metadata is stored inside “customXml section of the Open XML formats”:
However theory doesn’t provide all necessary information. In order to be able to remove metadata we need to understand it deeper. So for testing I created docx file with some test content, uploaded it to the document library with custom content type with several managed metadata fields and specified some values in these fields. After that I opened the doclib in the explorer view and copied document back to the file system. After that I changed extension to zip and unpacked the content of the file. In the files inside the package I found that managed metadata is stored in 2 places actually:
- item3.xml file inside customXml subfolder;
- custom.xml file inside docProps subfolder.
Metadata is stored differently inside these files. In the item3.xml it is stored like this:
For clarity I removed “http://schemas.microsoft.com/office/infopath/2007/PartnerControls” namespace from the some tags. This example shows that Language field contains English value. Also termId is stored within the value.
In custom.xml data is stored by the following way:
Here we also see value and term id, but in different format.
This investigation tells us that we need to remove the metadata from 2 places somehow. But how to do that, i.e. how to remove metadata from the office file programmatically?
First of all we need to download the Open XML SDK. We need to reference the following assembly from this SDK: DocumentFormat.OpenXml.dll. Also we will need to reference standard WindowsBase.dll. The code which removes the metadata is below:
Here we remove the metadata from the custom.xml first (lines 4-9) and then from custom xml part of the document item3.xml (lines 12-31). Removing from custom xml part is a little bit more tricky because you need to read xml content from the stream in order to find the correct part (single office file may contain several such parts).
Run this program with the file which contains metadata and then copy the file into another document library, all metadata will be empty. Hope that it will help you in your work.