Friday, December 21, 2012

Remove Sharepoint metadata from MS Office documents

In Sharepoint you may store files in the document libraries. Among with files themselves, it is possible to add additional metadata to each file. It is on of the ways to categorize content. In new Sharepoint 2013 platform it is even more important with their attraction to search-based solutions. Metadata values are stored differently for different files:

  • for Office documents metadata is stored in the file itself. It includes new open xml format (docx, xlsx, etc), and old formats (doc, xls, etc);
  • for other documents metadata is stored in the content database (there are also several mentions in the network that you may change this behavior by installing some extensions to the Sharepoint, but I didn’t find such extensions, if you know them, please share in comments).

So for example when you copy Word document (docx) from one document library to another (document libraries may be located in different web applications on different farms), metadata will be preserved. But if you will copy e.g. pdf document, all metadata will be lost. In this article I will show how to clear office files from the metadata. It can be useful when you reorganized content structure and want to start with clear version, without inheriting the garbage of old metadata (which even can be deleted in new version if we talk about managed metadata).

First of all we need to understand how metadata is stored in the office documents. I recommend the following article: Document Information Panel and Document Properties in SharePoint Server 2010. It says that metadata is stored inside “customXml section of the Open XML formats”:

image

However theory doesn’t provide all necessary information. In order to be able to remove metadata we need to understand it deeper. So for testing I created docx file with some test content, uploaded it to the document library with custom content type with several managed metadata fields and specified some values in these fields. After that I opened the doclib in the explorer view and copied document back to the file system. After that I changed extension to zip and unpacked the content of the file. In the files inside the package I found that managed metadata is stored in 2 places actually:

  • item3.xml file inside customXml subfolder;
  • custom.xml file inside docProps subfolder.

Metadata is stored differently inside these files. In the item3.xml it is stored like this:

   1: <?xml version="1.0" encoding="utf-8"?>
   2: <p:properties xmlns:p="http://schemas.microsoft.com/office/2006/metadata/properties"
   3: xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   4: xmlns:pc="http://schemas.microsoft.com/office/infopath/2007/PartnerControls">
   5:   <documentManagement>
   6:     <AuthorLogin xmlns="..." xsi:nil="true"/>
   7:     <DocLanguage_Hidden xmlns="...">
   8:       <Terms xmlns="...">
   9:         <TermInfo xmlns="...">
  10:           <TermName xmlns="...">English</TermName>
  11:           <TermId xmlns="...">42f6e37f-06b6-4881-946d-fc945753adfa</TermId>
  12:         </TermInfo>
  13:       </Terms>
  14:     </DocLanguage_Hidden>
  15:     ...
  16:   </documentManagement>
  17: </p:properties>

For clarity I removed “http://schemas.microsoft.com/office/infopath/2007/PartnerControls” namespace from the some tags. This example shows that Language field contains English value. Also termId is stored within the value.

In custom.xml data is stored by the following way:

   1: <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
   2: <Properties xmlns="http://schemas.openxmlformats.org/officeDocument/2006/custom-properties"
   3: xmlns:vt="http://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes">
   4:   <property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="2" name="ContentTypeId">
   5:     <vt:lpwstr>...</vt:lpwstr>
   6:   </property>
   7:   <property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="9" name="DocLanguage">
   8:     <vt:lpwstr>7;#English|42f6e37f-06b6-4881-946d-fc945753adfa</vt:lpwstr>
   9:   </property>
  10:   ...
  11: </Properties>

Here we also see value and term id, but in different format.

This investigation tells us that we need to remove the metadata from 2 places somehow. But how to do that, i.e. how to remove metadata from the office file programmatically?

First of all we need to download the Open XML SDK. We need to reference the following assembly from this SDK: DocumentFormat.OpenXml.dll. Also we will need to reference standard WindowsBase.dll. The code which removes the metadata is below:

   1: using (var document = WordprocessingDocument.Open("test.docx", true))
   2: {
   3:     // delete from custom properties first
   4:     if (document.CustomFilePropertiesPart != null &&
   5: document.CustomFilePropertiesPart.Properties != null)
   6:     {
   7:         document.CustomFilePropertiesPart.Properties.RemoveAllChildren();
   8:         document.CustomFilePropertiesPart.Properties.Save();
   9:     }
  10:  
  11:     // then from custom xml part "properties"
  12:     if (document.MainDocumentPart != null &&
  13: document.MainDocumentPart.CustomXmlParts != null)
  14:     {
  15:         Func<CustomXmlPart, bool> predicate =
  16:             p =>
  17:                 {
  18:                     using (var reader = new StreamReader(p.GetStream()))
  19:                     {
  20:                         var root = XElement.Load(reader);
  21:                         return (root.Name.LocalName == "properties");
  22:                     }
  23:                 };
  24:  
  25:         var propertiesPart = document.MainDocumentPart.CustomXmlParts
  26:             .FirstOrDefault(p => predicate(p));
  27:         if (propertiesPart != null)
  28:         {
  29:             document.MainDocumentPart.DeletePart(propertiesPart);
  30:         }
  31:     }
  32: }

Here we remove the metadata from the custom.xml first (lines 4-9) and then from custom xml part of the document item3.xml (lines 12-31). Removing from custom xml part is a little bit more tricky because you need to read xml content from the stream in order to find the correct part (single office file may contain several such parts).

Run this program with the file which contains metadata and then copy the file into another document library, all metadata will be empty. Hope that it will help you in your work.

15 comments:

  1. Thank you for posting this. I'm having some trouble getting this to work. Would you be able to post the complete code for this?

    ReplyDelete
  2. hi Todd,
    as far as I remember the code above worked in the console application, i.e. it should work without any changes. What problems do you have?

    ReplyDelete
  3. It isn't recognizing Wordprocessing, CustomXMLPart or XElement. I have referenced the OpenXML and WindowsBase dlls, and added "using DocuentFormat.OpenXML;" Is there something else I should be referencing?

    Thanks!

    ReplyDelete
  4. You need to add references to System.Xml.dll, System.Xml.Linq.dll and DocumentFormat.OpenXml.dll. After that you have to add usings with the following namespaces:
    - System.IO;
    - System.Linq;
    - System.Xml.Linq;
    - DocumentFormat.OpenXml.Packaging;

    Install resharper and it will add necessary namespaces automatically :).

    ReplyDelete
  5. i want to know how to get pdf files matadata,Please let me know what are the extensions need to install in the sharepoint

    ReplyDelete
  6. as far as I know, pdf is not OpenXml format, i.e. metadata are not stored inside pdf itself. They are stored only in content database, which means that you have to get its metadata like in all other cases: file.Item["Field"].

    ReplyDelete
  7. var propertiesPart = document.MainDocumentPart.CustomXmlParts.FirstOrDefault(p => predicate(p));

    The above line gives an error, CustomXmlParts.FirstOrDefault is not recognised.

    ReplyDelete
  8. Srikant Prasad,
    did you add the following namespaces (see above):
    - System.IO;
    - System.Linq;
    - System.Xml.Linq;
    - DocumentFormat.OpenXml.Packaging;
    ?

    ReplyDelete
  9. This seems to corrupt the document, problem seems to be that ItemN.xml reference is not removed from <Relationships in _rels folder

    ReplyDelete
  10. pateketu,
    for our documents this approach worked quite well, but in your cases there may be additional places in the document which require cleanup. You may analyze them and then remove using the same approach with Open XML SDK.

    ReplyDelete
  11. Hi Alexey,

    Many thanks for this article.
    I just have one question, which is as follows
    Can i remove or delete only specific property instead of removing all properties??
    is this possible

    Thanks.

    ReplyDelete
  12. Ketan,
    most probably it is possible. Just check what methods are available in document.CustomFilePropertiesPart.Properties collection instead of RemoveAllChildren(). Also in predicate function you should return true only for those properties which you want to delete: i.e. check not only name of the root element like in my example (root.Name.LocalName == "properties"), but also property name. It will require more deeper analysis of passed xml document in predicate.

    ReplyDelete
  13. This comment has been removed by the author.

    ReplyDelete
  14. Hi Alexey,

    Is is possible through jQuery, Please share if you have some jQuery code sample to remove the properties of the document.

    Thanks,
    Anil

    ReplyDelete
    Replies
    1. Anil,
      I don't think that it is possible to remote it in javascript exactly like in my example, but you may iterate through all metadata of the document and set each field to empty value like shown here: http://sadomovalex.blogspot.com/2015/12/copy-taxonomy-field-values-using.html (this example shows how to manipulate with managed metadata fields via javascript object model, but working with another field types is even more simpler).

      Delete