Understanding OOXML (Part-2): Anatomy of  Microsoft Office File Formats

Understanding OOXML (Part-2): Anatomy of Microsoft Office File Formats

If you landed here directly, this is a multi-part article about OOXML and how you can build your own OOXML framework to reduce the license costs on third party libraries. Link to Part 1 - Understanding OOXML (Part-1): The Key to Cross-Platform Office Document Automation

OOXML (Office Open XML) plays a pivotal role in cross-platform office automation and development by enabling consistent, reliable, and flexible document handling across different platforms and operating systems. Since it is an open standard, it allows a wide range of applications beyond Microsoft Office to read, write, and manipulate Office documents in a structured and predictable manner.

What Do Word, PowerPoint, Excel, Outlook, Access, and VSDX Files Really Contain or Are Made Up Of?

Office Open XML (OOXML) is a specification developed by Microsoft to store office documents in a standardized XML format. Unlike older binary formats (e.g., .doc, .xls), OOXML files are structured using XML, which offers a transparent way to manipulate and store data. OOXML files are essentially ZIP archives containing multiple XML files and other resources such as images and embedded media.

Below is a detailed breakdown of what these common Office document types contain when saved in the OOXML format.

Word Documents (.docx)

A Word document (.docx) file is made up of several parts:

  • word/document.xml: This file contains the core content of the Word document, such as text, paragraphs, headings, tables, and more. Each paragraph is stored as a separate XML element.
  • word/styles.xml: This file defines the styles used in the document. Styles include formatting like font size, color, headings, alignment, and more.
  • word/settings.xml: Contains settings for the document, such as page size, margins, and other options.
  • word/fontTable.xml: Stores font-related information, such as font family and sizes used in the document.
  • word/media/: Folder containing media such as images, audio, and video that have been embedded within the document.
  • word/_rels/: Relationship files linking various parts, such as linking images or charts embedded in the document.

PowerPoint Presentations (.pptx)

A PowerPoint presentation (.pptx) is similarly structured and contains:

  • ppt/presentation.xml: Defines the structure of the presentation, including slide layouts, transitions, and references to individual slides.
  • ppt/slides/: Contains individual XML files for each slide in the presentation. Each slide is represented as a slide.xml file.
  • ppt/media/: This folder contains images, videos, and other media elements embedded in the slides.
  • ppt/_rels/: Relationship files for managing references between slides and multimedia content.

Excel Workbooks (.xlsx)

An Excel workbook (.xlsx) file is made of:

  • xl/workbook.xml: Contains the overall structure of the workbook, including references to sheets within the workbook.
  • xl/worksheets/: Each worksheet is stored as an individual XML file, like sheet1.xml, sheet2.xml, containing the data of the cells, rows, and columns.
  • xl/styles.xml: Contains the styles used across the workbook, such as cell formatting (number, date, text alignment).
  • xl/sharedStrings.xml: Contains a list of all shared strings (e.g., repeated text) to optimize file size.
  • xl/worksheets/_rels/: Contains relationships for charts, images, or other media within the worksheet.

Outlook Files (.msg or .pst)

Outlook messages (.msg) or personal store files (.pst) contain email messages, calendar entries, tasks, and contacts. These file formats are not part of OOXML but can be exported as Word documents (.docx) or Excel sheets (.xlsx) using various conversion tools.

Access Files (.accdb)

Microsoft Access databases store relational data in .accdb files. While these are not part of OOXML, data from Access can be exported into OOXML-compatible formats (e.g., Excel) for further manipulation.

Visio Files (.vsdx)

A Visio drawing (.vsdx) file is structured with:

  • visio/document.xml: Contains the core structure of the drawing, including the pages and shapes.
  • visio/pages/: Each page in the drawing is represented by its own XML file.
  • visio/_rels/: Contains relationships between the shapes, connectors, and any other embedded elements in the diagram.

How Can They Be Peeled and Seen Inside Using Tools Like WinRAR?

The OOXML File Structure

OOXML files, like .docx, .pptx, and .xlsx, are essentially ZIP archives. This means that you can inspect the contents of these files using a file archiver such as WinRAR, 7-Zip, or even Windows Explorer.

When you open a .docx file, for example, you'll find a ZIP archive containing:

  • XML files that define the document's content, styles, and settings.
  • A _rels folder containing relationship files.
  • A media folder containing any images or media files.

Steps to Explore OOXML Files with WinRAR:

  1. Rename the File: Change the file extension from .docx, .pptx, or .xlsx to .zip.
  2. Extract the ZIP Archive: Open it with WinRAR or 7-Zip.
  3. Inspect the Contents: Browse the folder structure, which might look like this for a .docx file:

Inside these XML files, you will find the structure, content, and styles that define the document. This method is the key to understanding how data is stored in OOXML files and allows you to inspect and even modify the content manually.

For instance, Here's a multi-page word document:

When you open this using WinRAR,

compressed files
inside 'word' folder

The document contents will be present inside document.xml. media contains images, videos and any other media contents added in the file.

images in our word document

Let's take another example of a PowerPoint presentation with multiple slides.

courtesy: sample template gallery from Microsoft PowerPoint

Again, this pptx file has multiple files within.

internals of a PowerPoint presentation document

I hope this gave a glimpse of how office documents are structured and how many physical files are actually created under the hood within one single file.

Now let's see what each of these xml files consists of and how they are all stitched together to make it work.

Exploring Office XML files

Since the underlying concepts behind various document types are similar, I will focus on explaining the PowerPoint document format in detail. This is my preferred document type due to its more structured nature and the diverse range of object types it contains, which makes it an excellent example for understanding the nuances of Office OpenXML. Additionally, PowerPoint files tend to be more complex compared to other document types, offering a richer insight into how these documents are organized and processed.

Exploring the PowerPoint document: Understanding the Structure and Key Components

When you open a PowerPoint document (.pptx), what you’re actually looking at is a compressed archive file (ZIP format) containing various XML files and directories. Each of these files serves a specific purpose in defining the presentation, its slides, themes, media, and relationships. Let’s walk through the key components and how they are structured inside the archive.

1. Base Directory and Key Folders

After unzipping the .pptx file, you will find several folders and XML files. The main components of a PowerPoint document include:

  • /ppt/presentation.xml – Contains presentation properties.
  • /ppt/slides/slide1.xml – Defines the content of individual slides.
  • /ppt/slideMasters/ – Contains slide master templates for consistent formatting.
  • /ppt/theme/ – Holds theme-related resources, like color schemes and fonts.
  • /ppt/media/ – Stores all images and media files used in the presentation.
  • /_rels/ – Defines the relationships between different parts of the presentation.

2. /ppt/presentation.xml – The Heart of the Presentation

The presentation.xml file is the core of a PowerPoint presentation. It contains the high-level properties and references that organize the overall structure of the slides. This file defines:

  • Slide Layouts: Lists all the slides in the presentation, pointing to their respective slide files (e.g., slide1.xml, slide2.xml).
  • Slide Transitions: Defines any transitions or animations applied to the slides.
  • Slide Timing: Specifies the timing for slide animations, transitions, and auto-slide advancement.

In essence, presentation.xml connects all the individual slides and sets the stage for what will appear in the final presentation. Here's a sample of what this file might looks like internally,

<presentation xmlns="https://schemas.openxmlformats.org/presentationml/2006/main">
        <sldMasterId id="1"/>
        <sldMasterId id="2"/>
        <sldId id="1" r:id="rId1"/>
        <sldId id="2" r:id="rId2"/>
    <slideSize cx="12240000" cy="9180000"/>

3. /ppt/slides/slideX.xml – The Individual Slides

Each slide in the presentation is represented by an individual XML file inside the /ppt/slides/ folder, such as slide1.xml, slide2.xml, etc. Each of these files contains:

  • Slide Content: Text, shapes, tables, charts, etc., and their positions on the slide.
  • Slide Elements: The visual elements, such as bullet points, images, or any embedded objects (e.g., videos or Excel charts).
  • Slide Structure: This includes layout details, such as the position of objects (using coordinates), text styles, and formatting.

Each slideX.xml file will be linked back to presentation.xml, allowing the presentation to know which slides to include.

4. /ppt/slideMasters/ – The Slide Master Templates

PowerPoint uses Slide Masters to define the overall layout and design for a series of slides. Slide Masters ensure consistency across slides in terms of background images, colors, fonts, and positions for placeholders.

  • The /ppt/slideMasters/ folder contains one or more master slide XML files, typically named something like slideMaster1.xml.
  • These files contain the default styles and settings that are applied to slides unless overridden by individual slides. It includes details like:Layout Templates: The predefined structure for where content appears (e.g., title, text, images).Themes: Font styles, color schemes, and effects.

By linking a slide to a specific Slide Master, PowerPoint ensures a consistent look across multiple slides, even if their content differs. Here's an example of slide1.xml

<slide xmlns="https://schemas.openxmlformats.org/presentationml/2006/main">
            <sp spid="1">
                    <off x="100000" y="100000"/>
                    <ext cx="5000000" cy="2000000"/>
                            <rgbColor val="FF0000"/>
                            <t>Welcome to the Presentation!</t>

  • commonSlideData: Contains the shared data for the slide.
  • shapeTree: Represents the shapes (objects) on the slide. In this case, a rectangle with text.
  • sp: A shape element. Here, the shape has text content ("Welcome to the Presentation!").
  • style: Defines the shape's appearance, like color.
  • textBody: Contains the text elements for the shape.

The slide master is a powerful feature that demonstrates how reusable components can be created, even at the document level.

<slideMaster xmlns="https://schemas.openxmlformats.org/presentationml/2006/main">
            <sp spid="1">
                    <off x="0" y="0"/>
                    <ext cx="12240000" cy="9180000"/>
                            <rgbColor val="FFFFFF"/>
                            <rgbColor val="FFFFFF"/>

  • commonSlideData: Contains layout and styling shared by all slides using this master.
  • shapeTree: Contains the shapes and their layout.
  • style: Defines the visual appearance, such as background color or line styles.

5. /ppt/theme/ – Theme Resources

The /ppt/theme/ folder contains the files responsible for the visual theme of the presentation. This includes:

  • Theme Colors: The color palette for the slides.
  • Fonts: The typefaces used across the presentation.
  • Effects: Graphic effects applied to text and objects.
  • Formatting: Consistent formatting for titles, text, and backgrounds.

Files here ensure that the overall visual identity of the presentation is cohesive and consistent, as themes are applied uniformly to all slides. Here's an example of theme colour XML

<a:theme xmlns:a="https://schemas.openxmlformats.org/drawingml/2006/main">
        <a:clrScheme name="Office">
                <a:rgbColor val="000000"/>
                <a:rgbColor val="FFFFFF"/>
                <a:rgbColor val="1F4E79"/>

6. /ppt/media/ – Images and Media

In the /ppt/media/ folder, you’ll find all the external files used in the presentation, such as images, videos, audio clips, and embedded documents. These files are referenced by the individual slide XML files to display content like:

  • Images: .jpg, .png, .gif, or other image formats.
  • Audio and Video: Multimedia content that is embedded or linked.
  • Embedded Objects: Charts, Excel data, or other embedded objects that require external media files.

Each media file in this folder is given a unique name and ID, and is referenced by the slides using relationships (explained below).

7. /_rels/ – Relationships between Parts

The /_rels/ folder defines how different parts of the PowerPoint document are related to each other. It contains XML files that establish relationships between elements, such as:

  • Slide Relationships: Linking a slide to a specific slide master, media files, or external links.
  • Document Relationships: Linking slides to each other, media, and other components within the presentation.

For example, you might find an XML file like presentation.xml.rels, which outlines the relationship between the presentation.xml file and other parts of the presentation, such as images and slide masters. This allows the PowerPoint engine to understand how to stitch all the pieces together when rendering the presentation.

This diagram shows how the PowerPoint document is structured, how slides inherit from the Slide Master, and how they refer to themes, layouts, and other resources

visual of a multi-slide PowerPoint document

How All These Files Work Together

All of these files (and folders) work together in a cohesive structure to define a PowerPoint presentation. The presentation.xml file acts as the master controller, referencing the individual slides stored in /ppt/slides/ and connecting them with the appropriate styles, themes, and media resources. The Slide Masters ensure consistency in design, and the media folder provides the necessary assets like images and videos.

The relationships stored in the /ppt/_rels/ folder are essential for the PowerPoint application (or any other tool working with the OOXML standard) to properly reconstruct the presentation by combining all of the resources correctly.

In the next part, I'll walk you through the technical nuances of some of these document types and the underlying components within them. I'll try to show how to create a simple PowerPoint document programmatically using OOXML.


Pramod Hegde的更多文章

