How to read and write ODF/ODS files (OpenDocument Spreadsheets)
Introduction
The OpenDocument Format (ODF) is an XML-based file format
for representing electronic documents such as spreadsheets, charts,
presentations and word processing documents. The standard was developed by the
OASIS (Organization for the Advancement of Structured Information Standards),
and it is free and open format.
The OpenDocument format is used in free software and in
proprietary software. Originally, format was implemented by the OpenOffice.org office suite and, with Office 2007 SP2, Microsoft also
supports ODF subset.
This article will explain basics of ODF format, and specifically
its implementation in spreadsheet applications (OpenOffice.org Calc and Microsoft
Office Excel 2007 SP2). Presented is a demo application which writes/reads
tabular data to/from .ods files. Application is written in C# using Visual
Studio 2010. Created .ods files can be opened using Excel 2007 SP2 or greater and
OpenOffice.org Calc.
ODF format
OpenDocument format
supports document representation:
-
As a single XML document.
-
As a collection of several subdocuments within a
package.
Office applications use second approach so we will explain
it detail.
Every ODF file is a collection of several subdocuments within
a package (ZIP file), each of which stores part of the complete document. Each
subdocument stores a particular aspect of the document. For example, one
subdocument contains the style information and another subdocument contains the
content of the document.
This approach has following benefits:
-
You don't need to process entire file in order
to extract specific data.
-
Images and multimedia are now encoded in native
format, not as text streams.
-
Files are smaller as a result of compression and
native multimedia storage.
There are four subdocuments in the package that contains
file's data:
-
content.xml
- Document content and automatic styles used in the content.
-
styles.xml
- Styles used in the document content and automatic styles used in the styles
themselves.
-
meta.xml
- Document meta information, such as the author or the time of the last save
action.
-
settings.xml
- Application-specific settings, such as the window size or printer
information.
Besides them, in package can be many other subdocuments like
document thumbnail, images, etc.
In order to read the data from an ODF file you need to:
1.
Open package as a ZIP archive.
2.
Find parts that contain data you want to read.
3.
Read parts you are interested in.
On the other side, if you want to create a new ODF file, you
need to:
1.
Create/get all necessary parts.
2.
Package everything into a ZIP file with
appropriate extension.
Spreadsheet Documents
Spreadsheet document files are the subset of ODF files.
Spreadsheet files have .ods file extensions.
The content (sheets) is stored in content.xml subdocument.
Picture 1: content.xml subdocument
As we can see on Picture 1, sheets are stored as xml
elements. They contain column and row definitions, rows contain cells and so
on… On picture is data from one specific document, but from this we can see
basic structure of content.xml file
(you can also download full ODF
specification).
Implementation
Our demo is Windows Presentation Foundation application (Picture 2), written
in C# using Visual Studio 2010.
Picture 2: Demo application
Application can:
-
Create new Spreadsheet document.
-
Read existing Spreadsheet document.
-
Write created Spreadsheet document.
Creating new document and underlying model of application
Internal, Spreadsheet document is stored as DataSet. Each sheet is represented with DataTable, sheet's row with DataRow, and sheet's column with DataColumn. So, to create a new document
we have to create new DataSet, with DataTables. Each DataTable has number of rows and columns that conforms to our
needs.
To show data from our DataSet
(and to allow editing that data) application dynamically creates tabs with DataGridViews (that are connected to our
DataTables).
Through interface, user can read, write, edit data and add
new rows to Spreadsheet document.
As application, basically, transforms Spreadsheet document to / from DataSet, it can also be used as a reference for Excel to DataSet export / import scenarios.
Zip component and XML parser
Although classes from System.IO.Packaging
namespace (.NET 3.0) provides way to read and write ZIP files, they require
different format of ZIP file. Because of that, our demo uses open source
component called DotNetZip.
Using ZIP component we can extract files, get subdocument,
replace (or add) subdocuments that we want and save that file as .ods file
(which is ZIP file).
For processing documents we have used XmlDocument because it offers easy way to reach part that we want.
Note that, if performance is crucial for you, you should use XmlTextReader and XmlTextWriter. That solution need more work (and code), but
provides better performance.
Reading Spreadsheet Document
To read document we follow these steps:
1.
Extracting .ods file.
2.
Getting content.xml
file (which contains sheets data).
3.
Creating XmlDocument
object from content.xml file.
4.
Creating DataSet
(that represent Spreadsheet file).
5.
With XmlDocument
we select "table:table" elements, and
then we create adequate DataTables.
6.
We parse child's of "table:table" element and fill DataTables
with those data.
7.
At the end, we return DataSet and show it in application's interface.
Although ODF specification provides way to specify default
row, column and cell style, implementations have nasty practice (that specially
applies for Excel) that they rather write sheet as sheet with maximum number of
columns and maximum number of rows, and then they write all cells with their
style. So you could see that your sheet has more than 1000 columns (1024 in
Calc and 16384 in Excel), and even more rows (and each rows contains the number
of cells that is equal to the number of columns), although you only have write
data to first few rows/columns.
ODF specification provides a way that you specify some element
(like column/row/cell) and then you specify the number of time it repeats. So
above behavior doesn't affect the size of the file, but that complicates our
implementation.
Because of that, we can't just read the number of columns
and add equal number of DataColumns
to DataTable (because of performance
issues). In this implementation, we rather read cells and, if they have data,
we first create rows/columns they belong to, and then we add those cells to the
DataTable. So, at the end, we
allocate only space that we need to.
Writing Spreadsheet Document
To write document we follow these steps:
1.
Extracting template.ods
file (.ods file that we use as template).
2.
Getting content.xml
file.
3.
Creating XmlDocument
object from content.xml file.
4.
Erasing all "table:table"
elements from the content.xml
file.
5.
Reading data from our DataSet and composing adequate "table:table"
elements.
6.
Adding "table:table"
elements to content.xml file.
7.
Zipping that file as new .ods file.
In this application, as template, we have use empty
document. But application can be easy modified to use some other template (so,
that you have preserved styles, etc).
Download links
Here you can download the latest version of the demo application
(together with the C# source code): Download OdsReadWrite
Alternative ways
As always in programming, there is more than one method to
achieve the same thing.
ODF files are just a collection of xml files, packed in zip
files so, any of the vast number of tools for handling zip files and XML data
can be used to handle OpenDocument.
As another option, you could use some third party Excel C# / VB.NET component
which has support for ODF format. This will probably cost you some money but
has advantage that usually more than one format (for example: GemBox.Spreadsheet
reads/writes
XLS, XLSX, CSV, HTML and ODS) are supported within the same API, so your
application will be able to target different file formats using the same code.
About the author
Josip Kremenic (Josip Kremenić) works as a developer at GemBox Software. He primarily works on
GemBox.Spreadsheet - an Excel .NET component
for reading and writing XLS, XLSX, CSV, HTML and ODS files. He got interested
in ODF as a part of his work.