Parsing XML using Powershell

PowerShell is an interactive prompt and scripting environment. This document shows how to parse XML using Powershell.

TL&DR: Powershell is a command prompt that uses Cmdlets. Cmdlets are used to parse XML. XML stands for eXtensible Markup Language. XML carries data by defining tags that store values. To validate XML you first define an XSD file. Download Powershell Community Extensions. Run Test-xml: Test-Xml -Path books.xml -SchemaPath books.xsd -Verbose. Use [xml] accelerator to import data to Powershell: [xml]$library = gc books.xml. Parse data and write to csv $library.catalog.book | ConvertTo-Csv -NoTypeInformation | Set-Content eOutput.csv | Write-Output "" | Add-Content eOutput.csv.

Powershell

This is mostly an introduction to using Cmdlets in Powershell. Cmdlets are commands that are used in the Windows PowerShell environment. The following code blocks are cherry picked for Parsing XML.

General commands

These commands will be familiar to most. No explanation here.

clear           # Clears screen
ls              # List child items
dir             # List child items
atom books.xml  # Open atom text editor

Version

$PSVersionTable.PSVersion
# For me this returns...
# Major  Minor  Build  Revision
# -----  -----  -----  --------
# 5      0      10586  117

Create new file

The examples given in this note uses an existing xml file so the creation of books.xml is not actually required. The purpose of showing how to create a new file here is to introduce the "New-item" Cmdlet and its implementation.

New-item -name books.xml -ItemType File

Searching content

get-item books.xml # print file details
get-content books.xml # print each line
Select-string "computer" # find

Pipes ("|")

Pipes let you string commands together so that the outcome of the first command is past to the next and so on.

get-content "results.csv" | Select-string "computer"

Aliases

Aliases provide shorthand commands for Cmdlets. You can create your own aliases but I don't cover how to do that here.

# LONG FORMAT
Get-content "results.csv" | Select-string "computer"

# ALIASED
gc "results.csv" | sls "computer

Type the following to see what aliases are available:

# List aliases
get-alias | format-table Name, Definition -autosize

Useful aliases and commands

gc     # Get-content
sls    # Select-string
select # Select-Object
ft     # Format-table
epcsv  # Export-Csv
%      # ForEach-Object
$_     # Placeholder (used in piping)
@      # Array operator

Using ForEach-Object (alias = %)

This is a for loop construct similar to other languages.

# (0..4) = 0, 1, 2, 3, 4
# $var = variable to print
# I am using Write-Host for testing
ForEach( $var in 0..4)
{
Write-Host $var
}

Use alias to modify numbers from 0 to 4.

# The following code can be read as:
# Send numbers 0, 1,2, 3, and 4 (0..4) one at a time to
# a loop (%) that will multiply each number ($_)
# by 10
(0..4) | % { $_ * 10 }
# Returns
0
10
20
30
40

Array operator

# This is an array - semicolon separated key value pairs
@{fname="Ray"; lname="Blick"}

# Parse the array using the ForEach alias ( % ) and placeholder ( $_ )
@{fname="Ray"; lname="Blick"} | % {$_ }

Name     Value 
fname    Ray
lname    Blick

XML

XML stands for eXtensible Markup Language and it contains information that is wrapped in tags. None of the tags (e.g. catalog or book) are predefined and the final structure of the XML document is up to the person creating it. XML was designed to carry data in a structured way. XML is stored in plain text format and easily shared. XML does not need to be indented, but it helps to make it human readable. Beware the whitespace introduced by indenting can have meaning to some applications. However, most important to our end objective is that whitespace does not influence parsing XML documents.

Books.xml

The sample XML used in this document is provided by Microsoft which contains a catalog of books. The following code block shows what the XML would look like if it only contained the first record.

<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications
      with XML.</description>
   </book>
</catalog>

Search xml using Powershell "select-string" Cmdlet

Books.xml contains more than one book. This example prints out every line in the XML that contains "book". Note that we are not parsing information, and simply searching the XML file as if it were a plain text file.

Select-String -path "*.xml" -pattern "book"
# books.xml:3:   <book id="bk101">
# books.xml:11:   </book>
# books.xml:12:   <book id="bk102">
# books.xml:21:   </book>
# books.xml:22:   <book id="bk103">
# books.xml:31:   </book>
# books.xml:32:   <book id="bk104">
# books.xml:42:   </book>
# books.xml:43:   <book id="bk105">
# books.xml:52:   </book>
# books.xml:53:   <book id="bk106">
# books.xml:61:   </book>
# books.xml:62:   <book id="bk107">
# books.xml:70:   </book>
# books.xml:71:   <book id="bk108">
# books.xml:79:   </book>
# books.xml:80:   <book id="bk109">
# books.xml:89:   </book>
# books.xml:90:   <book id="bk110">
# books.xml:98:   </book>
# books.xml:99:   <book id="bk111">
# books.xml:108:   </book>
# books.xml:109:   <book id="bk112">
# books.xml:119:   </book>

Get-Content XML

Use the [XML] type accelerator

[xml]$library = Get-Content -Path books.xml
$library
# xml           catalog
# ---           -------
# version="1.0" catalog

View XML content (dot notation)

$library.catalog.book
# ... snipped
# id           : bk112
# author       : Galos, Mike
# title        : Visual Studio 7: A Comprehensive Guide
# genre        : Computer
# price        : 49.95
# publish_date : 2001-04-16
# description  : Microsoft Visual Studio 7 is explored in depth,
#                     looking at how Visual Basic, Visual C++, C#,
#                     and ASP+ are
#                     integrated into a comprehensive development
#                     environment.

View XML content as formated table

$library.catalog.book | Format-Table -AutoSize
# id    author               title                                  genre           price publish_date description
# --    ------               -----                                  -----           ----- ------------ -----------
# bk101 Gambardella, Matthew XML Developer's Guide                  Computer        44.95 2000-10-01   An in-depth look at creating applications...
# bk102 Ralls, Kim           Midnight Rain                          Fantasy         5.95  2000-12-16   A former architect battles corporate zombies,...
# bk103 Corets, Eva          Maeve Ascendant                        Fantasy         5.95  2000-11-17   After the collapse of a nanotechnology...
# bk104 Corets, Eva          Oberon's Legacy                        Fantasy         5.95  2001-03-10   In post-apocalypse England, the mysterious...
# bk105 Corets, Eva          The Sundered Grail                     Fantasy         5.95  2001-09-10   The two daughters of Maeve, half-sisters,...
# bk106 Randall, Cynthia     Lover Birds                            Romance         4.95  2000-09-02   When Carla meets Paul at an ornithology...
# bk107 Thurman, Paula       Splish Splash                          Romance         4.95  2000-11-02   A deep sea diver finds true love twenty...
# bk108 Knorr, Stefan        Creepy Crawlies                        Horror          4.95  2000-12-06   An anthology of horror stories about roaches,...
# bk109 Kress, Peter         Paradox Lost                           Science Fiction 6.95  2000-11-02   After an inadvertant trip through a Heisenberg...
# bk110 O'Brien, Tim         Microsoft .NET: The Programming Bible  Computer        36.95 2000-12-09   Microsoft's .NET initiative is explored in...
# bk111 O'Brien, Tim         MSXML3: A Comprehensive Guide          Computer        36.95 2000-12-01   The Microsoft MSXML3 parser is covered in...
# bk112 Galos, Mike          Visual Studio 7: A Comprehensive Guide Computer        49.95 2001-04-16   Microsoft Visual Studio 7 is explored in depth,...

Select one column

$library.catalog.book | select title
# title
# -----
# XML Developer's Guide
# Midnight Rain
# Maeve Ascendant
# Oberon's Legacy
# The Sundered Grail
# Lover Birds
# Splish Splash
# Creepy Crawlies
# Paradox Lost
# Microsoft .NET: The Programming Bible
# MSXML3: A Comprehensive Guide
# Visual Studio 7: A Comprehensive Guide

Select first 5 titles

$library.catalog.book | select title | select -First 5
# title
# -----
# XML Developer's Guide
# Midnight Rain
# Maeve Ascendant
# Oberon's Legacy
# The Sundered Grail

Select last 5 titles

$library.catalog.book | select title | select -Last 5
# title
# -----
# Creepy Crawlies
# Paradox Lost
# Microsoft .NET: The Programming Bible
# MSXML3: A Comprehensive Guide
# Visual Studio 7: A Comprehensive Guide

List XML data by property name

$library.catalog.book | % {$_.Author} | Format-Table -AutoSize
# Gambardella, Matthew
# Ralls, Kim
# Corets, Eva
# Corets, Eva
# Corets, Eva
# Randall, Cynthia
# Thurman, Paula
# Knorr, Stefan
# Kress, Peter
# O'Brien, Tim
# O'Brien, Tim
# Galos, Mike

Unique XML properties

$library.catalog.book | % {$_.author} | unique
# or
$library.catalog.book.author | unique
# Corets, Eva
# Galos, Mike
# Gambardella, Matthew
# Knorr, Stefan
# Kress, Peter
# O'Brien, Tim
# Ralls, Kim
# Randall, Cynthia
# Thurman, Paula

Sort XML properties

$library.catalog.book | % {$_.author} | sort
# or
$library.catalog.book.author | sort
# Corets, Eva
# Corets, Eva
# Corets, Eva
# Galos, Mike
# Gambardella, Matthew
# Knorr, Stefan
# Kress, Peter
# O'Brien, Tim
# O'Brien, Tim
# Ralls, Kim
# Randall, Cynthia
# Thurman, Paula

Sort descending XML properties

$library.catalog.book | % {$_.author} | sort -descending
# or
$library.catalog.book.author | sort -descending
# Thurman, Paula
# Randall, Cynthia
# Ralls, Kim
# O'Brien, Tim
# O'Brien, Tim
# Kress, Peter
# Knorr, Stefan
# Gambardella, Matthew
# Galos, Mike
# Corets, Eva
# Corets, Eva
# Corets, Eva

Count XML properties

($library.catalog.book | % {$_.author}).count
# or
($library.catalog.book.author).count
# 12

Group and count

$library.catalog.book | group author | Select-Object @{Label="author"; Expression={$_.Name}},Count
# author               Count
# ------               -----
# Gambardella, Matthew     1
# Ralls, Kim               1
# Corets, Eva              3
# Randall, Cynthia         1
# Thurman, Paula           1
# Knorr, Stefan            1
# Kress, Peter             1
# O'Brien, Tim             2
# Galos, Mike              1

Exporting data to CSV

$library.catalog.book | group author | 
Select-Object @{Label="author"; Expression={$_.Name}},Count | 
ConvertTo-Csv -NoTypeInformation | Set-Content exampleOutput.csv | 
Write-Output "" | Add-Content exampleOutput.csv

XML Validation

The tags that wrap data in an XML is called its schema. If these tags are shuffled by accident or misplaced in the document then errors occur. Validation is the process of checking an XML file for errors.

XML Rules

The following rules are defined by W3School.

  • XML documents must have a root element

  • XML elements must have a closing tag

  • XML tags are case sensitive

  • XML elements must be properly nested

  • XML attribute values must be quoted

Schema definition file (XSD)

A schema definition file (XSD) outlines exactly what your XML should look like. You can use an online tool or define the file yourself. Then download PowerShell Community Extensions from http://pscx.codeplex.com/releases. The extensions includes a Test-xml Cmdlet.

<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="catalog">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="book" maxOccurs="unbounded" minOccurs="0">
          <xs:complexType>
            <xs:sequence>
              <xs:element type="xs:string" name="author"/>
              <xs:element type="xs:string" name="title"/>
              <xs:element type="xs:string" name="genre"/>
              <xs:element type="xs:float" name="price"/>
              <xs:element type="xs:date" name="publish_date"/>
              <xs:element type="xs:string" name="description"/>
            </xs:sequence>
            <xs:attribute type="xs:string" name="id" use="optional"/>
          </xs:complexType>
        </xs:element>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>

Validating XML

Test-Xml -Path books.xml -SchemaPath books.xsd -Verbose
# True

Break XML closing tag (books_broken.xml)

<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew
      <title>XML Developer's Guide</title>
    </author> <!-- moved closing tag below title -->
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications
      with XML.</description>
   </book>
<!-- snipped -->

Validate schema:

Test-Xml -Path books_broken.xml -SchemaPath books.xsd -Verbose

VERBOSE: Error: The element 'author' cannot contain child element 'title'
because the parent element's content model is text only. Line 5, Position 8.
VERBOSE: Error: The element 'book' has invalid child element 'genre'. List
of possible elements expected: 'title'. Line 7, Position 8.

Missing open tag

<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      XML Developer's Guide</title> <!-- Missing <title> -->
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications
      with XML.</description>
   </book>
<!-- snipped -->

Validate schema:

Test-Xml -Path books_broken.xml -SchemaPath books.xsd -Verbose

VERBOSE: Error: The element 'book' cannot contain text. List of possible
elements expected: 'title'. Line 4, Position 44.
WARNING: books_broken.xml: The 'book' start tag on line 3 position 5 does
not match the end tag of 'title'. Line 5, position 30.

Incorrect datatype

<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>test 2000-10-01</publish_date><!-- Added text -->
      <description>An in-depth look at creating applications
      with XML.</description>
   </book>
<!-- snipped -->

Validate schema:

Test-Xml -Path books_broken.xml -SchemaPath books.xsd -Verbose

VERBOSE: Error: The 'publish_date' element is invalid - The value 'test
2000-10-01' is invalid according to its datatype
'http://www.w3.org/2001/XMLSchema:date' - The string 'test 2000-10-01' is
not a valid Date value. Line 8, Position 38.

results matching ""

    No results matching ""