Parsing XML using Powershell
PowerShell is an interactive prompt and scripting environment. This document shows how to parse XML using Powershell.
TL&DR: Powershell is a command prompt that uses Cmdlets. Cmdlets are used to parse XML. XML stands for eXtensible Markup Language. XML carries data by defining tags that store values. To validate XML you first define an XSD file. Download Powershell Community Extensions. Run Test-xml:
Test-Xml -Path books.xml -SchemaPath books.xsd -Verbose
. Use [xml] accelerator to import data to Powershell:[xml]$library = gc books.xml
. Parse data and write to csv$library.catalog.book | ConvertTo-Csv -NoTypeInformation | Set-Content eOutput.csv | Write-Output "" | Add-Content eOutput.csv
.
Powershell
This is mostly an introduction to using Cmdlets in Powershell. Cmdlets are commands that are used in the Windows PowerShell environment. The following code blocks are cherry picked for Parsing XML.
General commands
These commands will be familiar to most. No explanation here.
clear # Clears screen
ls # List child items
dir # List child items
atom books.xml # Open atom text editor
Version
$PSVersionTable.PSVersion
# For me this returns...
# Major Minor Build Revision
# ----- ----- ----- --------
# 5 0 10586 117
Create new file
The examples given in this note uses an existing xml file so the creation of books.xml is not actually required. The purpose of showing how to create a new file here is to introduce the "New-item" Cmdlet and its implementation.
New-item -name books.xml -ItemType File
Searching content
get-item books.xml # print file details
get-content books.xml # print each line
Select-string "computer" # find
Pipes ("|")
Pipes let you string commands together so that the outcome of the first command is past to the next and so on.
get-content "results.csv" | Select-string "computer"
Aliases
Aliases provide shorthand commands for Cmdlets. You can create your own aliases but I don't cover how to do that here.
# LONG FORMAT
Get-content "results.csv" | Select-string "computer"
# ALIASED
gc "results.csv" | sls "computer
Type the following to see what aliases are available:
# List aliases
get-alias | format-table Name, Definition -autosize
Useful aliases and commands
gc # Get-content
sls # Select-string
select # Select-Object
ft # Format-table
epcsv # Export-Csv
% # ForEach-Object
$_ # Placeholder (used in piping)
@ # Array operator
Using ForEach-Object (alias = %)
This is a for loop construct similar to other languages.
# (0..4) = 0, 1, 2, 3, 4
# $var = variable to print
# I am using Write-Host for testing
ForEach( $var in 0..4)
{
Write-Host $var
}
Use alias to modify numbers from 0 to 4.
# The following code can be read as:
# Send numbers 0, 1,2, 3, and 4 (0..4) one at a time to
# a loop (%) that will multiply each number ($_)
# by 10
(0..4) | % { $_ * 10 }
# Returns
0
10
20
30
40
Array operator
# This is an array - semicolon separated key value pairs
@{fname="Ray"; lname="Blick"}
# Parse the array using the ForEach alias ( % ) and placeholder ( $_ )
@{fname="Ray"; lname="Blick"} | % {$_ }
Name Value
fname Ray
lname Blick
XML
XML stands for eXtensible Markup Language and it contains information that is wrapped in tags. None of the tags (e.g. catalog or book) are predefined and the final structure of the XML document is up to the person creating it. XML was designed to carry data in a structured way. XML is stored in plain text format and easily shared. XML does not need to be indented, but it helps to make it human readable. Beware the whitespace introduced by indenting can have meaning to some applications. However, most important to our end objective is that whitespace does not influence parsing XML documents.
Books.xml
The sample XML used in this document is provided by Microsoft which contains a catalog of books. The following code block shows what the XML would look like if it only contained the first record.
<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
</catalog>
Search xml using Powershell "select-string" Cmdlet
Books.xml contains more than one book. This example prints out every line in the XML that contains "book". Note that we are not parsing information, and simply searching the XML file as if it were a plain text file.
Select-String -path "*.xml" -pattern "book"
# books.xml:3: <book id="bk101">
# books.xml:11: </book>
# books.xml:12: <book id="bk102">
# books.xml:21: </book>
# books.xml:22: <book id="bk103">
# books.xml:31: </book>
# books.xml:32: <book id="bk104">
# books.xml:42: </book>
# books.xml:43: <book id="bk105">
# books.xml:52: </book>
# books.xml:53: <book id="bk106">
# books.xml:61: </book>
# books.xml:62: <book id="bk107">
# books.xml:70: </book>
# books.xml:71: <book id="bk108">
# books.xml:79: </book>
# books.xml:80: <book id="bk109">
# books.xml:89: </book>
# books.xml:90: <book id="bk110">
# books.xml:98: </book>
# books.xml:99: <book id="bk111">
# books.xml:108: </book>
# books.xml:109: <book id="bk112">
# books.xml:119: </book>
Get-Content XML
Use the [XML] type accelerator
[xml]$library = Get-Content -Path books.xml
$library
# xml catalog
# --- -------
# version="1.0" catalog
View XML content (dot notation)
$library.catalog.book
# ... snipped
# id : bk112
# author : Galos, Mike
# title : Visual Studio 7: A Comprehensive Guide
# genre : Computer
# price : 49.95
# publish_date : 2001-04-16
# description : Microsoft Visual Studio 7 is explored in depth,
# looking at how Visual Basic, Visual C++, C#,
# and ASP+ are
# integrated into a comprehensive development
# environment.
View XML content as formated table
$library.catalog.book | Format-Table -AutoSize
# id author title genre price publish_date description
# -- ------ ----- ----- ----- ------------ -----------
# bk101 Gambardella, Matthew XML Developer's Guide Computer 44.95 2000-10-01 An in-depth look at creating applications...
# bk102 Ralls, Kim Midnight Rain Fantasy 5.95 2000-12-16 A former architect battles corporate zombies,...
# bk103 Corets, Eva Maeve Ascendant Fantasy 5.95 2000-11-17 After the collapse of a nanotechnology...
# bk104 Corets, Eva Oberon's Legacy Fantasy 5.95 2001-03-10 In post-apocalypse England, the mysterious...
# bk105 Corets, Eva The Sundered Grail Fantasy 5.95 2001-09-10 The two daughters of Maeve, half-sisters,...
# bk106 Randall, Cynthia Lover Birds Romance 4.95 2000-09-02 When Carla meets Paul at an ornithology...
# bk107 Thurman, Paula Splish Splash Romance 4.95 2000-11-02 A deep sea diver finds true love twenty...
# bk108 Knorr, Stefan Creepy Crawlies Horror 4.95 2000-12-06 An anthology of horror stories about roaches,...
# bk109 Kress, Peter Paradox Lost Science Fiction 6.95 2000-11-02 After an inadvertant trip through a Heisenberg...
# bk110 O'Brien, Tim Microsoft .NET: The Programming Bible Computer 36.95 2000-12-09 Microsoft's .NET initiative is explored in...
# bk111 O'Brien, Tim MSXML3: A Comprehensive Guide Computer 36.95 2000-12-01 The Microsoft MSXML3 parser is covered in...
# bk112 Galos, Mike Visual Studio 7: A Comprehensive Guide Computer 49.95 2001-04-16 Microsoft Visual Studio 7 is explored in depth,...
Select one column
$library.catalog.book | select title
# title
# -----
# XML Developer's Guide
# Midnight Rain
# Maeve Ascendant
# Oberon's Legacy
# The Sundered Grail
# Lover Birds
# Splish Splash
# Creepy Crawlies
# Paradox Lost
# Microsoft .NET: The Programming Bible
# MSXML3: A Comprehensive Guide
# Visual Studio 7: A Comprehensive Guide
Select first 5 titles
$library.catalog.book | select title | select -First 5
# title
# -----
# XML Developer's Guide
# Midnight Rain
# Maeve Ascendant
# Oberon's Legacy
# The Sundered Grail
Select last 5 titles
$library.catalog.book | select title | select -Last 5
# title
# -----
# Creepy Crawlies
# Paradox Lost
# Microsoft .NET: The Programming Bible
# MSXML3: A Comprehensive Guide
# Visual Studio 7: A Comprehensive Guide
List XML data by property name
$library.catalog.book | % {$_.Author} | Format-Table -AutoSize
# Gambardella, Matthew
# Ralls, Kim
# Corets, Eva
# Corets, Eva
# Corets, Eva
# Randall, Cynthia
# Thurman, Paula
# Knorr, Stefan
# Kress, Peter
# O'Brien, Tim
# O'Brien, Tim
# Galos, Mike
Unique XML properties
$library.catalog.book | % {$_.author} | unique
# or
$library.catalog.book.author | unique
# Corets, Eva
# Galos, Mike
# Gambardella, Matthew
# Knorr, Stefan
# Kress, Peter
# O'Brien, Tim
# Ralls, Kim
# Randall, Cynthia
# Thurman, Paula
Sort XML properties
$library.catalog.book | % {$_.author} | sort
# or
$library.catalog.book.author | sort
# Corets, Eva
# Corets, Eva
# Corets, Eva
# Galos, Mike
# Gambardella, Matthew
# Knorr, Stefan
# Kress, Peter
# O'Brien, Tim
# O'Brien, Tim
# Ralls, Kim
# Randall, Cynthia
# Thurman, Paula
Sort descending XML properties
$library.catalog.book | % {$_.author} | sort -descending
# or
$library.catalog.book.author | sort -descending
# Thurman, Paula
# Randall, Cynthia
# Ralls, Kim
# O'Brien, Tim
# O'Brien, Tim
# Kress, Peter
# Knorr, Stefan
# Gambardella, Matthew
# Galos, Mike
# Corets, Eva
# Corets, Eva
# Corets, Eva
Count XML properties
($library.catalog.book | % {$_.author}).count
# or
($library.catalog.book.author).count
# 12
Group and count
$library.catalog.book | group author | Select-Object @{Label="author"; Expression={$_.Name}},Count
# author Count
# ------ -----
# Gambardella, Matthew 1
# Ralls, Kim 1
# Corets, Eva 3
# Randall, Cynthia 1
# Thurman, Paula 1
# Knorr, Stefan 1
# Kress, Peter 1
# O'Brien, Tim 2
# Galos, Mike 1
Exporting data to CSV
$library.catalog.book | group author |
Select-Object @{Label="author"; Expression={$_.Name}},Count |
ConvertTo-Csv -NoTypeInformation | Set-Content exampleOutput.csv |
Write-Output "" | Add-Content exampleOutput.csv
XML Validation
The tags that wrap data in an XML is called its schema. If these tags are shuffled by accident or misplaced in the document then errors occur. Validation is the process of checking an XML file for errors.
XML Rules
The following rules are defined by W3School.
XML documents must have a root element
XML elements must have a closing tag
XML tags are case sensitive
XML elements must be properly nested
XML attribute values must be quoted
Schema definition file (XSD)
A schema definition file (XSD) outlines exactly what your XML should look like. You can use an online tool or define the file yourself. Then download PowerShell Community Extensions from http://pscx.codeplex.com/releases. The extensions includes a Test-xml Cmdlet.
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="catalog">
<xs:complexType>
<xs:sequence>
<xs:element name="book" maxOccurs="unbounded" minOccurs="0">
<xs:complexType>
<xs:sequence>
<xs:element type="xs:string" name="author"/>
<xs:element type="xs:string" name="title"/>
<xs:element type="xs:string" name="genre"/>
<xs:element type="xs:float" name="price"/>
<xs:element type="xs:date" name="publish_date"/>
<xs:element type="xs:string" name="description"/>
</xs:sequence>
<xs:attribute type="xs:string" name="id" use="optional"/>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
Validating XML
Test-Xml -Path books.xml -SchemaPath books.xsd -Verbose
# True
Break XML closing tag (books_broken.xml)
<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew
<title>XML Developer's Guide</title>
</author> <!-- moved closing tag below title -->
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
<!-- snipped -->
Validate schema:
Test-Xml -Path books_broken.xml -SchemaPath books.xsd -Verbose
VERBOSE: Error: The element 'author' cannot contain child element 'title'
because the parent element's content model is text only. Line 5, Position 8.
VERBOSE: Error: The element 'book' has invalid child element 'genre'. List
of possible elements expected: 'title'. Line 7, Position 8.
Missing open tag
<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
XML Developer's Guide</title> <!-- Missing <title> -->
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
<!-- snipped -->
Validate schema:
Test-Xml -Path books_broken.xml -SchemaPath books.xsd -Verbose
VERBOSE: Error: The element 'book' cannot contain text. List of possible
elements expected: 'title'. Line 4, Position 44.
WARNING: books_broken.xml: The 'book' start tag on line 3 position 5 does
not match the end tag of 'title'. Line 5, position 30.
Incorrect datatype
<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>test 2000-10-01</publish_date><!-- Added text -->
<description>An in-depth look at creating applications
with XML.</description>
</book>
<!-- snipped -->
Validate schema:
Test-Xml -Path books_broken.xml -SchemaPath books.xsd -Verbose
VERBOSE: Error: The 'publish_date' element is invalid - The value 'test
2000-10-01' is invalid according to its datatype
'http://www.w3.org/2001/XMLSchema:date' - The string 'test 2000-10-01' is
not a valid Date value. Line 8, Position 38.