I have couple of XML files that contain lots of duplicate entries, such as these.
using System.Xml.Linq;
XDocument xDoc = XDocument.Parse(xmlString);
xDoc.Root.Elements("annotation")
.SelectMany(s => s.Elements("image")
.GroupBy(g => g.Attribute("location").Value)
.SelectMany(m => m.Skip(1))).Remove();
If your duplicates are always in this form, then you could do this with a bit of XSLT to remove duplicate nodes. The XSLT for this is:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="image[@location = preceding-sibling::image/@location]"/>
</xsl:stylesheet>
If it's something that can happen frequently, then it might be worth having that stylesheet loaded into a XslCompiledTransform
instance.
Or you can simply get a list of all duplicate nodes using this XPath:
/annotations/annotation/image[@location = preceding-sibling::image/@location]
and remove them from their parent.
There's a couple of things that you could do here. As well as the other answers so far, you can note that Distinct() has an overload that takes an IEqualityComparer. You could use something like this ProjectionEqualityComparer to do something like this:
var images = xdoc.Descendants("image")
.Distinct(ProjectionEqualityComparer<XElement>.Create(xe => xe.Attributes("location").First().Value))
... which would give you all of the unique "image" elements that have unique location attributes.