View on GitHub

Web

XLinq to Web

Icon HTML => XML + CSS with XLinq 🤘

Version Downloads License

Read HTML as XML and query it with CSS over XLinq (or HtmlAgilityPack killer 😉).

No need to learn an entirely new object model for a page 🤘. This makes it the most productive and lean library for web scraping using the latest and greatest that .NET can offer.

Usage

using System.Xml.Linq;
using Devlooped.Web;

XDocument page = HtmlDocument.Load("page.html")
IEnumerable<XElement> elements = page.CssSelectElements("div.menuitem");

XElement title = page.CssSelectElement("html head meta[name=title]");

By default, HtmlDocument.Load will skip non-content elements script and style, turn all element names into lower case, and ignore all XML namespaces (useful when loading XHTML, for example) for easier querying. These options as well as granular whitespace handling can be configured using the overloads receiving an HtmlReaderSettings.

The underlying parsing is performed by the amazing SgmlReader library by Microsoft’s Chris Lovett.

In addition, the following extension methods make it easier to work with XML documents where you want to query with CSS or XPath without having to deal with XML namespaces:

using System.Xml;
using System.Xml.Linq;
using Devlooped.Web;

var doc = XDocument.Load("doc.xml")
// Will remove all xmlns declarations, and allow querying elements 
// as if none had namespaces, returns the root element
XElement nons = doc.RemoveNamespaces();

// Alternatively, you can also ignore at the XmlReader level
using var reader = XmlReader.Create("doc.xml").IgnoreNamespaces();
doc = XDocument.Load(reader);

// Finally, you can also skip elements at the reader level
using var reader = XmlReader.Create("doc.xml").SkipElements("foo", "bar");
doc = XDocument.Load(reader);

CSS

At the moment, supports the following CSS selector features:

And all combinators

Non-CSS features:

Dogfooding

CI Version Build

We also produce CI packages from branches and pull requests so you can dogfood builds as quickly as they are produced.

The CI feed is https://pkg.kzu.io/index.json.

The versioning scheme for packages is:

Sponsors

Clarius Org Kirill Osenkov MFB Technologies, Inc. Stephen Shaw Torutek DRIVE.NET, Inc. Ashley Medway Keith Pickford Thomas Bolon Kori Francis Toni Wenzel Giorgi Dalakishvili Uno Platform Dan Siegel Reuben Swartz Jacob Foshee Ix Technologies B.V. David JENNI Jonathan Oleg Kyrylchuk Charley Wu Jakob Tikjøb Andersen Seann Alexander Tino Hager Mark Seemann Ken Bonny Simon Cropp agileworks-eu sorahex Zheyu Shen Vezel ChilliCream 4OTC

Sponsor this project  

Learn more about GitHub Sponsors