In 2005 I attended a lecture at the British Computer Society from Prof. Ian Horrocks about the Semantic Web. If you’ve not heard the term before, the Semantic Web refers to the meaning based categorisation of websites and related information using a common standard, allowing websites and the information contained to be found easily. It uses ontology standards like OWL and RDF to define meaning. This lecture focused mainly on scientific endeavours, like medicine, genomics and earth sciences and all seemed a bit detached from the web at the time. Indeed the concept born from World Wide Web creator Tim Berners-Lee (himself a physicist) in 1999 started off with scientific data.
Whilst I was quite fascinated at the prospect of being able to navigate the web as if it were a single database, the enormity of what would be required to do this seemed to me unachievable. Five years later however, this is starting to look closer, with various cultural shifts as the internet has matured. Diverse aspects of the web are beginning to converge.
A step closer to the Semantic Web
With the growth of social media websites, vast amounts of user data and relationships have been defined and catalogued. Somewhat perversely the dominance of a few large companies (Google, Facebook, LinkedIn) has started to make possible some of the required changes, storing and categorising vast amounts of data, meaning and relationships. The driving force at the moment is obviously commercial: advertising revenue for search engines and social media websites.
Despite the dominance of a few large companies, there’s still enough competition in the market place to ensure that the people still have the power. Whilst search engines are motivated by commercial reasons, the overriding requirement is to provide useful information for the user to keep a market share.
Alongside this, work has continued from the W3C and Berners-Lee, encouraging various governments around the world (who have started to acknowledge the potential of the internet as a source of authoritative information) to provide support.
Tim Berners-Lee on Web 2.0
In an interview at the Web 2.0 summit in 2009, Berners-Lee explains his belief that in the same way email started off confined to AOL and expanded to become a non-company specific protocol across the internet, the same may happen with social networking sites. Many social media sites have made their data semi-public by releasing an API and it’s possible to connect between them. Google making agreements with the likes of Facebook and Twitter to include results in the SERPs and its social search is another step towards this idea.
RDFa and HTML 5
Google (and to a lesser extent) Yahoo are encouraging website owners to start using RDFa (a way to express RDF data within XHTML) to categorise certain data items, the reward being they’ll get more coverage in the SERPs and users will receive more accurate, useful data quickly. With both companies showing rich text snippets for various data items including events, people and reviews, there’s a good reason to start defining more of your data using RDFa now. The search engines themselves are policing these sources of information, deciding which authority websites are allowed to have rich snippets created, on an case by case basis as explained by Matt Cutts.
There is no question that SEO will change as RDFa is used more and supported by search engines. It’s very likely that with HTML 5 support, browsers will start to utilise this formatting too. Indeed Google has recently released a mobile web based voice application built in HTML 5, which is now supported by the latest iPhone and Palm OS, as an alternative to creating an app. YouTube has recently made a step in this direction by using the HTML 5 video tag instead of Flash Player in Chrome and Safari.
It’s all very well allowing users to categorise their data more accurately, but there’s always going to be a problem with dishonesty, especially when there’s commercial gain to be made. The question is how can we rely on data categorised by people who could lie to further their own needs. The answer, according to Semantic Web principles is in trust relationships that authenticate information as being authoritative by using web service based trust agents.
For me the main problem with this approach is that setting up these trust relationships creates restrictions about who can create content. Who decides who is authoritative? Does this remove the recently gained power from the masses? There’s a balance needed between giving everyone the power to publish and ensuring publicly available data is factually correct.
Government involvement/intervention and the convergence of standards will no doubt improve the reliability and find-ability of data. Ultimately however, whilst social media opens the web up to more user generated content, commercial forces will continue to dictate that the best search engines (and not necessarily the people who are in the best position to decide who is authoritative) have the most power and control the content we see, regardless of how well defined it is.