SMW Introduction
The Semantic Web is a concept that allows massive, reliable reuse of digital information by people and computers.
Why do we want data sharing and re-use
The simple reasons are it will make our society more authentically inclusive, representative and efficient while creating new levels of participation. Data collected for public institutions is invaluable when creating information on features (built and natural resources and infrastructure), spending and partnerships. These institutions have critical short and long term funding problems, and it's impossible and undesirable for them to address every need. Non profit and social economy organizations exist, which can sometimes obtain public data, but the number of addressable niches aren't practically served by 'silo' institutions. Opening data up with intentional access techniques and policies leads to more participation — more ability for individual citizens to understand, organize and analyze, including exchanging with professionals. Much like hobbyist astronomers can be key to important breakthroughs, there is tremendous potential in public data if it is shared. Public data should be considered as important in re-use as public infrastructure. revise, trop heavy.
Based on the ease and minimal cost of gathering and organizing data, functionality, and interested parties on the Internet, an as yet unnamed new sector of public participation, including loosely affiliated individuals and groups such as http://www.visiblegovernment.ca, http://opengovdata.ru, http://www.mysociety.org and http://open.org.nz, is developing. This sector includes individuals, physical communities, and communities of interest, it includes real experts, dedicated hobbyists and the casually interested. They try to solve problems and better understand their world, but they need real data. These groups can work reciprocally with our existing institutions to efficiently fill gaps and build our systems. They can crowd source large tasks, develop and maintain specialized tools, and build infrastructure services.
Why information sharing isn't common today
The Internet age is a major part of a an explosion in computer use in our age. One of the most remarkable things about the Web is it is based on HTML, a text format that is highly accessible by people and computers. Every Web page uses the same syntax to indicate what should be displayed, they all use the same retrieval mechanisms. This was a remarkable and unexpected breakthrough in communications, but the way companies jumped in to make the Web more attractive (Flash-y) and commerce-friendly did little to enable background information exchange. Today's focus on the forthcoming HTML 5, with its built in facilities for multimedia and interaction, helps mitigate these problems.
There are two main technical requirements to make digital information reusable and available — well-known access mechanisms, and the descriptions of how data is organized and detailed for reliable re-use. Efforts over the years have struggled with complexity and standardization, with major initiatives interfering with each other for technical or market reasons. However, with the value seen through rich information re-use (for example, in banking applications and partnership programs), many practical ad hoc, de facto and standard methods exists.
Another major concern is the incentive to share information. Today it's common for non profit organizations to hoard their information, to create "proprietary databases" they can use to pitch to granting agencies. Another factor is that ignoring standards allows efforts to move ahead on their own terms, without making their systems fit into larger systems which could slow them down. Another factor is insecurity — an organization may have a perfectly useful database, but in implementation it may not compare well to best technical efforts.
Trust is another issue. Many people do not think it's appropriate to share "government data," "hospital data," and so on. Yet within these monolithic descriptions, there are vast swaths of data that do not endanger individuals.
Another factor holding things back is how we use computers today — for the most part, like a typewriter. Not many people embed data from spreadsheets into their email, use automatic facilities for events and contacts, share to-do tasks, and so on. Documents and communications are one-offs, out of date the moment they're sent, and nothing is explicit in them. A semantic approach to computer data will help change this. Data will be more consistent, and when it comes to important statements we should be able to expect more. Increasing digital literacy is an issue here. Services like Facebook and Twitter introduce participation and embedded data that leads the way, along with the popular idea of 'infotainment.' Mapquest pushed ahead with interactive maps — information about the public environment — and today these services are better designed and more available than facilities produced by the government. The forces of automated, worldwide spam and fraud on the Internet are creating defenses, cultural and technical, suitable for mass participation.
The Internet has been mainstream for 15 years, nearly a generation of new and experienced users, programmers, researchers and so on using the most advanced systems available freely around the world. Innovation is amplified by international access and competition. We're starting to see real breakthroughs in Semantic Web type applications. With unlimited room for our improvement by building on rather than hoarding data, and the recognition of the value of a true participatory society, efforts to not share public data will have a stunting effect.
Drawbacks and missteps
AI boondoggle
value of efforts despite grand schemes - Dr Tony Shannon, on the OpenEHR mailing list, writes:
If.... if I was to wait for an entirely top-down semantically interoperable solution to my healthcare systems needs then I agree that could be like awaiting a Tower of Babel.
On the other hand, if we have agreed that...
- healthcare systems needs to change
- information management systems are key to improvements
- an international health IT platform to openly share clinically useful components would be a good (if disruptive) thing
- open standards (+/- open source solutions) are needed for that platform
...then *any* effort to evolve healthcare solutions using archetypes from the bottom up, appears to me to be a move in the right direction.
avoid grand schemes, exploit the many key advantages
Approaches to Semantic Web applications
Mining
There are essentially two types of SemWeb applications, mining and intentional semantic development. One technique in mining is "scraping" to parse presumably reliable HTML pages. Many citizen projects use this technique to extract public data from recalcitrant government sources, for example, They Work for You. Mash ups are related, sites like Housing Maps combine data from disparate sources into one useful interface. However, scraping can be easily foiled by obfuscating low level structure, intentionally or not.
Another mining approach involves scraping human oriented text. Open Calais is a infrastructure example of this. Health Base is an end user application. These sites use patterns in human text to try to derive statements. This technique is easily foiled leading to incorrect observations. Mining can be used as a way to import non semantic sites, but occasionally misinterpreted data and unclear re-use policies hamper these efforts.
Intentional markup
Intentional semantic development involves explicit markup of text items. Most HTML documents today contain only text and links. Semantically marked up documents have explicit annotations about data objects, indicating them as entities such as people, places, dates, and so on. Relations (links) have explicit meanings.
In FOAF, we can indicate "me" links on our home page that indicate another representation of ourselves. We can indicate links to friends, business associates, and organizations. It quickly becomes apparent that decentralized Facebook sites will be enabled, where individuals can publish their information wherever they like, using whatever licenses they like, and sites like Facebook can provide their own views of these webs of data referring to embedded licenses like ccRel.
Standard RSS and Atom syndicated feeds are also gaining rich data, including geo location, that allow third party sites to create views based on distributed data.
Using RDFa and Microformats, annotations are added to regular HTML that give them semantic meaning. A person's information can be marked up with hCard, allowing you to "right click" on a web page to add that person to your address book. Similar formats exist for locations and events.
Google, Yahoo and others use these formats to make their results more reliable. Without them, information is guessed from overall content on a page. So if you searched for "frames," looking for picture frames, you would be likely to find a page that referred to "frames" in its navigation. RDFa and Microformats allow more reliable markup of subjects, allowing meta directories to embed reviews from any cooperating site rather than trying to do everything themselves - because these reviews link back to the originating site, it's a "win win win" situation, for the meta directory, originating site, and end user, with richer, less biased results when a critical mass is reached.
The heavyweight options are systems such as RDF and Topic Maps. They provide a complex interlinked way to describe arbitrary data. Today they are only used for specific projects, but as their use grows we can expect the web to become more interlinked, allowing an endless assemblage of information using the best references.
One way to 'intentionally' create semantic data is Semantic Mediawiki.