SMW Introduction: Difference between revisions

no edit summary
No edit summary
No edit summary
Line 1: Line 1:
The Semantic Web is a concept that allows massive, reliable reuse of data.  
The Semantic Web is a concept that allows massive, reliable reuse of data.  


One of the most remarkable things about the Web is it is based on HTML, a text format that is highly accessible by humans ''and'' computers. Every Web page uses the same syntax to indicate what should be displayed, they all use the same retrieval mechanisms. This was a remarkable and unexpected (disruptive) breakthrough in communications, but the way companies jumped in to make the Web more attractive did little to make the exchange of data easier.
One of the most remarkable things about the Web is it is based on HTML, a text format that is highly accessible by humans ''and'' computers. Every Web page uses the same syntax to indicate what should be displayed, they all use the same retrieval mechanisms. This was a remarkable and unexpected breakthrough in communications, but the way companies jumped in to make the Web more attractive and commerce-friendly did little to make the exchange of data easier.


There are two main technical requirements to make data reusable and available - well known access mechanisms usable by any organization, and the schemas/ontologies, descriptions of how the data is organized and detailed for reliable re-use.  Efforts over the years have struggled with complexity and standardization, with major initiatives interfering with each other for technical reasons (eg Microformats vs RDFa) or while trying to dominate in the market. Settling these details can also be a tremendous effort, particularly the latter.
There are two main technical requirements to make data reusable and available - well known access mechanisms usable by any organization, and the schemas/ontologies, descriptions of how the data is organized and detailed for reliable re-use.  Efforts over the years have struggled with complexity and standardization, with major initiatives interfering with each other for technical reasons (eg Microformats vs RDFa) or while trying to dominate in the market. Settling these details can also be a tremendous effort, particularly the latter.


Another major concern is the model for how information will be shared. Today it's common for non profit organizations to hoard their information, to create "proprietary databases" they can use to pitch to granting agencies. Another factor is that ignoring standards allows efforts to move ahead on their own terms, without making their systems fit into larger systems which could slow them down. Another factor is insecurity - an organization may have a perfectly useful database, but in implementation it may not compare well to best technical efforts.  
Another major concern is the incentive to share information. Today it's common for non profit organizations to hoard their information, to create "proprietary databases" they can use to pitch to granting agencies. Another factor is that ignoring standards allows efforts to move ahead on their own terms, without making their systems fit into larger systems which could slow them down. Another factor is insecurity - an organization may have a perfectly useful database, but in implementation it may not compare well to best technical efforts.  


Yet, the Internet has been mainstream for 15 years, nearly a generation of new and experienced users, programmers, researchers and so on using the most advanced systems available freely around the world. We're starting to see real breakthroughs in Semantic Web type applications. With unlimited room for our improvement by building on rather than hoarding data, and the recognition of the value of a true participatory society, efforts to not share public data are the blockers.
Yet, the Internet has been mainstream for 15 years, nearly a generation of new and experienced users, programmers, researchers and so on using the most advanced systems available freely around the world. We're starting to see real breakthroughs in Semantic Web type applications. With unlimited room for our improvement by building on rather than hoarding data, and the recognition of the value of a true participatory society, efforts to not share public data are the blockers.


Based on the ease and minimal cost of gathering and organizing data functionality and interested parties on the Internet, an as yet unnamed new sector of public participation, including loosely affiliated individuals and groups such as http://www.visiblegovernment.ca, http://opengovdata.ru, http://www.mysociety.org and http://open.org.nz, are developing, This sector includes individuals, physical communities, and communities of interest, it includes real experts, dedicated hobbyists and the casually interested. They try to solve problems and better understand their world, but they need real data. These groups can work reciprocally with our existing institutions to efficiently fill gaps and build our systems. They can crowd source large tasks, develop and maintain specialized tools and build reliable infrastructure services. The cost is making public data re-usable at the institutional level. Unfortunately many agencies fear this approach since it will affect their societal placement (and most don't trust 'the masses').
== Why do we want data sharing and re-use ==


Another factor holding things back is how we use computers today - for the most part, like a typewriter. Not many people embed data from spreadsheets into their email, use automatic facilities for events and contacts, share to-do tasks, and so on. Documents and communications are one-offs, out of date the moment they're sent, and nothing is explicit in them. A semantic approach to computer data will help change this. Data will be more consistent, and when it comes to important statements we should be able to expect more.  
The simple reasons are it will make our society more authentically inclusive and efficient. Data collected by the government (and other public institutions) is invaluable when creating realistic ideas of facets like features (built and natural infrastructure), spending and partnerships. Today governments complain about short and long term funding problems, leading to service provision problems. And it's impossible and undesirable for the government to address every purpose. Non profit and social economy organizations exist (which can sometimes obtain public data). Opening data up with intentional policies of access to data leads to more participation and more ability for individual citizens to understand, organize and analyze, including exchanging with professionals, much like [http://techastronomy.com/article.asp?articleid=58065&7-Great-Discoveries-by-Amateur-Astronomers hobbyist astronomers can be key to important breakthroughs].


Computer front ends and people's habits will need to change to accommodate this. Sadly, however, the culture of many organizations and individuals will hold things back. Too many web design firms create sites like it's 1995 (or emphasize Flashy presentations that can't even be used by many people), too many executives forget their passwords, too many people focus on just the newest developments, forgetting that all others are steadily brewing around the world, too many organizations make excuses for not pursuing a way that builds on our useful fascination and involvement with information.
Based on the ease and minimal cost of gathering and organizing data functionality and interested parties on the Internet, an as yet unnamed new sector of public participation, including loosely affiliated individuals and groups such as http://www.visiblegovernment.ca, http://opengovdata.ru, http://www.mysociety.org and http://open.org.nz, are developing, This sector includes individuals, physical communities, and communities of interest, it includes real experts, dedicated hobbyists and the casually interested. They try to solve problems and better understand their world, but they need real data. These groups can work reciprocally with our existing institutions to efficiently fill gaps and build our systems. They can crowd source large tasks, develop and maintain specialized tools and build reliable infrastructure services. The cost is making public data re-usable at the institutional level.
 
Another factor holding things back is how we use computers today - for the most part, like a typewriter. Not many people embed data from spreadsheets into their email, use automatic facilities for events and contacts, share to-do tasks, and so on. Documents and communications are one-offs, out of date the moment they're sent, and nothing is explicit in them. A semantic approach to computer data will help change this. Data will be more consistent, and when it comes to important statements we should be able to expect more. Increasing [http://en.wikipedia.org/wiki/Digital_literacy digital literacy] is an issue here.
 
Computer front ends and people's habits will need to change to accommodate this. We can expect to see new usage patterns emerge just as people learned to use cut and paste. Many previous approaches will hold things back. Too many web design firms create sites like it's 1995 (or emphasize Flash-y presentations that can't even be used by many people), too many people can't remember their passwords, too many people focus on just the newest developments, forgetting that all others are steadily brewing around the world, too many organizations make excuses for not pursuing a way that builds on the capacity for fascination and involvement with information that's today called 'infotainment.'


== Approaches to Semantic Web applications ==
== Approaches to Semantic Web applications ==
Line 21: Line 25:
There are essentially two types of SemWeb applications, mining and intentional semantic development. One technique in mining is "scraping" to parse presumably reliable HTML pages. Many citizen projects use this technique to extract public data from recalcitrant government sources, for example, [http://www.theyworkforyou.com They Work for You]. Mash ups are related, sites like [http://www.housingmaps.com Housing Maps] combine data from disparate sources into one useful interface. However, scraping can be easily foiled by obfuscating low level structure, intentionally or not.
There are essentially two types of SemWeb applications, mining and intentional semantic development. One technique in mining is "scraping" to parse presumably reliable HTML pages. Many citizen projects use this technique to extract public data from recalcitrant government sources, for example, [http://www.theyworkforyou.com They Work for You]. Mash ups are related, sites like [http://www.housingmaps.com Housing Maps] combine data from disparate sources into one useful interface. However, scraping can be easily foiled by obfuscating low level structure, intentionally or not.


Another mining approach involves scraping human oriented text. [http://www.opencalais.com Open Calais] is a infrastructure example of this. [http://healthbase.netbase.com Health Base] is an end user application. These sites use patterns in human text to try to derive statements. This technique is easily foiled leading to incorrect observations.
Another mining approach involves scraping human oriented text. [http://www.opencalais.com Open Calais] is a infrastructure example of this. [http://healthbase.netbase.com Health Base] is an end user application. These sites use patterns in human text to try to derive statements. This technique is easily foiled leading to incorrect observations. Mining can be used as a way to import non semantic sites, but occasionally misinterpreted data and unclear re-use policies hamper these efforts.


=== Intentional markup ===
=== Intentional markup ===