GATE track 1 session: Difference between revisions

 
(14 intermediate revisions by the same user not shown)
Line 1: Line 1:
/* CSS placed here will be applied to all skins */
A full week of learning [http://en.wikipedia.org/wiki/General_Architecture_for_Text_Engineering GATE] text mining/information extraction language processing and talks. [http://gate.ac.uk/wiki/TrainingCourseAug2010/ Session wiki]
/* @import "/css/basic.css"; */
@import "/css/galleriffic-3.css";


.tpllabel {
[[File:GATE_screenshot.png|900px|GATE developer screenshot]]
        float: left;
        width: 15%;
}


.tplvalue {
GATE is written in Java and very Java centric. This makes it portable, fast, and heavyweight. A programming library is available.  It's 14 years old and has many users and contributors.
        float: right;
        width: 83%;
}


= Using GATE developer =


h1.firstHeading {
* GATE developer is used to process sets of Language Resources in Corpus using Processing Resources. They are typically saved to a serialized Datastore.
  display: none;
* ANNIE, VG (verb group) processors.
}
* Preserve formatting embeds tags in HTML or XML.
** Different strengths using GATE's graph (node/offset) based XML vs. preserved formatting (original xml/html)


.cleared {
= Information Extraction =
  clear: both;
}


td.month_name {
* IR - retrieve docs
  font-size: large;
* IE - retreive structured data
}


#p-logo a { background: url(/images/smacw3.png) 35% 50% no-repeat !important; }
* Knowledge Engineering - rule based
* Learning Systems - statistical


.noticebox {
Old Bailey IE project - 17th century english (Online)
  -moz-border-radius: 1em;
-webkit-border-radius: 1em;
  background: #eee;
  padding: 5px;
}


.hidden {
* POS - assigned in Token (noun, verb, etc)
  display: none;
 
}
* Gazateer - gotcha, have to set initialization parameter listsURL before it's loaded. Must also "save and reinitialize."
* Gazeteer creates Lookups, then transducer creaties named entities
* Then orthomatcher (spelling features in common) coreference associates those
 
* Annotation Key sets and annotation comparing
** Need setToKeep key in Document Reset for any pre annotated texts
 
== Evaluation / Metrics ==
 
* Evaluation metric - mathematically against human annotated
* Scoring - performance measures for annotation types
 
* Precision = correct / correct + spurious
* Recall = correct / correct + missing
* F-measure is precision and recall (harmonic mean)
* F=2⋅(precision⋅recall / precision+recall)
* GATE supports average, strict, lenient
 
* Result types - Correct, missing, spurious, partially correct (overlapped)
 
* Tools > Annotations Diff - comparing human vs machine annotation
 
* Corpus > Corpus quality assurance - compare by type
* (B has to be the generated set)
 
* Annotation set transfer (in tools) - transfer between docs in pipeline
** useful for eg HTML that has boilerplate
 
 
== To investigate ==
 
* markupAware for HTML/XML (keeps tags in editor)
* AnnotationStack
* Advanced Options
 
= JAPE =
 
* Rules based on tokens and lookups
 
Phase: MatchingStyles
Input: Lookup
Options: control = appelt
Rule: Test1
(
({Lookup.majorType == location})?
{Lookup.majorType == loc_key}
):match
-->
:match.Location = {rule=Test1}
 
Copying features: :match.Location = { type = :match.Lookup.minorType}
 
== To review, gotchas ==
 
* Rule types : first takes only first match, excludes compound
** a? b for "a b" will match "a b"
* multiplexor tranducers
* multi-constraint statements
* macros
* To reuse created annotations has to be a separate rule
 
=== Matching types ===
 
[[File:gate-matching.png|800px|Matching styles for JAPE]]
 
= To follow up =
 
* WebSphinx crawler CREOLE plugin
* [http://www.semanticsoftware.info/semantic-assistants-architecture GATE NLP web services]
 
= Other notes =
 
== Lucene data store and ANNIC ==
 
* Use <null> for default set
* Go to Datastore for queries
** eg {Person}({Token})+{Money}
* Useful for debugging JAPE and results
 
[[File:GATE-lucene-person-money.png|800px]]
 
= Demos =
 
* Mímir for querying large volumes of data (uses MG4J)
* Translating parts of speech between languages using Compound editor and Alignment editor
* Predicate extractor (MultiPaX)
** Mixed results at best
* OwlExporter
** NLP ontology
 
= Conclusions =
 
While it can do a lot out of the box and benefits from development time and breadth of connectivity, to be useful to more than patient specialists, it needs usability testing. A lot of things are inobvious and too domain specific that with a bit of work could be more broadly useful. Interaction could include a lot more immediate, useful and interesting looking displays. A web based version could have these features. However the team seems somewhat ambivalent about development. :)
 
Looking forward to learning about programming using GATE libraries.
 
{{Blikied|Aug 30, 2010}}
 
[[Category:SemWeb]]