GATE track 1 session: Difference between revisions

From zooid Wiki
Jump to navigation Jump to search
No edit summary
Line 1: Line 1:
A full week of learning GATE text mining/information extraction language processing and talks. [http://gate.ac.uk/wiki/TrainingCourseAug2010/ Session wiki]
[[GPS Location::{{ #geocode: Montreal, Canada}}]]


[[File:GATE_screenshot.png|900px|GATE developer screenshot]]
{{ #adsk: [[GPS Location::+]]
|?GPS Location
|format=exhibit
|views=map
|facets=tags
}}


GATE is written in Java and very Java centric. This makes it portable, fast, and heavyweight. A programming library is available.  It's 14 years old and has many users and contributors.
{{ #ask: [[GPS Location::+]]
 
|?GPS Location
= Using GATE developer =
}}
 
* GATE developer is used to process sets of Language Resources in Corpus using Processing Resources. They are typically saved to a serialized Datastore.
* ANNIE, VG (verb group) processors.
* Preserve formatting embeds tags in HTML or XML.
** Different strengths using GATE's graph (node/offset) based XML vs. preserved formatting (original xml/html)
 
= Information Extraction =
 
* IR - retrieve docs
* IE - retreive structured data
 
* Knowledge Engineering - rule based
* Learning Systems - statistical
 
Old Bailey IE project  - old english (Online)
 
* POS - assigned in Token (noun, verb, etc)
 
* Gazateer - gotcha, have to set initialization parameter listsURL before it's
loaded. Must also "save and reinitialize."
* Gazeteer creates Lookups, then transducer creaties named entities
* Then orthomatcher (spelling features in common) coreference associates those
 
* Annotation Key sets and annotation comparing
** Need setToKeep key in Document Reset for any pre annotated texts
 
== Evaluation / Metrics ==
 
* Evaluation metric - mathematically against human annotated
* Scoring - performance measures for annotation types
 
* Result types - Correct, missing, spurious, partially correct (overlapped)
 
* Tools > Annotations Diff - comparing human vs machine annotation
 
* Corpus > Corpus quality assurance - compare by type
* (B has to be the generated set)
 
* Annotation set transfer (in tools) - transfer between docs in pipeline
** useful for eg HTML that has boilerplate
 
=== To investigate ===
 
* markupAware for HTML/XML (keeps tags in editor)
* AnnotationStack
* Advanced Options
 
{{Blikied|Aug 30, 2010}}

Revision as of 20:28, 1 September 2010

45° 30' 11.46" N, 73° 34' 11.30" W

{{ #adsk: The following coordinate was not recognized: +.The following coordinate was not recognized: +. |?GPS Location |format=exhibit |views=map |facets=tags }}