2,150
edits
No edit summary |
No edit summary |
||
Line 5: | Line 5: | ||
GATE is written in Java and very Java centric. This makes it portable, fast, and heavyweight. A programming library is available. It's 14 years old and has many users and contributors. | GATE is written in Java and very Java centric. This makes it portable, fast, and heavyweight. A programming library is available. It's 14 years old and has many users and contributors. | ||
= Using GATE developer = | |||
* GATE developer is used to process sets of Language Resources in Corpus using Processing Resources. They are typically saved to a serialized Datastore. | * GATE developer is used to process sets of Language Resources in Corpus using Processing Resources. They are typically saved to a serialized Datastore. | ||
Line 11: | Line 11: | ||
* Preserve formatting embeds tags in HTML or XML. | * Preserve formatting embeds tags in HTML or XML. | ||
** Different strengths using GATE's graph (node/offset) based XML vs. preserved formatting (original xml/html) | ** Different strengths using GATE's graph (node/offset) based XML vs. preserved formatting (original xml/html) | ||
= Information Extraction = | |||
* IR - retrieve docs | |||
* IE - retreive structured data | |||
* Knowledge Engineering - rule based | |||
* Learning Systems - statistical | |||
Old Bailey IE project - old english (Online) | |||
* POS - assigned in Token (noun, verb, etc) | |||
* Gazateer - gotcha, have to set initialization parameter listsURL before it's | |||
loaded. Must also "save and reinitialize." | |||
* Gazeteer creates Lookups, then transducer creaties named entities | |||
* Then orthomatcher (spelling features in common) coreference associates those | |||
* Annotation Key sets and annotation comparing | |||
** Need setToKeep key in Document Reset for any pre annotated texts | |||
== Evaluation / Metrics == | |||
* Evaluation metric - mathmatically against human annotated | |||
* Scoring - performance measures for annotation types | |||
* Result types - Correct, missing, spurious, partially correct (overlapped) | |||
* Tools > Annotations Diff - comparing human vs machine annotation | |||
* Corpus > Corpus quality assurance - compare by type | |||
*( B has to be generated set | |||
* Annotation set transfer (in tools) - transfer between docs in pipeline | |||
** useful for eg html that has boilerplate | |||
=== To investigate === | === To investigate === |