2,153
edits
No edit summary |
|||
(25 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
A full week of learning GATE text mining/information extraction language processing and talks. [http://gate.ac.uk/wiki/TrainingCourseAug2010/ Session wiki] | A full week of learning [http://en.wikipedia.org/wiki/General_Architecture_for_Text_Engineering GATE] text mining/information extraction language processing and talks. [http://gate.ac.uk/wiki/TrainingCourseAug2010/ Session wiki] | ||
[[File:GATE_screenshot.png|900px|GATE developer screenshot]] | [[File:GATE_screenshot.png|900px|GATE developer screenshot]] | ||
Line 20: | Line 20: | ||
* Learning Systems - statistical | * Learning Systems - statistical | ||
Old Bailey IE project - | Old Bailey IE project - 17th century english (Online) | ||
* POS - assigned in Token (noun, verb, etc) | * POS - assigned in Token (noun, verb, etc) | ||
* Gazateer - gotcha, have to set initialization parameter listsURL before it's | * Gazateer - gotcha, have to set initialization parameter listsURL before it's loaded. Must also "save and reinitialize." | ||
loaded. Must also "save and reinitialize." | |||
* Gazeteer creates Lookups, then transducer creaties named entities | * Gazeteer creates Lookups, then transducer creaties named entities | ||
* Then orthomatcher (spelling features in common) coreference associates those | * Then orthomatcher (spelling features in common) coreference associates those | ||
Line 34: | Line 33: | ||
== Evaluation / Metrics == | == Evaluation / Metrics == | ||
* Evaluation metric - | * Evaluation metric - mathematically against human annotated | ||
* Scoring - performance measures for annotation types | * Scoring - performance measures for annotation types | ||
* Precision = correct / correct + spurious | |||
* Recall = correct / correct + missing | |||
* F-measure is precision and recall (harmonic mean) | |||
* F=2⋅(precision⋅recall / precision+recall) | |||
* GATE supports average, strict, lenient | |||
* Result types - Correct, missing, spurious, partially correct (overlapped) | * Result types - Correct, missing, spurious, partially correct (overlapped) | ||
Line 42: | Line 47: | ||
* Corpus > Corpus quality assurance - compare by type | * Corpus > Corpus quality assurance - compare by type | ||
*( B has to be generated set | * (B has to be the generated set) | ||
* Annotation set transfer (in tools) - transfer between docs in pipeline | * Annotation set transfer (in tools) - transfer between docs in pipeline | ||
** useful for eg | ** useful for eg HTML that has boilerplate | ||
== To investigate == | |||
* markupAware for HTML/XML (keeps tags in editor) | * markupAware for HTML/XML (keeps tags in editor) | ||
* AnnotationStack | * AnnotationStack | ||
* Advanced Options | * Advanced Options | ||
= JAPE = | |||
* Rules based on tokens and lookups | |||
Phase: MatchingStyles | |||
Input: Lookup | |||
Options: control = appelt | |||
Rule: Test1 | |||
( | |||
({Lookup.majorType == location})? | |||
{Lookup.majorType == loc_key} | |||
):match | |||
--> | |||
:match.Location = {rule=Test1} | |||
Copying features: :match.Location = { type = :match.Lookup.minorType} | |||
== To review, gotchas == | |||
* Rule types : first takes only first match, excludes compound | |||
** a? b for "a b" will match "a b" | |||
* multiplexor tranducers | |||
* multi-constraint statements | |||
* macros | |||
* To reuse created annotations has to be a separate rule | |||
=== Matching types === | |||
[[File:gate-matching.png|800px|Matching styles for JAPE]] | |||
= To follow up = | |||
* WebSphinx crawler CREOLE plugin | |||
* [http://www.semanticsoftware.info/semantic-assistants-architecture GATE NLP web services] | |||
= Other notes = | |||
== Lucene data store and ANNIC == | |||
* Use <null> for default set | |||
* Go to Datastore for queries | |||
** eg {Person}({Token})+{Money} | |||
* Useful for debugging JAPE and results | |||
[[File:GATE-lucene-person-money.png|800px]] | |||
= Demos = | |||
* Mímir for querying large volumes of data (uses MG4J) | |||
* Translating parts of speech between languages using Compound editor and Alignment editor | |||
* Predicate extractor (MultiPaX) | |||
** Mixed results at best | |||
* OwlExporter | |||
** NLP ontology | |||
= Conclusions = | |||
While it can do a lot out of the box and benefits from development time and breadth of connectivity, to be useful to more than patient specialists, it needs usability testing. A lot of things are inobvious and too domain specific that with a bit of work could be more broadly useful. Interaction could include a lot more immediate, useful and interesting looking displays. A web based version could have these features. However the team seems somewhat ambivalent about development. :) | |||
Looking forward to learning about programming using GATE libraries. | |||
{{Blikied|Aug 30, 2010}} | {{Blikied|Aug 30, 2010}} | ||
[[Category:SemWeb]] |