-
Description, History & Glossary
-
“Federated Content Search” at CLARIN
In short: Content Search over Distributed Resources
Also: Federated “Corpus Query Platform”
-
Search for patterns in distributed text collections
-
No central index!
-
Text resources include annotated corpora, full-texts etc.
-
FCS = interface specification, search infrastructure and software ecosystem
-
Usage of established standards and extensibility!
Interface Spezification
-
Description of search protocol (query languages, formats and communication channels)
“for homogeneous access to heterogeneous search engines” -
RESTful protocol
Search Infrastruktur in CLARIN and Text+
-
Central client (search result “Aggregator” and web portal)
-
Decentralized endpoints at the data centers (local search eninges on resources)
Software Ecosystem primarily in Java
-
Libraries (Java, Python, …)
-
Tools (Validator, Aggregator, Registry)
-
(Own) text resources
-
“Search engine” on those text resources
-
Minimum: full-text search
-
-
Deployment of publicly accessible FCS endpoint(s)
Pros
-
Integration of many resources, linking and comparison of results
-
Integration with other tools (Weblicht, Registry/VLO, Switchboard, …)
-
Same queries, formats, result presentation
-
No duplicate data storage, inconsistency
Cons
-
No control over resources
-
No deterministic results (e.g. links for publications)
-
No global ranking of results possible
Pros
-
Control over resources and search (ranking, fuzzy, …)
-
No duplication of data due to central index
-
Increased visibility in a larger resource catalog
Cons
-
Deployment of (additional) endpoint necessary
Data |
➕ At the endpoints |
➖ Duplicate data storage, possible inconsistency (age, updates); legally no transfer may be possible |
---|---|---|
Updates to Data |
➕ Endpoints can react quickly |
➖ Difficult, e.g. removal of resources in the event of legal problems; updates entail longer delays, if at all possible |
Global Ranking |
➖ Very difficult/impossible |
➕ Quite possible (?), probably implicit assumption and normalization of data for indexing |
Faceted Search |
➖ Difficult (e.g. via external metadata; not explicitly intended) |
➕ Indexing allows clustering/classification according to topics and categories |
-
~ 2011 Started as Working Group in CLARIN
-
Mai 2011 EDC/FCS Workshop
-
~ 2011–2013 Initial version, now named FCS “Legacy”
-
SRU Scan for resources, BASIC Search (CQL/full-text), KWIC
-
-
April 2013 FCS Workshop
-
~ 2013/2014 Code and Spec for FCS Core 1.0
-
fcs-simple-endpoint:1.0.0
,sru-server:1.5.0
-
BASIC Hits Data View, SRU Scan operation not used anymore
-
-
much has disappeared into the annals of history …
-
https://github.com/clarin-eric/fcs-misc/tree/main/historical/documents
-
https://trac.clarin.eu/wiki/FCS/Specification?action=history
-
https://trac.clarin.eu/wiki/Taskforces/FCS/FCS-Specification-Draft?action=history
-
https://www.clarin.eu/event/2013/federated-content-search-workshop
-
EDC: European Demonstrator Case
-
~ 2015/2016 Starting work on and Code for FCS Core 2.0
-
fcs-simple-endpoint:1.3.0
,sru-server:1.8.0
-
Advanced Data Views (FCS-QL), …
-
-
June 2017 Official release of FCS Core 2.0 Spec
-
2022 FCS is focus in Text+ (Findability)
-
2023 New FCS maintainer in CLARIN
-
Migration of Source Code to GitHub.com, updated documentation
-
Python FCS endpoint libraries
-
Updated libraries & tools
-
Prototypes for LexFCS extension
-
-
2024
-
Experiments with Entity Search (extension)
-
Rewrite of FCS Endpoint Validator
-
SRU (Search/Retrieval via URL) / OASIS searchRetrieve
-
Standardized by Library of Congress (LoC) / OASIS
-
RESTful
-
Explain: Listing of resources
-
Languages, annotations, supported data views and formats etc.
-
-
SearchRetrieve: Search request
-
-
Data as XML
-
Extensions to the protocol explicitely allowed
-
different (optional) annotationa layers
Full-text |
The |
cyclists |
are |
fast |
---|---|---|---|---|
Part of Speech |
DET |
NOUN |
VERB |
ADJ |
Lemmatisation |
The |
cyclist |
is |
fast |
Phonetic Transcription |
… |
… |
… |
… |
Orthographic Transcription |
… |
… |
… |
… |
[…] |
-
Current version of the specification: FCS Core 2.0
-
Poster at Bazaar @ CLARIN2023 on the current status
-
😎 “Awesome FCS” List: github.com/clarin-eric/awesome-fcs with relevant links to specs, tools, libraries, implementations and much more
-
Additions by Text+ (z.B. on LexFCS/LexCQL/Forks/Software): gitlab.gwdg.de/textplus/ag-fcs-documents/-/blob/main/awesome-fcs.md
-
-
CLARIN specifications: github.com/clarin-eric/fcs-misc
-
Small ecosystem (Code on Github/Gitlab)
-
Software libraries (SRU/FCS, endpoint + client, Java/Python)
-
Aggregator (Code: Github, Text+ Fork)
-
Online Validator for Endpunkte (fcsvalidator, Code: Github (old), Github (new))
-
-
Endpunkte Registry: centres.clarin.eu/fcs
-
Lexical Resources extension
-
First specification and implementation in Text+
-
Official extension of CLARIN → ~2024 Working Plan
-
-
AAI integration
-
Specification and implementation
-
Goal: Support access-restricted resources
-
Securing the aggregator via Shibboleth → Passing on AAI attributes to endpoints
-
Preliminary work from CLARIAH-DE, part of the Text+ work plan (IDS Mannheim, Uni/SAW Leipzig, preliminary work BBAW)
-
-
Syntactic Search
-
Entity Search
-
Optional metadata for each result
-
CLARIN-EU Taskforce
-
CLARIN ERIC working plan: „extending the protocol to cover additional data types (e.g. lexica) will be explored“
-
on the CLARIN 2024 Working Plan
-
-
Interest expressed from various countries
-
Preliminary work: „RABA“ (Estland): e.g. „Eesti Wordnet“
-
First specification and implementation in Text+
-
Specification on Zenodo: zenodo.org/records/7849754
-
Presentation at eLex 2023: “A Federated Search and Retrieval Platform for Lexical Resources in Text+ and CLARIN”
-
Aggregator: fcs.text-plus.org/?queryType=lex
-
CLARIN (contentsearch.clarin.eu, Registry)
-
209 Resources (94 in Advanced)
in 61 Languages
from 20 Institutions in 12 Countries
Text+ (fcs.text-plus.org)
-
53 Resources (17 in Advanced, 30 in Lexical)
in 6 Languages
from 9 Institutions in Germany
CLARIN
-
Alpha/Beta using Side-Loading in Aggregator
-
Stable/Long-Term: Entry in Centre Registry
-
CLARIN Account + Formular as a Centre
-
Including monitoring etc.
-
Text+
-
Side-Loading in Aggregator
-
WIP: Registry (index of endpoints)
-
Development of an alternative aggregator frontend as Web Component
-
Code: Vue.js Store + Vuetify Component (Dialog); Demo
-
Use of the Aggregator API
-
Restriction to subset of resources, e.g. for integration on own website
-
Faceting, alternative visualization
-
-
Java: Maven Archetype github.com/clarin-eric/fcs-endpoint-archetype
-
Java & Python (reference implementation Korp):
-
😎 “Awesome FCS” List: github.com/clarin-eric/awesome-fcs
-
List of reference implementations, endpoints, query parsers
-
Code for FCS SRU Aggregator and SRU/FCS Validator
-