pmhfst tokeniser inconsistently tokenises hyphen minus #28

rueter · 2023-03-21T09:54:50Z

lang-fin/tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst

The "hyphen minus" is sometimes separate and other times retained in +Cmp/SplitR situations

Here are five separate instances:
(1a)
Ruotsin keski- ja eteläosien välille

(1b)
yleisesti Etelä- ja Keski-Suomen alueella

yleisesti                                                                                                                                                   
Etelä-                                                                                                                                                      
ja                                                                                                                                                          
Keski-Suomen                                                                                                                                                
alueella

(2)
Instances of +Cmp/SplitL
Keski-Ruotsin ja -Norjan asumattomille metsäseuduille

Keski-Ruotsin                                                                                                                                               
ja                                                                                                                                                          
 -                                                                                                                                                          
Norjan                                                                                                                                                      
asumattomille                                                                                                                                               
metsäseuduille

In (2), one notices the indented and separate minus hyphen, which is a distinction from what is found in (1a).

In (3) and (4), it is disturbing to observe that a leading whitespace appears before the minus hyphen.
the "niin kuin" token is also peculiar
(3)
tiukoin ottein - niin kuin

(4)
(n. 4200 - 2500 eaa.)

In (5), an extra line has been inserted, but it may be associated with the « quote.
(5)
tarkoituksenmukaisuus - «muoto seuraa

tarkoituksenmukaisuus                                                                                                                                       
 -                                                                                                                                                          
                                                                                                                                                            
«                                                                                                                                                           
muoto                                                                                                                                                       
seuraa

The text was updated successfully, but these errors were encountered:

snomos · 2023-03-21T19:48:20Z

How are you running hfst-tokenise?

rueter · 2023-03-22T04:33:32Z

snomos · 2023-03-22T05:35:06Z

Thanks. But that can’t be how you ran hfst-tokenise in the examples 1-5 above, since the output is different from that in your last comment. Just the commands as text is fine, no need for an image 😊

rueter · 2023-03-22T07:30:10Z

echo 'TEXT' | hfst-tokenise -S tools/...pmhfst
Gives the result shown in (5) and the top part of the screenshot.
Add -g and we see more explicitely the whitespaces.
I'm using M2 ventura if that helps.

TinoDidriksen · 2023-03-22T07:43:31Z

There are a lot of quirks with tokenization. For Greenlandic we have a helper that wraps around hfst-tokenize and smooths out the quirks - e.g., https://github.com/giellalt/lang-kal/blob/main/tools/shellscripts/kal-tokenise.in#L472 for dashes. Might be a source of inspiration.

snomos · 2023-03-22T14:42:32Z

@flammie see @TinoDidriksen 's comment above re what we talked about earlier today to have a look at improving tokenisation and text analysis.

flammie · 2023-03-24T09:55:43Z

mm I know there's a lot of corner cases with tokenisation, some language specific other more generic. I'll try to make a list here then:

word-final hyphen should be part of word-token
word-initial hyphen should part of word-token
sole hyphen between spaces should be just that

I think I'ma make a test suite actually, this regresses quite easily and stays unnoticed for long...

snomos · 2023-03-25T10:17:35Z

Looks good. One note of caution: word final and word initial hyphen could also be errors (ie missing space between hyphen and word), thus I would suggest that these hyphens are part of the word token IFF the full token can be analyzed as such. If not, I would treat them as two separate tokens. If you don't you will get an unknown token when you could have had two known tokens.

merisiga · 2023-03-27T07:57:21Z

I am not sure Sjur's approach would work for Estonian. For example, we have a word "industriaalne" (industrial), in a compound it would be truncated: "industriaalmaastik" (industrial landscape) - "industriaal" alone" is not a legitimate word. Now, when part of a co-ordinated noun phrase, it would be truncated and with a hyphen: "industriaal- ja linnamaastik" (industrial and city landscape). This means that "industriaal-" is legitimate only together with the final hyphen. This truncation and hyphenating convention is not exceptional.

snomos · 2023-03-27T08:03:52Z

@merisiga it would work well as long as the word form industriaal- would get an analysis in itself, since in that case it would be recognised as a single token including the hyphen. The only case where it would be problematic would be if industriaal- had been misspelled, and thus not analysed.

merisiga · 2023-03-27T08:16:36Z

Ok, good. Now, what would you do with industriaal-- (two trailing hyphens)? This is a spelling error, but it consists of a legitimate token plus a hyphen. I am asking because this seems to be one of those corner cases, and I feel it would be easy to come up with an ad hoc solution which might induce some more ad hoc solutions down the pipeline. In short, I have an uneasy feeling, but cannot point to the exact reason for it...

snomos · 2023-03-27T08:28:32Z

Depends on the task. For tokenisation I would probably just let it be. I am not sure how the tokens would be split, there are several possible outcomes depending on the FST and the tokenisation rules. The easiest solution would be to just add -- as an erroneous variant of -, for all cases, and then be done. That would return one token for industriaal--. In cases where -- is used as an m-dash replacement (ie with spaces on both sides), it could have an m-dash as a (correct?) alternative, and use context disambiguation to select the correct one.

In the case of a grammar checker, I would probably iterate over the errors - two hyphens are usually an error to be corrected to one (again, such error detection and correction can be context dependent), and once corrected, the rest of the sentence including the corrected industriaal- would no longer be an issue.

Ie by treating -- in a separate step, I don't see this turning into a rabbit hole.

flammie · 2023-04-04T07:55:02Z

The Finnish output is now like:

Ruotsin
keski-
ja
eteläosien
välille
yleisesti
Etelä-
ja
Keski-Suomen
alueella
Keski-Ruotsin
ja
-Norjan
asumattomille
metsäseuduille
tiukoin
ottein
-
niin kuin
(
n.
4200
- 2500
eaa.
)
tarkoituksenmukaisuus
-
«
muoto
seuraa

the fixes so far only included lexc files, like deleting space and space hyphen from analysers and adding the expected hyphens to words. For most cases the tokenisation follows from the analyser.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pmhfst tokeniser inconsistently tokenises hyphen minus #28

pmhfst tokeniser inconsistently tokenises hyphen minus #28

rueter commented Mar 21, 2023 •

edited

Loading

snomos commented Mar 21, 2023

rueter commented Mar 22, 2023

snomos commented Mar 22, 2023

rueter commented Mar 22, 2023 •

edited

Loading

TinoDidriksen commented Mar 22, 2023

snomos commented Mar 22, 2023

flammie commented Mar 24, 2023

snomos commented Mar 25, 2023

merisiga commented Mar 27, 2023

snomos commented Mar 27, 2023

merisiga commented Mar 27, 2023

snomos commented Mar 27, 2023

flammie commented Apr 4, 2023

pmhfst tokeniser inconsistently tokenises hyphen minus #28

pmhfst tokeniser inconsistently tokenises hyphen minus #28

Comments

rueter commented Mar 21, 2023 • edited Loading

snomos commented Mar 21, 2023

rueter commented Mar 22, 2023

snomos commented Mar 22, 2023

rueter commented Mar 22, 2023 • edited Loading

TinoDidriksen commented Mar 22, 2023

snomos commented Mar 22, 2023

flammie commented Mar 24, 2023

snomos commented Mar 25, 2023

merisiga commented Mar 27, 2023

snomos commented Mar 27, 2023

merisiga commented Mar 27, 2023

snomos commented Mar 27, 2023

flammie commented Apr 4, 2023

rueter commented Mar 21, 2023 •

edited

Loading

rueter commented Mar 22, 2023 •

edited

Loading