Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pmhfst tokeniser inconsistently tokenises hyphen minus #28

Open
rueter opened this issue Mar 21, 2023 · 13 comments
Open

pmhfst tokeniser inconsistently tokenises hyphen minus #28

rueter opened this issue Mar 21, 2023 · 13 comments

Comments

@rueter
Copy link
Member

rueter commented Mar 21, 2023

lang-fin/tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst

The "hyphen minus" is sometimes separate and other times retained in +Cmp/SplitR situations

Here are five separate instances:
(1a)
Ruotsin keski- ja eteläosien välille

Ruotsin                                                                                                                                                     
keski                                                                                                                                                       
-                                                                                                                                                           
ja                                                                                                                                                          
eteläosien                                                                                                                                                  
välille  

(1b)
yleisesti Etelä- ja Keski-Suomen alueella

yleisesti                                                                                                                                                   
Etelä-                                                                                                                                                      
ja                                                                                                                                                          
Keski-Suomen                                                                                                                                                
alueella  

(2)
Instances of +Cmp/SplitL
Keski-Ruotsin ja -Norjan asumattomille metsäseuduille

Keski-Ruotsin                                                                                                                                               
ja                                                                                                                                                          
 -                                                                                                                                                          
Norjan                                                                                                                                                      
asumattomille                                                                                                                                               
metsäseuduille

In (2), one notices the indented and separate minus hyphen, which is a distinction from what is found in (1a).

In (3) and (4), it is disturbing to observe that a leading whitespace appears before the minus hyphen.
the "niin kuin" token is also peculiar
(3)
tiukoin ottein - niin kuin

tiukoin                                                                                                                                                     
ottein                                                                                                                                                      
 -                                                                                                                                                          
niin kuin

(4)
(n. 4200 - 2500 eaa.)

(                                                                                                                                                           
n.                                                                                                                                                          
4200                                                                                                                                                        
 -                                                                                                                                                          
2500                                                                                                                                                        
eaa.                                                                                                                                                        
)

In (5), an extra line has been inserted, but it may be associated with the « quote.
(5)
tarkoituksenmukaisuus - «muoto seuraa

tarkoituksenmukaisuus                                                                                                                                       
 -                                                                                                                                                          
                                                                                                                                                            
«                                                                                                                                                           
muoto                                                                                                                                                       
seuraa            
@snomos
Copy link
Member

snomos commented Mar 21, 2023

How are you running hfst-tokenise?

@rueter
Copy link
Member Author

rueter commented Mar 22, 2023

Screenshot 2023-03-22 at 6 20 37

@snomos
Copy link
Member

snomos commented Mar 22, 2023

Thanks. But that can’t be how you ran hfst-tokenise in the examples 1-5 above, since the output is different from that in your last comment. Just the commands as text is fine, no need for an image 😊

@rueter
Copy link
Member Author

rueter commented Mar 22, 2023

echo 'TEXT' | hfst-tokenise -S tools/...pmhfst
Gives the result shown in (5) and the top part of the screenshot.
Add -g and we see more explicitely the whitespaces.
I'm using M2 ventura if that helps.

@TinoDidriksen
Copy link
Member

There are a lot of quirks with tokenization. For Greenlandic we have a helper that wraps around hfst-tokenize and smooths out the quirks - e.g., https://github.com/giellalt/lang-kal/blob/main/tools/shellscripts/kal-tokenise.in#L472 for dashes. Might be a source of inspiration.

@snomos
Copy link
Member

snomos commented Mar 22, 2023

@flammie see @TinoDidriksen 's comment above re what we talked about earlier today to have a look at improving tokenisation and text analysis.

@flammie
Copy link
Contributor

flammie commented Mar 24, 2023

mm I know there's a lot of corner cases with tokenisation, some language specific other more generic. I'll try to make a list here then:

  • word-final hyphen should be part of word-token
  • word-initial hyphen should part of word-token
  • sole hyphen between spaces should be just that

I think I'ma make a test suite actually, this regresses quite easily and stays unnoticed for long...

@snomos
Copy link
Member

snomos commented Mar 25, 2023

Looks good. One note of caution: word final and word initial hyphen could also be errors (ie missing space between hyphen and word), thus I would suggest that these hyphens are part of the word token IFF the full token can be analyzed as such. If not, I would treat them as two separate tokens. If you don't you will get an unknown token when you could have had two known tokens.

@merisiga
Copy link
Contributor

I am not sure Sjur's approach would work for Estonian. For example, we have a word "industriaalne" (industrial), in a compound it would be truncated: "industriaalmaastik" (industrial landscape) - "industriaal" alone" is not a legitimate word. Now, when part of a co-ordinated noun phrase, it would be truncated and with a hyphen: "industriaal- ja linnamaastik" (industrial and city landscape). This means that "industriaal-" is legitimate only together with the final hyphen. This truncation and hyphenating convention is not exceptional.

@snomos
Copy link
Member

snomos commented Mar 27, 2023

@merisiga it would work well as long as the word form industriaal- would get an analysis in itself, since in that case it would be recognised as a single token including the hyphen. The only case where it would be problematic would be if industriaal- had been misspelled, and thus not analysed.

@merisiga
Copy link
Contributor

Ok, good. Now, what would you do with industriaal-- (two trailing hyphens)? This is a spelling error, but it consists of a legitimate token plus a hyphen. I am asking because this seems to be one of those corner cases, and I feel it would be easy to come up with an ad hoc solution which might induce some more ad hoc solutions down the pipeline. In short, I have an uneasy feeling, but cannot point to the exact reason for it...

@snomos
Copy link
Member

snomos commented Mar 27, 2023

Depends on the task. For tokenisation I would probably just let it be. I am not sure how the tokens would be split, there are several possible outcomes depending on the FST and the tokenisation rules. The easiest solution would be to just add -- as an erroneous variant of -, for all cases, and then be done. That would return one token for industriaal--. In cases where -- is used as an m-dash replacement (ie with spaces on both sides), it could have an m-dash as a (correct?) alternative, and use context disambiguation to select the correct one.

In the case of a grammar checker, I would probably iterate over the errors - two hyphens are usually an error to be corrected to one (again, such error detection and correction can be context dependent), and once corrected, the rest of the sentence including the corrected industriaal- would no longer be an issue.

Ie by treating -- in a separate step, I don't see this turning into a rabbit hole.

@flammie
Copy link
Contributor

flammie commented Apr 4, 2023

The Finnish output is now like:

Ruotsin
keski-
ja
eteläosien
välille
yleisesti
Etelä-
ja
Keski-Suomen
alueella
Keski-Ruotsin
ja
-Norjan
asumattomille
metsäseuduille
tiukoin
ottein
-
niin kuin
(
n.
4200
- 2500
eaa.
)
tarkoituksenmukaisuus
-
«
muoto
seuraa

the fixes so far only included lexc files, like deleting space and space hyphen from analysers and adding the expected hyphens to words. For most cases the tokenisation follows from the analyser.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants