Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mismatch between available comps in Readme and in rds files #374

Open
BenoitLondon opened this issue May 3, 2024 · 2 comments
Open

Mismatch between available comps in Readme and in rds files #374

BenoitLondon opened this issue May 3, 2024 · 2 comments

Comments

@BenoitLondon
Copy link

BenoitLondon commented May 3, 2024

Competition list is outdated in the README for load_match_comp_results

Some competitions have changed name in the rds files
e.g English Football League Cup is now EFL Cup
Copa America too
and UEFA Euro comps

More generally it would be nice to have a function listing all the competitions/country/league id which are available for each load functions so we could get the load data programmatically.

Thanks for this great package!

@BenoitLondon BenoitLondon changed the title Mismatch between available comps in REadme and in rds files Mismatch between available comps in Readme and in rds files May 3, 2024
@tonyelhabr
Copy link
Collaborator

RE: load functions. I like the idea of having a function to list possible cups, comps, and years. We'll have to think of the right way to automate it. Right now, it wouldn't be hard to write a function to list available country+gender+tier for a given data set.

DATA_REPO <- 'JaseZiv/worldfootballR_data'
get_possible_stashed_data <- function(tag, include_years = FALSE) {
  raw <- piggyback::pb_list(DATA_REPO, tag = tag)
  
  grid <- raw |> 
    tibble::as_tibble() |> 
    dplyr::filter(
      tools::file_ext(file_name) == 'rds'
    ) |> 
    dplyr::select(file_name) |> 
    tidyr::separate_wider_regex(
      file_name,
      c(country = '^[A-Z]+', '_', gender = '[MF]', '_', tier = '1st|2nd', '_', extra = '.*$'),
      cols_remove = FALSE
    ) |> 
    dplyr::select(
      file_name,
      country,
      gender,
      tier
    )
  grid
  
  if (isFALSE(include_years)) {
    return(grid |> dplyr::select(-file_name))
  }
  
  ## would have to read in files to identify years
}

possible_data <- get_possible_stashed_data(
  tag = 'fb_match_summary'
)
possible_data
#> # A tibble: 13 × 3
#>    country gender tier 
#>    <chr>   <chr>  <chr>
#>  1 BRA     M      1st  
#>  2 ENG     F      1st  
#>  3 ENG     M      1st  
#>  4 ENG     M      2nd  
#>  5 ESP     M      1st  
#>  6 FRA     M      1st  
#>  7 GER     M      1st  
#>  8 ITA     M      1st  
#>  9 MEX     M      1st  
#> 10 NED     M      1st  
#> 11 POR     M      1st  
#> 12 USA     F      1st  
#> 13 USA     M      1st 

It becomes more involved if you want to list seasons as well, since, as of now, we don't store that in a CSV anywhere, nor in the name of the stashed data files (which is why it's not hard to extract country, gender, and tier). As things stand now, you'd have to read in the data file, then extract the unique seasons. The data files can be slow to load, so this is not ideal.

I'd have to think of a robust solution to this.

@tonyelhabr
Copy link
Collaborator

RE: mismatched names. Yes, I've seen this kind of things with MLS team names, where they changed the name of a team at some point (e.g. 'Sporting Kansas City' -> 'Sporting KC'), either during the middle of the season or between seasons.

I'm not sure what the best, general solution is to ensuring name consistency over time. Perhaps, we could re-scrape data like a year after it occurred, assuming that names are no longer being changed at that point. Obviously this would take a lot of time. Perhaps there are shortcuts for checking self-consistency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants