Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is fast_excel thread safe? #27

Open
Mikopet opened this issue Nov 27, 2018 · 4 comments
Open

is fast_excel thread safe? #27

Mikopet opened this issue Nov 27, 2018 · 4 comments

Comments

@Mikopet
Copy link

Mikopet commented Nov 27, 2018

Hey!

We have some data, and want to query it from the DB, and export it to excel.

workbook = FastExcel.open(tmp_file_path, constant_memory: true)
sheet = workbook.add_worksheet('name)
sheet.write_row(0, column_names, workbook.bold_cell_format)

So the problem is, I want to write it parallelly.

limit = 50_000
count = 523_523 # whatever

0.step(count, limit).each do |offset|
  fork do
    data = db_query(offset, limit)

    data.each.with_index do |row, idx|
      sheet.write_row(
        offset + idx + 1,
        row
      )
    end
  end
end

Process.waitall
workbook.close

But the result is weird. Sometimes I have the half of the data, and sometimes a bit more. (but without empty rows!! which is super weird)
So the question is, fast_excel is thread unsafe, or I did miss something?

@Paxa
Copy link
Owner

Paxa commented Nov 28, 2018

I believe fast_excel is thread safe for atomic writes (write one cell) but write_row and append_row is not because it iterates over array and update state.

fork will create new process and file descriptors can not be shared between processes, I would generally avoid forking...

According to my benchmarks fast_excel is same fast as standart csv library, probably you facing issue with slow activerecord allocations, also that may consume lots of memory and process may start using swap. I have other project light_record that optimize AR allocations and support streaming for mysql (streaming will allocate AR objects on demand), with that I can generate about 4000 rows per second with ~20 columns (text, dates, numbers)

How many rows are you trying to write?

@Mikopet
Copy link
Author

Mikopet commented Nov 28, 2018

This light_record would be great, but I am trying to query from a relatively heavy db view.
And there is no model. (yet)

This view is about ~750k rows, but I have bigger data too to export.
Time is not as important, as the memory consumption.

@Paxa
Copy link
Owner

Paxa commented Nov 28, 2018

Querying with offset/limit may be slow, depends on query inside view, usually every next request will take longer time. Splitting query with primary key range will be faster (like id > 1000 && id < 2000)

I would suggest to use database client directly (also can get it from ActiveRecord::Base.connection_pool.checkout), run one query to get all rows and use streaming, then do type casting when necessary and write to excel.

Then it will fetch records and write to excel one by one, should be very memory efficient

Mysql: https://github.com/brianmario/mysql2#streaming
Postgres: https://www.rubydoc.info/github/ged/ruby-pg/PG%2FConnection:set_single_row_mode

@Mikopet
Copy link
Author

Mikopet commented Nov 30, 2018

It's a bit better now, thanks for the idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants