HN2new | past | comments | ask | show | jobs | submitlogin

I'd consider myself a 'C++ programmer' - I've used it for years, it works very well for me for what I do with it, etc. However, what I find most frustrating is that nothing is intuitive. And there is always a reason for why it is like that, I know. My favorite example is the erase/remove idiom.

Anyway, here is a question about C++. Last week I tried to implement a parser for a subset of CSV - one-byte-per-character, fields separated by comma, no quoting, fixed line ending. The only requirement was speed. So I started with a simple C++-style implementation, read one line at a time with std::getline, split with boost tokenizer, copy into vector of vector of string. But it was too slow, so I reimplemented it C-style - boom, first try, 30 times faster. Through some micro-optimisations I got it 30% faster still - copying less, adding some const left and right, caching a pointer. So if anyone is so inclined, how would I make the following more C++-ish and still get the same speed? Using fopen and raw char* to iterate over the memory buffer are what I'd consider the non-C++ aspects of it, but feel free to point out other idiom violations...

    bool CsvParser::OpenCStyle(boost::filesystem::path path)
    {
      FILE* fh = NULL;
      ::fopen_s(&fh, path.string().c_str(), "rb");
      if (fh == NULL) {
        return false;
      }

      const int BUFSIZE = 1024 * 256;
      char buf[BUFSIZE];

      m_Records.reserve(m_EstimatedRecordCount);
      bool at_end_of_file = false;
      std::string carryover_field_data;

      std::vector<std::string>* current_line = NULL;
      bool at_new_line = true;

      const char separator = m_Separator;

      while (!feof(fh)) {
        size_t size_read = ::fread(buf, sizeof(char), BUFSIZE, fh);
        char* start_of_field = buf;
        size_t i = 0;
        for (i = 0 ; i < BUFSIZE && i < size_read ; i++) {
          if (at_new_line) {
            m_Records.push_back(std::vector<std::string>());
            current_line = &m_Records.back();
            at_new_line = false;
          }
          if (buf[i] == separator || (at_new_line = (buf[i] == '\n'))) {
            current_line->push_back(std::string());
            if (carryover_field_data.size() != 0) {
              std::string field(start_of_field, &buf[i] - start_of_field);
              m_Records.back().push_back(carryover_field_data + field);
              carryover_field_data = "";
            } else {
              current_line->back().assign(start_of_field, &buf[i] - start_of_field);
            }
            start_of_field = &buf[i + 1];
          }
        }
        carryover_field_data = std::string(start_of_field, &buf[i] - start_of_field);
      }

      fclose(fh);

      return true;
    }


As for your getline being slow. It is a known problem.

http://stackoverflow.com/questions/9025093/stdcin-really-slo...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: