C++ String Interview Question

I was recently in a C++ technical interview, where I was given a bit of simple string manipulation code, which is intended to take a string and return a string that is comprised of the first and last n-characters, and then proceed to correct any bugs and to also make the function as efficient as possible, I came up with the solution below, however the interviewer claimed there was an even faster more optimal way:

Original code:

std::string first_last_n(int n, std::string s)
{
   std::string first_n = s.substr(0,n);
   std::string last_n = s.substr(s.size()-n-1,n);
   return first_n + last_n;
}

My code:

bool first_last_n(const std::size_t& n, const std::string& s, std::string& r)
{
   if (s.size() < n)
      return false;
   r.reserve(2 * n);
   r.resize(0);
   r.append(s.data(),s.data() + n);
   r.append(s.data() + (s.size() - n), s.data() + s.size());
   return true;
}

Summary of my changes:

Changed the interface to take a return string as reference (assuming RVO and rvalues are not available yet)
Removed temporary strings being constructed via substr
Passed input string as a const reference inorder to get around temporary instantiation of input
Fixed off-by-1 error in last_n string
Reduced the number of times each character is touched down to once, or twice (in the event of an overlapping scenario)
Placed a check in the event the string s's size is less than n, returning false for failure.

Assuming only native C++ is allowed, is there some other way of doing the above more efficiently or optimally?

Note 1: The original input string instance is not to be modified.

Note 2: All solutions must pass the following test case, otherwise they are not valid.

void test()
{
   {
      std::string s = "0123456789";
      std::string r = first_last_n(10,s);
      assert(r == "01234567890123456789");
   }

   {
      std::string s = "0123456789ABC0123456789";
      std::string r = first_last_n(10,s);
      assert(r == "01234567890123456789");
   }

   {
      std::string s = "1234321";
      std::string r = first_last_n(5,s);
      assert(r == "1234334321");
   }

}

This implementation should be fast:

inline std::string first_last_n(std::string::size_type n, const std::string& s)
{
    n = std::min(n, s.size());
    std::string ret;
    ret.reserve(2*n);
    ret.append(s.begin(), s.begin() + n);
    ret.append(s.end() - n, s.end());
    return ret;
}

It passes all three unit tests.

When using GNU libstdc++, the line that declares & initializes ret is extremely fast because libstdc++ uses a global "empty string" variable. Thus, it's simply a pointer copy. Calls to begin and end on s are also fast because they will resolve to the const versions of begin and end, begin() const and end() const, so the internal representation of s is not "leaked". With libstdc++, std::string::const_iterator is const char*, which is a pointer type and random access iterator. Thus, when std::string::append<const char*>(const char*, const char*) calls std::distance to obtain the length of the input range, it is a pointer difference operation. Also, std::string::append<const char*>(const char*, const char*) results in something like a memmove. Finally, the reserve operation ensures that enough memory is available for the return value.

EDIT: For the curious, here is the initialization of ret in the assembly output of MinGW g++ 4.5.0:

    movl    $__ZNSs4_Rep20_S_empty_rep_storageE+12, (%ebx)

It's simply copying the pointer to the global "empty representation".

EDIT2: Okay. I have now tested four variants with g++ 4.5.0 and Visual C++ 16.00.30319.01:

Variant 1 (the "c_str variant"):

inline std::string first_last_n(std::string::size_type n, const std::string& s)
{
   std::string::size_type s_size = s.size();
   n = std::min(n, s_size);
   std::string ret;
   ret.reserve(2*n);
   const char *s_cStr = s.c_str(), *s_cStr_end = s_cStr + s_size;
   ret.append(s_cStr, s_cStr + n);
   ret.append(s_cStr_end - n, s_cStr_end);
   return ret;
}

Variant 2 (the "data string" variant):

inline std::string first_last_n(std::string::size_type n, const std::string& s)
{
   std::string::size_type s_size = s.size();
   n = std::min(n, s_size);
   std::string ret;
   ret.reserve(2*n);
   const char *s_data = s.data(), *s_data_end = s_data + s_size;
   ret.append(s_data, s_data + n);
   ret.append(s_data_end - n, s_data_end);
   return ret;
}

Variant 3:

inline std::string first_last_n(std::string::size_type n, const std::string& s)
{
   std::string::size_type s_size = s.size();
   n = std::min(n, s_size);
   std::string ret(s);
   std::string::size_type d = s_size - n;
   return ret.replace(n, d, s, d, n);
}

Variant 4 (my original code):

inline std::string first_last_n(std::string::size_type n, const std::string& s)
{
   n = std::min(n, s.size());
   std::string ret;
   ret.reserve(2*n);
   ret.append(s.begin(), s.begin() + n);
   ret.append(s.end() - n, s.end());
   return ret;
}

The results for g++ 4.5.0 are:

Variant 4 is the fastest
Variant 3 is second (5% slower than variant 4)
Variant 1 is third (2% slower than variant 3)
Variant 2 is fourth (0.2% slower than variant 1)

The results for VC++ 16.00.30319.01 are:

Variant 1 is the fastest
Variant 2 is second (3% slower than variant 1)
Variant 4 is third (4% slower than variant 2)
Variant 3 is fourth (17% slower than variant 4)

Unsurprisingly, the variant that is fastest depends on the compiler. However, not knowing which compiler will be used I think that my variant is best because it is a familiar style of C++, it is the fastest when using g++, and it is not that much slower than variants 1 or 2 when using VC++.

One thing interesting from the VC++ results is that using c_str rather than data is faster. Perhaps that is why your interviewer said that there is a faster way than your implementation.

EDIT3:

Actually, I just thought about another variant:

Variant 5:

inline std::string first_last_n(std::string::size_type n, const std::string& s)
{
   n = std::min(n, s.size());
   std::string ret;
   ret.reserve(2*n);
   std::string::const_iterator s_begin = s.begin(), s_end = s.end();
   ret.append(s_begin, s_begin + n);
   ret.append(s_end - n, s_end);
   return ret;
}

It's just like variant 4 except that the begin and end iterators for s are saved.

When variant 5 is tested, it actually beats out variant 2 (the data string variant) when using VC++:

Variant 1 is the fastest
Variant 5 is second (1.6% slower than variant 1)
Variant 2 is third (1.4% slower than variant 5)
Variant 4 is third (4% slower than variant 2)
Variant 3 is fourth (17% slower than variant 4)

If you don't need to maintain the contents of the original string, then you can copy the last n characters into positions [n+1, 2n] of the original string and truncate it at 2n. You will have to be careful to first expand the string and also be careful not to overwrite any characters before writing to them if the string is shorter than 2n.

This will halve the number of operations to construct the string, as well as remove the need to create a new string. So its theoretically between 2 and 4 times faster. But of course you have just destroyed the original string, which you'd have to ask the interviewer if it is acceptable.

How about removing the middle N-2n characters, where N is the length of the source string?

// compiled with cl /Ox first_last_n.cpp /W4 /EHsc

inline void
first_last_n2(string::size_type n, const std::string &s, string &out)  // method 2
{
  // check against degenerate input
  assert(n > 0);
  assert(n <= s.size());

  out.reserve(2*n);
  out.assign(s, 0, n);
  out.append(s, s.size()-n, n);
}

Times:

method 1:  // original method
2.281
method 2:  // my method
0.687
method 3:  // your code.
0.782

Note: Timing specifically tests "long" strings. I.e. those where short string optimization is not used. (My strings were 100 length).

My only thought is that if this function is only called with C null-terminated strings, you might be requiring the extra construction of a std::string for parameter 's'.

Possibly the 'more' efficient method would be to allow either a std::string or const char *s passed in.

Memcpy is a cheat?

#include <cstring>
#include <iostream>
#include <string>

std::string first_last_n(int n, const std::string& s)
{
  if (s.size() < n)
      return "";

    char str[n*2];
    memcpy(str, s.data(), n);
    memcpy(str+n, s.data() + s.size()-n, n);

    return (const char *)str;
}

int main()
{
    std::cout << first_last_n(2, "123454321") << std::endl;
}

EDIT So I removed the other one. This is not a cheat.

If you must pass the tests then you're going to have to write inefficient code, because you must return a copy of a string. This implies you must use dynamic allocation, possibly multiple times because of the copy.

So change the tests and change the signature.

template<class Out>
Out first_last_n(const std::string::size_type& n, const std::string& s, Out r)
{
    r = copy_n(s.begin(), n, r);
    std::string::const_iterator pos(s.end());
    std::advance(pos, -n);
    return copy_n(pos, n, r);
}

Then call it like so:

std::string s("Hello world!");
char r[5];
r[4] = 0;
first_last_n(2, s, r);

This allows you to use dynamic programming, and it eliminates the need of dynamic allocation in the function.

I like my algorithms minimalistic, and I purposely eliminated the check for n being smaller or equal to the size of the string. I replace the check with a pre-condition for the function. Preconditions are faster than checks: they have zero overhead.

来源：https://stackoverflow.com/questions/4494884/c-string-interview-question

标签

c++

optimization

string

processing-efficiency