Best solutions for Microsoft interview tasks. Pattern Recognition.

Best solutions for Microsoft interview tasks. Pattern Recognition.

Description:

Solution:

It looks like this task is not about algorithms, this is just about writing a simple parser which breaking the given string to parts and writing a function which finds and counts all substrings in the given string.

The only interesting subject which can be discussed in this task is implementation of algorithm for finding substrings. But the task doesn’t require to implement this algorithm, the strings are short, we have no special requirements for search of substrings, so we can use implementation from the standard library. The only thing we should keep in mind is that complexity of this implementation may be very different. Simple algorithm like this one has average complexity about 2*N where N is length of the string and worse complexity O(N*P) where P is length of the pattern, and depends on implementation of std::string.find() method.

int count_substrings(const std::string & s, const std::string& sub, bool overlapped = true) {
    size_t sub_size = sub.size();
    if ( (sub_size == 0) || (sub_size > s.size()) ) return 0;
    int count = 0;
    for (size_t offset = s.find(sub); 
       offset != string::npos;
       offset = s.find(sub, overlapped ? offset + 1 : offset + sub.size()))

    {
        ++count;
    }
    return count;

}
 

For a string like “aaa...(1.000.000 more ‘a’)...aaaB” and overlapped substring “aaa...(999.000 more ‘a’)...aaaB” it will works VERY slow because we will need to compare 1.000.000 * 999.000 characters.

There are many much faster algorithms even with complexity O(N) but all of them are applicable in special cases or has additional requirements. This is the cause why there are so many algorithms were invented and people don’t use only one the best of all. Below I will show a couple of the most famous and general algorithms which are applicable in most of cases and don’t have too many additional requirements.

The most famous is the Knuth–Morris–Pratt algorithm

int KMP_count(const string &s, const string &sub) {
    int num_subs = 0;
    size_t sub_size = sub.size();
    size_t s_size = s.size();
    if ((sub_size == 0) || (sub_size > s.size())) { return 0; }
    vector<int> prefixes(sub_size);
    for (int k = 0, i = 1; i < sub_size; ++i) {
        while ((k > 0) && (sub[i] != sub[k])) {
            k = prefixes[k - 1];
        }
        if (sub[i] == sub[k]) { k++; }
        prefixes[i] = k;
    }

    for (int k = 0, i = 0; i < s_size; ++i) {
        while ((k > 0) && (sub[k] != s[i])) {
            k = prefixes[k - 1];
        }
        if (sub[k] == s[i]) { k++; }

        if (k == sub_size) {
            // to get position of substrings in the string
            // (i - sub_size + 1); 
            ++num_subs;
        }
    }
    return num_subs;
}

There is a very fast algorithm with linear complexity which works similar to the KMP algorithm but looks more understandable IMHO. It is so called Z-function. Like the previous algorithm it requires to allocate an additional array of integers with size S where S is length of string N + length of substring P. Then we save result of calculations of Z-function in the array, pass through this array and recognize positions of searched substrings. Implementation of this algorithm is here:

vector<int> z_function (const string & s) {
    int n = (int) s.size();
    vector<int> z (n);
    for (int i=1, l=0, r=0; i<n; ++i) {
        if (i <= r)
            z[i] = min (r-i+1, z[i-l]);
        while (i+z[i] < n && s[z[i]] == s[i+z[i]])
            ++z[i];
        if (i+z[i]-1 > r)
            l = i,  r = i+z[i]-1;
    }
    return z;
}

int z_count(const std::string & s, const std::string& sub) {
    size_t sub_size = sub.size();
    if ( (sub_size == 0) || (sub_size > s.size()) ) return 0;

    string working_string(sub + s);
    size_t ws_size = working_string.size();

    auto z = z_function(working_string);

    int number_subs = 0;

    for(int i = sub_size; i < ws_size; ++i) {
        if (z[i] >= sub_size) {
// to get exact position of substrings in the string
//            cout << i - sub_size << endl; 
            ++number_subs;
        }
    }
    return number_subs;
}

And finally we should implement the parser which will break the input string to the pattern and the blobs where we will search the patterns.

Unfortunately standard C++ still have no so good tokenizer as strtok from plain C so I will use it. Even if it looks not so safety as pure C++ solution but it allows to tokenize a string in one like instead of tens of lines in pure C++.

string parser(const string & s) {
    string output;
    int total_patterns = 0;

    size_t pos = s.find_first_of(';');

    string pattern = s.substr(0, pos);
    string blobs = s.substr(pos+1);

    for(char *token = strtok((char*)blobs.c_str(), "|"); 
        token != NULL; token = strtok(NULL, "|")) 
    {
        string stoken(token);
//        int num_patterns = count_substrings(stoken, pattern);
//        int num_patterns = z_count(stoken, pattern);
        int num_patterns = KMP_count(stoken, pattern);
        total_patterns += num_patterns;
        output.append(to_string(num_patterns));
        output.append("|");
    }
    output.append(to_string(total_patterns));
    return output;
}

Repository with the full project you can find here: https://github.com/jolly-fellow/microsoft/tree/master/pattern_recognition

Return to the table of contents.


要查看或添加评论,请登录

Alexander Molchevskyi的更多文章

社区洞察

其他会员也浏览了