Lately, I was working on a project that’ll heavily concentrating on the server-side. Although those server-side stuffs aren’t really my expertise. The lucky part is that this project is about process the HTTP header and return a proper json as response.
After most of my challenges are completed, there’s one last problem for me to solve:
“How do I get only the domain name and extension name out of a host name?”
For instance, you’ve got some string domains as input:
www.google.com
www.google.co.jp
mail.google.com.hk
The expected output should be:
google.com
google.co.jp
google.com.hk
After some research, I came to know that there’s a builtin function gethostbyname
from a native C library netdb
which can be directly adopted into my program. The function is well explained in the https://www.gnu.org/software/libc/manual/html_node/Host-Names.html.
Just take a look at the solution out there, it’s easily parsing the hostname into a easy-to-use data structure which completely is the final answer to look for.
Actually, before I investigated this builtin function, I’ve got myself another solution. And here is what I thought.
# My own approach
Just by taking look at the input and output. It’s obviously an easy problem of processing strings. The first idea I came up with was cutting the first group of strings and return the rest of them.
My idea was obviously naive after running a couple of tests. Because some scenarios like multiple sub-domains weren’t taken into consideration.
For instance, a host name like this one:
one.two.abc.com
Expected result:
abc.com
If I only cut the first group of strings. The result is:
two.abc.com
It’s still far from the correct answer. Then I realized that such “obviously easy” algorithm problem shouldn’t be solved by a brute force. There must be a certain pattern in it.
My ration told me to list all of the elements from the hostnames and then find out some aspects to brainstorm with later.
Alright. What does a hostname consist of?
Domain name + Delimiter + Extension name
For google.com
- google as domain name
.
as delimiter- com as extension name
Okay. I think I’ve found something here…
There’s actually a simple formula to balance the numbers of delimiters and sectors
delimiters = sectors - 1
Usually the extension is either 1 sector (.com) or 2 sectors (co.cc). So if I could enumerate all of the 2-sector extensions, then the domain name is apparently an easy catch.
Why?
It is quite common to say that species of the 2-sector extensions aren’t many. Most of such extensions starts with either “co” or “com”. For instance, co.jp & com.hk.
So in all, the solution would be doing the following things:
-
Split the hostname by delimiter of “.”
e.g. www.google.com.hk -> [“www”, “google”, “com”, “hk"]
-
Check if the second to last element is “co” or “com”
-
if
yes
, then take the third to last element as domain name. -
if
no
, then take the second to last element as domain name.
# Implementation
Here is my implementation in C:
function getdomainbyhost(char *http_host)
{
char *res = calloc(128, sizeof(char));
// return directly if Dev env
if (!strcmp(http_host, "localhost") ||
!strcmp(http_host, "127.0.0.1") ||
!strcmp(http_host, "0.0.0.0"))
{
sprintf(res, "%s", http_host);
return res;
}
// Get "." occurrence frequency in host
// e.g. a.b.google.com.hk => 4
// e.g. b.google.com.hk => 3
// e.g. google.co.jp => 2
// e.g. google.com.hk => 2
// e.g. google.com => 1
int occur = 0;
for (int i = 0; http_host[i] != '\0'; ++i)
{
if (http_host[i] == '.')
{
++occur;
}
}
if (occur == 1)
{
sprintf(res, "%s", http_host);
return res;
}
char *arr[32];
int i = 0;
char domain_name[128];
memcpy(domain_name, http_host, strlen(http_host));
domain_name[strlen(http_host)] = '\0';
char *token = strtok(domain_name, ".");
while (NULL != token)
{
arr[i++] = token;
token = strtok(NULL, ".");
}
// If it's 2-worded domain extension, check the second to last word
// e.g. google.com.hk => check "com" => google.com
// e.g. some-service.google.com => => check "google" => google.com
regex_t regex_domain_ext;
regcomp(®ex_domain_ext, REGEX_DOMAIN_EXT, REG_EXTENDED | REG_NOSUB);
int match_result = regexec(®ex_domain_ext, arr[occur - 1], 0, NULL, 0);
regfree(®ex_domain_ext);
if (match_result != REG_NOMATCH)
{
sprintf(res, "%s.%s.%s", arr[occur - 2], arr[occur - 1], arr[occur]);
}
else
{
sprintf(res, "%s.%s", arr[occur - 1], arr[occur]);
}
return res;
}
It’s perhaps never a good solution. And it may look a bit buggy and will not pass some of the QA tests. Just as you know, I’ll have to use the builtin function gethostbyname
.
But I believe it’s a good practice of solving problems in an unique way. Maybe in the feature days when I take a look back on what I’ve done here, I’ll think myself a super idiot coming out with such dumb solution.