Standardization and Modification of URLs in Excel & Google Docs

Zdeněk Nešpor Excel Leave a Comment

Welcome to a practical tutorial on how to standardize and modify URL datasets in Excel, Google Sheets, or other table processors. I’ll teach you how to clean up trailing slashes, WWW subdomain, change HTTP to HTTPS, and much more useful tricks for trimming or concatenating URLs. You’ll appreciate this information during work with the complex datasets, preparing redirection tables, or experimenting.

Proper knowledge of functions and formulas will boost your work productivity, and you’ll be way more efficient. A manual customization is an option for maybe one or two URLs. Bigger datasets require proper solutions. Let’s dig into it. I’ll teach you how to work with URLs more effectively.

Smart Article Controls

Here you can control options for displayed formulas and functions. It will allow you to copy and paste formulas without any further adjustments. Change delimiter from semicolon to comma. Or change default cell from A1 to whatever you need in your table. Press the button Customize formulas to rewrite all formulas presented in this article and marked with an orange lightbulb icon.

  • Select a formula delimiter. Active delimiter is ;. Change it to .
  • Select a starting cell. Active cell is A1. Change it to .

About Enhanced Articles

The piece you are just reading is an enhanced article or a smart article if you want. It’s a next-level technical content enhanced with JavaScript functions to improve your experience.

I bet you are familiar with the following scenario. You dedicate your time to searching for Excel formulas. And then you must, character by character, recreate those formulas to fit and work in your sheets. Sometimes you need semicolons, sometimes commas, and rarely your data starts in the same cell as is in the example.

No problem. With this enhanced smart article, you have full control over the final formulas. You can tailor and customize the final result to fulfill your requirements.

Table of Contents

Removing HTTP, HTTPS, and WWW

The only function you’ll need at this moment is a SUBSTITUTE. It’s used for substituting a specific string of characters with another string of characters. Let’s do a simple removal of one URL component. Check the formula and the example table below. In column A are raw data. In column B are data modified by the formula.

=SUBSTITUTE(A1;"http://";;1) Formula customization enabled

AB
1http://link-brain.comlink-brain.com

Formula Explanation

From A1 source cell SUBSTITUTE text string “http://” with nothing and apply it only on the first occurrence of the given string.

SUBSTITUTE syntax

=SUBSTITUTE(text; old_text; new_text; instance_number)

  • text – [required] – Text or cell reference where you want to substitute characters.
  • old_text – [required] – A string of characters to lookup for and be replaced, highly recommended to put this in between quotation marks.
  • new_text – [required] – A string of characters to replace old_text, highly recommended to put this in between quotation marks.
  • instance_number – [optional] – Which occurrence of the old_text you want to replace. The starting number is 1. If left empty, all occurrences will be replaced.

SUBSTITUTE nesting

To remove more URL components at the same time, try nesting the same function. Check the formula and the example table below. In column A are raw data. In column B are data modified by the formula.

=SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(A1;"https://";;1);"http://";;1);"www.";;1) Formula customization enabled

AB
1http://link-brain.comlink-brain.com
2http://www.link-brain.comlink-brain.com
3https://link-brain.comlink-brain.com
4https://www.link-brain.comlink-brain.com
5www.link-brain.comlink-brain.com

Removing trailing slashes

Trailing slash removal is a bit more complicated. You can use the function LEFT to delete the last character of a text string in a given cell. But it’s not a very flexible solution. The goal is to detect appropriate cells and remove only trailing slashes. For that, let’s use functions IF, LEFT, RIGHT, and LEN. Check the formula and the example table below. In column A are raw data. In column B are data modified by the formula.

=IF(RIGHT(A1;1)="/";LEFT(A1;LEN(A1)-1);A1) Formula customization enabled

AB
1http://link-brain.com/http://link-brain.com
2http://link-brain.comhttp://link-brain.com
2http://link-brain.com/test/http://link-brain.com/test
2http://link-brain.com/examplehttp://link-brain.com/example

Formula Explanation

IF it’s true that the first character in cell A1 from RIGHT is a trailing slash, count LEN of cell A1 and return characters from LEFT minus 1 character. If a given condition is not true, copy the whole string from cell A1 without modifications.

RIGHT & LEFT syntax

=LEFT(text; number_of_characters)
=RIGHT(text; number_of_characters)

  • text – [required] – Text or cell reference where you want to substitute characters.
  • number_of_characters – [optional] – A number of characters to extract. If left empty, number 1 will be used as default.

LEN syntax

=LEN(text)

  • text – [required] – Text or cell reference where you want to count length. Lenght = number of characters and spaces.

IF syntax

=IF(logical_test; if_true; if_false)

  • logical_test – [required] – Logical test to check if a given condition is true or false.
  • if_true – [optional] – If the condition is true, return text or do something. The text should be wrapped in between quotation marks. It can also be left empty.
  • if_false – [optional] – If the condition is false, return text or do something. The text should be wrapped in between quotation marks. It can also be left empty.

Standardization of URL

At this moment, you should know all the necessary components to create more complex formulas. Let’s try to create one to remove http://, https://, www, and trailing slashes. It may look not very easy, but it’s just a few nested functions used above. Check the formula and the example table below. In column A are raw data. In column B are data modified by the formula.

=SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(IF(RIGHT(A1;1)="/";LEFT(A1;LEN(A1)-1);A1);"https://";;1);"http://";;1);"www.";;1) Formula customization enabled

AB
1https://www.link-brain.comlink-brain.com
2www.link-brain.com/link-brain.com
3https://link-brain.com/link-brain.com
4https://www.link-brain.com/contactlink-brain.com/contact
5https://link-brain.com/contact/link-brain.com/contact

The result is a column with beautiful URLs in the unified state. This can be used as a starting point for many other operations with URLs. Let’s do some more exciting work.

Deduplication and preparation of a redirect table

First things first. There is usually no point in working with duplicates. Let’s clean our URL dataset. I’d recommend copying previous output into a new sheet.

  1. Select the whole column with URLs to deduplicate.
  2. Switch to Data tab.
  3. Choose the option Remove duplicates.
  4. If you have a header cell, don’t forget to check My data has headers.
  5. Click on OK.
  6. Done, your data is clean without duplicates.

This method is how I prepare data for redirection tables. In case you do the same thing, don’t forget to mention or remind the developers or webmasters of all the possible combinations that should be redirected. All are listed below.

  • http://
  • https://
  • http:// + www.
  • https:// + www.
  • http:// + /
  • https:// + /
  • http:// + www. + /
  • https:// + www. + /

URL composing and merging

Sometimes it’s not good enough to prepare only a simple construct with additional instructions. It would be best if you did some URL reconstruction by yourself. I’ll show you how to merge different parts of URLs.

Into one column prepare strings like https://, www. or /. And construct a new formula by using the CONCAT function. Check the formula and the example table below. In column A are URL components. In column B are raw deduplicated data modified by the previously used formula. Cell C is empty to create some space for a better look. And in cell D is a result.

=CONCAT($A$2;B1;$A$4) Formula customization disabled

ABCD
1http://link-brain.comhttps://link-brain.com/
2https://link-brain.com/contacthttps://link-brain.com/contact/
3www.link-brain.com/toolshttps://link-brain.com/tools/
4/link-brain.com/tools/url-openerhttps://link-brain.com/tools/url-opener/

Formula Explanation

CONCAT cells A2, B1, and A4.

Symbol $ before column and row reference marks absolute reference, meaning it will stay as-is referencing to the same cell all the time. If you drag your formula into other cells, absolute references $A$2 and $A$4 will remain the same. Only relative reference to the original cell B1 will be changed dynamically.

CONCAT syntax

=CONCAT(text_1; text_2; text_3)

  • text – [required] – Text strings or cell references to be concatenated.

CONCATENATE syntax

Confused? Don’t be. There are two functions used for the same thing. Both work in major table processors like Excel and Google Sheets. In Excel, I’d recommend using shorter CONCAT. And in Google Sheets, you have to use CONCATENATE if you want your formulas working correctly.

=CONCATENATE (text_1; text_2; text_3)

  • text – [required] – Text strings or cell references to be concatenated.

How to type $

  • Input $ into any function is very easy. Click into the function field and move your cursor to the cell reference you need to fix. Press F4. Both column and row references will have $ in front of them. Press F4 repeatedly to change it or remove dollar signs. This is the Excel way you should learn to utilize.
  • Key 4 in the alphanumerical area of your keyboard under F1-F12.
  • Keyboard shortcut ALT+36.

Replacing domain names, subdomains, or TLDs

Further manipulation with URLs can be handy. There are many options and utilization of changing TLDs, domain names, or subdomains. And you can use SUBSTITUTE with which you should be quite familiar at this moment.

In the following example, I’m removing http://, https://, www., trailing slashes, replacing subdomain, and replacing domain name to a new variant. Don’t be scared by the long formula. It’s nothing more than a few nested substitutions. In column A are raw data. In column B are data modified by the formula.

=SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(IF(RIGHT(A1;1)="/";LEFT(A1;LEN(A1)-1);A1);"https://";;1);"http://";;1);"www.";;1);"robin";"batman";1);"link-brain.com";"linkbrain.org";1) Formula customization enabled

AB
1https://www.link-brain.comlinkbrain.org
2http://robin.link-brain.com/batman.linkbrain.org
3https://robin.link-brain.com/contactbatman.linkbrain.org/contact

Replacing parameters, anchors, or fragments

It’s crucial to learn how to work with messy data properly. All the fbclid, gclid, and other parameters are infesting URLs these days. Let’s see how to deal with it—time to introduce a new function MID. It can help tremendously with cutting off parameters ? or hashes #.

=SUBSTITUTE(A1;MID(A1;(FIND("#";A1;1));LEN(A1));;1) Formula customization enabled

AB
1https://link-brain.com#session_id=IwRgJG_FWeqF-7fkbFczhttps://link-brain.com
2https://link-brain.com#anchorhttps://link-brain.com
3https://link-brain.com#!jokerhttps://link-brain.com

=SUBSTITUTE(A1;MID(A1;(FIND("?";A1;1));LEN(A1));;1) Formula customization enabled

AB
1https://link-brain.com?fbclid=IwRgJG_FWeqF-7fkbFczhttps://link-brain.com
2https://link-brain.com/?utm_source=google&utm_medium=cpchttps://link-brain.com
3https://link-brain.com?_escaped_fragment_=riddlerhttps://link-brain.com

Formula Explanation

Function FIND helps you to find a specific string of characters in the middle of a text string. In combination with function MID which works similarly as RIGHT or LEFT it is possible to return a result without any additional parameters or hashes. Some of the formula arguments are replaced, for example, with LEN. This creates dynamic references to process data smoothly and efficiently.

FIND syntax

=FIND(find_text; in_text; starting_number)

  • find_text – [required] – Text or character to find.
  • in_text – [required] – Text or cell reference where to search for a given text.
  • starting_number – [optional] – The starting position for searching. If omited number 1 will be used as a default value.

MID syntax

=MID(text; starting_number; characters_number)

  • text – [required] – Text or cell reference to process.
  • starting_number – [required] – The starting position where text extraction should begin.
  • characters_number – [required] – How many characters should be extracted.

Cleaning URLs from Internet Archive Wayback Machine

Internet Archive Wayback Machine is super useful. Sadly its URLs are hard to work. Let’s use Excel to make very specific Web Archive URLs nicer.

Check this URL: http://web.archive.org/web/20180111064607/https://link-brain.com/. Some facts you can work with. There are five slashes before the URL components you need—useful information but not good enough. There are also at least 27 characters before the fifth slash, which is much better information. 28 characters at a minimum if you count the last slash. Let’s use utilize both. Check the table below. In column A you can see source data. In column B is cleaned URL.

=SUBSTITUTE(A1;LEFT(A1;(FIND("/";A1;28)));;1) Formula customization enabled

AB
1http://web.archive.org/web/20180111064607/https://link-brain.com/https://link-brain.com/

Formula Explanation

In the formula, I’m using FIND to search for slash, which is at least 28 characters from the beginning of the string. Then I use LEFT to return characters from that last slash and SUBSTITUTE the original text string with a new one.

I’d recommend performing URL standardization in a new column. It’s possible to do it all in one step, but the formula is super complicated and hard to debug.

Quick tip

To extract old URLs, you can use free LINK-BRAIN IAWM Extractor. Tool designed to pull data from Wayback Machine API and clean them up.

Overview of used functions

Here are listed all the functions used in this article with links to the official documentation.

Function in ExcelFunction in Google SheetsFunction type
SUBSTITUTESUBSTITUTEtext
IFIFlogical
LENLENtext
LEFTLEFTtext
RIGHTRIGHTtext
MIDMIDtext
FINDFINDtext
CONCATCONCATtext
CONCATENATECONCATENATEtext

Excel file with examples to download

Most of the examples used in this article can be found in the testing table below. Feel free to download it and test how it all works.

Conlusion

Automation and creating proper solutions pay up in the long term. With all the functions and formulas in this article, you will save tons of time. Your work will be more fun and much easier and faster. Always try to figure out minimalistic and straightforward solutions. Do not overcomplicate stuff. Because it’s not so complicated and time-consuming to modify, update, and sustain minimalistic code.

Some final tips. If you hate Excel and have only a small amount of data, you are very welcome to use free LINK-BRAIN URL Builder. It can do most of the stuff mentioned here, with only a few clicks directly in your browser.

The next tip is regarding functions in Excel, which is now translated into many different languages. Originally I wanted to implement some language switcher into the customization. But it’s quite tricky, and Excel already has an official build-in tool for that called Excel Functions Translator.

Do you use some other useful formulas? Please share them in the comments. If you try to solve some specific problem, also leave a comment. I’ll try to help, and we might figure out something together. Any feedback will be much appreciated.

READ. SHARE. REPEAT.
  •  
  •  
  •  
  •  
  •  
  •  

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.