17
String Operations On Phone Numbers In SQL
When dealing with SQL databases, users will have to interact with different data types and schemas depending on the project's requirements. Since databases are used in many different applications and support multiple platforms, devices, and users, there may be differences in how data is entered, even in a simple field like a phone number.
For example, one set of users will enter the phone number with the country code, while the others may simply ignore the country code. Therefore, it is essential for database users to know how to clean data in a SQL database. In this post, we will see how to utilize SQL string operations on phone number fields to properly clean, format, and extract data.
Data comes in all forms and sizes. Most of the time, it will be raw data, without proper formatting. Data cleaning allows users to filter, sort, and format this raw data to extract meaningful information from it. Data cleaning also makes data analysis a much more efficient process as an additional benefit.
Let's assume we have a text field that captures user information and asks users to enter valuable information like phone numbers. We need to clean this text field to make the phone numbers usable. How would we go about doing that? To begin with, we need to understand SQL string operations.
String operations are functions that can be used to manipulate strings. These string operators (also called string functions) take a string as input, process it according to the specified function, and return a string as the output. The availability of these functions will differ depending on the database. For example, the following is a list of commonly used string functions in MySQL:
Function | Usage |
---|---|
LEFT / RIGHT | Extract a specified number of characters from a string (starting from left or right) |
TRIM | Remove leading or trailing spaces |
POSITION / STRPOS | Analogous to the Location function, where a numerical value of the position of the specified substring is provided. |
SUBSTR | Extract the specified substring from a specific location to match the specified length |
CONCAT | Combine two or more strings to make a single string |
COALESCE | Replace null values with a specified value |
FORMAT | Formats the value to the specified format |
REGEXP | Matches string to a regular expression |
REGEXP_SUBSTR | Extracts value for the matching regular expression |
STR | Convert numerical value to string |
REVERSE | Reverses the specified string |
REPLACE | Replaces the specified string with the given value. |
The full list of available string functions and operators is available in the official documentation.
Now we have a basic understanding of string operations and related functions in SQL! So, let's look at how exactly we can utilize some of these functions on phone numbers to extract and format data. For all the examples provided here, we will be using the Arctype SQL client with a MySQL database in a Windows environment with the following dataset.
We can utilize the SUBSTR
command to format phone numbers. It will extract parts of the number and use the CONCAT
to join the specific parts together. In the following example, we have broken down the phone number into three distinct sections and combined them together with the formatting as a new field called num_formatted
:
SELECT
phone_num,
CONCAT(
'(',
SUBSTR(phone_num, 1, 3),
') ',
SUBSTR(phone_num, 4, 3),
'-',
SUBSTR(phone_num, 7)
) AS num_formatted
FROM
EmpDetails
The result of the operation is show below:
You can create a new column to save the formatted number using the ALTER TABLE
command. It will create a new column and update it with the formatted numbers.
ALTER TABLE EmpDetails
ADD formatted_number VARCHAR(255);
UPDATE
EmpDetails
SET
formatted_number = (
SELECT
CONCAT(
'(',
SUBSTR(phone_num, 1, 3),
') ',
SUBSTR(phone_num, 4, 3),
'-',
SUBSTR(phone_num, 7)
) AS num_formatted
)
The result of the above operation is shown below:
We can modify our SQL command to include the REPLACE
function if there are additional characters in the phone number field, such as dashes and brackets. This function can be used to remove unnecessary characters and whitespaces in a string. Now let's further format the formmated_number
field to include the country code.
SELECT
formatted_number,
CONCAT(
'+1 '
'(',
SUBSTR(num_cleansed, 1, 3),
') ',
SUBSTR(num_cleansed, 5, 3),
'-',
SUBSTR(num_cleansed, 7)
) AS num_formatted
FROM
(
SELECT
formatted_number,
REPLACE(REPLACE(REPLACE(REPLACE(formatted_number, '(', ''), ')', ''), '-', ''), ' ','') as num_cleansed
FROM
EmpDetails
) FormattedPhoneNumbers
In the above statement, the REPLACE
function is used to remove the brackets, dashes, and whitespaces from the formatted_number
field before undergoing the formatting process.
We can use the LEFT
and RIGHT
functions to specify and extract different parts of the phone number. In the following example, we will extract the area code of the phone number using the LEFT
function and the last four digits using the RIGHT
function based on the num_formatted
field.
SELECT
REPLACE(LEFT(formatted_number, 4), '(', '') As 'Area Code',
RIGHT(formatted_number, 4) AS 'Last Four Digits'
FROM
EmpDetails
Here, the REPLACE
function is used to remove the bracket from the selected number block. The result of the query is shown below.
One of the most common tasks in the world of data cleaning is extracting phone numbers from a text block. The complexity and feasibility of this task will mostly depend on the composition of the text.
The easiest way to extract phone numbers is to utilize regular expressions to target the specific phone number formats. Extracting data has become far simpler with the introduction of functions like REGEXP_SUBSTR
in MySQL 8.0. We will be populating the details column with some phone numbers in different formats, as shown below.
First, we need to figure out which rows consist of data matching our requirements. In this case, the following regular expressions will be used on the details field.
- Any consecutive 10 digits
- Formatted number (XXX-XXX-XXXX)
SELECT
*
FROM
EmpDetails
WHERE
# Any 10 Digits
details REGEXP '[[:digit:]]{10}'
# Formatted Number (XXX-XXX-XXXX)
OR details REGEXP '[0-9]{3}-[0-9]{3}-[0-9]{4}';
Since we have identified the rows, the next step is to extract the phone numbers. It can be done using the REGEXP_SUBSTR function to extract the substring which matches the given regular expression. As we need to query two different regular expressions, we will be using the CONCAT_WS function to combine the results of both expressions into a single column.
SELECT
emp_id,
name,
email,
CONCAT_WS(
'',
REGEXP_SUBSTR(details, '^[0-9]+$', 1, 1, 'm'),
REGEXP_SUBSTR(details, '[0-9]{3}-[0-9]{3}-[0-9]{4}', 1, 1, 'm')
) AS extracted_phone_numbers
FROM
(
SELECT
*
FROM
EmpDetails
WHERE
details REGEXP '[[:digit:]]{10}'
OR details REGEXP '[0-9]{3}-[0-9]{3}-[0-9]{4}'
) ExtractedDetails
To query results from a single field with multiple numbers, we need to create a stored procedure that loops through the desired field to capture all matching regex patterns. For instance, let's see how to extract multiple phone numbers from the details field of emp_id 1702 (Dan).
DELIMITER $$
CREATE PROCEDURE get_number_of_matches_full()
BEGIN
DECLARE regex_match INT DEFAULT 1;
DECLARE current_number INT DEFAULT 1;
WHILE regex_match >= 1 DO
CREATE TABLE IF NOT EXISTS extracted_numbers_table (
`emp_id` int NOT NULL,
`name` varchar(255) DEFAULT NULL,
`email` varchar(255) DEFAULT NULL,
`extracted_phone_num` varchar(255) DEFAULT NULL
);
INSERT INTO extracted_numbers_table (emp_id, name, email, extracted_phone_num)
SELECT emp_id, name, email, REGEXP_SUBSTR(details, '[0-9]{3}-[0-9]{3}-[0-9]{4}', 1, current_number, 'm') FROM EmpDetails WHERE emp_id = 1702;
SET current_number = current_number + 1;
IF ((SELECT REGEXP_SUBSTR(details, '[0-9]{3}-[0-9]{3}-[0-9]{4}', 1, current_number, 'm') FROM EmpDetails WHERE emp_id = 1702) IS NULL) THEN
SET regex_match = 0;
END IF;
END WHILE;
END $$
DELIMITER;
The result of this operation is shown below.
In the above code block, we have created a stored procedure called get_number_of_matches_full
, which loops through the details field until all the regex matches are found on the specified row of the EmpDetails table. We are using the REGEXP_SUBSTR
function with the position argument to extract different matches. It updates by one at each loop run and saves the results on a newly created extracted_numbers_table
. Finally, the loop exits when a NULL
result is found.
We can call this procedure and view the results by querying the extracted_numbers_table
as shown below.
CALL get_number_of_matches_full;
SELECT * FROM extracted_numbers_table;
String operations in SQL are vital functions to clean and format data within a database. Moreover, string operations are the core component when dealing with valuable information such as phone numbers as they allow users to extract or manipulate phone numbers according to their requirements. However, it's important to remember that the exact functions available will change depending on the database type and version. Therefore, always remember to refer to the documentation to find out the available string operations and implement them to your heart's content.
17