Introduction to the Internet

Learning objectives

You know what the internet is and what sorts of computers make up the internet.
You know a set of key protocols and terms (TCP/IP, URI, DNS, HTTP).

What is the internet

Internet is a network, through which computers around the world are connected to each other. One of the possible origins for the term internet are the words inter-, which is a prefix that refers to something between or among, and network, which refers to a group of interconnected things. As a combination, the term internet refers to multiple connected networks.

Origins of the internet

The origins of the internet are in the 1960s efforts to build functionality for time-sharing of computers. One of the objectives was to increase the access to high-powered research computers and, consequently, reduce the amount of time that such computers were idle. Such computer resources were often geographically distant, and continuously traveling to use those resources was not a feasible long-term solution.

One of the results of these efforts was one of the predecessors of internet, ARPANET, a packet-switching network that provided an opportunity to connect computers that were geographically far from each other. One of the key features was that computers that were connecting to each other did not need to have a direct connection, but that the connection could be formed through a network of computers with the purpose of passing messages forward. As a part, or perhaps a consequence, of this process, a protocol-combination called TCP/IP -- specifying how data should be packaged, addressed, transmitted, routed, and received -- was formed and later standardized.

Partially due to the standardization of the TCP/IP-protocol, adding new networks into the existing group of networks became easier, increasing access to the internet. In the 1980s, most of the points for accessing the internet were still related to universities, while commercial companies -- internet service provides -- that provided access to the internet started to emerge in the late 1980s.

At that point, the internet was still mainly used for text-based communication and running remote processes. Similarly, the internet service typically providers provided access to limited services, such as email. In the late 1980s, the internet as we now know it did not exist yet.

World wide web and browsers

By the early 1990s, Tim Berners-Lee (credited for inventing the world wide web) had developed many of the core tools that we currently see as the internet. This included a protocol for transferring text documents between computers (HyperText Transfer Protocol, HTTP), a language for representing structure and data on a web page (HyperText Markup Language, HTML), a web server that could respond to requests, and a web browser that could show the user HTML-pages retrieved from web servers.

Since then, more browsers and servers have emerged. In the late 1990s, the most widely used web browsers were the Internet Explorer and Netscape Navigator, which was later open-sourced to create the Mozilla browser and later the Firefox browser. Currently, the most used browsers include Chrome (from Google), Safari (from Apple), Edge and Internet Explorer (from Microsoft), and Firefox (from Mozilla foundation).

As a part of the evolution of browsers, functionality for adding dynamic content to browsers was developed. In particular, this included scripting languages such as JavaScript, which has then become the de-facto language for creating dynamic functionality into web pages.

Search engines

While the evolution of browsers (and web servers) has played a significant role in making the web available and enjoyable, one of the key turning points in the history of the internet is the emergence and evolution of search engines.

The early search engines were effectively web servers hosting lists of other web servers that people could visit, and where web server hosts could add their site. Some crawlers, i.e. software that visited sites and stored some of the information, were developed already in the early 1990s, but with little success.

One of the early commercial services was the Yahoo! search engine, launched in 1994, which contained a searchable index of addresses with descriptions for each address. Adding a commercial site to the index cost money, while free sites could be added for free. One of the challenges with the search engine was that creating descriptions took quite a bit of time (i.e. they were hand-written).

The first search engine that provided the functionality for writing queries in natural language was Altavista, which was launched in 1995. The service both crawled web sites (i.e. visiting web sites downloading the content) and indexed the sites for search purposes.

Identifying meaningful sites was a challenging task, and purely searching and indexing page content could lead to biased results. At one point, web page developers, for example, added invisible text to their sites to increase the importance of the site in search engines.

Google search, which was launched in 1997, had a potential solution to the problem of identifying meaningful sites. Their algorithm called PageRank in part identified the relevance of a site in terms of the number of quality sites linking to that site, which improved the ranking of search results. Additional algorithms have since then added to the service, including also personalization of results based on earlier browsing behavior.

Currently, Google is one of the leading search engines, in addition Bing and Baidu search engine.

Internet access

The adoption and availability of internet in different countries is partially influenced by national and international policies. As an example, the World Summit on Information Society in 2003, highlighted that everyone should have the possibility for creating, accessing, using, and sharing information and knowledge.

This, in part, is also related to internet access -- as an example, the United Nations has condemned disrupting or limiting internet access. Different countries take their responsibility in ensuring access to internet for all differently. As an example, Finland has declared the availability of internet as a basic right.

Overall, currently, as per statista, some 60% of the global population are active internet users.

Roles of computers and communication

The internet consists of interconnected computers with different roles. Client software, including but not limited to internet browsers, make requests through which they ask for information from servers. Server software, including web applications, listen to requests, responding to them based on their internal logic whenever a request is received.

The format of requests and responses is standardized, and follows specific guidelines. This means that both the client and the server must adhere to specific protocols if the messages are to be understood. Even missing a single character in how a message is structured can lead to a situation where the message is not sent, is not correctly received, or is not correctly interpreted.

Question not found or loading of the question is still in progress.

When a client makes a request, the request is passed to the server through a network of computers. This network consists of routers, i.e. computers with the specific purpose of passing a message forward until it reaches its destination.

Key protocols and concepts

The functionality of internet is based on agreed-upon protocols, i.e. standardized sets of rules for communication between computers (and computer software). By implementing and using existing protocols, applications (and, more precisely, their developers), know how messages should be interpreted, and how such messages should be formed when they are sent.

Next, we look into the main protocols that allow internet to function as it does.

TCP/IP

Two of the main protocols that internet relies upon is the TCP-protocol (Transmission Control Protocol) and the IP-protocol (Internet Protocol). The TCP-protocol ensures that messages are received as intended, while the IP-protocol ensures that the messages are delivered to the correct address. These protocols are typically referred jointly using the term TCP/IP.

When a computer sends a message, e.g. a file, to another computer using the TCP-protocol, the data is divided into packets. Each packet contains a small header, up to 60 bytes, and content, up to 65 kilobytes. The header contains information about the sender and the receiver (IP-addresses), a number indicating the order of the particular packet, a checksum of the content, and a few other values.

Receiving a message consists of receiving one or more packets. When a computer receives a packet, it verifies that the content was correctly received by calculating a checksum of the content and comparing it with the checksum in the header. If the content was not correctly received, i.e. the checksum in the header does not match the checksum calculated from the content, the receiver requests the content again from the sender. On the other hand, if the message was correctly received, i.e. the checksum in the header and the checksum calculated from the content match, the receiver will inform the sender of the correctly sent packet.

When all packets related to a message have been received, the message will be reconstructed as a whole based on the packet order numbers in the packet headers. In summary, the responsibility of the TCP-protocol is to verify that the message is received, and that the content of the message is received correctly.

The small networks that together form the internet consists of multiple types of computers. For some of the computers, the only role is to forward packets towards the correct address. Such computers are called routers.

As mentioned before, each packet has a header, which includes also the IP-address of the recipient. An IP-address consists of multiple parts, which can be somewhat analogous to home addresses. In a home address, one part of the address is the country, one part of the address is the city, and so on. Similar to home addresses, IP-addresses consist of parts, which can be used to narrow down the computer to which the IP-address belongs to.

When a packet is sent, it is forwarded to the closest router. Each router has a routing table, which is effectively a database of IP-address parts and routers that correspond to them. Each router typically has data about the IP-addresses of closeby routers (and computers), but only high-level information about routers that are further away. With high-level information, we mean for example information about which router is responsible for which "country". This information is needed as no single router is connected to all the computers in the world

When a router receives a package, it checks the header of the package for the IP-address of the recipient. The router looks up the IP-address from its routing table, and chooses the best match for passing the packet forward. Similarly, the next router will receive the packet, check it, and pass it forward. This behavior -- i.e. passing packets forward -- is a key part of the core functionality of the internet.

Finally, when the packet is received by the router that is connected to the recipient, the packet is forwarded to the recipient. The packet has then reached its destination. As per the TCP-protocol, the packet is checked, and the recipient responds with a message to the sender, indicating that the packet has been received.

Question not found or loading of the question is still in progress.

URI

Resources on the internet, e.g. pages, files, and so on, are identified using URIs (Uniform Resource Identifier). The term resource comes from the time where web applications mostly consisted of static documents and had little dynamic functionality. These days, however, the term can refer to practically any content retrieved from the internet, also content that is created dynamically for the specific request.

An URI contains information about the used protocol, the address of the server, a port of the server, and a path as well as possible query parameters and an anchor.

protocol://server-address:port/path?param=value&param2=value#anchor

As an example, let's look at the URI https://www.aalto.fi/en/department-of-computer-science. Here, the protocol is https, the server address is www.aalto.fi, and the path is en/department-of-computer-science. The URI does not contain the server port, query parameters or an anchor. If a port is not specified, default port for the protocol is used -- for example, for http the default port is 80, and for https, the default port is 443.

In more detail, the parts of URI are as follows:

protocol: the protocol that is used when making the request, for example http or https.
server-address: the address of the server that is being connected to. This can be either an IP-address or a text-based address such as www.aalto.fi.
port: a number between 0 and 65535, representing a port on the server in which an application is listening for requests. If the port is omitted, default port for the used protocol is used.

In addition, an URI may contain a path, query parameters, and an anchor. These are as follows.

path: the path identifies a resource on the server. The path can contain multiple parts divided by slash /, and the path may also contain information about a specific document or a file name (e.g. index.html).
query parameters: query parameters consist of a collection of key-value -pairs, which are sent to the server as a part of the request within the URI. Each key-value -pair is linked to each others using the equals-sign, i.e. key=value. Multiple key-value -pairs are separated from each others using an ampersand &, e.g. key1=value1&key2=value2.
anchor: an anchor can be used to indicate a position in a document (but, anchor can be used for storing information about the current request or site).

In combination, URIs are used to specify the protocol, the server, and the resource.

Question not found or loading of the question is still in progress.

In general, there are two server addresses that (almost) always correspond to the local machine. Both localhost and 127.0.0.1 correspond to your local machine. That is, when we make requests in the material to a server in localhost, we also assume that you are running the discussed server on your own computer.

DNS

Domain Name Service (DNS) servers are a particular server type that help the internet function. The TCP/IP-protocol functions based on IP-addresses, while server addresses in URIs can be written as text. When a request to an URI is made, the server address in the URI needs to be first resolved into an IP address. If the server address is an IP address, no resolution needs to be made. Otherwise, the address needs to be looked up.

Computers typically have a cache (a temporary memory) that contains recent visits to websites, storing both the text-based address and the corresponding IP address. If the address is within the cache, the resolution of the IP address happens locally on the computer. If the address is not in the cache, however, the address needs to be resolved using an external service.

Here, DNS servers come into play. Whenever an address needs to be resolved, the computer sends a message to a DNS server asking for a corresponding IP address. Once the DNS server returns the IP address, the computer stores the IP address into its cache, and then continues by sending the message to the actual IP address.

All DNS servers themselves also do not know all the addresses and their corresponding IP addresses. If a DNS server does not know the corresponding address, it will query a so-called root name server, which returns information on which DNS servers should know the IP address of a particular domain. While internet service providers and server hosts in general maintain DNS servers, which means that there are plenty of DNS servers, there are just a handful of root DNS servers. Root name servers are supervised by the Internet Corporation for Assigned Names and Numbers (ICANN), which is also responsible for governing top-level domains, i.e. .com, .org, etc.

HTTP

HTTP (HyperText Transfer Protocol) is the protocol used by browsers and servers for requests and responses. It built on top of the TCP/IP-protocol, which means that those working with the HTTP-protocol do not have to worry about the underlying concerns such as dividing requests into packets, making sure that packets arrive at the correct destination, making sure that the content of the packets is correct, or making sure that the message that the packets jointly build is correctly reconstructed.

The protocol is based on the client-server -model where each request receives one response (request-response paradigm). This means that each request is handled as a separate entity, and two subsequent requests from the same source to the same target are not automatically linked to each other. In the client-server model, workload and responsibilities are divided between clients and servers, where clients ask for services or resources from servers. In practice, clients do not share their resources, and rely on the resources of the servers. The request-response paradigm is a basic communication method for communication between computers; one computer sends a request to which the target responds.

At the time of writing this material, the most used HTTP-protocol version is 1.1, which is outlined in the RFC 2616-specification. Newer versions of the protocol such as HTTP/2 and HTTP/3 have also been proposed -- they are both compatible with the HTTP 1.1 -version, which we mostly discuss here.