As discussed in Chapter 1, HTTP
is the standard that allows documents to be communicated and shared over the
Web. From a network perspective, HTTP is an application-layer protocol that is built on top of
TCP/IP. Since the original version, HTTP/0.9, there have only been two revisions
of the HTTP standard. HTTP/1.0 was released as RFC-1945[1] in May 1996 and
HTTP/1.1 as RFC-2616 in June 1999.
In Chapter 1, we told you that
HTTP is very simple: a client—most conspicuously a web browser—sends a request
for some resource to a web (HTTP) server, and the server sends back a response.
The HTTP response carries the resource—the HTML document or image or whatever—as
its payload back to the client.
Continuing our analogy from the previous section, HTTP is a
kind of cover letter—like a fax cover sheet—that is stored in an envelope and
tells the receiver what language the document is in, instructions on how to read
the letter, and how to reply.
D.2.1 Uniform Resource Locators
Uniform resource locators—more commonly known as URLs—are used as the primary naming and
addressing method of the Web. URLs belong to the larger class of uniform resource identifiers ; both identify resources, but URLs
include specific host details that allow connection to a server that holds the
resource.
A URL can be broken into three basic parts: first, the protocol
identifier; second, the host and service identifier; and, last, a resource
identifier that contains a path with optional parameters and an optional query
that identifies the resource. The following example shows a URL that identifies
an HTTP resource:
http://host_domain_name:8080/absolute_path?query
The HTTP standard doesn't place any limit on the length of a
URL, but some older browsers and proxy servers do. The structure of a URL is
formally described by RFC-2396: Uniform Resource Identifiers (URI): Generic
Syntax.
D.2.1.1 Protocol
The first part of the
URL identifies the application protocol. HTTP URLs start with the familiar
http://. Other applications that use URLs to locate resources identify
different protocols; for example, URLs used with the File Transfer Protocol
(FTP) begin with ftp://. URLs that identify HTTP resources served over
connections that are encrypted using the Secure Sockets Layer start with
https://. We discuss the use of the Secure Sockets Layer to protect data
transmitted over the Internet in Chapter 11.
D.2.1.2 Host and service identification
The next part of the
HTTP URL identifies the host on which the web server is running, and the port on
which the server listens for HTTP requests. The domain name or the IP address
can identify the host component. Using the domain name allows user-friendly web
addresses such as:
http://www.w3.org/Protocols/
The equivalent URL using the IP address is:
http://18.29.1.35/Protocols/
Domain names are not case sensitive.
D.2.1.3 Nonstandard TCP ports
By default, a HTTP
server listens for requests on port 80. So, for example, requests for the URL http://www.oreilly.com are made to the host machine www.oreilly.com on port 80. When a nonstandard port is
used, the URL must include the port number so the browser can successfully
connect to the service. For example, the URL http://example.com:8080 connects to the web server
running on port 8080 on the host example.com.
D.2.1.4 Resource identification
The remaining URL
components help locate a specific resource. The path, with optional parameters,
and an optional query are processed by the web server to locate or compute a
response.
The path often corresponds to an actual file path on the host's
filesystem. For example, an Apache web server running on a Unix machine that
hosts example.com may store all the web content
under the directory /usr/local/apache2/htdocs and be configured to use
the path component of the URL relative to that directory. In this case, the HTTP
response to the URL http://example.com/marketing/home.html contains the
file /usr/local/apache2/htdocs/marketing/home.html.
In contrast to domain names, the resource identification
component is usually case sensitive. This is because it refers to a directory or
file on the web server, and Unix servers (which host the majority of web sites)
are case sensitive.
D.2.1.5 Parameters and queries
The path component of a
URL can include parameters and queries that are used by the web server. A common
example is to include a query as part of the URL that runs a search script. The
following example shows the string q=red as a query that the script
search.php can use:
http://example.com/search.php?q=red
Multiple query terms can be encoded using the &
character as a separator:
http://example.com/search.php?q=red&r=victoria
Parameters allow other information not related to a query to be
encoded. For example, consider the parameter lines=10 in the URL:
http://example.com/search.php;lines=10?q=red
This can be used by the search.php script to modify the
number of lines to display in a result screen.
HTTP provides the distinction between parameters and queries,
but parameters are more complex than described here and are not commonly used in
practice. We discussed how PHP can use query variables encoded into URLs in Chapter 6.
D.2.1.6 Fragment identifiers
A URL can include a fragment identifier that is interpreted by the client
once a requested resource has been received. A fragment identifier is included
at the end of a URL separated from the path by the # character. The
meaning of the fragment identifier depends on the type of the resource. For
example, the following URL includes the fragment identifier tannin for
a HTML document:
http://example.com/documents/glossary.html#tannin
When a web browser receives the HTML resource, it then
positions the rendered document in the display to start at the anchor element
if the named anchor exists.
D.2.1.7 Absolute and relative URLs
The URL general syntax
allows a resource to be specified as an absolute or a relative URL. Absolute URLs identify the protocol
http://, the host, and the path of the resource, and can be used alone to
locate a resource. Here's an example absolute URL:
http://example.com/documents/glossary.html
Relative URLs don't contain all the components and are always
considered with respect to a base URL. A relative
URL is resolved to an absolute URL, with respect to the base URL. Typically, a
relative URL contains the path components of a resource and allows related sets
of resources to reference each other in a relative way. This allows path
hierarchies to be readily changed without the need to change every URL embedded
in a set of documents.
A web browser has two ways to set base URLs when resolving relative URLs. The first method allows a
base URL to be encoded into the HTML using the
Read my Curriculum Vitae Read my employment history Visit Fred's home page
Consider what happens if the page that contains the example is
requested with the following URL:
http://example.com/development/dave/home.html
The three relative URLs are resolved to the following absolute
URLs by the browser:
http://example.com/development/dave/cv.html http://example.com/development/dave/work/emp.html http://example.com/admin/fred.html
Table D-1 shows several relative URLs and how they are resolved
to the corresponding absolute URLs given the base URL
http://example.com/a/b/c.html?foo=bar.
D.2.1.8 URL encoding
The characters used in
resource names, query strings, and parameters must not conflict with the
characters that have special meanings or aren't allowed in a URL. For example, a
question mark character identifies the beginning of a query, and an ampersand
(&) character separates multiple terms in a query.
The meanings of these characters can be escaped using a
hexadecimal encoding consisting of the percent character (%) followed
by the two hexadecimal digits representing the ASCII encoded of the character.
For example, an ampersand (&) character is encoded as
%26.
The characters that need to be escape-encoded are the control,
space, and reserved characters:
; / ? : @ & = + $ ,
Delimiter characters must also be encoded:
< > # % "
The following characters can cause problems with gateways and
network agents, and should also be encoded:
{} | \ ^ [ ] `
PHP provides the rawurlencode( ) function to encode special characters. For
example, rawurlencode( ) can build the href attribute of an
embedded link:
echo ''; D.2.2 HTTP Requests
The model used for HTTP
requests is to apply methods to identified resources. A HTTP request message contains a method
name, a URL to which the method is to be applied, and header fields. Some
requests can include a body—for example, the data collected in a form—that is
referred to in the HTTP standard as the entity-body.
The following is the example HTTP request we showed you in Chapter 1:
GET /~hugh/index.html HTTP/1.1 Host: goanna.cs.rmit.edu.au From: hugh@hughwilliams.com (Hugh Williams) User-agent: Hugh-fake-browser/version-1.0 Accept: text/plain, text/html
The request applies the GET method to the /~hugh/index.html resource. The action is to retrieve
the HTML document stored in the file index.html.
The first line of the message is the request and contains the
method name GET, the request URL /~hugh/index.html, and the HTTP version
HTTP/1.1, each separated by a space character. The request is followed
by a list of header fields. Each field is represented as a name and value pair
separated with a colon character, and each field is on a separate line.
The header fields are followed by a blank line and then by the
optional body of the message. A POST method request usually contains a
body of text, as we discuss in the next section.
D.2.2.1 Request methods
The HTTP standard divides these methods into those that are
safe and those that aren't. The safe methods—GET and
HEAD—don't have any persistent side effects on the server. The unsafe
methods—POST, PUT, and DELETE—are designed to have
persistent effects on the server. The standard allows for clients to warn users
that a request may be unsafe and, for example, most browsers won't resend a
request with the POST method without user confirmation.
The HTTP standard further classifies methods as idempotent when a request can be repeated many times
and have the same effect as if the method was called once. The GET,
HEAD, PUT, and DELETE methods are classified as
idempotent. The POST method isn't.
D.2.2.2 GET versus POST
The HTTP standard includes the two methods to achieve different
goals. The POST method was intended to create a resource. The contents
of the resource would be encoded into the body of the HTTP request. For example,
an order form might be processed and a new row in a database created.
The GET method is used when a request has no side
effects (such as performing a search) and the POST method is used when
a request has side effects (such as adding a new row to a database). A more
practical issue is that the GET method may result in long URLs, and may
even exceed some browser and server limits on URL length.
Use the POST method if any of the following are
true:
Use the GET method if all the following are true:
D.2.3 HTTP Responses
When a web server
processes a request from a browser, it attempts to apply the method to the
identified resource and create a response. The action of the request may succeed
or fail, but the web server always sends a response message back to the
browser.
A HTTP response message contains a status line, header fields,
and (usually) the requested entity as the body of the message. For example, the
following is the result of a GET method request for a small HTML
file:
HTTP/1.1 200 OK Date: Sun, 19 Dec 2004 02:54:37 GMT Server: Apache/2.0.48 Last-Modified: Fri, 19 Dec 2003 02:53:08 GMT ETag: "4445f-bf-39f4f994" Content-Length: 321 Accept-Ranges: bytes Connection: close Content-Type: text/html
The first, status line begins with the protocol version of the
message, followed by a status code and a reason phrase, each separated by a
space character. The status code is a number and the reason phrase describes its
meaning; these are discussed in the next section. The status line is then
followed by the header fields. As with the request, each field is represented as
a name and value pair separated with a colon character. A blank line separates
the header fields from the body of the response, in this case an HTML
document.
D.2.3.1 Status codes
HTTP status codes are
used to classify responses to requests. The HTTP status code system is
extensible, with a set of codes described in the standard that are "generally
recognized in current practice". HTTP defines a status code as a three-digit
number, where the first digit is the class of response. The following list shows
the five classes of codes defined by HTTP:
D.2.4 Caching
Most user agents, such
as web browsers, allow HTTP responses to be cached. HTTP responses are cached by
saving a response to a request in memory. When a browser considers a request, it
first looks to its local cache to see if it has an up-to-date copy of the
response before sending the request to the web server. This can significantly
reduce the number of requests sent to a web server, improving the performance of
the web application and responsiveness to users.
Consider a web site that includes a company logo on the top of
each HTML page:
When the browser requests a page that contains the image, a
separate request is sent to retrieve the image /images/logo.gif. If the image resource is cacheable, and browser caching is enabled, the browser
saves the response. A subsequent request for the image is recognized, and the
local copy from the cache is used rather than sending another request to the web
server.
A browser uses a cached response until the response becomes
stale, or the cache becomes full and the response
is displaced by the resources from other requests. The primary mechanism for
determining if a response is stale is comparing the date and time set in the
Expires header field with the date and time of the machine running the
browser. If the date and time are incorrectly set on the machine, a cached
response may expire immediately or be cached longer than intended.
HTTP describes the conditions that allow a user agent to cache
a response. However, there are many situations in which an application may wish
to prevent a page from being cached, particularly when the content of a response
is dynamically generated, such as in a web database application.
HTTP/1.1 uses the Cache-Control header field as its
basic caching control mechanism. For example, setting the Cache-Control
header field to no-cache in a HTTP response prevents the response from
being cached by a HTTP/1.1 user agent. The header can be used in requests and
responses, but we consider only responses here.
Some HTTP/1.1 Cache-Control settings are directed to
user agents that maintain caches for more that one user, such as proxy servers.
Proxy servers are used to achieve several goals, the most important of which is
to provide caching of responses for a group of users. A local network, such as
that found in a university department, can be configured to send all HTTP
requests to a proxy server. The proxy server forwards requests to the
destination web server and passes back the responses to the originating
client.
Proxy servers can cache responses and thus reduce requests sent
outside the local network. Setting the Cache-Control header field to
public allows a user agent to make the cached response available to any
request. Setting the Cache-Control header field to private
allows a user agent to make the cached response available only to the client who
made the initial request.
|
Day Up > |
news sources from :